# Introduction

## Premise

This is a utility for converting a (text-only) short children's story into a PDF with vivid illustrations for each scene.

It first breaks the story down into an XML format laying out the main character, settings, and scenes. Then it takes this XML object and converts each of the scenes into an image with consistent styling, characters, and settings.

## Status

**It is very much a work in progress!** As of my writing this introduction, not all functionality (e.g. nothing relating to PDFs) is implemented. The primary goal was to practice working on a **concrete application that includes significant prompt engineering**. At this point, the experimental results still leave a lot to be desired, but they perform the basic task of illustrating individual scenes based only on the story's text and do so with some degree of consistency between images.

## Internals

### XML Format

The XML format used to cut up and provide consistent descriptions for the image generation step follows a strict structure, using the xml.etree.ElementTree library. Details of each tag are spelled out in the `Prompts` section below, so I won't go into detail here, but below is the basic structure:

```xml
<Story>
    <Characters>
        <Character name="John Doe">{visual description}</Character>
        ...
    </Characters>
    <Settings>
        <Setting>{visual description}</Setting>
        ...
    </Settings>
    <Scenes>
        <Scene>
            <Text>{original text of scene}</Text>
            <Setting>{visual description (from a Settings tag child above)}</Setting>
            <Action>{visual description (introducing characters with descriptions from Characters tag children above)}</Action> 
        </Scene>
        ...
    </Scenes>
</Story>
```

## Usage

This is evolving as I continue to implement functionality so I will try to keep this description up to date.

The steps are

1. Add the story as a text file in the `input` folder (it is a sibling of this file), keeping track of the filename, e.g. "Little Red Riding Hood"
2. Run all cells
3. Add a cell calling `convert_story_to_html_illustration()` on the story's filename, e.g. `convert_story_to_html_illustration("Little Red Riding Hood")`
4. Open the `output` folder (also a sibling of this file) in your browser.

# Setup

## Imports

In [275]:
from openai import OpenAI
import pandas as pd
import numpy as np
import re
import html
import os
import xml.etree.ElementTree as ET
import asyncio
import nest_asyncio
from functools import partial

## Utils

In [277]:
def clone_node(node: ET.Element) -> ET.Element:
    return ET.fromstring(ET.tostring(node))

def base64_to_html_img(base64_string: str):
    return f'<img src="data:image/png;base64,{base64_string}" alt="Image"/>'

def read_input(filename: str) -> str:
    with open(f"input/{filename}", 'r') as file:
        return file.read().strip()

def write_output(filename: str, content: str):
    with open(f"output/{filename}", 'w') as file:
        file.write(content)

## Config

In [279]:
openai_api_key_path = os.path.expanduser('~/openai_api_key')
openai_api_key = read_file(openai_api_key_path)

client = OpenAI(api_key=openai_api_key)

In [280]:
nest_asyncio.apply()

## Constants

In [282]:
DEFAULT_IMAGE_LIMIT = 12

## Prompts

In [284]:
GENERAL_IMAGE_PROMPT = (
    "Illustrate the following scene as a Matisse oil painting. Paint should cover most of the canvas. It should be painted with broad brushes and bright colors from a basic palate. " +
    "It should have crisp, well-defined edges. Focus more on characters than setting"
)
    
def get_image_prompt(image_prompt_xml_string: str) -> str:
    return f'{GENERAL_IMAGE_PROMPT}: {image_prompt_xml_string}'

In [285]:
XML_PROMPT = """
You will take a story from the user and build a structured XML object to describe it. All tag content must be enclosed between open and close tags.
The XML root tag contains 3 tags, each described below: <Settings>, <Characters>, and <Scenes>.

The <Settings> tag contains a list of <Setting> tags. Each <Setting> tag contains a short description of the appearance of a distinct location
in which at least one Scene takes place. The location should be specific enough all the characters in the scene are clearly visible. You should add 
details that are not specified in the story to make a description that is easier to visualize as long as they do not contradict the story. Be specific 
about shapes, colors, indoor vs. outdoor, etc., subject to the constraint that it must be between 100 and 150 characters.

The <Characters> tag contains a list of <Character> tags. Each <Character> tag contains a visual description of a major character in the story. 
Be specific around colors, clothing, age, and gender, subject to the constraint that it must be between 150 and 200 characters. Each 
<Character> tag has an attribute "name" that is the character's most commonly used name.

The <Scenes> tag contains a list of <Scene> tags. Each <Scene> tag represents a different scene in the story. Each Scene should be a long as possible 
because you want to minimize the number of Scenes. A new Scene only begins when either (1) the location changes or (2) a chunk of time in the narrative 
passes without any new action. Moreover, if a scene is shorter than three sentences, it should be combined with the next scene into a single <Scene> tag. 
A scene should never end in the middle of a conversation. Each <Scene> tag contains 3 subtags each described below: <Text>, <Setting>, and <Action>. 
The <Text> tag is the substring of the story narrating the Scene. Each <Scene> tag's <Text> tag is disjoint each other <Scene> tags' <Text> tags
and they partition the full text of the story. The <Setting> tag is taken from the <Setting> tag (nested in the <Settings> tag) that corresponds to the 
location of the Scene, copied exactly. If the Scene spans multiple locations, pick the setting in which the longest substring of Text takes place. The 
<Action> tag visually describes the action in the part of the Scene taking place in the Setting in under 400 characters. The <Action> tag refers to each of the 
characters by the "name" attribute of the corresponding <Character> tag nested in the <Characters> tag, wrapped in the triple brackets, i.e. [[[ and ]]].

For all visual descriptions, i.e. the <Setting>, <Character>, and <Action> tags, you should add details that are not specified in the story to make it 
easier to visualize, but you must not contradict descriptions in the story. Be concise: never waste space on non-visual details.
"""

# Main Code

## XML Prep

### Helpers

In [289]:
def format_character_introduction(character: ET.Element) -> str:
    return f'{character.get('name')} ({character.text.strip('.')})'

def substitute_character_descriptions(root: ET.Element) -> ET.Element:
    '''Modifies input and returns result'''
    character_xml_string_map = {character.get('name'): format_character_introduction(character) for character in root.findall('./Characters/Character')}
    for description in root.findall('./Scenes/Scene/Action'):
        for character_name in character_xml_string_map.keys():
            description.text = re.sub(rf'\[\[\[{character_name}\]\]\]', character_xml_string_map.get(character_name), description.text, count=1)
        description.text = re.sub(r'(\[\[\[|\]\]\])', '', description.text)
    return root

### Main

In [291]:
def convert_text_to_xml(story_text: str, prompt: str = XML_PROMPT, print_xml: bool = False) -> ET.Element:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": prompt
            },
            {
                "role": "user",
                "content": re.sub(r'\n+', '\n', story_text)
            }
        ]
    )

    stripped_xml_string = re.sub(r'(?ms)[^<]*(<.*>).*', r'\1', response.choices[0].message.content)

    try:
        xml_root = substitute_character_descriptions(ET.fromstring(stripped_xml_string))
        if print_xml:
            ET.dump(xml_root)
        return xml_root
    except Exception as e:
        if print_xml:
            print(f"Error in XML conversion operation: {e}. Printing XML story in a simplified string format")
            print(stripped_xml_string)
        raise e

## Image Prep

### Helpers

In [294]:
def parse_image_prompt(scene: ET.Element) -> str:
    scene_clone = clone_node(scene)
    scene_clone.remove(scene_clone.find('./Text'))
    escaped_xml_string = ET.tostring(scene_clone).decode()
    whitespace_trimmed_unescaped_xml_string = re.sub(r'(?ms)>\s*<', '><', html.unescape(escaped_xml_string))
    return whitespace_trimmed_unescaped_xml_string

def get_all_image_generation_inputs(story: ET.Element) -> list[dict]:
    image_prompts = [parse_image_prompt(scene) for scene in story.findall('./Scenes/Scene')]
    return [{
        "model": 'dall-e-3',
        "prompt": image_prompt,
        "size": '1024x1024',
        "quality": "standard",
        "response_format": "b64_json",
    } for image_prompt in image_prompts]

### Main

In [296]:
async def convert_xml_to_base64_images_async(story: ET.Element, limit) -> list[str]:
    inputs = get_all_image_generation_inputs(story)
    if len(inputs) > limit:
        raise Exception(f"The story had {len(image_prompts)} scenes in total, which exceeds the limit of {limit} so it won't run unless you raise the limit. NOTE: The default limit is {DEFAULT_IMAGE_LIMIT}.")

    loop = asyncio.get_running_loop()
    tasks = [loop.run_in_executor(None, partial(client.images.generate, **input)) for input in inputs]
    responses = await asyncio.gather(*tasks)
    base64_strings = [responses[i].data[0].b64_json for i in range(len(responses))]
    return base64_strings

def convert_xml_to_base64_images(story: ET.Element, limit) -> list[str]:
    return asyncio.run(convert_xml_to_base64_images_async(story, limit))

## HTML Prep

In [None]:
def convert_xml_and_base64_images_to_html_elements(story: ET.Element, base64_images: str) -> str:
    texts = [

## Integration

In [298]:
def convert_story_to_html_illustration(story_filename: str, limit: int = DEFAULT_IMAGE_LIMIT):
    story_text = read_input(story_filename)
    story = convert_text_to_xml(story_text)
    illustrations_base64 = convert_xml_to_base64_images(story, limit)
    write_output(f"{story_filename}.html", '\n'.join([base64_to_html_img(base64) for base64 in illustrations_base64]))
    print(f"Operation complete! Open output/{story_filename}.html in your browser to view results")
    