Kristen Swerzenski

DSC 670 Advanced Uses of Generative AI

9 February 2025

## Project Milestone 3: Building the Model

The time has come to fine-tune our first model! In the last milestone, I tested a large language model's ability to to serve as a virtual focus group for aspiring screenwriters to gain real-time feedback on their stories, characters, and overall narrative structure with only prompt direction and no fine-tuning of the model. While the model could generate some useful feedback, it often very general and lacked depth and variety. Many responses were repetitive across different prompts, failing to capture the nuances of individual scenes. It was alos observed that the models struggled to interpret the format of a script and could not always decipher dialogue, scene direction, and other elements of the scripts. To improve the quality and specificity of the feedback, we will now employ fine-tuning to further specialize the model's ability to perform this task. By training the model on structured examples of high-quality script analysis, I aim to create a tool that provides more insightful, engaging, and constructive feedback tailored to each unique script.

In [2]:
import os
from dotenv import load_dotenv
import requests
from openai import OpenAI
import json
from collections import defaultdict
import numpy as np
import tiktoken
import time
from IPython.display import clear_output

In [3]:
# Loading environment variables
load_dotenv(dotenv_path=r'C:\Users\krist\Documents\Data Science MS\DSC670\Week 7\OPENAI_API_KEY.env')

# Retrieving the API key
api_key = os.getenv("OPENAI_API_KEY")

In [4]:
# Checking response from API
if api_key is None:
    raise Exception("Missing API key.")

In [5]:
# Initializing the client
client = OpenAI(api_key=api_key)

### Preparing a data set

The first thing that we need to do in order to fine tune the model is locate a data set. However, the data set has to be in a specific format for fine tuning, and despite a lot of searching there are no data sets that I can find that have a structure of input/output pairs that would match what I want the fine-tuned model to achieve (the input would be a script, and the output would be constructive criticism on that script). I was, however, able to find a dataset on Kaggle that contains 33 annotated screenplays for the model to use to better understand the structure of scripts (https://www.kaggle.com/datasets/gufukuro/movie-scripts-corpus). From here, I also need some solid feedback for these scripts that the model can use to understand the type and content of feedback that I would be looking for.

To do this, I am first going to use generative AI to take each of the 34 script snippets and provide feedback on the clarity, pacing, and dialogue of each one. While I know that using generative AI to create data for it to essentially train itself on could lead to a self-destructive loop, I'm hoping that it is a solid enough of a foundation for the fine-tuned model to build off of. Depending on time, I might also enter some manual feedback into the training data for certain examples or modify what the LLM produces to introduce some variety into the training set and give that human touch. While it is by no means a perfect means of curating a dataset for fine-tuning, it is the best that can be done with the time and resources available. With unlimited time, I would ideally love to take either take critical reviews of specific parts of films and pair them with their appropraite script or solicit real focus group feedback on different scripts to use for training. For now though, we will let the generative AI generate some feedback for these scripts to construct our training data.

In [5]:
# Loading in all of the annotated scripts
scripts_dir = 'manual_annotations'

# Reading all text files in the directory
script_files = [f for f in os.listdir(scripts_dir) if f.endswith('.txt')]

# Creating a dictionary to store the scene data
scenes = {}

# Looping over all script files and read their content
for script_file in script_files:
    with open(os.path.join(scripts_dir, script_file), 'r', encoding='utf-8') as f:
        scenes[script_file] = f.read()

# Checking the first few lines of one scene to verify
print(scenes[script_files[0]][:500])

scene_heading: INT.  CONCOURSE/AIRPORT TERMINAL - BAY

text: CLOSE ON A FACE.  A nine year old boy, YOUNG COLE, his eyes wide
	with wonder. watching something intently.  We HEAR the sounds of
	the P.A. SYSTEM droning Flight Information mingled with the
	sounds of urgent SHOUTS, running FEET, EXCLAMATIONS.
	YOUNG COLE'S POV:  twenty yards away, a BLONDE MAN is sprawled on
	the floor, blood oozing from his gaudy Hawaiian shirt.
	A BRUNETTE in a tight dress, her face obscured from YOUNG COLE'S
	vie


Now that the scripts are loaded into a dictionary, I am going to create a function to make calls to the API to retrieve feedback in the format I am looking for for wach script.

In [1]:
# Creating a function to generate feedback for each script
def generate_feedback(scene_text):
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "developer", "content": "You are a professional screenplay analyst. Given the following scene, provide feedback on its clarity, pacing, and dialogue."},
            {"role": "user", "content": scene_text},
        ],
        max_tokens=500,
        temperature=0
    )

    feedback = completion.choices[0].message.content
    return feedback

In [70]:
# Applying the function to every script in the dictionary and printing the first one
feedback_data = {}
for script_file, scene_text in scenes.items():
    feedback = generate_feedback(scene_text)
    feedback_data[script_file] = {
        "scene_text": scene_text,
        "feedback": feedback
    }
    
    
print(feedback_data[script_files[0]]['feedback'])

**Feedback on Scene:**

**Clarity:**
The scene effectively establishes a vivid contrast between two distinct settings: the airport terminal and the dystopian future world. The transition from Young Cole's perspective to the adult Cole's environment is clear, and the use of visual and auditory cues helps the audience follow the narrative shift. However, the scene could benefit from more explicit connections between the two timelines to enhance the audience's understanding of their relationship. The introduction of the scientists and their roles could be clarified further to avoid confusion about their significance and purpose.

**Pacing:**
The pacing of the scene is generally well-managed, with a gradual build-up from the initial airport incident to the more intense and mysterious future setting. The transition from the dream-like memory to the harsh reality of the future is smooth, maintaining the audience's engagement. However, the sequence in the future world, particularly the decont

The feedback that theh LLM returned is overall pretty solid. It is definitely more detailed than some of the prompts I tested previously thanks to the defined structure of responses I asked for. Now I am going to take all of these generated feedback responses and format them first into a JSON, then formal the JSON into a JSONL for training purposes.

In [71]:
# Saving input/response pairs as a JSON
with open('feedback_data.json', 'w', encoding='utf-8') as f:
    json.dump(feedback_data, f, ensure_ascii=False, indent=4)

print("Feedback data saved as 'feedback_data.json'")

Feedback data saved as 'feedback_data.json'


In [8]:
# Converting JSON into JSONL
with open('feedback_data.json', 'r', encoding='utf-8') as f:
    feedback_data = json.load(f)

# Defining the system instruction
system_message = {
    "role": "system",
    "content": "You are a focus group reviewing a scene script for an unproduced film or television show. Provide constructive feedback on clarity, pacing, and dialogue. Consider engagement, emotional impact, and storytelling effectiveness in your analysis."
}
# Converting and save to JSONL format
with open('fine_tuning_data.jsonl', 'w', encoding='utf-8') as f:
    for scene_file, details in feedback_data.items():
        json.dump({
            "messages": [
                system_message,
                {"role": "user", "content": details["scene_text"]},
                {"role": "assistant", "content": details["feedback"]}
            ]
        }, f)
        f.write("\n")

print("Fine-tuning dataset saved as 'fine_tuning_data.jsonl'")

Fine-tuning dataset saved as 'fine_tuning_data.jsonl'


I took a quick look at the JSON an JSONL output and everything seems to have been converted correctly. I also read through some of the generated responses and made a few minor tweaks. I didn't perform any great overhauls on any of the responses due to time constraints, so I think I am going to see how well this mostly-generated trianing data performs when used to tun the model before spending time to manually edit and build out the training data any more. 

### Dataset Checks

Now that I have a JSONL training data set in what should be the correct format, I want to make sure that it is in the correct format for fine-tuning. In this section I will be performing a number of checks on the data to ensure it loads correctly, it is formatted correctly, and to get some estimates of the computational resources that will be required for the training job. This will include basic checks looking for any glaring overall issues with the data set, formattting checks which will ensure that all of the examples in the dataset are appropriately formatted for training, and some cost estimations to better understand the structure of the data in terms of tokens, costs, and distributions.

#### Basic Checks

In [15]:
# Defining a function to open the training data and perform basic checks
def basic_checks(data_file):
    try:
        with open(data_file, 'r', encoding='utf-8') as f:
            dataset = [json.loads(line) for line in f]
            
        print(f"Basic checks for file {data_file}:")
        print("Count of examples in training dataset:", len(dataset))
        print("First example:")
        for message in dataset[0]['messages']:
            print(message)
        return True
    except Exception as e:
        print(f"An error has occurred in file {data_file}: {e}")
        return False

#### Formatting Checks

In [9]:
# Checking the data for consistent formatting
def format_checks(dataset, filename):
    format_errors = defaultdict(int)
    
    for ex in dataset:
        # Checking to ensure each example is a dictionary
        if not isinstance(ex, dict):
            format_errors["data_type"] += 1
            continue
        
        # Ensuring each example has a 'messages' key
        messages = ex.get("messages", None)
        if not messages:
            format_errors["missing_messages_list"] += 1
            continue
        
        for message in messages:
            # Ensuring each message has a role and content key
            if "role" not in message or "content" not in message:
                format_errors["message_missing_key"] += 1
        
        # Checking for any inrecognized keys
        if any(k not in ("role", "content", "name", "function_call") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        # Ensuring the role of message is one of the recognized roles
        if message.get("role", None) not in (
            "system",
            "user",
            "assistant",
            "function"
        ):
            format_errors["unrecognized_role"] += 1
            
        # Checking for either a content or function call in each message, and that they're formatted appropriately    
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
              
        # Ensuring at least one message has the tole of assistant        
        if not any(message.get("role", None) == "assistant" for message in messages):
            format_errors["example_missing_assistant_message"] += 1

    # If format errors are found, print them and return False
    if format_errors:
        print(f"Formatting errors found in file {filename}:")
        for k, v in format_errors.items():
            print(f"{k}: {v}")
        return False
    print(f"No formatting errors found in file {filename}")
    return True

#### Cost Estimation and Token Analysis

In [10]:
# Token and epoch estimates
MAX_TOKENS = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

# Estimating how many tokens will be used for training
def estimate_tokens(dataset, assistant_tokens):
    n_epochs = TARGET_EPOCHS

    # Retrieving number of examples in the dataset
    n_train_examples = len(dataset)

    # Adjusting epochs if number of examples is less than the target
    if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
        n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
    
    # Adjusting the epochs if the number of examples is more than the target
    elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
        n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

    # Calculating the total number of tokens in the dataset
    n_billing_tokens_in_dataset = sum(min(MAX_TOKENS, length) for length in assistant_tokens)

    # Printing the total token count that will be charged during training
    print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")

    # Print default number of epochs for training
    print(f"You will train for {n_epochs} epochs on this dataset")

    # Printing total number of tokens that will be charged during training
    print(f"You will be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

    # If the total token count exceeds the maximum tokens, print a warning 
    if n_billing_tokens_in_dataset > MAX_TOKENS:
        print("WARNING: Your dataset contains examples longer than 4K tokens by {n_billing_tokens_in_dataset - MAX_TOKENS} tokens.")
        print("You will be charged for the full length of these examples during training, but only the first 4K tokens will be used for training.")

In [11]:
# Printing the number of tokens in the messages
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

In [12]:
# Printing the number of tokens in the assistant messages
def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

In [13]:
# Printing the distribution of values
def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

#### Performing the Data Checks

In [17]:
if __name__ == "__main__":
    # Listing the training and validation files
    files = ["fine_tuning_data.jsonl"]

    for file in files:
        # Running basic checks on the files
        if not basic_checks(file):
            print("Exiting...")
            exit()

    print("-" * 50)

    # Running additional checks to validate token counts and number of examples per label
    encoding = tiktoken.get_encoding("cl100k_base")

    files = [
        "fine_tuning_data.jsonl",
    ]
    
    # Performing token checks and formatting checks
    for file in files:
        print(f"Processing file: {file}")
        with open(file, "r", encoding="utf-8") as f:
            dataset = [json.loads(line) for line in f]

        total_tokens = []
        assistant_tokens = []

        if not format_checks(dataset, file):
            print("Exiting...")
            exit()

        for ex in dataset:
            messages = ex.get("messages", {})
            total_tokens.append(num_tokens_from_messages(messages))
            assistant_tokens.append(num_assistant_tokens_from_messages(messages))
        
        # Printing results of the checks
        print_distribution(total_tokens, "total tokens")
        print_distribution(assistant_tokens, "assistant tokens")
        estimate_tokens(dataset, assistant_tokens)
        print(f"Processing file completed: {file}")
        print("-" * 50)

Basic checks for file fine_tuning_data.jsonl:
Count of examples in training dataset: 33
First example:
{'role': 'system', 'content': 'You are a focus group reviewing a scene script for an unproduced film or television show. Provide constructive feedback on clarity, pacing, and dialogue. Consider engagement, emotional impact, and storytelling effectiveness in your analysis.'}
{'role': 'user', 'content': 'scene_heading: INT.  CONCOURSE/AIRPORT TERMINAL - BAY\n\ntext: CLOSE ON A FACE.  A nine year old boy, YOUNG COLE, his eyes wide\n\twith wonder. watching something intently.  We HEAR the sounds of\n\tthe P.A. SYSTEM droning Flight Information mingled with the\n\tsounds of urgent SHOUTS, running FEET, EXCLAMATIONS.\n\tYOUNG COLE\'S POV:  twenty yards away, a BLONDE MAN is sprawled on\n\tthe floor, blood oozing from his gaudy Hawaiian shirt.\n\tA BRUNETTE in a tight dress, her face obscured from YOUNG COLE\'S\n\tview, rushes to the injured man, kneels beside him, ministering\n\tto his woun

The checks for the data set returned nothing to be concerned with, save for maybe the token costs. Because the inputs and outputs are pretty hefty token-wise, I'll just need to watch my credits during and after the fine-tuning to make sure I am not running up too high of a bill. Otherwise, looks like the data preparation was fairly successful, and the data is ready to be used for fine-tuning.

### Fine-Tuning

Now comes the time for fine-tuning! The first step of this process is to load in the training file and retrieve the training file ID. Once I have that, I can then construct the fine-tuning job. For this assignment, I will be using gpt-3.5-turbo as it is a little chepaer than gpt-4 and gpt-4o-mini might not be the best choice for this task. I will then construct a series of checks that will provide updates on the status of the fine-tuning job, both statically and dynamically. Once the fine-tuning job is successfully completed, we can then move on to testing the fine-tuned model.

#### Loading the Training Data

In [18]:
TRAINING_FILENAME = 'fine_tuning_data.jsonl'

In [19]:
# Loading in the training dataset
file = client.files.create(
  file=open(TRAINING_FILENAME, "rb"),
  purpose="fine-tune"
)

print("Training file ID:", file.id)
print("Training file name:", file.filename)

Training file ID: file-6wY9QydCcKddKKkv1qaosN
Training file name: fine_tuning_data.jsonl


#### Starting the Fine-Tuning Job

Now that we have the training file loaded and have an ID assigned to it, we can construct the fine-tuning job. I'll be using 3 epochs for this training because the training set isn't huge and I don't want to risk the change of overfitting the model. 

In [26]:
# Starting the fine-tuning job
ft = client.fine_tuning.jobs.create(
    training_file="file-6wY9QydCcKddKKkv1qaosN",
    model="gpt-3.5-turbo",
    method={
        "type": "supervised",
        "supervised": {
            "hyperparameters": {
                "n_epochs": 3
            }
        }
    }
)

print("Finetuning job ID:", ft.id)

Finetuning job ID: ftjob-tYxjZTx3VCVc8GLr2KMx3WYT


#### Checking the Status of the Fine-Tuning Job

Now, we will perform a number of checks to keep tabs on the training process. First, I will grab a list of all of my queued up fine-tuning jobs and the status of each one to see where my current job falls in the line. 

In [27]:
# Retrieving all queued fine-tuning jobs
ft_jobs = client.fine_tuning.jobs.list()
for ft_job in ft_jobs:
    print(ft_job.id, ft_job.status)

ftjob-tYxjZTx3VCVc8GLr2KMx3WYT running
ftjob-gWHovX1gmqLgXZapbrOZYzOu queued
ftjob-DHf5S7JzXN0S2dVYn6Ge2nji succeeded


The next check will list out the the events that have been completed in the job so far. From this list, we can see that the job was started, the files validated, and the fine-tuning job was created and is currently in progress.

In [28]:
ft_job_events = client.fine_tuning.jobs.list_events(
    fine_tuning_job_id='ftjob-tYxjZTx3VCVc8GLr2KMx3WYT',
    limit=2)

for ft_job_event in ft_job_events:
    print(ft_job_event.id, ft_job_event.message)

ftevent-pAvHAn1tm9zvL0iqR8NKH0dJ Fine-tuning job started
ftevent-KANGvfJCoSWgeQpg3JiOSKji Files validated, moving job to queued state
ftevent-5oBRI2I7yMmR4oqRjs0tMET4 Validating training file: file-6wY9QydCcKddKKkv1qaosN
ftevent-R8TKRvUAZeD91iuXJd7969LA Created fine-tuning job: ftjob-tYxjZTx3VCVc8GLr2KMx3WYT


The last checks will be a dynamic function that checks the status of the job every 10 seconds until the job either fails or is completed successfully. The status is updated after every check, and we can see with the final output that the job was carried out successfully. This is an extremely useful way to keep tabs on the fine-tuning job and get real-time updates as to where you are at in the training process.

In [31]:
JOB_ID = "ftjob-tYxjZTx3VCVc8GLr2KMx3WYT"

url = f"https://api.openai.com/v1/fine_tuning/jobs/{JOB_ID}"

# Defining headers for authentication
headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}

# Tracking the progress of the fine-tuning job
start_time = time.time()
status = None

while status not in ["succeeded", "failed"]:
    # Making a GET request to retrieve the job details
    response = requests.get(url, headers=headers)
    
    if response.status_code == 200:
        job_details = response.json()
        status = job_details.get("status")
        
        # Printing job details
        print(job_details)
        
        print("Elapsed Time: {} minutes {} seconds".format(
            int((time.time() - start_time) // 60),
            int((time.time() - start_time) % 60)))
        print(f'Status: {status}')
    else:
        print(f"Error: {response.status_code}, {response.text}")
        break 
    
    # Providing a check every 10 seconds and clearing the previous output
    clear_output(wait=True)
    time.sleep(10)  

# Printing final status
if status in ["succeeded", "failed"]:
    print(f'Fine-Tuning job {JOB_ID} finished with status: {status}')

Fine-Tuning job ftjob-tYxjZTx3VCVc8GLr2KMx3WYT finished with status: succeeded


Now that the the fine-tuning job is complete, it is time to test out the model and see how well it does!

## Testing the Model

Now comes the really fun part - testing the fine-tuned model to see how the fine-tuning has improved the performance of the model with providing feedback for screenwriters. I am going to use some of the prompts that I used in my initial testing in the last milestone to see how the responses have improved from the base gpt-3.5-turbo model. 

In [32]:
# Retrieving details of the fine-tuning job
client.fine_tuning.jobs.retrieve("ftjob-tYxjZTx3VCVc8GLr2KMx3WYT")

FineTuningJob(id='ftjob-tYxjZTx3VCVc8GLr2KMx3WYT', created_at=1739040309, error=Error(code=None, message=None, param=None), fine_tuned_model='ft:gpt-3.5-turbo-0125:personal::AykBr6aP', finished_at=1739040597, hyperparameters=Hyperparameters(batch_size=1, learning_rate_multiplier=2.0, n_epochs=3), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-MktBuv68z9JVLPqQFWPQVoIe', result_files=['file-Wm9PEdU9qf7oE6tpg9gud5'], seed=2039927102, status='succeeded', trained_tokens=246015, training_file='file-6wY9QydCcKddKKkv1qaosN', validation_file=None, estimated_finish=None, integrations=[], method=Method(dpo=None, supervised=MethodSupervised(hyperparameters=MethodSupervisedHyperparameters(batch_size=1, learning_rate_multiplier=2.0, n_epochs=3)), type='supervised'), user_provided_suffix=None)

#### Test 1

This first prompt is rather basic to see how the model performs with the base, expected instructions, but it does one of the non copyrighted scripts from the initial testing so we can compare some of the feedback that it provides.

In [35]:
prompt_one = """
EXT. DOG PARK - DAY
BEN and VERA sit on a park bench in a trendy dog park
watching BEN’s dog Lupin run around in circles.
VERA
You never told me how your date
went the other night.
BEN
It was … actually a little strange.
You know how Lupin is in a few of
my profile pictures, right? On the
app?
VERA
From what you’ve told me, Lupin is
an excellent wing-dog.
BEN
He’s the goodest boy there is.
VERA
So what happened?
BEN
So the other night, I’m out with
this guy … “Ted”. We had a great
time: drinks at that new place with
the secret door, dinner at La Mer
and then I suggested a night cap
back at mine. We get back to mine
and Ted meets Lupin and is all over
him--big cuddles, throwing toys
down the hall. Like I suddenly
didn’t exist.
VERA
Maybe you’re Lupin’s wingman, did
you ever think about that?
BEN
But Ted hadn’t mentioned him all
night! Most guys I’m out with,
that’s like the first thing they
bring up. Lupin’s a great
icebreaker, that’s why I’ve got him
in so many of my pictures.
(MORE)
BEN (CONT'D)
2.
This guy shows up at my place and
it’s like he just happens to
discover that I’m a dog owner?
VERA
Doesn’t really sound like much of a
crime, Ben.
BEN
Then: I go to the kitchen to get a
bottle of wine, and when I come
back Ted’s taking all these photos
with Lupin. Hugging him, patting
him, staring in his eyes. I swear I
caught him draping his scarf around
Lupin’s neck. I had to fake a
headache and get him out of there.
Like, that is weird, right?
Pause. VERA shifts her weight on the park bench.
VERA
Do you still have him on the app?
BEN forks over his phone. VERA scrolls for a moment and then
gasps.
VERA (CONT'D)
Oh. My. God. Look!
“Ted” has a different dog in every picture on his profile.
BEN swipes through them in horror, settling on one in which
Lupin wears a scarf. He looks over to the dog, oblivious,
running around the park with a stick in his mouth.
BEN
He’s a monster. Some kind of serialkiller-dog-stalker.
VERA
You’re so lucky. Both of you!
BEN
I need to hug Lupin. (Calling out.)
Lupin! LUPIN!
He runs to his dog across the grass.
"""


# Testing the fine-tuned model
completion = client.chat.completions.create(
  model="ft:gpt-3.5-turbo-0125:personal::AykBr6aP",
  messages=[
    {"role": "system", "content": "You are a focus group reviewing a scene script for an unproduced film or television show. Provide constructive feedback on clarity, pacing, and dialogue. Consider engagement, emotional impact, and storytelling effectiveness in your analysis."},
    {"role": "user", "content": f"Here is a script scene:\n\n{prompt_one}\n\nPlease provide your feedback."}
  ],
  max_tokens=500
)
     
print(completion.choices[0].message.content)

**Feedback on Scene:**

**Clarity:**
Overall, the scene is clear and easy to follow. The setting and characters are well-established, and the dialogue effectively conveys the central conflict. The use of the dog park as a backdrop adds a nice visual element and helps to ground the conversation. The transition from the anecdote about the date to the revelation about Ted's profile is smooth and natural. The interaction between Ben and Vera is clear, and their friendship is evident through their comfortable banter.

**Pacing:**
The pacing of the scene is well-managed. The initial setup, with Ben recounting his date, is engaging and sets a light-hearted tone. The tension builds gradually as Ben describes Ted's behavior, leading to a humorous and satisfying reveal with Vera's discovery of Ted's profile. The scene's conclusion, with Ben's realization and joking response, provides a nice resolution. The dialogue is snappy, which keeps the scene moving without feeling rushed.

**Dialogue:**
Th

Immediately I can see an improvement from the initial outputs from the last milestone. Whreas the base model provided rather surface-level feedback, the fine-tuned model gives specific feedback on whaht parts of the script are strong and where some improvements can be made. The output follows the structure that the fine-tuning data outlined, giving feedback on pacing, clarity, and dialogue. It was also interesting to see that the model also provided some additional thoughts as well with some expanded feedback but aso added some great feedback on ways to make the scene stronger. Overall, I'd say so far that this model is a significant improvement over the base model.

#### Test 2

This next prompt iis word for word one of the test prompts I tried in Milestone 2, providing another script snippet to the model and asking specifically for feedback regarding how the scene's dialogue would hit with a Gen Z audience.

In [36]:
prompt_two = """
EXT. APARTMENT BALCONY - NIGHT
ASH checks their watch: just minutes left until midnight.
They look up and out at the city before them and sigh through
their nose.
They’re standing on the large, wraparound balcony of a nice
apartment. Inside, a New Year’s Eve party is in full swing.
Out here, in the cold, various guests have stepped out for
various reasons: for fresh air, for a smoke, for a break from
the crowd, for privacy with one another. ASH does their best
to ignore the lot of them--to keep their distance. They check
their watch again.
The sounds of the party crescendo for a moment as the sliding
balcony door is quickly opened and closed. Out of the party
steps ASH’s friend CHARLIE, carrying a bottle of champagne
and two glasses. CHARLIE joins ASH by the railing.
CHARLIE
(pouring drinks)
I’ve got it!
ASH
(taking a glass)
Nice! What have you got?
CHARLIE
I’ve been thinking.
ASH
You’ve been thinking...
CHARLIE
It’s a thought. A bold thought.
ASH
A New Year’s resolution?
CHARLIE
God, no! It’ll be far too late by
then.
ASH
So let’s hear it.
CHARLIE
Not yet: first, we drink.
They ‘cheers!’ And drink. CHARLIE pours another round.
2.
ASH
So far, so good.
CHARLIE
This isn’t even “the thing”.
ASH
So what is “the thing”?
CHARLIE
Drink again and I’ll tell you.
ASH drinks. CHARLIE doesn’t.
CHARLIE (CONT'D)
I think we should kiss at midnight.
Pause. ASH performs the longest drink swallow of their life.
CHARLIE fills ASH’s glass again.
ASH
You and me?
CHARLIE
Yep.
ASH
You want to kiss?
CHARLIE
At midnight. For New Year’s.
ASH
I’m confused.
CHARLIE
Don’t be. It’s a tradition, whereASH
No, I get that bit. You want us to
kiss at midnight?
CHARLIE
I figure with you, I’m guaranteed a
kiss. If I look elsewhere, I might
not be so lucky.
ASH
So it’s insurance? Isn’t that guy
here? That one you want to...
CHARLIE
Nathan. Yes, Nathan is here. And I
thought about shooting that shotASH
But instead2.
3.
CHARLIE
...I came out here.
ASH
For me?
CHARLIE
Little bit wishing I hadn’t, now...
An OBNOXIOUS GUEST bangs on the glass door and mouths “GET
READY!” Charlie downs their glass and pours another.
ASH
What are you doing?
CHARLIE
Catching up with you.
ASH
You know that’s not what I meant.
A “WHOOP!” from the crowd inside. CHARLIE smiles at ASH.
CHARLIE
It might be nice?
ASH
I think it could be.
CHARLIE
Not where I’d planned the evening
was gonna go...
ASH
You can still go and find Nathan.
CHARLIE shakes their head. Inside the apartment, a muffled
countdown begins: *TEN*, *NINE*, *EIGHT*, *SEVEN*...
CHARLIE
So are we doing it?
ASH
Is it a good idea?
CHARLIE
That sounds like next year’s
problem.
ASH
Just a kiss?
CHARLIE
Maybe?
... *THREE*, *TWO*, *ONE*! The party erupts. Champagne,
cheering, laughter, hugs, kisses. And ASH and CHARLIE
together in the middle of it all.
"""

# Testing the fine-tuned model
completion = client.chat.completions.create(
  model="ft:gpt-3.5-turbo-0125:personal::AykBr6aP",
  messages=[
    {"role": "system", "content": "You are a focus group reviewing a scene script for an unproduced film or television show. Provide constructive feedback on clarity, pacing, and dialogue. Consider engagement, emotional impact, and storytelling effectiveness in your analysis."},
    {"role": "user", "content": f"Here is a script scene:\n\n{prompt_two}\n\nPlease provide your feedback. Give suggestions on how to tailor the dialogue between the characters more to a Gen Z audience. Give specific examples of where you would incorporate some of these suggestions in the script."}
  ],
  max_tokens=500
)
     
print(completion.choices[0].message.content)

**Feedback on Scene:**

**Clarity:**
The scene is clear and effectively establishes the setting, the characters' relationship, and the tension leading up to the midnight kiss. The use of the balcony at a New Year's Eve party as the backdrop is a familiar and relatable setting, and it provides a nice contrast between the intimate moment between Ash and Charlie and the bustling party inside.

**Pacing:**
The pacing is well-executed, with a gradual build-up as Charlie introduces the idea of the kiss at midnight. The dialogue exchange is lively and keeps the audience engaged, with a nice balance of tension and humor. The countdown at the end adds a sense of urgency and anticipation, leading to a satisfying conclusion.

**Dialogue:**
The dialogue is natural and character-driven, effectively conveying the dynamic between Ash and Charlie. Charlie's playful and slightly flirtatious tone contrasts nicely with Ash's more reserved and skeptical demeanor. The back-and-forth about the kiss is humor

 If you recall, the feedback from the base model was cringeworthy, leaning heavily into Gen Z steeotypes to the point where the suggestions were comical. The fine-tuned model shows a significant improvement. While it still provides the feedback on clarity, pace, and dialogue, it both tailors that feedback and provides additional feedback on how to incorporate more elements of Gen Z culture into the scene. It even provided some specific dialogue suggestions whcih, while a little rough around the edges, are much better than the base model suggestions. The fine-tuned model also was much better at correctly identifying the different parts of the script, most likely from the labeled scripts that was provided to it during fine-tuning, and was able to differentiate dialogue from stage direction and so on. This prompt showed heavy improvements from the output of the base model.

#### Test 3

Last, I am going to try giving the model a concept of a short film as opposed to a whole script and ask for some feedback on the story idea.

In [7]:
prompt_three = '''
Imagine a short film where the protagonist is a young artist struggling to find inspiration. The film follows the protagonist 
as they go about their day to day life seemingly in a creative slump. However, something happens that opens the artist's eyes
up to the wonder in the ordinary, and we watch as the artist turns their ordinary experiences into extraordinary art.
'''

# Testing the fine-tuned model
completion = client.chat.completions.create(
  model="ft:gpt-3.5-turbo-0125:personal::AykBr6aP",
  messages=[
    {"role": "system", "content": "You are a focus group reviewing a scene script for an unproduced film or television show. Provide constructive feedback on clarity, pacing, and dialogue. Consider engagement, emotional impact, and storytelling effectiveness in your analysis."},
    {"role": "user", "content": f"Here is a synopsis for a short film:\n\n{prompt_three}\n\nDescribe some feedback you might give on the character development and story pacing."}
  ],
  max_tokens=500
)
     
print(completion.choices[0].message.content)

**Character Development:**
The character development in this short film concept appears to revolve around the growth and transformation of the artist protagonist. This is a compelling arc, as many audience members can likely relate to the struggles of creativity and inspiration. To enhance the character development, consider delving deeper into the protagonist's internal world. What specific obstacles or fears are they facing in their artistic process? What is at stake for them personally and professionally? By fleshing out these internal conflicts, the audience can better understand and empathize with the protagonist's journey.

Additionally, you might consider incorporating a backstory or a glimpse into the protagonist's past to provide context for their current struggles. This can add layers to the character and make their eventual transformation even more resonant. Whether through a brief flashback or a revealing conversation, a well-crafted backstory can enrich the audience's conn

This prompt was interesting because I only provided scripts in the fine-tuning examples, not synopses, but I wanted to see if the generalization abilities of the base model were still present in this fine-tuned model. Sure enough, the fine-tuned model performed relatively well with working with what I gave it. It also strayed a bit from the clarity, dialogue, pacing formula and provide feedback relevant to what I was looking for. This skill is critical in creative tasks like this as it allows the model to adapt to different aspects of storytelling without being overly rigid in its responses. Rather than simply mimicking the structure of the fine-tuning examples, the model demonstrated an ability to extrapolate and provide insightful, context-aware feedback. This suggests that fine-tuning can enhance the model’s specificity while still preserving the flexibility and generalization capabilities of the base model - an important balance for applications that require both structure and creative interpretation.

## Conclusions

Overall, fine-tuning significantly improved the quality of feedback, enhancing the model’s ability to understand the nuances of script structure and provide more insightful, context-aware responses. It was also seen that the model retained a good amount of adaptability despite the structured examples provided in the training set, allowing it to go beyond rigid criteria and offer relevant feedback tailored to each scene—a crucial skill for creative applications. Moving forward, I would love to further refine and expand on the training data to add some more examples with human-generated feedbcak to further inprove the quality of the feedback. Adding in some data on how real films and television shows end up performing (or real responses from focus groups) could also be useful in identifying audience trends, refining the model’s feedback to align more closely with what resonates with viewers. If the model could recognize patterns in successful scripts, like pacing choices, character dynamics, or dialogue styles that tend to engage audiences, it could be incredibly useful to elevate the feedback to another level. Groeing the training data would also allow for the creation of a test set to be able to quantify improvements more systematically. Right now, the evaluation is mostly qualitative—observing how the feedback feels rather than measuring specific performance gains. With a proper test set, I could start tracking metrics like consistency, relevance, and even sentiment alignment with professional script coverage. . Ultimately, this project has shown that with the right data and approach, language models can become valuable creative tools not just for providing feedback, but in shaping stronger, more engaging stories.