# **Pipeline to Label the Dataset Using GPT-4.1-Mini**

After cleaning our dataset, we should prep our data for labeling using GPT-4.1-Mini. Before we do that, we need to look at the model's rate limits found here: https://platform.openai.com/settings/organization/limits

To use the OpenAI API, you must first create an account, then add your billing information. After that, create a project and then your own API key. It is important to choose a model that will be beneficial, fast, and cheap. I chose to use GPT-4.1-Mini for labeling and GPT-3.5-Turbo for fine-tuning. Also **very important to respect rate limits** as each model has an RPD (request per day) limit. 

But first, I did a little more cleaning to help prepare my data:
- open ocpoetry_cleaned.csv
- delete poem, author, body_text columns
- save new csv file with just the "cleaned" column as ocpoetry_df.csv
    - this makes it easier for the model to just learn

**Quick explanation of code**

To respect rate limits, I batched poems in 5 and added time.sleep(1). Rate limits will depend on your account. I started off with 200 requests per day and then it updated to 500 10 minutes later.
- If increase, you can batch more poems together or reduce time sleep!

**Install dependencies and define API key**

In [None]:
# pip install openai if haven't already

In [1]:
import pandas as pd
import openai
import time
import json
import openai
from openai import OpenAI
import os

In [2]:
openai.api_key = "Your api key"

In [3]:
client = openai.OpenAI()  # It will automatically use the env variable

**Test with the first 10 lines**

In [8]:
def get_themes_for_batch(poems_batch, client):
    prompt = "For each poem below, give me 5 symbolism themes as a JSON list.\n\n"
    for i, poem in enumerate(poems_batch, start=1):
        prompt += f"{i}. \"{poem}\"\n\n"
    prompt += "Respond ONLY with a JSON array of arrays. No explanations or extra text.\nExample: [[\"theme1\", \"theme2\"], [\"themeA\", \"themeB\"]]"

    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that extracts symbolism themes from poems."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=500,
        temperature=0.2,
    )
    
    themes_json = response.choices[0].message.content
    print("Raw response:", themes_json)  # Debug print to check output
    
    try:
        themes = json.loads(themes_json)
    except json.JSONDecodeError:
        print("JSON decode error. Saving raw response to bad_response.txt")
        with open("bad_response.txt", "w", encoding="utf-8") as f:
            f.write(themes_json)
        themes = [[]] * len(poems_batch)  # fallback empty themes for all poems
    
    return themes

# Create OpenAI client
client = openai.OpenAI(api_key=openai.api_key)

# Load your cleaned CSV (make sure path is correct)
df = pd.read_csv("C:/Users/Marielle/OneDrive/Desktop/LLM Project/ocpoetry_df.csv")

# Only take first 10 rows to test
df_test = df.head(10).copy()

batch_size = 5
themes_col = []

for i in range(0, len(df_test), batch_size):
    batch = df_test['cleaned'].iloc[i:i+batch_size].tolist()
    themes_batch = get_themes_for_batch(batch, client)
    themes_col.extend(themes_batch)
    time.sleep(1)  # Respect rate limits

df_test['themes'] = themes_col

df_test.to_csv("ocpoem_test_with_themes.csv", index=False)

print("Saved CSV with themes for first 10 poems!")

Raw response: [
  [
    "struggle and resilience",
    "emotional drowning",
    "hope and recovery",
    "inner conflict",
    "self-liberation"
  ],
  [
    "moving on",
    "closure",
    "heartache and healing",
    "finality",
    "emotional distance"
  ],
  [
    "unrequited love",
    "emotional neglect",
    "distance and disconnection",
    "self-realization",
    "longing and loneliness"
  ],
  [
    "isolation",
    "hidden pain",
    "emotional emptiness",
    "facade and pretense",
    "invisibility"
  ],
  [
    "industrial imagery",
    "monotony and routine",
    "isolation",
    "weariness",
    "detachment"
  ]
]
Raw response: [
  [
    "Environmental pollution",
    "Industrialization and its impact",
    "Ephemeral beauty",
    "Decay and deterioration",
    "Invisible dangers"
  ],
  [
    "Loneliness and isolation",
    "Hidden sadness",
    "Mental health struggles",
    "Desire for understanding",
    "Emotional vulnerability"
  ],
  [
    "Communication beyond 

**Proceed with Whole Dataset**

In [11]:
def get_themes_for_batch(poems_batch, client):
    prompt = "For each poem below, give me 5 symbolism themes as a JSON list.\n\n"
    for i, poem in enumerate(poems_batch, start=1):
        prompt += f"{i}. \"{poem}\"\n\n"
    prompt += "Respond ONLY with a JSON array of arrays. No explanations or extra text.\nExample: [[\"theme1\", \"theme2\"], [\"themeA\", \"themeB\"]]"

    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that extracts symbolism themes from poems."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=500,
        temperature=0.2,
    )
    
    themes_json = response.choices[0].message.content
    print("Raw response:", themes_json)  # Debug print to check output
    
    try:
        themes = json.loads(themes_json)
    except json.JSONDecodeError:
        print("JSON decode error. Saving raw response to bad_response.txt")
        with open("bad_response.txt", "w", encoding="utf-8") as f:
            f.write(themes_json)
        themes = [[]] * len(poems_batch)  # fallback empty themes for all poems
    
    return themes

# Create OpenAI client
client = openai.OpenAI(api_key=openai.api_key)

# Load your cleaned CSV (make sure path is correct)
df = pd.read_csv("C:/Users/Marielle/OneDrive/Desktop/LLM Project/ocpoetry_df.csv")

# Only take first 10 rows to test
df_full = df.copy()

batch_size = 5
themes_col = []

for i in range(0, len(df_full), batch_size):
    batch = df_full['cleaned'].iloc[i:i+batch_size].tolist()
    themes_batch = get_themes_for_batch(batch, client)
    themes_col.extend(themes_batch)
    time.sleep(1)  # Respect rate limits

df_full['themes'] = themes_col

df_full.to_csv("ocpoem_final_with_themes.csv", index=False)

print("Saved CSV with themes for all poems!")

Raw response: [
  [
    "struggle and resilience",
    "emotional drowning",
    "hope and recovery",
    "inner conflict",
    "self-liberation"
  ],
  [
    "closure and moving on",
    "heartache and healing",
    "finality and goodbye",
    "emotional distance",
    "acceptance"
  ],
  [
    "unrequited love",
    "emotional neglect",
    "distance and disconnection",
    "self-realization",
    "longing and disappointment"
  ],
  [
    "loneliness and invisibility",
    "emotional isolation",
    "masking pain",
    "neglect and abandonment",
    "despair"
  ],
  [
    "industrial imagery",
    "monotony and fatigue",
    "isolation in work",
    "mechanization and coldness",
    "nighttime labor"
  ]
]
Raw response: [
  [
    "Environmental degradation",
    "Industrial pollution",
    "Fragility of nature",
    "Urban decay",
    "Invisible threats"
  ],
  [
    "Hidden sadness",
    "Emotional isolation",
    "Inner conflict",
    "Desire for understanding",
    "Mental health 

**Training done!** Took 20 minutes, but 2 hours to figure out