# **Pipeline to Pre-Tune GPT-3.5-Turbo**

To pretune our data, we will use OpenAI's GPT-3.5-Turbo model. To do that, we need to convert the .csv file (ocpoem_final_with_themes.csv) into a JSON file and then train/fine-tune our model with that. 

But before doing so, there is a line I found when looking at the final dataset that posed a little bit of a concern. Line 311 was actually not a poem (it was an advertisement) and our labeling model (GPT-4.1-Mini) did a good job with catching that. This means that no themes were output for this line so we will filter it out because it may confuse our model during training.

**Import dependencies**

In [29]:
import pandas as pd
import json
import openai

In [13]:
df = pd.read_csv("C:/Users/Marielle/OneDrive/Desktop/LLM Project/ocpoem_final_with_themes.csv")

**Filter out bad line**

In [14]:
df.shape

(980, 2)

In [24]:
# Print the row you want to drop (optional check)
print("Row 311 before dropping:\n", df.iloc[309])

Row 311 before dropping:
 cleaned    greetings, reddit.com at tay bridge press, we ...
themes                                                    []
Name: 309, dtype: object


In [25]:
# Drop row 311 (index 309)
df_cleaned = df.drop(index=309).reset_index(drop=True)

# Save to a new CSV file
df_cleaned.to_csv("C:/Users/Marielle/OneDrive/Desktop/LLM Project/ocpoem_final_with_themes_cleaned.csv", index=False)

In [26]:
df_cleaned.shape

(979, 2)

**Preparing the fine-tune data: turn .csv into a JSON file**

In [None]:
df = pd.read_csv("C:/Users/Marielle/OneDrive/Desktop/LLM Project/ocpoem_final_with_themes_cleaned.csv")

In [28]:
fine_tune_data = []

for _, row in df.iterrows():
    if not row['themes'] or row['themes'] == '[]':
        continue  # skip empty or broken rows

    try:
        # Convert string list to actual Python list (if necessary)
        themes_list = eval(row['themes']) if isinstance(row['themes'], str) else row['themes']
    except Exception:
        continue

    fine_tune_data.append({
        "messages": [
            {"role": "user", "content": row["cleaned"]},
            {"role": "assistant", "content": json.dumps(themes_list)}  # keep it as stringified list
        ]
    })

# Save as JSONL
with open("ocpoem_finetune_data.jsonl", "w", encoding="utf-8") as f:
    for item in fine_tune_data:
        f.write(json.dumps(item) + "\n")

print("Fine-tuning file saved: ocpoem_finetune_data.jsonl")

Fine-tuning file saved: ocpoem_finetune_data.jsonl


**Upload the file to OpenAI**

In [32]:
client = openai.OpenAI(api_key="Your api key")

In [34]:
with open("ocpoem_finetune_data.jsonl", "rb") as f:
    file = client.files.create(file=f, purpose="fine-tune")

print("File uploaded. File ID:", file.id)

File uploaded. File ID: file-HzAqwSH7MwRd1k3fhcYbUf


**Fine-tune GPT-3.5-Turbo**

In [35]:
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-3.5-turbo")

print("Fine-tune job created. Job ID:", job.id)

Fine-tune job created. Job ID: ftjob-RoUd7fn1iAtUgvJr0Trd0v8J


In [40]:
job_id="file-HzAqwSH7MwRd1k3fhcYbUf"

In [59]:
# Check status
job_status = client.fine_tuning.jobs.retrieve(job.id)
print(job_status.status)  # will show 'validating_files', 'queued', 'running', or 'succeeded'

running


In [96]:
# Check events
events = client.fine_tuning.jobs.list_events(job.id)
for event in events.data:
    print(event.message)

Evaluating model against our usage policies
New fine-tuned model created
Checkpoint created at step 1958
Checkpoint created at step 979
Step 2937/2937: training loss=0.56
Step 2936/2937: training loss=0.12
Step 2935/2937: training loss=0.17
Step 2934/2937: training loss=0.20
Step 2933/2937: training loss=0.34
Step 2932/2937: training loss=0.46
Step 2931/2937: training loss=0.02
Step 2930/2937: training loss=0.21
Step 2929/2937: training loss=0.10
Step 2928/2937: training loss=0.24
Step 2927/2937: training loss=0.29
Step 2926/2937: training loss=0.24
Step 2925/2937: training loss=0.22
Step 2924/2937: training loss=0.16
Step 2923/2937: training loss=0.65
Step 2922/2937: training loss=0.41


In [98]:
# Check status
job_status = client.fine_tuning.jobs.retrieve(job.id)
print(job_status.status) 

succeeded


**Submitted and started at 12:37 AM, 2937 steps, finished at 2:07 AM. Fine-tuning took about ~1 hour, 30 minutes**

**Save the model ID**

In [99]:
jobs = client.fine_tuning.jobs.list(limit=5)
for job in jobs.data:
    print(f"Job ID: {job.id}, Model: {job.fine_tuned_model}")

Job ID: ftjob-RoUd7fn1iAtUgvJr0Trd0v8J, Model: ft:gpt-3.5-turbo-0125:personal::C15qtfQm


In [100]:
fine_tuned_model_id = jobs.data[0].fine_tuned_model
print("Fine-tuned model ID:", fine_tuned_model_id)

Fine-tuned model ID: ft:gpt-3.5-turbo-0125:personal::C15qtfQm
