# Create a fine tune model from OpenAI Davinci

### Scrape conversations from movie script

We want to scrape the conversations into a data structure like below.

The \n\n###\n\n and END is the separator to inform the model when the prompt or completion ends.

The separators should not exist in the prompt or completion.|

In [None]:
"""
[{
  "prompt": "USER: Hey, how are you?\n\n###\n\n",
  "completion": "ASSISTANT: I'm good thank you!END"}
]
"""

In [None]:
from bs4 import BeautifulSoup
import requests

In [None]:
response = requests.get('https://imsdb.com/scripts/Her.html')
html = response.content
soup = BeautifulSoup(html, 'html.parser')

Get all Theodore and Samatha's conversations 

In [None]:
import re
conversations = []
b_tags = soup.find_all('b')
for i in range(len(b_tags)):
    tag = b_tags[i]
    if tag and tag.string and tag.next_sibling is not None and tag.next_sibling.name != 'b':
        conversation = {}
        tag_string = tag.string.strip()
        cleaned_message = tag.next_sibling.strip()
        cleaned_message = cleaned_message.replace('\r\n', '')
        cleaned_message = cleaned_message.replace('\\', '')
        cleaned_message = re.sub(r'\([^)]*\)', '', cleaned_message)
        cleaned_message = ' '.join(cleaned_message.split())
        if tag_string.find('THEODORE') != -1:
            conversation['THEODORE'] = cleaned_message
        elif tag_string.find('SAMANTHA') != -1:
            conversation['SAMANTHA'] = cleaned_message
        if len(conversation) > 0:
            conversations.append(conversation)


Keep all the conversations between the two characters

In [None]:
ts_conversations = []
for i in range(1, len(conversations), 2):
    if 'THEODORE' in conversations[i - 1] and 'SAMANTHA' in conversations[i]:
        ts_conversations.append(conversations[i-1])
        ts_conversations.append(conversations[i])


Add spearators

In [None]:
training_data = []
for i in range(1, len(ts_conversations), 2):
    data = {}
    data["prompt"] = f"{''.join(ts_conversations[i-1].values())}\n\n###\n\n"
    data["completion"] = f" {''.join(ts_conversations[i].values())}END"
    training_data.append(data)

In [None]:
import pandas as pd
df = pd.DataFrame(training_data)

df.sample(10)

Save as jsonl file

In [None]:
df.to_json('training_data.jsonl', orient='records', lines=True)

# Fine tune the model with openai

In [None]:
# Check if the data are properly formatted
!openai tools fine_tunes.prepare_data -f training_data.jsonl

In [None]:
# Create find tuning
!openai api fine_tunes.create -t training_data.jsonl -m davinci --suffix "Her"

In [None]:
# list all your fine tuned models
!openai api fine_tunes.list

In [None]:
# Get the status of the fine tuning model
!openai api fine_tunes.get -i ft-f4gPZuStsshSbVaHx0lE6hlq

In [None]:
# !openai api fine_tunes.cancel -i ft-GfxNi7ihj2VWIVtFFzrH7ia2

Once training is finished. The model can be found in the playground: https://platform.openai.com/playground and it can be used in API calls