# Model Fine Tuning
---
I think that GPT can generate the best description of what fine tuning is. Here is what it output when asked:  
> Fine-tuning a GPT model means adjusting the model's parameters on a smaller dataset in order to adapt it to a specific task. The model is already pre-trained on a large dataset, but it may not be perfectly suited for the specific task it's being used for. Fine-tuning allows the model to learn additional information from a smaller dataset so that it can perform better on the specific task.  
>
> For example, imagine a GPT model was pre-trained on a large dataset of general text, such as Wikipedia articles. If you want to use this model for a specific task like language translation, you would fine-tune it on a smaller dataset of translated text. This would allow the model to learn additional information about how the languages relate to one another, and it would help the model translate more accurately.  
>
> In summary, fine-tuning is a way to adapt a pre-trained model to a specific task by training it further with a smaller dataset. This allows the model to improve its performance on the specific task.  

So, in our case, we want to fine-tune our model on mortgage information to help a borrower complete a mortgage application.  

In [1]:
# Imports
import pandas as pd
import numpy as np
import openai
import os
import re
import time
import warnings
from ipywidgets import widgets
from numpy.random import Generator, PCG64

warnings.filterwarnings("ignore", message="Unverified HTTPS request is being made to host")
os.environ["CURL_CA_BUNDLE"] = ""
pd.set_option("display.max_colwidth", None)
rand = Generator(PCG64(seed=13))

# You must have your own OpenAI api key and store it 
# as an environment variable on your computer
openai.api_key = os.getenv("OPENAI_API_KEY")

## Creating the training data
---
We have chosen the Consumer Financial Protection Bureau as our fine-tuning data set. We scraped text from their website in the `scrape_cfpb_website` notebook. In the future, we'd want to augment this with additional information about American mortgages. Basically, the more data, the better.  

When we create the training data, we need to format it as "prompt" (what the user would type in) and "completion" (what the model would return). Fortunately, the structure from the CFPB website is perfectly suited for this application as its articles are titled with questions. For its "key terms," we create questions that ask what the term means. We randomly generate different forms of the question so that the model doesn't always expect a question in the exact same format (e.g., "what is \___?").

OpenAI's API expects the data in a very specific format. The question has to end with a "prompt stop" that tells the model that this is the end of the question. It has to be characters that aren't in any of the questions we are training on or likely to be in a user's question. We chose to use `\nX0X\n` as our stop. We also have to provide a "completion stop" that tells the model that it should stop answering the question. We chose `###` for our completion stop. Additionally, because of how tokenization works, all completions have to start with a whitespace.  

Finally, the data has to be saved to a JSON lines file.

In [2]:
# Load CFPB Key Terms and Mortgage questions
# as created in the scrape_cfpb_website notebook
key_path = "../data/cfpb_key_terms.csv"
key_df = pd.read_csv(key_path)

question_path = "../data/cfpb_mortgage_questions.csv"
quest_df = pd.read_csv(question_path)

In [3]:
# Prepare and merge the 2 data sets
prompt_stop = "\nX0X\n"
completion_stop = "###"

def term_case(term):
    """
    Make a term lower case except where it is an acronym
    """
    s = re.split(r"(\s|-)", term, maxsplit=1)
    
    if len(s) == 1 and len(s[0]) <= 4:
        return term.upper()
    elif len(s[0]) == 3:
        return s[0].upper() + "".join(s[1:]).lower()
    upper = lambda s: s.group(1).upper()
    return re.sub(r"(\([a-z]+\))", upper, term.lower())

def a_an(word):
    """ 
    Decide if the preceding article should be a or an. 
    """
    vowels = ["a", "e", "i", "o", "u", 
              "A", "E", "I", "O", "U"
              "ho", "Ho", "8", "11", "18"]
    if any([word.startswith(v) for v in vowels]):
        return "an"
    return "a"

def create_key_question(term):
    """
    Make a question out of the 'key term'. Randomly
    vary the format of the question so the model gets
    some variety.
    """
    term = term_case(term)
    n = rand.random()
    
    # Plural terms
    if term.endswith("s"):
        if n > 0.5:
            return f"What are {term}?"
        return f"Can you explain {term}?"
    # Singular terms
    if n < 0.25:
        return f"What is {a_an(term)} {term}?"
    elif n < 0.5:
        return f"What does {term} mean?"
    elif n < 0.75:
        return f"Explain {term}."
    return f"Define {term}"

def make_answer(row):
    """
    Remove the question from the content column.
    The content column already has broken up answers
    that are too long into multiple entries.
    """
    return row.content.replace(row.question, "")
    

df = pd.DataFrame({
    "prompt": pd.concat([
        key_df.term.apply(create_key_question),
        quest_df.question
    ]) + prompt_stop,
    "completion": " " + pd.concat([
        key_df.definition,
        quest_df.apply(make_answer, axis=1)
    ]) + completion_stop
}).reset_index(drop=True)

df.head(2)

Unnamed: 0,prompt,completion
0,Define 5/1 adjustable rate mortgage\nX0X\n,"A 5/1 adjustable rate mortgage (ARM) or 5-year ARM is a mortgage loan where ""5"" is the number of years your initial interest rate will stay fixed. The ""1"" represents how often your interest rate will adjust after the initial five-year period ends. The most common fixed periods are 3, 5, 7, and 10 years and ""1,"" is the most common adjustment period. It's important to carefully read the contract and ask questions if you're considering an ARM. ###"
1,Define ability-to-repay rule\nX0X\n,The ability-to-repay rule is the reasonable and good faith determinationmost mortgage lenders are required to make that you are able to pay back the loan. ###


In [4]:
# Save to a json lines file
fpath = "../data/fine_tune_data.jsonl"
df.to_json(fpath, orient="records", lines=True)

### Double check data formatting
---
OpenAI's provides a command line interface data preparation tool to ensure that the data is properly formatted.   
From your terminal, use `cd` to navigate to the directory where your data is 
located. Then run the following command and respond to any follow up prompts.
```
openai tools fine_tunes.prepare_data -f fine_tune_data.jsonl
```

### Upload file
---
Once your data is properly formatted, you have to upload your file to OpenAI's server. There
are various ways to do this (as shown in the [documentation](https://beta.openai.com/docs/api-reference/files/upload)), but we are going to use the python implementation below.  
The response will contain the uploaded file's ID, which we will need when training.

In [5]:
# Upload data for fine tuning
upload_response = openai.File.create(
  file=open(fpath, "rb"),
  purpose="fine-tune"
)
file_id = upload_response["id"]
print("Uploaded file ID:", file_id)

Uploaded file ID: file-kgAo80TwjAYP77Pt6wgxk9lH


---
### Fine tune the model
Finally, it's time to actually fine tune your model.  

You can pick between the 4 model types: ada, babbage, curie, davinci. Ada is the fastest and cheapest and davinci is the slowest and most expensive. For
most tasks, Curie will be sufficient. You can check pricing [here](https://openai.com/api/pricing/#faq-fine-tuning-pricing-calculation).   

We will use the following arguments:  
- `training_file = file_id`: the ID of the training data file we uploaded previously.
- `model = "curie"`: Curie almost always has great performance with low latency. For a more advanced yet expensive model, use "davinci"
- `n_epochs = 3`: the numer of times the model goes through the data. More times will make the model "memorize" your data (which isn't good). This is directly related to the cost.
- `suffix = "mortybot_chatbot"`: this lets us have a custom name for our model.  

While our data set is not large and thus the model will not take too long to fine tune (five-ish minutes), the wait is determined by other jobs in the system's queue. It could be hours or it could be only a few minutes.

In [6]:
# Run fine tuning
MODEL = "curie"
EPOCHS = 3

print(f"Kicked off at {pd.to_datetime('today').strftime('%I:%M %p', )}\n")
train_response = openai.FineTune.create(
    training_file=file_id,
    model=MODEL,
    n_epochs=EPOCHS,
    suffix="mortybot_chatbot"
)
train_id = train_response["id"]
print("Training run ID:", train_id)

# Check in one minute intervals if model is done training
t0 = time.time()
while True:
    train_info = openai.FineTune.retrieve(train_id)
    if train_info["status"] == "succeeded":
        model_name = train_info["fine_tuned_model"]
        break
    else:
        time.sleep(60)
print(f"Model name: {model_name}\n")

# Print out training run information
events = {e["message"]: e for e in train_info["events"]}
cost = [e for e in events if e.startswith("Fine-tune costs")][0]
print(cost)

t0 = events[f"Created fine-tune: {train_id}"]["created_at"]
t1 = events["Fine-tune started"]["created_at"]
t2 = events["Fine-tune succeeded"]["created_at"]
print(f"Queue time: {(t1 - t0) / 60:.2f} minutes")
print(f"Train time: {(t2 - t1) / 60:.2f} minutes")

Kicked off at 11:17 AM

Training run ID: ft-Xof8qaRPtmUXo0I9NPJpiN1B
Model name: curie:ft-hannah:mortybot-chatbot-2023-01-30-18-29-25

Fine-tune costs $0.81
Queue time: 65.87 minutes
Train time: 6.00 minutes


### Using the fine tuned model
---
There are many ways to make requests to our model, but for now we're going to call it in python from this notebook. In practice, it will be integrated into your application.  
When we make a request for a completion, we have to make sure that we add on the prompt
stop token that we added to our training set above. Otherwise, the model will be 
confused.  

We will set the following arguments: 
- `model = model_id`: our fine-tuned model ID from above
- `max_tokens = 150`: the maximum numner of tokens to generate in the completion (more tokens is more expensive)
- `temperature = 0`: we don't want the model to get 'creative' with its answers
- `frequency_penalty = 2`: curb its desire to repeat sentences
- `presence_penalty = -1`: keep the answer on topic
- `best_of = 3`: get the best answer out of 3 tries
- `stop = completion_stop`: the stop sequence we put in our training data so that it knows when to stop generating a response (could be shorter than max_tokens)

For whatever reason, it sometimes makes up contact information. Since we don't want to mislead consumers, we are going to remove it.

In [7]:
def ask_mortybot(question, model):
    """
    Makes a request to our fine tuned model to 
    answer a question.
    """
    MAX_TOKENS = 175
    TEMPERATURE = 0
    FREQ_PENALTY = 2
    PRES_PENALTY = -1
    BEST_OF = 3
    
    response = openai.Completion.create(
        model=model,
        prompt=question + prompt_stop,
        max_tokens=MAX_TOKENS,
        temperature=TEMPERATURE,
        frequency_penalty=FREQ_PENALTY,
        presence_penalty=PRES_PENALTY,
        best_of=BEST_OF,
        stop=completion_stop
    )
    answer = response["choices"][0]["text"].strip()
    
    # Remove possibly fake contact info
    phone_pattern = (
        r"(at )??(1\s+)??\(??[0-9]{3}\)??\s+[0-9A-Z]{3}[-\s]+[0-9A-Z]{4}\s+"
        r"(\([0-9]{4}\))?"
    )
    email_pattern = r"(at )??([a-zA-Z0-9\.]+)@([a-zA-Z0-9]+)\.([a-zA-Z]+)"
    answer = re.sub(phone_pattern, "", answer)
    answer = re.sub(email_pattern, "", answer)
    
    # Cut answer at last full sentence
    answer = "".join(
        re.split(r"(\.|\"\.|\!)", answer[::-1], maxsplit=1)[1:])[::-1]
    return answer

In [10]:
# Ask a question and print the answer
Q = widgets.Text(placeholder="Ask a mortgage question")
output = widgets.Output()
def print_my_question(q):
    with output:
        A = ask_mortybot(q.value, model_name)
        print(f"Question:\n{q.value}\nAnswer:\n{A}\n")

display(Q, output)
Q.on_submit(print_my_question)

Text(value='', placeholder='Ask a mortgage question')

Output()

### Sample answers
---
**What does LTV stand for?**  
>Loan to value (LTV) is the amount of your mortgage loan compared to the property value.

<br></br>

**Can you explain the difference between a conforming and non-conforming loan?**  
>A conforming loan is one that meets the requirements of the relevant federal housing program. The programs have different rules and requirements, but generally a loan that is a conforming loan will be eligible for special benefits such as lower interest rates and other benefits.  A non-conforming mortgage may or may not meet all of the rules for a particular program, but it does not qualify as a conforming mortgage under any federal housing program.

<br></br>

**How long does the application process take?**  
>The application process for a mortgage takes about three business days. You can start the application process online or by calling. If you call, you may be able to speak with a lender or broker to get more information about loan products and services.