# Method 4: Fine Tuning - Synthetic Prompts

In this series we are exploring how to match tone of a sample. My goal is to instruct and tune the LLM to talk like me. I'll use a few podcasts I've been on as examples.

Check out the [full video](link_to_video) overview of this for more context.

In [1]:
import os, json
from dotenv import load_dotenv
from langchain.chat_models import ChatOpenAI
from langchain import PromptTemplate

load_dotenv()

True

In [2]:
chat = ChatOpenAI(model='gpt-4', openai_api_key=os.getenv("OPENAI_API_KEY", "YOUR_API_KEY_HERE"))

### Previous work
We did a bunch of work in the previous methods to get my tone description, relevant docs to our sample query and writing examples. We'll load those up here so we don't need to run the code again.

In [3]:
# This is a text file of a bunch of lines that I said
with open("Transcripts/GregLines.txt", 'r') as file:
    greg_lines = file.read()

# This is a description of my tone as determined by the LLM (previous method)
with open("gregs_tone_description.txt", 'r') as file:
    gregs_tone_description = file.read()

# This are specific references to me talking about a previous role that I had
with open("first_job_college_relevant_docs.txt", 'r') as file:
    relevant_docs = file.read()

# These are specific examples of how I talk, similar to the GregLines above
with open("greg_example_writing.txt", 'r') as file:
    writing_examples = file.read()

### Fine Tuning

Now we are going to move onto the fun part, fine tuning. I want to use a open sourced model to save on costs and see how a model that *hasn't* been trained on so much safety does.

To do this I'm going to fine tune and run my model via [Gradient.ai](https://gradient.ai/) who helped sponsor this video. Super easy to get set up and the team has been responsive.

### Step 1: Create Synthetic Prompts

When you fine tune it's recommended to have a set of validated 'input' and 'output' pairs. This is your training set. I'm going to use my transcripts as the output, but what do use for the input?

I'm going to try having gpt4 generated me an synthetic 'input' that I'll use for my training set. Kind of like [Jeopardy](https://www.jeopardy.com/).

In [4]:
greg_lines_list = greg_lines.split("\n\n")
greg_lines_list = [x for x in greg_lines_list if len(x) < 1500 and len(x) > 100]

len(greg_lines_list)

60

Looks like I have 60 lines after some filtering. Let's start there. It may seem like not that many data points but I want to try it out.

Now we'll have GPT4 generate us some inputs that would have resulted in these outputs. I'll save each pair in a list.

In [5]:
# Uncomment the code below if you want to run it manually or else skip below and load the file I've already done

# input_pairs = []

# for i, line in enumerate(greg_lines_list):
#     # Status counter
#     if i % 10 == 0:
#         print (i)
    
#     # Finally, let's load it all up into a good prompt for us to use!
#     template = """
#         You are a bot that is good at generating an 'input' statement given a statement someone says
        
#         Your goal is to ask a question that would have resulted in the output statement you're given
        
#         Think of it like a game of of Jeopardy.
        
#         -Example-
#         Output: Last night I went to dinner and had a great time with my wife!
#         Input: What did you do last night?
#         -End Of Examples-
        
#         Here is the output you should give an input to: {greg_line}
#         """

#     prompt = PromptTemplate(
#         input_variables=["greg_line"],
#         template=template,
#     )

#     final_prompt = prompt.format(
#         greg_line = line
#     )

#     llm_answer = chat.predict(final_prompt)

#     input_pairs.append({
#         'input' : llm_answer,
#         'output' : line
#     })

# with open("greg_synthetic_pairs.json", "w") as file:
#     json.dump(input_pairs, file)

In [6]:
with open("greg_synthetic_pairs.json", "r") as file:
    input_pairs = json.loads(file.read())

In [7]:
# Let's see a sample
input_pairs[10]

{'input': 'What are your predictions for the development of OpenAI and other open source models in the next 18 months?',
 'output': "However, in that 18 months when when those open source models are getting better, OpenAI is gonna improve their capabilities. They're gonna come out with good stuff, and they're gonna be on a GPT 5, 6, whatever."}

I'll save this to file so we don't have to do it again or you can load it up on your own if you want

Great, now that we have our input/output pairs, let's transformt them into training data points. You can see what the suggested format is on [Gradient's website](https://docs.gradient.ai/docs/tips-and-tricks)

In [8]:
training_set = []

for pair in input_pairs:
    training_set.append({"inputs": f"<s>### Instruction:\n{pair['input']}\n\n### Response:\n{pair['output']}</s>" })

### Fine Tuning

Now we are on to the fine tuning step. We'll use their Nous Hermes 2 model. You can check out their full list of supported models [here](https://docs.gradient.ai/docs/models-1). You'll need to use python 3.10+

In [9]:
from gradientai import Gradient

# Make your Gradient client
gradient = Gradient(access_token=os.getenv("GRADIENT_API_TOKEN", "YourTokenHere"),
                    workspace_id=os.getenv("GRADIENT_WORKSPACE_ID", "YourWorkSpaceIdHere"))

# Get your base model ready. You'll need to grab the model slug from Gradient's website
base_model = gradient.get_base_model(base_model_slug="nous-hermes2")

In [11]:
### Create your new model which you'll stem from the base model
new_model = base_model.create_model_adapter(
    name="My Greg Model - Synthetic Prompts"
)

print(f"Created model adapter with id {new_model.id}")

Created model adapter with id 7d314527-d693-4ecf-8d7c-3bc6876ed436_model_adapter


Great! Now let's do the cool part of fine tuning

In [12]:
# Training on the training_set we made above
new_model.fine_tune(samples=training_set)

FineTuneResponse(number_of_trainable_tokens=12433, sum_loss=35208.676)

### Final Output

In [13]:
from langchain.llms import GradientLLM
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

In [14]:
llm = GradientLLM(
    # `ID` listed in `$ gradient model list`
    model=new_model.id,
    # optional: set new credentials, they default to environment variables
    gradient_workspace_id=os.environ["GRADIENT_WORKSPACE_ID"],
    gradient_access_token=os.environ["GRADIENT_API_TOKEN"]
)

In [15]:
question = "what was your first job out of college? Did you like it?"

In [16]:
# Finally, let's load it all up into a good prompt for us to use!

template = """
    <s>
    Speak in the tone & style of Greg Kamradt.
    Respond in a short, conversational manner that answers the question below.
    
    Here is relevant information that can be used to answer the question
    **Start of relevant information**
    {background_information}
    **End of relevant information**
    
    Here are some examples of how Greg Kamradt talks, mimic the tone you see here
    **Start of examples information**
    {examples}
    **End of examples information**
    
    ANSWER THIS QUESTION: {question}
    </s>
    """

prompt = PromptTemplate(
    input_variables=["background_information", "question", "examples"],
    template=template,
)

final_prompt = prompt.format(
    background_information = relevant_docs,
    examples=writing_examples,
    question = question
)

llm_answer = llm.predict(final_prompt)

print (llm_answer)

**Start of answer**
    Yeah. Yeah. Yeah. It's a good question.


Hm, this just isn't good.