# LLM Finetuning Script
An LLM (Large Language Model) fine-tuning script is a program designed to adapt a pre-trained language model to a specific task or domain. It does this by training the model on a smaller, more focused dataset,allowing it to specialize and improve its performance in the desired area.

## Download the Libraries 

In [1]:
!pip install -U openai

Collecting openai
  Downloading openai-1.52.0-py3-none-any.whl (386 kB)
     ------------------------------------ 386.9/386.9 kB 283.7 kB/s eta 0:00:00
Collecting jiter<1,>=0.4.0
  Downloading jiter-0.6.1-cp310-none-win_amd64.whl (200 kB)
     ------------------------------------ 200.0/200.0 kB 221.0 kB/s eta 0:00:00
Collecting typing-extensions<5,>=4.11
  Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Installing collected packages: typing-extensions, jiter, openai
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.9.0
    Uninstalling typing_extensions-4.9.0:
      Successfully uninstalled typing_extensions-4.9.0
  Attempting uninstall: openai
    Found existing installation: openai 1.35.7
    Uninstalling openai-1.35.7:
      Successfully uninstalled openai-1.35.7
Successfully installed jiter-0.6.1 openai-1.52.0 typing-extensions-4.12.2


## Import OpenAI

In [2]:
from openai import OpenAI
client = OpenAI(api_key="API_KEY_HERE")

## Simple API Request for OPEN AI

In [10]:
completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
    {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
  ]
)

print(completion.choices[0].message)

ChatCompletionMessage(content="In the realm of code, a concept profound,\nLies recursion, where magic is found.\nLike a mirror reflecting itself anew,\nA function calls itself, a mesmerizing view.\n\nWith elegance and grace, it unfolds,\nBreaking tasks into fractions, so bold.\nA dance of iterations, endless and true,\nRecursion lays bare the code's inner hue.\n\nLike a Russian doll, nested in layers so deep,\nA problem is solved with promises to keep.\nEach call spawns another, a journey so grand,\nUntil the base case reveals its hand.\n\nWith beauty and mystery, recursion unfolds,\nSolving problems with a tale untold.\nA loop within a loop, a journey divine,\nIn the world of programming, recursion shines.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None)


## Import Dataset and create Testing Dataset

In [12]:
import pandas as pd
df = pd.read_csv("employee_rating.csv")
df.head(10)
     

Unnamed: 0,prompt,rate
0,"Role: Software Engineer, User Description: As ...",Bad
1,"Role: Software Engineer, User Description: Wit...",Bad
2,"Role: Software Engineer, User Description: I'm...",Bad
3,"Role: Software Engineer, User Description: Cur...",Bad
4,"Role: Technical Lead, User Description: I have...",Bad
5,"Role: Technical Lead, User Description: With 1...",Bad
6,"Role: Technical Lead, User Description: I brin...",Bad
7,"Role: Technical Lead, User Description: Curren...",Bad
8,"Role: Technical Lead, User Description: I'm an...",Bad
9,"Role: Software Intern, User Description: Recen...",Moderate


In [13]:
df_head = df.head(10)
df_head.to_csv("employee_rating_test.csv", index=False) 

In [7]:
df = df.iloc[10:] 
df.to_csv("employee_rating.csv", index=False)

## Convert Rows to GPT Interpreting Format

In [20]:
def convert_to_gpt35_format(dataset):
    fine_tuning_data = []
    for _, row in dataset.iterrows():
        json_response = '{"rate": "' + row['rate'] + '"}' 
        fine_tuning_data.append({
            "messages": [
                {"role": "user", "content": row['prompt']},
                {"role": "assistant", "content": json_response}
            ]
        })
    return fine_tuning_data

dataset = pd.read_csv('employee_rating.csv')
converted_data = convert_to_gpt35_format(dataset)
converted_data

[{'messages': [{'role': 'user',
    'content': "Role: Software Engineer, User Description: As an undergraduate, my exposure to the industry is limited, and I don't hold any certifications. Though I can manage basic Python programming, I haven't engaged in building software applications or gained hands-on experience in the Software Development life cycle."},
   {'role': 'assistant', 'content': '{"rate": "Bad"}'}]},
 {'messages': [{'role': 'user',
    'content': "Role: Software Engineer, User Description: With my undergraduate status, I come with no prior industry experience and lack certifications. My programming skills are confined to Python basics, and I haven't contributed to any noteworthy software projects or participated in the Software Development life cycle."},
   {'role': 'assistant', 'content': '{"rate": "Bad"}'}]},
 {'messages': [{'role': 'user',
    'content': "Role: Software Engineer, User Description: I'm an undergraduate with no industry experience and no certifications. 

In [18]:
import json
json.loads(converted_data[0]['messages'][-1]['content'])
     

{'rate': 'Bad'}

## Test Train Spint 

In [21]:
from sklearn.model_selection import train_test_split

# Stratified splitting. Assuming 'Top Category' can be used for stratification
train_data, val_data = train_test_split(
    converted_data,
    test_size=0.2,
    stratify=dataset['rate'],
    random_state=42  # for reproducibility
)

## Convert the Array to JSON Format

In [22]:
def write_to_jsonl(data, file_path):
    with open(file_path, 'w') as file:
        for entry in data:
            json.dump(entry, file)
            file.write('\n')


training_file_name = "train.jsonl"
validation_file_name = "val.jsonl"

write_to_jsonl(train_data, training_file_name)
write_to_jsonl(val_data, validation_file_name)

In [23]:
training_file = client.files.create(
    file=open(training_file_name, "rb"), purpose="fine-tune"
)
validation_file = client.files.create(
    file=open(validation_file_name, "rb"), purpose="fine-tune"
)

print("Training file id:", training_file.id)
print("Validation file id:", validation_file.id)

Training file id: file-2lTAiDq4pDtirVJBqntVhadc
Validation file id: file-AF6syvwv8upmRHMebnOB2nBz


## Run the below cell and wait for few hours to get the Build Job Completed

In [33]:
suffix_name = "fullstackTutorial"

response = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    validation_file=validation_file.id,
    model="gpt-3.5-turbo",
    suffix=suffix_name,
)
response

FineTuningJob(id='ftjob-MZWpdUWT7TNG2eHCUHcKscid', created_at=1704434281, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0613', object='fine_tuning.job', organization_id='org-IiIKN0jLcjY7NoA3XECMH6Fg', result_files=[], status='validating_files', trained_tokens=None, training_file='file-Fsk7wAy8RBaI4OieDcSGZ2SR', validation_file='file-vVtLhqtjCdH8XTAzqypc6RmC')

## Get the Finetuned Model ID

In [43]:
responseModel = client.fine_tuning.jobs.retrieve("ftjob-Qmz2BUIEwNcIJYVxZ1tFxR9g")
responseModel

FineTuningJob(id='ftjob-Qmz2BUIEwNcIJYVxZ1tFxR9g', created_at=1704429684, error=None, fine_tuned_model='ft:gpt-3.5-turbo-0613:stemlink:fullstacktutorial:8dWQ9vUC', finished_at=1704430108, hyperparameters=Hyperparameters(n_epochs=3, batch_size=1, learning_rate_multiplier=2), model='gpt-3.5-turbo-0613', object='fine_tuning.job', organization_id='org-IiIKN0jLcjY7NoA3XECMH6Fg', result_files=['file-zoPH4H27dE5wPfjfZ50K4dVt'], status='succeeded', trained_tokens=7719, training_file='file-Fsk7wAy8RBaI4OieDcSGZ2SR', validation_file='file-vVtLhqtjCdH8XTAzqypc6RmC')

In [44]:
fine_tuned_model_id = responseModel.fine_tuned_model
print("\nFine-tuned model id:", fine_tuned_model_id)


Fine-tuned model id: ft:gpt-3.5-turbo-0613:stemlink:fullstacktutorial:8dWQ9vUC


## Make Test Prediction

In [27]:
def predict(test_messages, fine_tuned_model_id):

    response = client.chat.completions.create(
        model= fine_tuned_model_id, messages=test_messages
    )

    return response.choices[0].message.content

In [28]:
testSent = "Role: Software Engineer, User Description: With five years of experience in software engineering, I've successfully designed and implemented solutions to optimize efficiency and enhance user experiences. My achievements include spearheading the development of a critical module that increased application performance by 30%. Passionate about staying ahead in the tech landscape, I bring a wealth of knowledge in both frontend and backend development, aiming to contribute to innovative projects."

formatted_message = [
        {
            "role": "user",
            "content": "Role: Software Engineer, User Description: With five years of experience in software engineering, I've successfully designed and implemented solutions to optimize efficiency and enhance user experiences. My achievements include spearheading the development of a critical module that increased application performance by 30%. Passionate about staying ahead in the tech landscape, I bring a wealth of knowledge in both frontend and backend development, aiming to contribute to innovative projects."

        }
    ]

predict(formatted_message,"ft:gpt-3.5-turbo-0613:stemlink:fullstacktutorial:8dWQ9vUC")

'{"rate": "good"}'

## (Optional) Evaluate the Accuracy

In [15]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def format_test(row):

    formatted_message = [
        {
            "role": "user",
            "content": row['prompt']
        }
    ]
    return formatted_message


def predict(test_messages, fine_tuned_model_id):

    response = client.chat.completions.create(
        model=fine_tuned_model_id, messages=test_messages, temperature=0, max_tokens=50
    )

    return response.choices[0].message.content

In [16]:
def store_predictions(test_df, fine_tuned_model_id):

    print("fine_tuned_model_id",fine_tuned_model_id)
    test_df['Prediction'] = None

    for index, row in test_df.iterrows():
        test_message = format_test(row)
        prediction_result = predict(test_message, fine_tuned_model_id)
        test_df.at[index, 'Prediction'] = prediction_result

    test_df.to_csv("predictions.csv")

In [21]:

test_df = pd.read_csv("employee_rating_example.csv")
store_predictions(test_df, "ft:gpt-3.5-turbo-0613:stemlink:fullstacktutorial:8dWQ9vUC")
     

fine_tuned_model_id ft:gpt-3.5-turbo-0613:stemlink:fullstacktutorial:8dWQ9vUC


In [5]:
testSentence = "Role: Technical Lead, User Description: Im a senior engineer with just 1 year experince. I have programmed in 2 languages and I have worked previosly in 1 company"

completion = client.chat.completions.create(
  model="ft:gpt-3.5-turbo-0613:stemlink:fullstacktutorial:8dWQ9vUC",
  messages=[
    {"role": "user", "content": testSentence}
  ]
)

print(completion.choices[0].message.content) # "{"rate":"good"}"

{"rate": "Bad"}
