# LLM Finetuning Script
An LLM (Large Language Model) fine-tuning script is a program designed to adapt a pre-trained language model to a specific task or domain. It does this by training the model on a smaller, more focused dataset,allowing it to specialize and improve its performance in the desired area.

## Download the Libraries 

In [1]:
!pip install -U openai

Collecting openai
  Downloading openai-1.52.0-py3-none-any.whl (386 kB)
     ------------------------------------ 386.9/386.9 kB 283.7 kB/s eta 0:00:00
Collecting jiter<1,>=0.4.0
  Downloading jiter-0.6.1-cp310-none-win_amd64.whl (200 kB)
     ------------------------------------ 200.0/200.0 kB 221.0 kB/s eta 0:00:00
Collecting typing-extensions<5,>=4.11
  Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Installing collected packages: typing-extensions, jiter, openai
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.9.0
    Uninstalling typing_extensions-4.9.0:
      Successfully uninstalled typing_extensions-4.9.0
  Attempting uninstall: openai
    Found existing installation: openai 1.35.7
    Uninstalling openai-1.35.7:
      Successfully uninstalled openai-1.35.7
Successfully installed jiter-0.6.1 openai-1.52.0 typing-extensions-4.12.2


## Import OpenAI

In [6]:
from openai import OpenAI
client = OpenAI(api_key="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")

## Simple API Request for OPEN AI

In [4]:
completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
    {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
  ]
)

print(completion.choices[0].message)

ChatCompletionMessage(content="In the land of code, there lies a tale,  \nOf recursion's power, without fail.  \nA function that calls itself anew,  \nCreating a loop, like morning dew.  \n\nLike a mirror reflecting its own face,  \nRecursion dives deep into every trace.  \nThrough layers of calls, it travels far,  \nSolving problems, like a shining star.  \n\nFrom base case to recursive dream,  \nIt all begins with a simple theme.  \nA function that calls itself with care,  \nUnraveling mysteries in the air.  \n\nSo heed the call of recursion's might,  \nA journey into the coding night.  \nSolving problems with elegant grace,  \nIn the wondrous world of recursive space.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None)


## Import Dataset and create Testing Dataset

In [20]:
import pandas as pd
df = pd.read_csv("employee_rating_new.csv")
df.head(50)
     

Unnamed: 0,prompt,rate
0,"Role: Software Engineer, User Description: As ...",Bad
1,"Role: Software Engineer, User Description: Wit...",Bad
2,"Role: Software Engineer, User Description: I'm...",Bad
3,"Role: Software Engineer, User Description: Cur...",Bad
4,"Role: Technical Lead, User Description: I have...",Bad
5,"Role: Technical Lead, User Description: With 1...",Bad
6,"Role: Technical Lead, User Description: I brin...",Bad
7,"Role: Technical Lead, User Description: Curren...",Bad
8,"Role: Technical Lead, User Description: I'm an...",Bad
9,"Role: Software Intern, User Description: Recen...",Moderate


In [10]:
df_head = df.head(10)
df_head.to_csv("employee_rating_test.csv", index=False) 

In [12]:
df = df.iloc[10:] 
df.to_csv("employee_rating.csv", index=False)

## Convert Rows to GPT Interpreting Format

In [14]:
def convert_to_gpt35_format(dataset):
    fine_tuning_data = []
    for _, row in dataset.iterrows():
        json_response = '{"rate": "' + row['rate'] + '"}' 
        fine_tuning_data.append({
            "messages": [
                {"role": "user", "content": row['prompt']},
                {"role": "assistant", "content": json_response}
            ]
        })
    return fine_tuning_data

dataset = pd.read_csv('employee_rating.csv')
converted_data = convert_to_gpt35_format(dataset)
converted_data

[{'messages': [{'role': 'user',
    'content': "Role: Software Intern, User Description: I'm in the early stages of my software engineering degree, lacking professional experience. I possess coding skills in both Python and javascript, with a basic understanding of databases and the ability to use basic database queries. I've also completed a Python certification to strengthen my foundation."},
   {'role': 'assistant', 'content': '{"rate": "Moderate"}'}]},
 {'messages': [{'role': 'user',
    'content': "Role: Software Intern, User Description: Currently pursuing my software engineering degree, I'm new to the field and have no prior work experience. I'm proficient in Python and Java, with a basic grasp of databases and the capability to execute fundamental database queries. I've taken the initiative to complete a Python certification. I have built basic websites in react. I know html, css, javascript"},
   {'role': 'assistant', 'content': '{"rate": "Moderate"}'}]},
 {'messages': [{'role

In [16]:
import json
json.loads(converted_data[0]['messages'][-1]['content'])
     

{'rate': 'Moderate'}

## Test Train Spint 

In [26]:
from sklearn.model_selection import train_test_split

# Stratified splitting. Assuming 'Top Category' can be used for stratification
train_data, val_data = train_test_split(
    converted_data,
    test_size=0.2,
    stratify=dataset['rate'],
    random_state=42  # for reproducibility
)

## Convert the Array to JSON Format

In [28]:
def write_to_jsonl(data, file_path):
    with open(file_path, 'w') as file:
        for entry in data:
            json.dump(entry, file)
            file.write('\n')


training_file_name = "train.jsonl"
validation_file_name = "val.jsonl"

write_to_jsonl(train_data, training_file_name)
write_to_jsonl(val_data, validation_file_name)

In [30]:
training_file = client.files.create(
    file=open(training_file_name, "rb"), purpose="fine-tune"
)
validation_file = client.files.create(
    file=open(validation_file_name, "rb"), purpose="fine-tune"
)

print("Training file id:", training_file.id)
print("Validation file id:", validation_file.id)

Training file id: file-4aBHRkj4lRx7gFa0anex7jmd
Validation file id: file-2W3nPmn15hrmxlkWDZphWQvZ


## Run the below cell and wait for few hours to get the Build Job Completed

In [4]:
suffix_name = "fullstackTutorial"

response = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    validation_file=validation_file.id,
    model="gpt-3.5-turbo",
    suffix=suffix_name,
)
response

NameError: name 'client' is not defined

## Get the Finetuned Model ID

In [8]:
responseModel = client.fine_tuning.jobs.retrieve("ftjob-Qmz2BUIEwNcIJYVxZ1tFxR9g")
responseModel

FineTuningJob(id='ftjob-Qmz2BUIEwNcIJYVxZ1tFxR9g', created_at=1704429684, error=Error(code=None, message=None, param=None), fine_tuned_model='ft:gpt-3.5-turbo-0613:stemlink:fullstacktutorial:8dWQ9vUC', finished_at=1704430108, hyperparameters=Hyperparameters(n_epochs=3, batch_size=1, learning_rate_multiplier=2), model='gpt-3.5-turbo-0613', object='fine_tuning.job', organization_id='org-IiIKN0jLcjY7NoA3XECMH6Fg', result_files=['file-zoPH4H27dE5wPfjfZ50K4dVt'], seed=None, status='succeeded', trained_tokens=7719, training_file='file-Fsk7wAy8RBaI4OieDcSGZ2SR', validation_file='file-vVtLhqtjCdH8XTAzqypc6RmC', estimated_finish=None, integrations=[], user_provided_suffix='fullstackTutorial')

In [12]:
fine_tuned_model_id = responseModel.fine_tuned_model
print("\nFine-tuned model id:", fine_tuned_model_id)


Fine-tuned model id: ft:gpt-3.5-turbo-0613:stemlink:fullstacktutorial:8dWQ9vUC


## Make Test Prediction

In [14]:
def predict(test_messages, fine_tuned_model_id):

    response = client.chat.completions.create(
        model= fine_tuned_model_id, messages=test_messages
    )

    return response.choices[0].message.content

In [28]:
testSent = "Role: Software Engineer, User Description: With five years of experience in software engineering, I've successfully designed and implemented solutions to optimize efficiency and enhance user experiences. My achievements include spearheading the development of a critical module that increased application performance by 30%. Passionate about staying ahead in the tech landscape, I bring a wealth of knowledge in both frontend and backend development, aiming to contribute to innovative projects."

formatted_message = [
        {
            "role": "user",
            "content": "Role: Software Engineer, User Description: With five years of experience in software engineering, I've successfully designed and implemented solutions to optimize efficiency and enhance user experiences. My achievements include spearheading the development of a critical module that increased application performance by 30%. Passionate about staying ahead in the tech landscape, I bring a wealth of knowledge in both frontend and backend development, aiming to contribute to innovative projects."

        }
    ]

predict(formatted_message,"ft:gpt-3.5-turbo-0613:stemlink:fullstacktutorial:8dWQ9vUC")

'{"rate": "good"}'

## (Optional) Evaluate the Accuracy

In [16]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def format_test(row):

    formatted_message = [
        {
            "role": "user",
            "content": row['prompt']
        }
    ]
    return formatted_message


def predict(test_messages, fine_tuned_model_id):

    response = client.chat.completions.create(
        model=fine_tuned_model_id, messages=test_messages, temperature=0, max_tokens=50
    )

    return response.choices[0].message.content

In [17]:
def store_predictions(test_df, fine_tuned_model_id):

    print("fine_tuned_model_id",fine_tuned_model_id)
    test_df['Prediction'] = None

    for index, row in test_df.iterrows():
        test_message = format_test(row)
        prediction_result = predict(test_message, fine_tuned_model_id)
        test_df.at[index, 'Prediction'] = prediction_result

    test_df.to_csv("predictions.csv")

In [22]:

test_df = pd.read_csv("employee_rating_example.csv")
store_predictions(test_df, "ft:gpt-3.5-turbo-0613:stemlink:fullstacktutorial:8dWQ9vUC")
     

FileNotFoundError: [Errno 2] No such file or directory: 'employee_rating_example.csv'

In [24]:
testSentence = "Role: Technical Lead, User Description: Im a senior engineer with just 1 year experince. I have programmed in 2 languages and I have worked previosly in 1 company"

completion = client.chat.completions.create(
  model="ft:gpt-3.5-turbo-0613:stemlink:fullstacktutorial:8dWQ9vUC",
  messages=[
    {"role": "user", "content": testSentence}
  ]
)

print(completion.choices[0].message.content) # "{"rate":"good"}"

{"rate": "Bad"}
