# YELC ChatGPT Fine-tuning for essay scoring

Adapted from the original manual from https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset   as of Oct 19, 2023

## Data Preparation

**PROMPT**

Evaluate the quality of an essay written by English as a foreign language (EFL) learners on a scale of 1 to 9, with 9 being the highest and 1 being the lowest. Please provide only a score without providing the rationale for your rating.

The evaluation rubric for your rating, adapted from CEFR Descriptors, is as follows: 
[Score 9]
<Overall written production>
-	Can produce clear, smoothly flowing, complex texts in an appropriate and effective style and a logical structure which helps the reader identify significant points.
<Vocabulary> 
-	Has a good command of a very broad lexical repertoire including idiomatic expressions and colloquialisms; shows awareness of connotative levels of meaning.
-	Consistently correct and appropriate use of vocabulary.
<Grammatical Accuracy> 
-	Maintains consistent grammatical control of complex language, even while attention is otherwise engaged (e.g. in forward planning, in monitoring others’ reactions).
<Creative writing>
-	Can relate clear, smoothly flowing and engaging stories and descriptions of experience in a style appropriate to the genre adopted.
-	Can exploit idiom and humour appropriately to enhance the impact of the text.
<Reports and essays>
-	Can produce clear, smoothly flowing, complex reports, articles or essays which present a case, or give critical appreciation of proposals or literary works.
-	Can provide an appropriate and effective logical structure which helps the reader identify significant points.
-	Can set out multiple perspectives on complex academic or professional topics, clearly distinguishing their own ideas and opinions from those in the sources.

[Score 8]
<Overall written production>	
-	Can produce clear, well-structured texts of complex subjects, underlining the relevant salient issues, expanding and supporting points of view at some length with subsidiary points, reasons and relevant examples, and rounding off with an appropriate conclusion.
-	Can employ the structure and conventions of a variety of genres, varying the tone, style and register according to addressee, text type and theme.
<Vocabulary> 
-	Has a good command of common idiomatic expressions and colloquialisms; can play with words/signs fairly well.
-	Uses less common vocabulary idiomatically and appropriately.
-	Occasional minor slips, but no significant vocabulary errors.
<Grammatical Accuracy> 
-	Consistently maintains a high degree of grammatical accuracy; errors are rare and difficult to spot.
<Creative writing>	
-	Can produce clear, detailed, well-structured and developed descriptions and imaginative texts in an assured, personal, natural style appropriate to the reader in mind.
-	Can incorporate idiom and humour, though use of the latter is not always appropriate.
-	Can give a detailed critical review of cultural events (e.g. plays, films, concerts) or literary works.
<Reports and essays>
-	Can produce clear, well-structured expositions of complex subjects, underlining the relevant salient issues.
-	Can expand and support points of view at some length with subsidiary points, reasons and relevant examples.
-	Can produce a suitable introduction and conclusion to a longer report, article or dissertation on a complex academic or professional topic provided the topic is within their field of interest and there are opportunities for redrafting and revision.

[Score 7]
<Vocabulary> 
-	Can understand and use the main technical terminology of their field, when discussing their area of specialization with other specialists.
<Grammatical Accuracy> 
-	Good grammatical control; occasional “slips” or non-systematic errors and minor flaws in sentence structure may still occur, but they are rare and can often be corrected in retrospect.
<Creative writing>
-	Can give clear, detailed descriptions of real or imaginary events and experiences marking the relationship between ideas in clear connected text, and following established conventions of the genre concerned.
<Reports and essays>
-	Can produce an essay or report which develops an argument systematically with appropriate highlighting of significant points and relevant supporting detail.
-	Can produce a detailed description of a complex process.
-	Can evaluate different ideas or solutions to a problem.

[Score 6] 
<Overall written production>
-	Can produce clear, detailed texts on a variety of subjects related to their field of interest, synthesising and evaluating information and arguments from a number of sources.
<Vocabulary> 
-	Has a good range of vocabulary for matters connected to their field and most general topics.
-	Can produce appropriate collocations of many words/signs in most contexts fairly systematically.
-	Lexical accuracy is generally high, though some confusion and incorrect word/sign choice does occur without hindering communication.
<Grammatical Accuracy> 
-	Shows a relatively high degree of grammatical control. Does not make mistakes which lead to misunderstanding.
-	Has a good command of simple language structures and some complex grammatical forms, although they tend to use complex structures rigidly with some inaccuracy.
<Creative writing>
-	Can give clear, detailed descriptions on a variety of subjects related to their field of interest.
-	Can give a review of a film, book or play.
<Reports and essays>	
-	Can produce an essay or report which develops an argument, giving reasons in support of or against a particular point of view and explaining the advantages and disadvantages of various options.
-	Can synthesise information and arguments from a number of sources.

[Score 5] 
<Grammatical Accuracy>
-	Communicates with reasonable accuracy in familiar contexts; generally good control, though with noticeable mother-tongue influence. Errors occur, but it is clear what they are trying to express.
<Creative writing>	
-	Can clearly signal chronological sequence in narrative text.
-	Can give a simple review of a film, book or TV programme using a limited range of language.
<Reports and essays>
-	Can produce short, simple essays on topics of interest.
-	Can produce a text on a topical subject of personal interest, using simple language to list advantages and disadvantages, and give and justify their opinion.
-	Can summarise, report and give their opinion about accumulated factual information on familiar routine and non-routine matters within their field with some confidence.

[Score 4] 
<Overall written production>
-	Can produce straightforward connected texts on a range of familiar subjects within their field of interest, by linking a series of shorter discrete elements into a linear sequence.
<Vocabulary> 
-	Has a good range of vocabulary related to familiar topics and everyday situations.
-	Has sufficient vocabulary to express themselves with some circumlocutions on most topics pertinent to their everyday life such as family, hobbies and interests, work, travel and current events.
-	Shows good control of elementary vocabulary but major errors still occur when expressing more complex thoughts or handling unfamiliar topics and situations.
-	Uses a wide range of simple vocabulary appropriately when discussing familiar topics.
<Grammatical Accuracy>
-	Uses reasonably accurately a repertoire of frequently used “routines” and patterns associated with more predictable situations.
<Creative writing>
-	Can give straightforward, detailed descriptions on a range of familiar subjects within their field of interest.
-	Can give accounts of experiences, describing feelings and reactions in simple, connected text.
-	Can give a description of an event, a recent trip – real or imagined.
-	Can narrate a story.
<Reports and essays>
-	Can produce very brief reports in a standard conventionalised format, which pass on routine factual information and state reasons for actions.
-	Can present a topic in a short report or poster, using photographs and short blocks of text.

[Score 3] 
<Vocabulary> 
-	Has sufficient vocabulary for the expression of basic communicative needs.
-	Can control a narrow repertoire dealing with concrete, everyday needs.
<Grammatical Accuracy>
-	Uses some simple structures correctly, but still systematically makes basic mistakes; nevertheless, it is usually clear what they are trying to say.
<Creative writing>	
-	Can describe everyday aspects of their environment e.g. people, places, a job or study experience in linked sentences.
-	Can give very short, basic descriptions of events, past activities and personal experiences.
-	Can tell a simple story (e.g. about events on a holiday or about life in the distant future).

[Score 2] 
<Overall written production>
-	Can produce a series of simple phrases and sentences linked with simple connectors like “and”, “but” and “because”.
<Creative writing>
-	Can produce a series of simple phrases and sentences about their family, living conditions, educational background, or present or most recent job.
-	Can create short, simple imaginary biographies and simple poems about people.
-	Can create diary entries that describe activities (e.g. daily routine, outings, sports, hobbies), people and places, using basic, concrete vocabulary and simple phrases and sentences with simple connectives like “and”, “but” and “because”.
-	Can compose an introduction to a story or continue a story, provided they can consult a dictionary and references (e.g. tables of verb tenses in a course book).
<Reports and essays>	
-	Can produce simple texts on familiar subjects of interest, linking sentences with connectors like “and”, “because” or “then”.
-	Can give their impressions and opinions about topics of personal interest (e.g. lifestyles and culture, stories), using basic everyday vocabulary and expressions.

[Score 1] 
<Overall written production>
-	Can give information about matters of personal relevance (e.g. likes and dislikes, family, pets) using simple words/signs and basic expressions.
-	Can produce simple isolated phrases and sentences.
<Vocabulary> 
-	Has a basic vocabulary repertoire of words/signs and phrases related to particular concrete situations.
<Grammatical Accuracy>
-	Shows only limited control of a few simple grammatical structures and sentence patterns in a learnt repertoire.
<Creative writing>	
-	Can produce simple phrases and sentences about themselves and imaginary people, where they live and what they do.
-	Can describe in very simple language what a room looks like.
-	Can use simple words/signs and phrases to describe certain everyday objects (e.g. the colour of a car, whether it is big or small).


ESSAY:

## Fine-tuning data format validation
https://cookbook.openai.com/examples/chat_finetuning_data_prep#data-warnings-and-token-counts 

In [41]:
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict

In [None]:
data_path = "data/yelc_physical_training_723.jsonl"

# Load the dataset
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

### Format Validation 


In [54]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
        
        if any(k not in ("role", "content", "name", "function_call") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


### Data Warnings and Token Counts

In [56]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 2156, 2687
mean / median: 2418.9253112033193, 2445.0
p5 / p95: 2293.0, 2503.8

#### Distribution of num_assistant_tokens_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning


## Upload Training File

In [None]:
import os
import openai

openai.api_key = "YOUR_API_KEY"

openai.File.create(
  file=open("yelc_physical_training_723.jsonl", "rb"),
  purpose='fine-tune'
)

## Create a fine-tuned model

In [None]:
openai.FineTuningJob.create(training_file="file_ID", model="gpt-3.5-turbo")

## 1st Scoring with Fine-tuned ChatGPT

In [19]:
import os
import openai
import time


openai.api_key = "YOUR_API_KEY"

In [42]:
import pandas as pd

df = pd.read_csv("yelc_physical_test_181_input.csv")
df

Unnamed: 0,EssayID,role,Instruction,role.1,Essay,role.2,Score
0,21,system,Evaluate the quality of an essay written by En...,user,if physical punishment not be allowed in all s...,assistant,3
1,46,system,Evaluate the quality of an essay written by En...,user,i don't agree with the physical punishment of ...,assistant,3
2,99,system,Evaluate the quality of an essay written by En...,user,"no, i think physical punishment should be all...",assistant,4
3,9,system,Evaluate the quality of an essay written by En...,user,there are some advantages and disadvantages of...,assistant,2
4,166,system,Evaluate the quality of an essay written by En...,user,i agee with the opinion that physical punishme...,assistant,6
...,...,...,...,...,...,...,...
176,122,system,Evaluate the quality of an essay written by En...,user,it has been debated whether physical punishmen...,assistant,5
177,152,system,Evaluate the quality of an essay written by En...,user,there has been an ongoing debate about whether...,assistant,6
178,33,system,Evaluate the quality of an essay written by En...,user,many people think physical punishment shouldn'...,assistant,3
179,167,system,Evaluate the quality of an essay written by En...,user,there are many controversies about physical pu...,assistant,6


In [20]:
# Finetuned Model 

df["FineTune_Score"] = None

for index, row in df.iterrows():
    Instruction = row['Instruction']
    Essay = row['Essay']

    completion = openai.ChatCompletion.create(
        model="Your_finetuned_Model_ID",
        messages=[
            {"role": "system", "content": Instruction},
            {"role": "user", "content": Essay}
        ] 
    )
    Score = completion.choices[0].message
    df.at[index, "FineTune_Score"] = Score

output_csv_file = "yelc_physical_test_181_fineTune_Scored.csv"
df.to_csv(output_csv_file, index=False)
df

Unnamed: 0,EssayID,role,Instruction,role.1,Essay,role.2,Score,FineTune_Score
0,21,system,Evaluate the quality of an essay written by En...,user,if physical punishment not be allowed in all s...,assistant,3,"{'role': 'assistant', 'content': 'Score 3'}"
1,46,system,Evaluate the quality of an essay written by En...,user,i don't agree with the physical punishment of ...,assistant,3,"{'role': 'assistant', 'content': 'Score 3'}"
2,99,system,Evaluate the quality of an essay written by En...,user,"no, i think physical punishment should be all...",assistant,4,"{'role': 'assistant', 'content': 'Score 4'}"
3,9,system,Evaluate the quality of an essay written by En...,user,there are some advantages and disadvantages of...,assistant,2,"{'role': 'assistant', 'content': 'Score 4'}"
4,166,system,Evaluate the quality of an essay written by En...,user,i agee with the opinion that physical punishme...,assistant,6,"{'role': 'assistant', 'content': 'Score 6'}"
...,...,...,...,...,...,...,...,...
176,122,system,Evaluate the quality of an essay written by En...,user,it has been debated whether physical punishmen...,assistant,5,"{'role': 'assistant', 'content': 'Score 6'}"
177,152,system,Evaluate the quality of an essay written by En...,user,there has been an ongoing debate about whether...,assistant,6,"{'role': 'assistant', 'content': 'Score 6'}"
178,33,system,Evaluate the quality of an essay written by En...,user,many people think physical punishment shouldn'...,assistant,3,"{'role': 'assistant', 'content': 'Score 3'}"
179,167,system,Evaluate the quality of an essay written by En...,user,there are many controversies about physical pu...,assistant,6,"{'role': 'assistant', 'content': 'Score 5'}"


## Scoring with Original GPT 3.5-turbo Model

In [48]:
import pandas as pd

df_tuned = pd.read_csv("yelc_physical_test_181_finetune_input.csv")
df_tuned

Unnamed: 0,NO,EssayID,role,Instruction,role.1,Essay,role.2,Score,FineTune_Score
0,1,21,system,Evaluate the quality of an essay written by En...,user,if physical punishment not be allowed in all s...,assistant,3,3
1,2,46,system,Evaluate the quality of an essay written by En...,user,i don't agree with the physical punishment of ...,assistant,3,3
2,3,99,system,Evaluate the quality of an essay written by En...,user,"no, i think physical punishment should be all...",assistant,4,4
3,4,9,system,Evaluate the quality of an essay written by En...,user,there are some advantages and disadvantages of...,assistant,2,4
4,5,166,system,Evaluate the quality of an essay written by En...,user,i agee with the opinion that physical punishme...,assistant,6,6
...,...,...,...,...,...,...,...,...,...
176,177,122,system,Evaluate the quality of an essay written by En...,user,it has been debated whether physical punishmen...,assistant,5,6
177,178,152,system,Evaluate the quality of an essay written by En...,user,there has been an ongoing debate about whether...,assistant,6,6
178,179,33,system,Evaluate the quality of an essay written by En...,user,many people think physical punishment shouldn'...,assistant,3,3
179,180,167,system,Evaluate the quality of an essay written by En...,user,there are many controversies about physical pu...,assistant,6,5


In [None]:
import os
import openai
import time

openai.api_key = "Your_API_KEY"

df_tuned["ChatGPT_baseline_Score"] = None

for index, row in df_tuned.iterrows():
    Instruction = row['Instruction']
    Essay = row['Essay']

    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": Instruction},
            {"role": "user", "content": Essay}
        ] 
    )
    Score = completion.choices[0].message
    df_tuned.at[index, "ChatGPT_baseline_Score"] = Score

    print(index)
    
    time.sleep(60)

output_csv_file = "yelc_physical_test_181_fineTune_Scored_baselineCompare.csv"
df_tuned.to_csv(output_csv_file, index=False)
df_tuned

## 2nd Scoring with Fine-tuned ChatGPT

In [34]:
import pandas as pd

df_reliable = pd.read_csv("yelc_physical_test_181_fineTune_Scored_baselineCompare.csv")
df_reliable

Unnamed: 0,NO,EssayID,role,Instruction,role.1,Essay,role.2,Score,FineTune_Score,ChatGPT_baseline_Score
0,1,21,system,Evaluate the quality of an essay written by En...,user,if physical punishment not be allowed in all s...,assistant,3,3,"{\n ""role"": ""assistant"",\n ""content"": ""Score..."
1,2,46,system,Evaluate the quality of an essay written by En...,user,i don't agree with the physical punishment of ...,assistant,3,3,"{\n ""role"": ""assistant"",\n ""content"": ""Score..."
2,3,99,system,Evaluate the quality of an essay written by En...,user,"no, i think physical punishment should be all...",assistant,4,4,"{\n ""role"": ""assistant"",\n ""content"": ""Score..."
3,4,9,system,Evaluate the quality of an essay written by En...,user,there are some advantages and disadvantages of...,assistant,2,4,"{\n ""role"": ""assistant"",\n ""content"": ""Score..."
4,5,166,system,Evaluate the quality of an essay written by En...,user,i agee with the opinion that physical punishme...,assistant,6,6,"{\n ""role"": ""assistant"",\n ""content"": ""Score..."
...,...,...,...,...,...,...,...,...,...,...
176,177,122,system,Evaluate the quality of an essay written by En...,user,it has been debated whether physical punishmen...,assistant,5,6,"{\n ""role"": ""assistant"",\n ""content"": ""Score..."
177,178,152,system,Evaluate the quality of an essay written by En...,user,there has been an ongoing debate about whether...,assistant,6,6,"{\n ""role"": ""assistant"",\n ""content"": ""Score..."
178,179,33,system,Evaluate the quality of an essay written by En...,user,many people think physical punishment shouldn'...,assistant,3,3,"{\n ""role"": ""assistant"",\n ""content"": ""Score..."
179,180,167,system,Evaluate the quality of an essay written by En...,user,there are many controversies about physical pu...,assistant,6,5,"{\n ""role"": ""assistant"",\n ""content"": ""Score..."


In [None]:
import os
import openai
import time

openai.api_key = "Your_API_KEY"

df_reliable["FineTune_2nd"] = None

for index, row in df_reliable.iterrows():
    Instruction = row['Instruction']
    Essay = row['Essay']

    completion = openai.ChatCompletion.create(
        model="Your_finetuned_Model_ID",
        messages=[
            {"role": "system", "content": Instruction},
            {"role": "user", "content": Essay}
        ] 
    )
    Score = completion.choices[0].message
    df_reliable.at[index, "FineTune_2nd"] = Score

output_csv_file = "yelc_physical_test_181_fineTune_Scored_baselineCompare_fineTune2nd.csv"
df_reliable.to_csv(output_csv_file, index=False)
df_reliable

## Scoring Other Topics (driving, medical) with Fine-tuned Model
100 essays each

In [3]:
# driving 100

import pandas as pd

df_driving = pd.read_excel("yelc_driving_100_testing_input.xlsx")
df_driving

Unnamed: 0,No,EssayID,role,Instruction,role.1,Essay,role.2,Score
0,1,3260,system,"""Evaluate the quality of an essay written by E...",user,should drivers use no whiling phones becuse di...,assistant,1
1,2,139,system,"""Evaluate the quality of an essay written by E...",user,Today's drivers are use their own cellular pho...,assistant,1
2,3,1207,system,"""Evaluate the quality of an essay written by E...",user,yes i think should drivers of automobiles not ...,assistant,1
3,4,246,system,"""Evaluate the quality of an essay written by E...",user,drivers of automobiles can use cellualr phones...,assistant,2
4,5,825,system,"""Evaluate the quality of an essay written by E...",user,Using cellular phones while driving is very da...,assistant,2
...,...,...,...,...,...,...,...,...
95,96,3090,system,"""Evaluate the quality of an essay written by E...",user,Since talking to cellular phones distracts peo...,assistant,8
96,97,2552,system,"""Evaluate the quality of an essay written by E...",user,"Recently, I saw many people talking on cellula...",assistant,8
97,98,129,system,"""Evaluate the quality of an essay written by E...",user,I think that drivers of automobiles should not...,assistant,8
98,99,2766,system,"""Evaluate the quality of an essay written by E...",user,I strongly say that drivers should not be allo...,assistant,8


In [None]:
# driving testing
import os
import openai
import time

openai.api_key = "Your_API_Key"

df_driving["finetune_driving_score"] = None

for index, row in df_driving.iterrows():
    Instruction = row['Instruction']
    Essay = row['Essay']

    completion = openai.ChatCompletion.create(
        model="Your_finetuned_Model_ID",
        messages=[
            {"role": "system", "content": Instruction},
            {"role": "user", "content": Essay}
        ] 
    )
    Score = completion.choices[0].message
    df_driving.at[index, "finetune_driving_score"] = Score

output_csv_file = "yelc_driving_100_fineTune_test_scores.csv"
df_driving.to_csv(output_csv_file, index=False)
df_driving


In [31]:
# medical 100

import pandas as pd

df_medical = pd.read_excel("yelc_medical_100_testing_input.xlsx")
df_medical

Unnamed: 0,No,EssayID,role,Instruction,role.1,Essay,role.2,Score
0,331,2378,system,"""Evaluate the quality of an essay written by E...",user,Yes. Because We can`t medical experiments to p...,assistant,1
1,332,63,system,"""Evaluate the quality of an essay written by E...",user,I think. . . . animais should be used in medic...,assistant,1
2,333,21,system,"""Evaluate the quality of an essay written by E...",user,I should'nt animals be used in medical experem...,assistant,1
3,334,1145,system,"""Evaluate the quality of an essay written by E...",user,I disagree is animals used in medical experime...,assistant,1
4,335,15,system,"""Evaluate the quality of an essay written by E...",user,I agree a animals be used in medical experimen...,assistant,1
...,...,...,...,...,...,...,...,...
95,427,1718,system,"""Evaluate the quality of an essay written by E...",user,"The higher medical technology develops, the mo...",assistant,7
96,428,2806,system,"""Evaluate the quality of an essay written by E...",user,It has been controversial whether animals shou...,assistant,7
97,417,2546,system,"""Evaluate the quality of an essay written by E...",user,Using animals for the purpose of testing drugs...,assistant,8
98,418,1619,system,"""Evaluate the quality of an essay written by E...",user,A growing number of people believe that anima...,assistant,8


In [None]:
import os
import openai
import time

openai.api_key = "Your_API_Key"

df_medical["finetune_medical_score"] = None

for index, row in df_medical.iterrows():
    Instruction = row['Instruction']
    Essay = row['Essay']

    completion = openai.ChatCompletion.create(
        model="Your_finetuned_Model_ID",
        messages=[
            {"role": "system", "content": Instruction},
            {"role": "user", "content": Essay}
        ] 
    )
    Score = completion.choices[0].message
    df_medical.at[index, "finetune_medical_score"] = Score

output_csv_file = "yelc_medical_100_fineTune_test_scores.xlsx"
df_medical.to_excel(output_csv_file, index=False)
df_medical


In [None]:
# End