## before and after finetuning for gpt-3

In [8]:
!pip install BeautifulSoup4 --quiet

In [72]:
!pip install openai --quiet

In [None]:
!pip install evaluate --queit

In [None]:
!pip install bert_score --quiet

In [56]:
import pandas as pd
from google.colab import userdata
import pandas as pd
from openai import OpenAI
from bs4 import BeautifulSoup
import re
from sklearn.metrics import accuracy_score, mean_squared_error
from evaluate import load

In [1]:
prompt = """
you are an IELTS examiner. your task is to evaluate a writing section in an IELTS academic
exam. you have to provide overall band score in <BAND_SCORE> </BAND_SCORE> tags and detailed evaluation in <EVALUATION></EVALUATION> tags . I will provide you the grading
criteria in <CRITERIA> </CRITERIA> tags. The user will send you the task and his answer and you should respond with a feedback on how well does the user follow the grading criteria and his score. Provide his score in this format <Score>Score</Score>.
<CRITERIA>
TASK RESPONSE (TR)
For Task 2 of both AC and GT Writing tests, candidates are required to formulate and
develop a position in relation to a given prompt in the form of a question or
statement, using a minimum of 250 words. Ideas should be supported by evidence,
and examples may be drawn from a candidate’s own experience.
The TR criterion assesses:
▪ how fully the candidate responds to the task.
▪ how adequately the main ideas are extended and supported.
▪ how relevant the candidate’s ideas are to the task.
▪ how clearly the candidate opens the discourse, establishes their position and
formulates conclusions.
▪ how appropriate the format of the response is to the task.
COHERENCE AND COHESION (CC)
This criterion is concerned with the overall organisation and logical development of
the message: how the response organises and links information, ideas and language.
Coherence refers to the linking of ideas through logical sequencing, while cohesion
refers to the varied and appropriate use of cohesive devices (e.g. logical connectors,
conjunctions and pronouns) to assist in making clear the relationships between and
within sentences.
The CC criterion assesses:
▪ the coherence of the response via the logical organisation of information
and/or ideas, or the logical progression of the argument.
▪ the appropriate use of paragraphing for topic organisation and presentation.
▪ the logical sequencing of ideas and/or information within and across
paragraphs.
▪ the flexible use of reference and substitution (e.g. definite articles, pronouns).
▪ the appropriate use of discourse markers to clearly mark the stages in a
response, e.g. [First of all | In conclusion], and to signal the relationship between ideas and/or information, e.g. [as a result | similarly].

LEXICAL RESOURCE (LR)
This criterion refers to the range of vocabulary the candidate has used and the
accuracy and appropriacy of that use in terms of the specific task.
The LR criterion assesses:
▪ the range of general words used (e.g. the use of synonyms to avoid repetition).
▪ the adequacy and appropriacy of the vocabulary (e.g. topic-specific items,
indicators of writer’s attitude).
▪ the precision of word choice and expression.
▪ the control and use of collocations, idiomatic expressions and sophisticated
phrasing.
▪ the density and communicative effect of errors in spelling.
▪ the density and communicative effect of errors in word formation.
GRAMMATICAL RANGE AND ACCURACY (GRA)
This criterion refers to the range and accurate use of the candidate’s grammatical
resource via the candidate’s writing at sentence level.
The GRA criterion assesses:
▪ the range and appropriacy of structures used in a given response (e.g. simple,
compound and complex sentences).
▪ the accuracy of simple, compound and complex sentences.
▪ the density and communicative effect of grammatical errors.
▪ the accurate and appropriate use of punctuation.
</CRITERIA>
"""

In [3]:
# get responses before finetuning
def get_responses(model_name, df, client, prompt):
  responses = []
  for index, row in df.iterrows():
    message =  [{"role": "system", "content": prompt},
              {"role": "user", "content": f"Here is the task:\n <Task>{row['Question']}</Task> \n And here is my answer: \n <Answer>{row['Answer']}</Answer>"},
            ]
    response = client.chat.completions.create(
      messages=message,
      model = model_name,
    )
    responses.append(response)
  return responses


In [67]:
def get_squared_error(responses, y_true):
  scores=[]
  feedbacks=[]

  for response in responses:
    match = re.search(r'<score>(.*?)</score>', responses[0].choices[0].message.content.lower())
    scores.append(match.group(1))

  y_true = [float(x) for x in y_true]
  y_pred = [float(x) for x in scores]

  m = mean_squared_error(y_pred = y_pred, y_true=y_true)
  return m

In [None]:
def get_semantic_similarity(responses, feedbacks, scores):
  bertscore = load("bertscore")
  predictions = [response.choices[0].message.content for response in responses]
  references = [f"<Score>{score}</Score> \n feedback : {feedback}" for score, feedback in zip(scores, feedbacks)]

  results = bertscore.compute(predictions=predictions, references=references, lang="en")
  return sum(results['f1']) / len(results['f1'])


In [71]:
# BEFORE FINETUNING
api_key = userdata.get("openai_api_key")
client = OpenAI(api_key=api_key)
model_name = "gpt-3.5-turbo"
df = pd.read_csv('ielts_buddy_test_dataset.csv')
responses = get_responses(model_name, df, client, prompt)
mse = get_squared_error(responses, df["Final Score"].tolist())
semantic_similarity = get_semantic_similarity(responses, df["Feedback"], df["Final Score"])
print("mean_squared_error after finetuning: ", mse)
print("semantic_similarity: after finetuning", semantic_similarity)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mean_squared_error after finetuning:  0.5
semantic_similarity: after finetuning 0.8347912430763245


In [69]:
# AFTER FINETUNING
api_key = userdata.get("openai_api_key")
client = OpenAI(api_key=api_key)
model_name = userdata.get("finetuned_model")
df = pd.read_csv('ielts_buddy_test_dataset.csv')
#responses = get_responses(model_name, df, client, prompt)
mse = get_squared_error(responses, df["Final Score"].tolist())
semantic_similarity = get_semantic_similarity(responses, df["Feedback"], df["Final Score"])
print("mean_squared_error after finetuning: ", mse)
print("semantic_similarity: after finetuning", semantic_similarity)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


mean_squared_error:  0.35
semantic_similarity:  0.845073914527893
