# OpenAI Models Evaluation

In the previous notebook ([OpenAIFinedTuning](e_OpenAIFineTuning.ipynb)), we fine-tuned the `Davinci` & `GPT3.5` models using a dataset generated by `GPT4` described in the notebook [DatasetCreation](a_DatasetCreation.ipynb). In this notebook, we will evaluate the **Zero-shot** & **Fined-tuned** performance of the `Davinci` & `GPT3.5` models on our French text simplification task.

In [1]:
# ---------------------------- PREPARING NOTEBOOK ---------------------------- #
# Autoreload
%load_ext autoreload
%autoreload 2

# Random seed
import numpy as np
np.random.seed(42)

# External modules
import os
from IPython.display import display

# Set global log level
import logging
logging.basicConfig(level=logging.INFO)
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

# Define PWD as the current git repository
import git
repo = git.Repo('.', search_parent_directories=True)
pwd = repo.working_dir
os.chdir(pwd)

# import

In [2]:
# -------------------------- LOAD PREVIOUS NOTEBOOKS ------------------------- #
import json
import __main__
import black

paths = [
    os.path.join(pwd, "notebooks", "text_simplification", "c_MistralEvaluation.ipynb"),
    os.path.join(pwd, "notebooks", "text_simplification", "e_OpenAIFineTuning.ipynb"),
]

# Read notebooks
code_dict = {}
for path in paths:
    code = ""
    with open(path, "r") as f:
        temp = json.load(f)

    cells = [
        cell
        for cell in temp["cells"]
        if cell["cell_type"] == "code"
        and len(cell["source"]) > 0
        and cell["source"][-1] == "# import"
    ]
    notebook_code = "\n".join(
        line
        for cell in cells
        for line in cell["source"]
        if line != "# import" and len(line) > 0 and line[0] != "%"
    )
    # Create something like a header
    code += f"# {'-'*76} #\n"
    code += f"# {os.path.basename(path).upper():^76} #\n"
    code += f"# {'-'*76} #\n"
    code += notebook_code

    # Add "Module Creation"
    notebook_name = (
        os.path.basename(path).replace("imported_", "").replace(".ipynb", "")
    )
    code += """
# --------------------------------- IMPORTER --------------------------------- #
import types


class MyNotebook:
    pass


NOTEBOOK_NAME = MyNotebook()
# Put every function defined in the notebook in the class
NOTEBOOK_NAME.__dict__.update(
    {
        name: obj
        for name, obj in locals().items()
        if isinstance(obj, (type, types.FunctionType))
        if not (name.startswith("_") or name == "MyNotebook")
    }
)
    """.replace(
        "NOTEBOOK_NAME", notebook_name
    )

    # Remove empty lines
    code = "\n".join([line for line in code.split("\n") if len(line) > 0])
    # Format code
    code = black.format_str(code, mode=black.FileMode())

    # Write scrach file
    path = os.path.join(
        pwd, "scratch", f"imported_{os.path.basename(path).replace('ipynb', 'py')}"
    )
    if not os.path.exists(os.path.dirname(path)):
        os.makedirs(os.path.dirname(path))
    with open(path, "w") as f:
        f.write(code)
    code_dict[path] = code


# Mainify code
for path, code in code_dict.items():
    compiled = compile(code, path, "exec")
    exec(compiled, __main__.__dict__)

# import

## Loading data

For our evaluation, we are going to use the datasets `french_difficulty` & `sentences` that we used in our *Estimation of difficulty* task because these two datasets are labelled using CEFR labels. We will create a dataset consisting of 1000 sentences (*100 of each level except A1 per dataset*).

In [3]:
# ------------------------------- LOADING DATA ------------------------------- #
test_df = c_MistralEvaluation.get_balanced_dataframe(
    c_MistralEvaluation.download_difficulty_estimation(pwd), nbr=100
)
test_df.columns = ["Original", "Difficulty"]
test_df.value_counts("Difficulty")

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

Difficulty
A2    200
B1    200
B2    200
C1    200
C2    200
Name: count, dtype: int64

## Zero-shot evaluation

We're going to start by evaluating the zero-shot performance of the Open-AI models. We will therefore use the basic models without using the fine-tuning we performed in the previous notebook ([OpenAIFinedTuning](e_OpenAIFineTuning.ipynb)).

In [8]:
# ----------------------- ZERO-SHOT EVALUATION FUNCTION ---------------------- #
from tqdm import notebook as notebook_tqdm
import openai
import pandas as pd

# Connect to OpenAI
e_OpenAIFineTuning.connect_to_openai()

import signal, time


class Timeout:
    """Timeout class using ALARM signal"""

    class Timeout(Exception):
        pass

    def __init__(self, sec):
        self.sec = sec

    def __enter__(self):
        signal.signal(signal.SIGALRM, self.raise_timeout)
        signal.alarm(self.sec)

    def __exit__(self, *args):
        signal.alarm(0)  # disable alarm

    def raise_timeout(self, *args):
        raise Timeout.Timeout()


def evaluate_openai(inputs: pd.Series, model: str, context: str):
    # Compute predictions
    predictions = []
    for text in notebook_tqdm.tqdm(inputs):
        try:
            with Timeout(15):
                if "gpt" in model:
                    response = openai.ChatCompletion.create(
                        model=model,
                        messages=[
                            {"role": "system", "content": context},
                            {"role": "user", "content": text},
                        ],
                        max_tokens=len(text) * 2,
                    )
                    prediction = response.choices[0]["message"]["content"].strip()
                else:
                    response = openai.Completion.create(
                        engine=model,
                        prompt=f"{context}{text}",
                        max_tokens=len(text) * 2,
                    )
                    prediction = response.choices[0].text.strip()
        except Exception as e:
            print(e)
            print(f"Error with text: {text}")
            print("Skipping...")
            predictions.append("Error")
            continue

        predictions.append(prediction)
        # Save prediction for security
        pd.DataFrame(predictions).to_csv(
            os.path.join(pwd, "scratch", "openai_predictions.csv"), index=False
        )

    return pd.DataFrame(predictions)


# import

In [9]:
# ------------------- FORMAT AND SAVE PREDICTIONS FUNCTION ------------------- #
def format_and_save_predictions(predictions: pd.DataFrame, model: str):
    # Format predictions
    predictions_df = pd.concat(
        [
            test_df["Original"],
            predictions.iloc[:, 0]
            .str.extract(r"(.*[\.\n])")
            .iloc[:, 0]
            .rename("Simplified")
            .str.strip(),
        ],
        axis=1,
    )

    # Create save path
    path = os.path.join(pwd, "results", "text_simplification", "OpenAIEvaluation")
    if not os.path.exists(path):
        os.makedirs(path)

    # Save original predictions
    predictions.to_csv(os.path.join(path, f"{model}_predictions.csv"), index=False)

    # Save formatted predictions
    predictions_df.to_csv(
        os.path.join(path, f"{model}_formatted_predictions.csv"), index=False
    )

In [10]:
# ----------------------------- DAVINCI ZERO-SHOT ---------------------------- #
# Make predictions
predictions = evaluate_openai(
    test_df.apply(
        lambda row: e_OpenAIFineTuning.create_davinci_conversation(row, training=False),
        axis=1,
    ),
    "davinci-002",
    "",
)

# Format and save predictions
format_and_save_predictions(predictions, "davinci-002-zero-shot")

  0%|          | 0/1000 [00:00<?, ?it/s]


Error with text: Voici une phrase en français de niveau CECRL B1 à simplifier :\n'''La science en fait partie.'''\nDonne moi une phrase simplifiée au niveau CECRL A2 tout en conservant au maximum son sens original
Skipping...


In [11]:
# ------------------------------- GPT ZERO-SHOT ------------------------------ #
# Make predictions
predictions = evaluate_openai(
    test_df.apply(
        lambda row: e_OpenAIFineTuning.create_davinci_conversation(row, training=False),
        axis=1,
    ),
    "gpt-3.5-turbo-1106",
    "",
)

# Format and save predictions
format_and_save_predictions(predictions, "gpt-3.5-turbo-1106-zero-shot")

  0%|          | 0/1000 [00:00<?, ?it/s]


Error with text: Voici une phrase en français de niveau CECRL B1 à simplifier :\n'''Il s’est endormi… Et c’est ainsi que je fis la connaissance du petit prince.'''\nDonne moi une phrase simplifiée au niveau CECRL A2 tout en conservant au maximum son sens original
Skipping...


## Fine-tuned evaluation

We are now going to evaluate the performance of the models we trained in the previous notebook ([OpenAIFineTuning](e_OpenAIFineTuning.ipynb))

In [None]:
# ----------------------------- DAVINCI FINETUNED ---------------------------- #
import json

# Get model id (results/text_simplification/OpenAIFineTuning/davinci-002_trained.json)
path = os.path.join(
    pwd,
    "results",
    "text_simplification",
    "OpenAIFineTuning",
    "davinci-002_trained.json",
)
with open(path, "r") as f:
    model_id = json.load(f)["model"]["fine_tuned_model"]

# Make predictions
predictions = evaluate_openai(
    test_df.apply(
        lambda row: e_OpenAIFineTuning.create_davinci_conversation(row, training=False),
        axis=1,
    ),
    model_id,
    "",
)
# Format and save predictions
format_and_save_predictions(predictions, "davinci-002-finetuned")

  0%|          | 0/1000 [00:00<?, ?it/s]


Error with text: Voici une phrase en français de niveau CECRL A2 à simplifier :\n'''Il tourna le dos à la mer qui lui avait fait tant de mal en le fascinant depuis son arrivée sur l’île, et il se dirigea vers la forêt et le massif rocheux. . Durant les semaines qui suivirent, Durant les semaines qui suivirent, Robinson explora l’île méthodiquement et tâcha de repérer les sources et les abris naturels, les meilleurs emplacements pour la pêche, les coins à noix de coco, à ananas et à choux palmistes.'''\nDonne moi une phrase simplifiée au niveau CECRL A1 tout en conservant son sens original.
Skipping...

Error with text: Voici une phrase en français de niveau CECRL B2 à simplifier :\n'''Sans doule l,avait-il suffisamment ennuyée pour qu'elle eût trouvé assez bon de iimiter sa grande civilitéso à sa minesl et à ses gestes lorsqu'l|a regardait, et peut-être même le code des façons à observer envers les étrangers, pour rigou- reusement aimabie qu'il obligeât à se montrer en leur présence, 

In [None]:
# ------------------------------ GPT FINE-TUNED ------------------------------ #
import json

# Get model i
path = os.path.join(
    pwd,
    "results",
    "text_simplification",
    "OpenAIFineTuning",
    "gpt-3.5-turbo-1106_trained.json",
)
with open(path, "r") as f:
    model_id = json.load(f)["model"]["fine_tuned_model"]

# Make predictions
predictions = evaluate_openai(
    test_df.apply(
        lambda row: e_OpenAIFineTuning.create_davinci_conversation(row, training=False),
        axis=1,
    ),
    model_id,
    "",
)
# Format and save predictions
format_and_save_predictions(predictions, "gpt-3.5-turbo-1106-finetuned")

  0%|          | 0/1000 [00:00<?, ?it/s]