# Metrics

In order to compare the performance of the different models, we need to define a metric. This metric must take into account the following elements:
- Whether the simplified sentences are indeed at the expected level
- Whether the simplified sentences retain the meaning of the original sentence

To measure the first point, we simply use a model already trained in the `difficulty_estimation` section to estimate the difference in difficulty between the original sentences and the simplified sentences. A measure of **Accuracy** should enable us to obtain a score between 0 and 1 describing whether our first condition is met.

To measure the second point, we will use the **Similarity Consinus** between the embeddings of the original sentences and the simplified sentences. This measure will give us a score between 0 and 1 describing whether our second condition is met.

In order to remain consistent between these two measures and to use a different model to those we could evaluate (**Mistral-7B**, **Davinci**, **GPT3.5**), it seems appropriate to use a **BERT** model for both measures. La formule finale, inspirée du **F1-score** est la suivante :
$$
\begin{align*} \text{Metric} & = score = \frac{2 \times w_1 \times A \times (1 - w_1) \times S}{w_1 \times A + (1 - w_1) \times S } \\ \text{Where :} \\ \text{Metric} & = \text{The combined metric} \\ w_1 & = \text{Weighting coefficient for similarity} \\ w_2 & = (1-w_1) : \text{Weighting coefficient for accuracy}\\
S &= \text{Cosine similarity between input and output}\\
A &= \text{Accuracy between BERT predictions and expected labels}  \end{align*}
$$



In [18]:
# ---------------------------- PREPARING NOTEBOOK ---------------------------- #
# Autoreload
%load_ext autoreload
%autoreload 2

# Random seed
import numpy as np
np.random.seed(42)

# External modules
import os
from IPython.display import display

# Set global log level
import logging
logging.basicConfig(level=logging.INFO)
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

# Define PWD as the current git repository
import git
repo = git.Repo('.', search_parent_directories=True)
pwd = repo.working_dir
os.chdir(pwd)

# import

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Downloading test data

In order to evaluate the relevance of our metric, we will first evaluate it on the dataset used to fine-tune **Mistral-7B** in the [FineTuningMistral](b_FineTuningMistral.ipynb) notebook. This dataset is described in the [DatasetCreation](a_DatasetCreation.ipynb) notebook and was created using **GPT-4**.

In [19]:
# -------------------------- LOAD PREVIOUS NOTEBOOKS ------------------------- #
import json
import __main__
import black

paths = [
    os.path.join(pwd, "notebooks", "text_simplification", "a_DatasetCreation.ipynb"),
    os.path.join(pwd, "notebooks", "text_simplification", "c_MistralEvaluation.ipynb"),
]

# Read notebooks
code_dict = {}
for path in paths:
    code = ""
    with open(path, "r") as f:
        temp = json.load(f)

    cells = [
        cell
        for cell in temp["cells"]
        if cell["cell_type"] == "code"
        and len(cell["source"]) > 0
        and cell["source"][-1] == "# import"
    ]
    notebook_code = "\n".join(
        line
        for cell in cells
        for line in cell["source"]
        if line != "# import" and len(line) > 0 and line[0] != "%"
    )
    # Create something like a header
    code += f"# {'-'*76} #\n"
    code += f"# {os.path.basename(path).upper():^76} #\n"
    code += f"# {'-'*76} #\n"
    code += notebook_code

    # Add "Module Creation"
    notebook_name = (
        os.path.basename(path).replace("imported_", "").replace(".ipynb", "")
    )
    code += """
# --------------------------------- IMPORTER --------------------------------- #
import types


class MyNotebook:
    pass


NOTEBOOK_NAME = MyNotebook()
# Put every function defined in the notebook in the class
NOTEBOOK_NAME.__dict__.update(
    {
        name: obj
        for name, obj in locals().items()
        if isinstance(obj, (type, types.FunctionType))
        if not (name.startswith("_") or name == "MyNotebook")
    }
)
    """.replace(
        "NOTEBOOK_NAME", notebook_name
    )

    # Remove empty lines
    code = "\n".join([line for line in code.split("\n") if len(line) > 0])
    # Format code
    code = black.format_str(code, mode=black.FileMode())

    # Write scrach file
    path = os.path.join(
        pwd, "scratch", f"imported_{os.path.basename(path).replace('ipynb', 'py')}"
    )
    if not os.path.exists(os.path.dirname(path)):
        os.makedirs(os.path.dirname(path))
    with open(path, "w") as f:
        f.write(code)
    code_dict[path] = code


# Mainify code
for path, code in code_dict.items():
    compiled = compile(code, path, "exec")
    exec(compiled, __main__.__dict__)

# import

In [21]:
# ----------------------------- LOADING TEST DATA ---------------------------- #
metric_test_df = a_DatasetCreation.download_data()
metric_test_df.columns = ["Original", "Simplified"]
metric_test_df.head()

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,Original,Simplified
0,L'apprentissage des langues étrangères stimule...,On apprend mieux avec les langues étrangères.
1,Les écosystèmes marins sont régulièrement pert...,Les usines abîment souvent la vie sous la mer.
2,L'absorption de polluants atmosphériques par l...,Les forêts aident à garder l'air propre en abs...
3,"Confrontées à une mutation économique rapide, ...",Les entreprises doivent changer vite pour rest...
4,La philosophie existentialiste s'affirme par l...,L'existentialisme dit que la vie n'a pas de se...


## Difficulty assessment

We are going to use the **CamemBert** model already trained to evaluate the difficulty of sentences. We will define a function that takes a dataframe as input and returns a score between 0 and 1, indicating how well the difficulty of the original sentences are correctly related to the difficulty of the simplified sentences.

The dataframe to be evaluated must have the following format:
- `Original`: the original sentence
- `Simplified`': the simplified sentence

*Note that it is not necessary for the `Difficulty` column to be present in the dataframe, as we will calculate it in order to reduce the bias of our evaluation.*

In [22]:
# ------------------------- BERT PREDICTION FUNCTION ------------------------- #
import dill
import numpy as np
import pandas as pd
import torch
from huggingface_hub import snapshot_download
from transformers import CamembertForSequenceClassification
import os


def get_bert_difficulty_prediction(series: pd.Series, dataset: str, pwd: str = "."):
    # Clone model checkpoint
    if not os.path.exists(os.path.join(pwd, dataset)):
        snapshot_download(
            repo_id=f"OloriBern/Lingorank_Bert_{dataset}",
            local_dir=os.path.join(pwd, dataset),
            revision="main",
            repo_type="model",
        )

    # Load tokenizer and label encoder
    with open(
        os.path.join(
            os.path.join(pwd, dataset),
            "train_camembert_tokenizer_label_encoder.pkl",
        ),
        "rb",
    ) as f:
        tokenizer, label_encoder = dill.load(f)

    # Charger le modèle; assurons-nous qu'il matche la classe de votre modèle
    model = CamembertForSequenceClassification.from_pretrained(
        os.path.join(pwd, dataset)
    )

    # Mettre le modèle en mode évaluation
    model.eval()

    # Préparer les données pour le modèle
    inputs = tokenizer(
        series.tolist(),
        padding=True,
        truncation=True,
        return_tensors="pt",
    )

    # Charger les tensors sur l'appareil adéquat (GPU si disponible)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Désactiver le calcul du gradient puisque nous sommes en inférence
    batch_size = 32
    all_predictions = []
    with torch.no_grad():
        # Traiter les inputs par batch pour éviter les MemoryError
        for i in range(0, len(inputs["input_ids"]), batch_size):
            batch_inputs = {
                key: value[i : i + batch_size].to(device)
                for key, value in inputs.items()
            }
            # Faire les prédictions
            outputs = model(**batch_inputs)

            # Appliquer une fonction softmax pour obtenir les probabilités
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

            # Convertir les prédictions en numpy array pour faciliter l'accès aux résultats et leur manipulation
            predictions = predictions.cpu().numpy()

            all_predictions.append(predictions)

    # Concatenate all predictions
    predictions = np.concatenate(all_predictions)

    # Get best predictions
    predictions = np.argmax(predictions, axis=1)

    # Apply label encoder
    predictions = label_encoder.inverse_transform(predictions)

    return predictions


# import

In [23]:
# Create mini test set for BERT
mini_test = metric_test_df["Original"].sample(10)

# Predict
get_bert_difficulty_prediction(
    mini_test, "french_difficulty", os.path.join(pwd, "scratch")
)

array(['C2', 'B2', 'B2', 'B2', 'C1', 'B2', 'C1', 'C1', 'C2', 'C2'],
      dtype=object)

In [24]:
# ------------------------ GET SIMPLIFICATION ACCURACY ----------------------- #
import os
import pandas as pd
from typing import NamedTuple


def get_simplification_accuracy(df: pd.DataFrame, dataset: str, pwd: str = "."):
    # Get predictions
    predictions = get_bert_difficulty_prediction(
        pd.concat([df["Original"], df["Simplified"]], axis=0), dataset, pwd
    )

    comparison = pd.DataFrame(
        {
            "Original": predictions[: len(predictions) // 2],
            "Simplified": predictions[len(predictions) // 2 :],
        }
    )
    difficulty_equivalence = {
        "A1": 1,
        "A2": 2,
        "B1": 3,
        "B2": 4,
        "C1": 5,
        "C2": 6,
        "level1": 1,
        "level2": 2,
        "level3": 3,
        "level4": 4,
    }

    comparison["Original"] = comparison["Original"].map(difficulty_equivalence)
    comparison["Simplified"] = comparison["Simplified"].map(difficulty_equivalence)
    results = comparison["Original"] - comparison["Simplified"]
    accuracy = (results == 1).sum() / len(results)

    # Return accuracy & predictions
    return NamedTuple(
        "SimplificationAccuracy", [("accuracy", float), ("predictions", pd.DataFrame)]
    )(accuracy, comparison)

In [25]:
get_simplification_accuracy(
    metric_test_df, "french_difficulty", os.path.join(pwd, "scratch")
).accuracy

0.496

## Evaluation of the conservation of meaning

We now need to check whether the simplified sentences retain the meaning of the original sentences. To do this, we're going to apply the same **CamemBERT** model, but this time without any fine-tuning to compare the embeddings of each sentence.

In [26]:
# ----------------------- GET SIMPLIFICATION SIMILARITY ---------------------- #
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from typing import NamedTuple

def calculate_bleu_score(original, simplified):
    reference = original.split()
    candidate = simplified.split()
    score = sentence_bleu([reference], candidate)
    return score


def calculate_rouge_score(original, simplified):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    scores = scorer.score(original, simplified)
    return scores

def get_simplification_similarity(df: pd.DataFrame, bleu: bool = False, rouge: bool = False):
    if bleu and rouge:
        raise ValueError("Cannot use both BLEU and ROUGE at the same time")
    elif bleu:
        similarity = df.dropna().apply(lambda row: calculate_bleu_score(row['Original'], row['Simplified']), axis=1)
    elif rouge:
        similarity = df.dropna().apply(lambda row: sum([x[2] for x in calculate_rouge_score(row['Original'], row['Simplified']).values()]) / 2, axis=1).iloc[0]
    else:
        # Load model
        model = SentenceTransformer("camembert-base")

        # Encode sentences
        original = model.encode(df["Original"].tolist())
        simplified = model.encode(df["Simplified"].tolist())

        # Compute cosine similarity
        similarity = cosine_similarity(original, simplified)

        # Keep only diagonal
        similarity = np.diag(similarity)

    # Mean similarity
    similarity_mean = similarity.mean()

    # Return similarity & similarity mean
    return NamedTuple(
        "SimplificationSimilarity",
        [("similarity", np.ndarray), ("similarity_mean", float)],
    )(similarity, similarity_mean)


# import

In [27]:
print(get_simplification_similarity(metric_test_df).similarity_mean)
print(get_simplification_similarity(metric_test_df, bleu=True).similarity_mean)
print(get_simplification_similarity(metric_test_df, rouge=True).similarity_mean)

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using defa

0.8856876
0.01240844735738772
0.38888888888888884


## Definition of the metric

We have functions that can give us similarity and difficulty scores. So we're going to define a function that takes a dataframe as input and returns the metric score.

In [28]:
# ------------------------------- DEFINE METRIC ------------------------------ #
def simplification_metric(
    df: pd.DataFrame, dataset: str, pwd: str = ".", w1: float = 0.5, bleu : bool = False, rouge : bool = False
):
    accuracy = get_simplification_accuracy(df, dataset, pwd).accuracy
    similarity = get_simplification_similarity(df, bleu=bleu, rouge=rouge).similarity_mean

    score = (
        4
        * (w1 * accuracy * (1 - w1) * similarity)
        / (w1 * accuracy + (1 - w1) * similarity)
    )

    return NamedTuple(
        "SimplificationMetric",
        [("accuracy", float), ("similarity", float), ("score", float)],
    )(accuracy, similarity, score)

### Evaluation of the metric

To evaluate the relevance of our metric, we will apply it to four datasets:
- The dataset used for fine-tuning **Mistral-7B** (*Supposed to have a higher score*)
- A test dataset composed of 60 sentences simplified by **Mistral-7B** (*Supposed to have a lower score*)
- A dataset of 60 unsimplified sentences.
- A dataset consisting of 60 sentences associated with another sentence at a lower level but with no semantic link.

In [29]:
# ------------------------- CREATING EVALUATION SETS ------------------------- #
# GPT-4
test_df_1 = metric_test_df

# Mistral-7B
test_df_2 = pd.read_csv(
    os.path.join(
        pwd,
        "results",
        "text_simplification",
        "MistralEvaluation",
        "predictions_fine_tuned_formatted.csv",
    )
)

# Perfect Meaning
test_df_original = c_MistralEvaluation.get_balanced_dataframe(
    c_MistralEvaluation.download_difficulty_estimation("."), 100
)
test_df_3 = pd.concat(
    [test_df_original["Sentence"], test_df_original["Sentence"]], axis=1
)
test_df_3.columns = ["Original", "Simplified"]

# Perfect Simplification
test_df_4 = (
    pd.concat(
        [test_df_original["Sentence"], test_df_original["Sentence"].shift(10)],
        axis=1,
    )
    .dropna()
    .reset_index(drop=True)
)
test_df_4.columns = ["Original", "Simplified"]

Fetching 6 files:   0%|          | 0/6 [00:00<?, ?it/s]

In [30]:
# Mute UserWarning
import warnings
warnings.filterwarnings("ignore")

print(get_simplification_similarity(test_df_1, bleu=True).similarity_mean)
print(get_simplification_similarity(test_df_2, bleu=True).similarity_mean)
print(get_simplification_similarity(test_df_3, bleu=True).similarity_mean)
print(get_simplification_similarity(test_df_4, bleu=True).similarity_mean)

0.01240844735738772
0.0748661577512338
0.98
8.228289207862188e-05


In [31]:
# ----------------------------- METRIC ASSESSMENT ---------------------------- #
# Loading test data
test_df_1 = metric_test_df
test_df_2 = pd.read_csv(
    os.path.join(
        pwd,
        "results",
        "text_simplification",
        "MistralEvaluation",
        "predictions_fine_tuned_formatted.csv",
    )
)

# Assessing metric
## Bert
gpt4_metric_best = simplification_metric(
    test_df_1, "french_difficulty", os.path.join(pwd, "scratch")
)
mistral_metric_best = simplification_metric(
    test_df_2.dropna(axis=0), "french_difficulty", os.path.join(pwd, "scratch")
)
perfect_meaning_metric_best = simplification_metric(
    test_df_3, "french_difficulty", os.path.join(pwd, "scratch")
)
perfect_simplification_metric_best = simplification_metric(
    test_df_4, "french_difficulty", os.path.join(pwd, "scratch")
)
## Blue
gpt4_metric_best_bleu = simplification_metric(
    test_df_1, "french_difficulty", os.path.join(pwd, "scratch"), bleu=True
)
mistral_metric_best_bleu = simplification_metric(
    test_df_2.dropna(axis=0), "french_difficulty", os.path.join(pwd, "scratch"), bleu=True
)
perfect_meaning_metric_best_bleu = simplification_metric(
    test_df_3, "french_difficulty", os.path.join(pwd, "scratch"), bleu=True
)
perfect_simplification_metric_best_bleu = simplification_metric(
    test_df_4, "french_difficulty", os.path.join(pwd, "scratch"), bleu=True
)
## Rouge
gpt4_metric_best_rouge = simplification_metric(
    test_df_1, "french_difficulty", os.path.join(pwd, "scratch"), rouge=True
)
mistral_metric_best_rouge = simplification_metric(
    test_df_2.dropna(axis=0), "french_difficulty", os.path.join(pwd, "scratch"), rouge=True
)
perfect_meaning_metric_best_rouge = simplification_metric(
    test_df_3, "french_difficulty", os.path.join(pwd, "scratch"), rouge=True
)
perfect_simplification_metric_best_rouge = simplification_metric(
    test_df_4, "french_difficulty", os.path.join(pwd, "scratch"), rouge=True
)

# Displaying results
print(f"Test 1 (GPT-4): {gpt4_metric_best}, {gpt4_metric_best_bleu}")
print(f"Test 2 (Mistral-7B): {mistral_metric_best}, {mistral_metric_best_bleu}")
print(f"Test 3 (Perfect Meaning): {perfect_meaning_metric_best}, {perfect_meaning_metric_best_bleu}")
print(f"Test 4 (Perfect Simplification): {perfect_simplification_metric_best}, {perfect_simplification_metric_best_bleu}")

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/31 [00:00<?, ?it/s]

Batches:   0%|          | 0/31 [00:00<?, ?it/s]

INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using defa

Test 1 (GPT-4): SimplificationMetric(accuracy=0.496, similarity=0.8856876, score=0.6358905555153163), SimplificationMetric(accuracy=0.496, similarity=0.01240844735738772, score=0.024211202316777855)
Test 2 (Mistral-7B): SimplificationMetric(accuracy=0.3527054108216433, similarity=0.9138029, score=0.5089634653624058), SimplificationMetric(accuracy=0.3527054108216433, similarity=0.0748661577512338, score=0.12351475573748853)
Test 3 (Perfect Meaning): SimplificationMetric(accuracy=0.0, similarity=1.0, score=0.0), SimplificationMetric(accuracy=0.0, similarity=0.98, score=0.0)
Test 4 (Perfect Simplification): SimplificationMetric(accuracy=0.1808080808080808, similarity=0.80995727, score=0.2956236191929849), SimplificationMetric(accuracy=0.1808080808080808, similarity=8.228289207862188e-05, score=0.00016449092694329018)


### Find ideal parameters

To find the best parameters for our metric, we simply want to ensure that the scores obtained for datasets `Test 3` & `Test 4` are equal. After a few simplifications, we want $$w_1=\dfrac{y_2-y_1}{x_1-x_2+y_2-y_1}$$. Now all we have to do is the maths!

In [32]:
# ----------------------------- METRIC ASSESSMENT ---------------------------- #
# Find best w1
## Bert
x_1 = perfect_meaning_metric_best.accuracy
x_2 = perfect_simplification_metric_best.accuracy
y_1 = perfect_meaning_metric_best.similarity
y_2 = perfect_simplification_metric_best.similarity
w_1_bert = (y_2 - y_1) / (x_1 - x_2 + y_2 - y_1)
## BLEU
x_1 = perfect_meaning_metric_best_bleu.accuracy
x_2 = perfect_simplification_metric_best_bleu.accuracy
y_1 = perfect_meaning_metric_best_bleu.similarity
y_2 = perfect_simplification_metric_best_bleu.similarity
w_1_bleu = (y_2 - y_1) / (x_1 - x_2 + y_2 - y_1)
# ROUGE
x_1 = perfect_meaning_metric_best_rouge.accuracy
x_2 = perfect_simplification_metric_best_rouge.accuracy
y_1 = perfect_meaning_metric_best_rouge.similarity
y_2 = perfect_simplification_metric_best_rouge.similarity
w_1_rouge = (y_2 - y_1) / (x_1 - x_2 + y_2 - y_1)
logging.info(f"Best w1 (BERT): {w_1_bert}")
logging.info(f"Best w1 (BLUE): {w_1_bleu}")
logging.info(f"Best w1 (ROUGE): {w_1_rouge}")

INFO:root:Best w1 (BERT): 0.5124506310430758


INFO:root:Best w1 (BLUE): 0.8442284292011875
INFO:root:Best w1 (ROUGE): 0.8468776732249786


## Calculation of all metrics

Now that we have a good evaluation of our metric, we can apply it to all the models we trained in the previous notebooks. Namely :
- **Mistral-7B**
- **Davinci Zero-Shot**
- **GPT-3.5 Zero-Shot**
- **Davinci Fine-Tuned**
- **GPT-3.5 Fine-Tuned**

For an effective comparison, we will also evaluate the models on the same datasets as for the metric evaluation (**Perfect Meaning** & **Perfect Evaluation**).

In [33]:
# ----------------------------- COMPUTING METRICS ---------------------------- #
from IPython.display import Markdown

# Perfect Meaning
display(Markdown("### Perfect Meaning"))
display(test_df_3.head())
perfect_meaning_metric_bert = simplification_metric(
    test_df_3, "french_difficulty", os.path.join(pwd, "scratch"), w1=w_1_bert
)
perfect_meaning_metric_bleu = simplification_metric(
    test_df_3, "french_difficulty", os.path.join(pwd, "scratch"), w1=w_1_bleu, bleu=True
)
perfect_meaning_metric_rouge = simplification_metric(
    test_df_3, "french_difficulty", os.path.join(pwd, "scratch"), w1=w_1_rouge, rouge=True
)
display(Markdown(f"**Accuracy (BERT)**: *{perfect_meaning_metric_bert.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BERT)**: *{perfect_meaning_metric_bert.similarity:.2%}*"))
display(Markdown(f"**Score (BERT)**: *{perfect_meaning_metric_bert.score:.2%}*"))
display(Markdown(f"**Accuracy (BLEU)**: *{perfect_meaning_metric_bleu.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BLEU)**: *{perfect_meaning_metric_bleu.similarity:.2%}*"))
display(Markdown(f"**Score (BLEU)**: *{perfect_meaning_metric_bleu.score:.2%}*"))
display(Markdown(f"**Accuracy (ROUGE)**: *{perfect_meaning_metric_rouge.accuracy:.2%}*"))
display(Markdown(f"**Similarity (ROUGE)**: *{perfect_meaning_metric_rouge.similarity:.2%}*"))
display(Markdown(f"**Score (ROUGE)**: *{perfect_meaning_metric_rouge.score:.2%}*"))


# Perfect Simplification
display(Markdown("### Perfect Simplification"))
display(test_df_4.head())
perfect_simplification_metric_bert = simplification_metric(
    test_df_4, "french_difficulty", os.path.join(pwd, "scratch"), w1=w_1_bert
)
perfect_simplification_metric_bleu = simplification_metric(
    test_df_4, "french_difficulty", os.path.join(pwd, "scratch"), w1=w_1_bleu, bleu=True
)
perfect_simplification_metric_rouge = simplification_metric(
    test_df_4, "french_difficulty", os.path.join(pwd, "scratch"), w1=w_1_rouge, rouge=True
)
display(Markdown(f"**Accuracy (BERT)**: *{perfect_simplification_metric_bert.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BERT)**: *{perfect_simplification_metric_bert.similarity:.2%}*"))
display(Markdown(f"**Score (BERT)**: *{perfect_simplification_metric_bert.score:.2%}*"))
display(Markdown(f"**Accuracy (BLEU)**: *{perfect_simplification_metric_bleu.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BLEU)**: *{perfect_simplification_metric_bleu.similarity:.2%}*"))
display(Markdown(f"**Score (BLEU)**: *{perfect_simplification_metric_bleu.score:.2%}*"))
display(Markdown(f"**Accuracy (ROUGE)**: *{perfect_simplification_metric_rouge.accuracy:.2%}*"))
display(Markdown(f"**Similarity (ROUGE)**: *{perfect_simplification_metric_rouge.similarity:.2%}*"))
display(Markdown(f"**Score (ROUGE)**: *{perfect_simplification_metric_rouge.score:.2%}*"))

# GPT-4
display(Markdown("### GPT-4"))
display(metric_test_df.head())
gpt4_metric_bert = simplification_metric(
    metric_test_df, "french_difficulty", os.path.join(pwd, "scratch"), w1=w_1_bert
)
gpt4_metric_bleu = simplification_metric(
    metric_test_df, "french_difficulty", os.path.join(pwd, "scratch"), w1=w_1_bleu, bleu=True
)
gpt4_metric_rouge = simplification_metric(
    metric_test_df, "french_difficulty", os.path.join(pwd, "scratch"), w1=w_1_rouge, rouge=True
)
display(Markdown(f"**Accuracy (BERT)**: *{gpt4_metric_bert.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BERT)**: *{gpt4_metric_bert.similarity:.2%}*"))
display(Markdown(f"**Score (BERT)**: *{gpt4_metric_bert.score:.2%}*"))
display(Markdown(f"**Accuracy (BLEU)**: *{gpt4_metric_bleu.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BLEU)**: *{gpt4_metric_bleu.similarity:.2%}*"))
display(Markdown(f"**Score (BLEU)**: *{gpt4_metric_bleu.score:.2%}*"))
display(Markdown(f"**Accuracy (ROUGE)**: *{gpt4_metric_rouge.accuracy:.2%}*"))
display(Markdown(f"**Similarity (ROUGE)**: *{gpt4_metric_rouge.similarity:.2%}*"))
display(Markdown(f"**Score (ROUGE)**: *{gpt4_metric_rouge.score:.2%}*"))


# Mistral-7B Zero-shot
path = os.path.join(pwd, "results", "text_simplification")
mistral_zero_shot_df = pd.read_csv(
    os.path.join(path, "MistralEvaluation", "predictions_zero_shot_formatted.csv")
)
display(Markdown("### Mistral-7B Zero-shot"))
display(mistral_zero_shot_df.head())
mistral_zero_shot_metric_bert = simplification_metric(
    mistral_zero_shot_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_bert,
)
mistral_zero_shot_metric_bleu = simplification_metric(
    mistral_zero_shot_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_bleu,
    bleu=True
)
mistral_zero_shot_metric_rouge = simplification_metric(
    mistral_zero_shot_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_rouge,
    rouge=True
)
display(Markdown(f"**Accuracy (BERT)**: *{mistral_zero_shot_metric_bert.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BERT)**: *{mistral_zero_shot_metric_bert.similarity:.2%}*"))
display(Markdown(f"**Score (BERT)**: *{mistral_zero_shot_metric_bert.score:.2%}*"))
display(Markdown(f"**Accuracy (BLEU)**: *{mistral_zero_shot_metric_bleu.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BLEU)**: *{mistral_zero_shot_metric_bleu.similarity:.2%}*"))
display(Markdown(f"**Score (BLEU)**: *{mistral_zero_shot_metric_bleu.score:.2%}*"))
display(Markdown(f"**Accuracy (ROUGE)**: *{mistral_zero_shot_metric_rouge.accuracy:.2%}*"))
display(Markdown(f"**Similarity (ROUGE)**: *{mistral_zero_shot_metric_rouge.similarity:.2%}*"))
display(Markdown(f"**Score (ROUGE)**: *{mistral_zero_shot_metric_rouge.score:.2%}*"))

# Mistral-7B Fine-tuned
path = os.path.join(pwd, "results", "text_simplification")
mistral_fine_tuned_df = pd.read_csv(
    os.path.join(path, "MistralEvaluation", "predictions_fine_tuned_formatted.csv")
)
display(Markdown("### Mistral-7B Fine-tuned"))
display(mistral_fine_tuned_df.head())
mistral_fine_tuned_metric_bert = simplification_metric(
    mistral_fine_tuned_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_bert,
)
mistral_fine_tuned_metric_bleu = simplification_metric(
    mistral_fine_tuned_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_bleu,
    bleu=True
)
mistral_fine_tuned_metric_rouge = simplification_metric(
    mistral_fine_tuned_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_rouge,
    rouge=True
)
display(Markdown(f"**Accuracy (BERT)**: *{mistral_fine_tuned_metric_bert.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BERT)**: *{mistral_fine_tuned_metric_bert.similarity:.2%}*"))
display(Markdown(f"**Score (BERT)**: *{mistral_fine_tuned_metric_bert.score:.2%}*"))
display(Markdown(f"**Accuracy (BLEU)**: *{mistral_fine_tuned_metric_bleu.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BLEU)**: *{mistral_fine_tuned_metric_bleu.similarity:.2%}*"))
display(Markdown(f"**Score (BLEU)**: *{mistral_fine_tuned_metric_bleu.score:.2%}*"))
display(Markdown(f"**Accuracy (ROUGE)**: *{mistral_fine_tuned_metric_rouge.accuracy:.2%}*"))
display(Markdown(f"**Similarity (ROUGE)**: *{mistral_fine_tuned_metric_rouge.similarity:.2%}*"))
display(Markdown(f"**Score (ROUGE)**: *{mistral_fine_tuned_metric_rouge.score:.2%}*"))

# GPT Zero-Shot
gpt_df = pd.read_csv(
    os.path.join(
        path, "OpenAIEvaluation", "gpt-3.5-turbo-1106_zeroshot_predictions.csv"
    )
)
display(Markdown("### GPT Zero-Shot"))
display(gpt_df.head())
gpt_zero_shot_metric_bert = simplification_metric(
    gpt_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_bert,
)
gpt_zero_shot_metric_bleu = simplification_metric(
    gpt_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_bleu,
    bleu=True
)
gpu_zero_shot_metric_rouge = simplification_metric(
    gpt_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_rouge,
    rouge=True
)
display(Markdown(f"**Accuracy (BERT)**: *{gpt_zero_shot_metric_bert.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BERT)**: *{gpt_zero_shot_metric_bert.similarity:.2%}*"))
display(Markdown(f"**Score (BERT)**: *{gpt_zero_shot_metric_bert.score:.2%}*"))
display(Markdown(f"**Accuracy (BLEU)**: *{gpt_zero_shot_metric_bleu.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BLEU)**: *{gpt_zero_shot_metric_bleu.similarity:.2%}*"))
display(Markdown(f"**Score (BLEU)**: *{gpt_zero_shot_metric_bleu.score:.2%}*"))
display(Markdown(f"**Accuracy (ROUGE)**: *{gpu_zero_shot_metric_rouge.accuracy:.2%}*"))
display(Markdown(f"**Similarity (ROUGE)**: *{gpu_zero_shot_metric_rouge.similarity:.2%}*"))
display(Markdown(f"**Score (ROUGE)**: *{gpu_zero_shot_metric_rouge.score:.2%}*"))


# Davinci Fine-Tuned
davinci_fn_df = pd.read_csv(
    os.path.join(
        path, "OpenAIEvaluation", "davinci-002-finetuned_formatted_predictions.csv"
    )
)
display(Markdown("### Davinci Fine-Tuned"))
display(davinci_fn_df.head())
davinci_finetuned_metric_bert = simplification_metric(
    davinci_fn_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_bert,
)
davinci_finetuned_metric_bleu = simplification_metric(
    davinci_fn_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_bleu,
    bleu=True
)
daivnci_finetuned_metric_rouge = simplification_metric(
    davinci_fn_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_rouge,
    rouge=True
)
display(Markdown(f"**Accuracy (BERT)**: *{davinci_finetuned_metric_bert.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BERT)**: *{davinci_finetuned_metric_bert.similarity:.2%}*"))
display(Markdown(f"**Score (BERT)**: *{davinci_finetuned_metric_bert.score:.2%}*"))
display(Markdown(f"**Accuracy (BLEU)**: *{davinci_finetuned_metric_bleu.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BLEU)**: *{davinci_finetuned_metric_bleu.similarity:.2%}*"))
display(Markdown(f"**Score (BLEU)**: *{davinci_finetuned_metric_bleu.score:.2%}*"))
display(Markdown(f"**Accuracy (ROUGE)**: *{daivnci_finetuned_metric_rouge.accuracy:.2%}*"))
display(Markdown(f"**Similarity (ROUGE)**: *{daivnci_finetuned_metric_rouge.similarity:.2%}*"))
display(Markdown(f"**Score (ROUGE)**: *{daivnci_finetuned_metric_rouge.score:.2%}*"))

# GPT Fine-Tuned
gpt_fn_df = pd.read_csv(
    os.path.join(
        path,
        "OpenAIEvaluation",
        "gpt-3.5-turbo-1106-finetuned_formatted_predictions.csv",
    )
)
display(Markdown("### GPT Fine-Tuned"))
display(gpt_fn_df.head())
gpt_finetuned_metric_bert = simplification_metric(
    gpt_fn_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_bert,
)
gpt_finetuned_metric_bleu = simplification_metric(
    gpt_fn_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_bleu,
    bleu=True
)
gpt_finetuned_metric_rouge = simplification_metric(
    gpt_fn_df.dropna(),
    "french_difficulty",
    os.path.join(pwd, "scratch"),
    w1=w_1_rouge,
    rouge=True
)
display(Markdown(f"**Accuracy (BERT)**: *{gpt_finetuned_metric_bert.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BERT)**: *{gpt_finetuned_metric_bert.similarity:.2%}*"))
display(Markdown(f"**Score (BERT)**: *{gpt_finetuned_metric_bert.score:.2%}*"))
display(Markdown(f"**Accuracy (BLEU)**: *{gpt_finetuned_metric_bleu.accuracy:.2%}*"))
display(Markdown(f"**Similarity (BLEU)**: *{gpt_finetuned_metric_bleu.similarity:.2%}*"))
display(Markdown(f"**Score (BLEU)**: *{gpt_finetuned_metric_bleu.score:.2%}*"))
display(Markdown(f"**Accuracy (ROUGE)**: *{gpt_finetuned_metric_rouge.accuracy:.2%}*"))
display(Markdown(f"**Similarity (ROUGE)**: *{gpt_finetuned_metric_rouge.similarity:.2%}*"))
display(Markdown(f"**Score (ROUGE)**: *{gpt_finetuned_metric_rouge.score:.2%}*"))

### Perfect Meaning

Unnamed: 0,Original,Simplified
0,Le couple a bien fait d'évacuer,Le couple a bien fait d'évacuer
1,"— Voyez-vous, dit-il, l’avantage des tempêtes,...","— Voyez-vous, dit-il, l’avantage des tempêtes,..."
2,Le feuillage des arbres faisait comme une mer ...,Le feuillage des arbres faisait comme une mer ...
3,Le sabre d’abordage et la hache étaient égalem...,Le sabre d’abordage et la hache étaient égalem...
4,"Il travaillait lentement, soigneusement, assem...","Il travaillait lentement, soigneusement, assem..."


INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using defa

**Accuracy (BERT)**: *0.00%*

**Similarity (BERT)**: *100.00%*

**Score (BERT)**: *0.00%*

**Accuracy (BLEU)**: *0.00%*

**Similarity (BLEU)**: *98.00%*

**Score (BLEU)**: *0.00%*

**Accuracy (ROUGE)**: *0.00%*

**Similarity (ROUGE)**: *100.00%*

**Score (ROUGE)**: *0.00%*

### Perfect Simplification

Unnamed: 0,Original,Simplified
0,Certains champignons rouges à pois jaunes deva...,Le couple a bien fait d'évacuer
1,Il savait que si un nouveau conflit avec les E...,"— Voyez-vous, dit-il, l’avantage des tempêtes,..."
2,"Contre cette mauvaise pente, il ne connaissait...",Le feuillage des arbres faisait comme une mer ...
3,Cent fois le bois se fendit sous l’action soit...,Le sabre d’abordage et la hache étaient égalem...
4,Mais il aurait fallu des milliers de cages de ...,"Il travaillait lentement, soigneusement, assem..."


INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/31 [00:00<?, ?it/s]

Batches:   0%|          | 0/31 [00:00<?, ?it/s]

INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using defa

**Accuracy (BERT)**: *18.08%*

**Similarity (BERT)**: *81.00%*

**Score (BERT)**: *30.02%*

**Accuracy (BLEU)**: *18.08%*

**Similarity (BLEU)**: *0.01%*

**Score (BLEU)**: *0.01%*

**Accuracy (ROUGE)**: *18.08%*

**Similarity (ROUGE)**: *0.00%*

**Score (ROUGE)**: *0.00%*

### GPT-4

Unnamed: 0,Original,Simplified
0,L'apprentissage des langues étrangères stimule...,On apprend mieux avec les langues étrangères.
1,Les écosystèmes marins sont régulièrement pert...,Les usines abîment souvent la vie sous la mer.
2,L'absorption de polluants atmosphériques par l...,Les forêts aident à garder l'air propre en abs...
3,"Confrontées à une mutation économique rapide, ...",Les entreprises doivent changer vite pour rest...
4,La philosophie existentialiste s'affirme par l...,L'existentialisme dit que la vie n'a pas de se...


INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using defa

**Accuracy (BERT)**: *49.60%*

**Similarity (BERT)**: *88.57%*

**Score (BERT)**: *64.00%*

**Accuracy (BLEU)**: *49.60%*

**Similarity (BLEU)**: *1.24%*

**Score (BLEU)**: *0.77%*

**Accuracy (ROUGE)**: *49.60%*

**Similarity (ROUGE)**: *38.89%*

**Score (ROUGE)**: *20.86%*

### Mistral-7B Zero-shot

Unnamed: 0,Original,Simplified
0,Je ne sais pas ; pas beaucoup peut-être ; pas ...,Je ne sais pas ; peut-être pas beaucoup ; en t...
1,Il fit ainsi par deux fois le tour de l’épave.,Il a donc fait deux fois le tour de l'épave.
2,Comme il avait froid !,
3,Il referma son tonnelet à tabac et se laissa a...,Il a fermé sa boîte à tabac et est allé paress...
4,En somme sa situation était loin d’être désesp...,"En résumé, sa situation n'était pas très mauva..."


INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/30 [00:00<?, ?it/s]

Batches:   0%|          | 0/30 [00:00<?, ?it/s]

INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using defa

**Accuracy (BERT)**: *27.57%*

**Similarity (BERT)**: *93.12%*

**Score (BERT)**: *43.10%*

**Accuracy (BLEU)**: *27.57%*

**Similarity (BLEU)**: *29.44%*

**Score (BLEU)**: *15.33%*

**Accuracy (ROUGE)**: *27.57%*

**Similarity (ROUGE)**: *88.46%*

**Score (ROUGE)**: *34.29%*

### Mistral-7B Fine-tuned

Unnamed: 0,Original,Simplified
0,La classe de Mme Gaudé a fait une pièce de thé...,La classe de Mme Gaudé a joué une pièce en ang...
1,Finalement elles s’égorgèrent l’une l’autre et...,"Finalement, elles se tuèrent l'une l'autre et ..."
2,"D’abord il ne vit rien, mais il finit par déco...","D'abord, il ne voyait rien, mais il a trouvé u..."
3,Aurait-il des visiteurs ?,Y aurait-il des gens qui viennent ici ?
4,"Robinson connaissait cette enfant, il en était...","Robinson avait déjà vu cette enfant, il le sav..."


INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using defa

**Accuracy (BERT)**: *35.27%*

**Similarity (BERT)**: *91.38%*

**Score (BERT)**: *51.43%*

**Accuracy (BLEU)**: *35.27%*

**Similarity (BLEU)**: *7.49%*

**Score (BLEU)**: *4.49%*

**Accuracy (ROUGE)**: *35.27%*

**Similarity (ROUGE)**: *87.04%*

**Score (ROUGE)**: *36.86%*

### GPT Zero-Shot

Unnamed: 0,Original,Simplified
0,"Après ça, nous étions un peu fatigués","Après ça, nous étions un peu fatigués."
1,Finalement elles s’égorgèrent l’une l’autre et...,"Finalement, elles se sont fait du mal et sont ..."
2,Il construisit alors des cages dans lesquelles...,Il a fait des cages avec une porte qui s'ouvra...
3,"Il ne pouvait plus marcher qu’à quatre pattes,...",Il ne pouvait plus marcher sur ses deux jambes...
4,Il devait être bloqué sur un banc de sable ou ...,Il devait être coincé sur un banc de sable ou ...


INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/29 [00:00<?, ?it/s]

Batches:   0%|          | 0/29 [00:00<?, ?it/s]

INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using defa

**Accuracy (BERT)**: *30.60%*

**Similarity (BERT)**: *93.38%*

**Score (BERT)**: *46.66%*

**Accuracy (BLEU)**: *30.60%*

**Similarity (BLEU)**: *13.45%*

**Score (BLEU)**: *7.75%*

**Accuracy (ROUGE)**: *30.60%*

**Similarity (ROUGE)**: *100.00%*

**Score (ROUGE)**: *38.50%*

### Davinci Fine-Tuned

Unnamed: 0,Original,Simplified
0,"Après ça, nous étions un peu fatigués","Après jouer, je suis un peu fatiguée. --\n"
1,Finalement elles s’égorgèrent l’une l’autre et...,Elles se blessèrent alors mutuellement et mour...
2,Il construisit alors des cages dans lesquelles...,Ça fait longtemps que il essaie d'attraper la ...
3,"Il ne pouvait plus marcher qu’à quatre pattes,...",Il était très mal en point et ne pouvait plus ...
4,Il devait être bloqué sur un banc de sable ou ...,Il était peut-être au (...) plus...


INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/30 [00:00<?, ?it/s]

Batches:   0%|          | 0/30 [00:00<?, ?it/s]

INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using defa

**Accuracy (BERT)**: *24.26%*

**Similarity (BERT)**: *82.71%*

**Score (BERT)**: *38.02%*

**Accuracy (BLEU)**: *24.26%*

**Similarity (BLEU)**: *0.70%*

**Score (BLEU)**: *0.43%*

**Accuracy (ROUGE)**: *24.26%*

**Similarity (ROUGE)**: *52.63%*

**Score (ROUGE)**: *23.16%*

### GPT Fine-Tuned

Unnamed: 0,Original,Simplified
0,"Après ça, nous étions un peu fatigués","Ensuite, nous étions un peu plus fatigués."
1,Finalement elles s’égorgèrent l’une l’autre et...,Elles se sont fait du mal jusqu'à ce qu'elles ...
2,Il construisit alors des cages dans lesquelles...,Il a fabriqué des boîtes où l'animal entre et ...
3,"Il ne pouvait plus marcher qu’à quatre pattes,...",Il marchait comme un animal et il mangeait par...
4,Il devait être bloqué sur un banc de sable ou ...,Il était coincé sur du sable ou sur des rocher...


INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/29 [00:00<?, ?it/s]

Batches:   0%|          | 0/29 [00:00<?, ?it/s]

INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.
INFO:absl:Using defa

**Accuracy (BERT)**: *33.85%*

**Similarity (BERT)**: *91.40%*

**Score (BERT)**: *49.94%*

**Accuracy (BLEU)**: *33.85%*

**Similarity (BLEU)**: *4.13%*

**Score (BLEU)**: *2.52%*

**Accuracy (ROUGE)**: *33.85%*

**Similarity (ROUGE)**: *70.59%*

**Score (ROUGE)**: *31.40%*

In [34]:
metrics_df = pd.DataFrame(
    {
        "Model": [
            "Perfect Meaning",
            "Perfect Simplification",
            "GPT-4",
            "Mistral-7B Zero-Shot",
            "Mistral-7B Fine-Tuned",
            "GPT Zero-Shot",
            "Davinci Fine-Tuned",
            "GPT Fine-Tuned",
        ],
        "Accuracy": [
            perfect_meaning_metric_bert.accuracy,
            perfect_simplification_metric_bert.accuracy,
            gpt4_metric_bert.accuracy,
            mistral_zero_shot_metric_bert.accuracy,
            mistral_fine_tuned_metric_bert.accuracy,
            gpt_zero_shot_metric_bert.accuracy,
            davinci_finetuned_metric_bert.accuracy,
            gpt_finetuned_metric_bert.accuracy,
        ],
        "Similarity (BERT)": [
            perfect_meaning_metric_bert.similarity,
            perfect_simplification_metric_bert.similarity,
            gpt4_metric_bert.similarity,
            mistral_zero_shot_metric_bert.similarity,
            mistral_fine_tuned_metric_bert.similarity,
            gpt_zero_shot_metric_bert.similarity,
            davinci_finetuned_metric_bert.similarity,
            gpt_finetuned_metric_bert.similarity,
        ],
        "Similarity (BLEU)": [
            perfect_meaning_metric_bleu.similarity,
            perfect_simplification_metric_bleu.similarity,
            gpt4_metric_bleu.similarity,
            mistral_zero_shot_metric_bleu.similarity,
            mistral_fine_tuned_metric_bleu.similarity,
            gpt_zero_shot_metric_bleu.similarity,
            davinci_finetuned_metric_bleu.similarity,
            gpt_finetuned_metric_bleu.similarity,
        ],
        "Similarity (ROUGE)": [
            perfect_meaning_metric_rouge.similarity,
            perfect_simplification_metric_rouge.similarity,
            gpt4_metric_rouge.similarity,
            mistral_zero_shot_metric_rouge.similarity,
            mistral_fine_tuned_metric_rouge.similarity,
            gpu_zero_shot_metric_rouge.similarity,
            daivnci_finetuned_metric_rouge.similarity,
            gpt_finetuned_metric_rouge.similarity,
        ],
        "Score (BERT)": [
            perfect_meaning_metric_bert.score,
            perfect_simplification_metric_bert.score,
            gpt4_metric_bert.score,
            mistral_zero_shot_metric_bert.score,
            mistral_fine_tuned_metric_bert.score,
            gpt_zero_shot_metric_bert.score,
            davinci_finetuned_metric_bert.score,
            gpt_finetuned_metric_bert.score,
        ],
        "Score (BLEU)": [
            perfect_meaning_metric_bleu.score,
            perfect_simplification_metric_bleu.score,
            gpt4_metric_bleu.score,
            mistral_zero_shot_metric_bleu.score,
            mistral_fine_tuned_metric_bleu.score,
            gpt_zero_shot_metric_bleu.score,
            davinci_finetuned_metric_bleu.score,
            gpt_finetuned_metric_bleu.score,
        ],
        "Score (ROUGE)": [
            perfect_meaning_metric_rouge.score,
            perfect_simplification_metric_rouge.score,
            gpt4_metric_rouge.score,
            mistral_zero_shot_metric_rouge.score,
            mistral_fine_tuned_metric_rouge.score,
            gpu_zero_shot_metric_rouge.score,
            daivnci_finetuned_metric_rouge.score,
            gpt_finetuned_metric_rouge.score,
        ],
    }
)

metrics = metrics_df.sort_values("Score (BERT)", ascending=False)

# Display metrics
display(Markdown("### Metrics"))
display(metrics_df)

# Save metrics
path = os.path.join(pwd, "results", "text_simplification", "Metrics")
if not os.path.exists(path):
    os.makedirs(path)
metrics_df.to_csv(os.path.join(path, "metrics.csv"), index=False)

### Metrics

Unnamed: 0,Model,Accuracy,Similarity (BERT),Similarity (BLEU),Similarity (ROUGE),Score (BERT),Score (BLEU),Score (ROUGE)
0,Perfect Meaning,0.0,1.0,0.98,1.0,0.0,0.0,0.0
1,Perfect Simplification,0.180808,0.809957,8.2e-05,0.0,0.300187,5.1e-05,0.0
2,GPT-4,0.496,0.885688,0.012408,0.388889,0.639991,0.007696,0.208616
3,Mistral-7B Zero-Shot,0.275681,0.931159,0.294438,0.884615,0.430979,0.153258,0.342883
4,Mistral-7B Fine-Tuned,0.352705,0.913803,0.074866,0.87037,0.514322,0.04489,0.368621
5,GPT Zero-Shot,0.306011,0.933759,0.134486,1.0,0.466554,0.077511,0.385006
6,Davinci Fine-Tuned,0.242647,0.827054,0.007003,0.526316,0.380151,0.00434,0.231552
7,GPT Fine-Tuned,0.338462,0.914014,0.041305,0.705882,0.499404,0.02517,0.313957


In [40]:
# Keep one sentence of every dataset
example_df = pd.concat(
    [
        metric_test_df.sample(1),
        mistral_zero_shot_df.sample(1),
        mistral_fine_tuned_df.sample(1),
        gpt_fn_df.sample(1),
        gpt_df.sample(1),
        davinci_fn_df.sample(1),
        test_df_3.sample(1),
        test_df_4.sample(1),
    ],
)
example_df.index = ["GPT-4"] * 1 + ["Mistral-7B Zero-Shot"] * 1 + ["Mistral-7B Fine-Tuned"] * 1 + ["GPT Fine-Tuned"] * 1 + [
    "GPT Zero-Shot"
] * 1 + ["Davinci Fine-Tuned"] * 1 + ["Perfect Meaning"] * 1 + [
    "Perfect Simplification"
] * 1
# Compute Accuracy and Similarity
## Accuracy
predictions = get_simplification_accuracy(
    example_df, "french_difficulty", os.path.join(pwd, "scratch")
).predictions
accuracy = pd.Series(predictions["Original"] == predictions["Simplified"] + 1)
accuracy.index = example_df.index
example_df["Accuracy"] = accuracy.astype(int)
## Similarity
similarity = get_simplification_similarity(example_df).similarity
example_df["Similarity"] = similarity
## Score
score = (
    4
    * (w_1_bert * accuracy * (1 - w_1_bert) * similarity)
    / (w_1_bert * accuracy + (1 - w_1_bert) * similarity)
)
example_df["Score (BERT)"] = score

# Save
path = os.path.join(pwd, "results", "text_simplification", "Metrics")
if not os.path.exists(path):
    os.makedirs(path)
example_df.to_csv(os.path.join(path, "example.csv"))
example_df

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: camembert-base
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,Original,Simplified,Accuracy,Similarity,Score (BERT)
GPT-4,L'étude de la psychologie cognitive fournit de...,La psychologie cognitive nous aide à comprendr...,1,0.895892,0.943212
Mistral-7B Zero-Shot,Ils I'ignoraient.,Ils ne s'occupaient pas.,1,0.838622,0.909675
Mistral-7B Fine-Tuned,"En 2017, 34, 6% de la population vit dans un m...","En 2017, 34,6% des personnes vivaient dans une...",1,0.955593,0.976134
GPT Fine-Tuned,"Un Chat, nommé Rodilardus, faisait de rats tel...",Le chat appelé Rodilardus tuait tant de rats q...,0,0.937257,0.0
GPT Zero-Shot,"On vient d'acheter une résidence seconda ire, ...",Nous venons d'acheter une maison de vacances a...,1,0.957827,0.977328
Davinci Fine-Tuned,"Malgré les progrès accomplis, la situation ali...",Malgré la plupart des pays ayant moins de gens...,1,0.842424,0.911964
Perfect Meaning,Ce paradoxe a été repris par une palanquée d' ...,Ce paradoxe a été repris par une palanquée d' ...,0,1.0,0.0
Perfect Simplification,Les Inuits pourraient porter plainte contre le...,Avec la disparition des parents se dénouent en...,0,0.731567,0.0
