<h1 align="center">Lenny Pelhate, Etienne SULTAN</h1>

# CentraleSupelec - Natural language processing

## Natural Language Inferencing (NLI): 

(NLI) is a classical NLP (Natural Language Processing) problem that involves taking two sentences (the premise and the hypothesis ), and deciding how they are related (if the premise *entails* the hypothesis, *contradicts* it, or *neither*).

Ex: 


| Premise | Label | Hypothesis |
| --- | --- | --- |
| A man inspects the uniform of a figure in some East Asian country. | contradiction | The man is sleeping. |
| An older and younger man smiling. | neutral | Two men are smiling and laughing at the cats playing on the floor. |
| A soccer game with multiple males playing. | entailment | Some men are playing a sport. |

### Stanford NLI (SNLI) corpus

In this labwork, I propose to use the Stanford NLI (SNLI) corpus ( https://nlp.stanford.edu/projects/snli/ ), available in the *Datasets* library by Huggingface.

    from datasets import load_dataset
    snli = load_dataset("snli")
    #Removing sentence pairs with no label (-1)
    snli = snli.filter(lambda example: example['label'] != -1) 

## Subject

You are asked to provide an operational Jupyter notebook that performs the task of NLI. For that, you need to tackle the following aspects of the problem:

1. Loading and preprocessing the data
2. Designing a PyTorch model that, given two sentences, decides how they are related (*entails*, *contradicts* or *neither*.)
3. Training and evaluating the model using appropriate metrics
4. (Optional) Allowing to play with the model (forward user sentences and visualize the prediction easily)
5. (Optional) Providing visual insight about the model (i.e. visualizing the attention if your model is using attention)

You can choose between a trained approach (for which I suggest using the huggingface *transformer* library) or a zero-shot or few-shot approach (for which I suggest using a local *ollama* server). You can, of course, do both and compare your results.

## Evaluation

The evaluation will be based on several criteria:

- Clarity and readability of the notebook. The notebook is the report of you project. Make it easy and pleasant to read.
- Justification of implementation choices (i.e. the network, the cost funtion, the optimizer, ...)
- Quality of the code. The various deeplearning and NLP labworks provide many example of good practices for designing experiments with neural networks. Use them as inspirational examples!

## Additional recommendations

- You are not seeking to publish a research paper! I'm not expecting state-of-the-art results! The idea of this labwork is to assess that you have integrated the skills necessary to handle textual data using deep neural network techniques.

- This labwork will be evaluated but we are still here to help you! Don't hesitate to request our help if you are stuck.

- If you intend to use BERT based models, let me give you an advice. The bert-base-* models available in *Transformers* need more than 12Go to be fine-tuned on GPU. To avoid memory issues, you can use several solutions: 

    - Use a lighter BERT based model such as DistilBERT, ALBERT, ...
    - Train a classification model on top of BERT, whithout fine-tuning it (i.e. freezing BERT weights)

## Huggingface documentations

In case you want to use the huggingface *Datasets* and *Transformer* libraries (which I advice), here are some useful documentation pages:

- Dataset quick tour

    https://huggingface.co/docs/datasets/quicktour.html
    
- Documentation on data preprocessing for transformers

    https://huggingface.co/transformers/preprocessing.html
    
- Transformer Quick tour (with distilbert example for classification).

    https://huggingface.co/transformers/quicktour.html
    



## Use Local models via Ollama

### Starting ollama server

Open a terminal and run the following command:

> mkdir ollama <br>
> cd ollama <br
> curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz <br>
> tar -xzf ollama-linux-amd64.tgz <br>
> cd bin/ <br>
> ./ollama serve

This will start an ollama server accessible at http://localhost:11434


In [1]:
system_prompt = """You are an assistant for question-answering tasks. Answer the question according only to the given context.
If question cannot be answered using the context, simply say I don't know. Do not make stuff up.

Context: {context}
"""

user_prompt = """
Question: {question}

Answer:"""

#context = "Barack Hussein Obama II (born August 4, 1961) is an American politician who was the 44th president of the United States from 2009 to 2017. A member of the Democratic Party, he was the first African-American president. Obama previously served as a U.S. senator representing Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004."
context = "Barack Hussein Obama II (born August 32, 1861) is an American politician who was the 44th president of the United States from 2009 to 2017. A member of the Democratic Party, he was the first African-American president. Obama previously served as a U.S. senator representing Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004."
question = "When Barack obama was born ?"


In [2]:
!pip install litellm



In [5]:
from litellm import completion


response = completion(
  model="ollama/mistrallite",
  messages=[{"content": system_prompt.format(context=context),"role": "system"}, {"content": user_prompt.format(question=question),"role": "user"}],
  api_base="http://localhost:11434",
  stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")


August 32, 1861

## **Proposition de réponse:**

### **Chargement du dataset SNLI:**

In [6]:
# Install transformers if not present
# !pip install transformers evaluate datasets accelerate
# !pip install 'accelerate>=0.26.0'

In [7]:
from datasets import load_dataset
snli = load_dataset("snli")
#Removing sentence pairs with no label (-1)
snli = snli.filter(lambda example: example['label'] != -1)

  from .autonotebook import tqdm as notebook_tqdm
Using the latest cached version of the dataset since snli couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'plain_text' at /usr/users/sdim/sdim_36/.cache/huggingface/datasets/snli/plain_text/0.0.0/cdb5c3d5eed6ead6e5a341c8e56e669bb666725b (last modified on Tue Mar 25 18:02:11 2025).


### **Exploration du dataset:**

**Dataset type:**
- 'datasets.dataset_dict.DatasetDict'

The dataset is already split into train-test-validation. 

**Labels:**
- Entailment: 0
- Neutral: 1
- Contradiction: 2

On visualise quelques exemples du dataset train :

In [8]:
for i in range(3):
    print('Premise: ', snli["train"][i]['premise'])
    print('Hypothesis: ', snli["train"][i]['hypothesis'])
    print('Label: ', snli["train"][i]['label'], '\n')

Premise:  A person on a horse jumps over a broken down airplane.
Hypothesis:  A person is training his horse for a competition.
Label:  1 

Premise:  A person on a horse jumps over a broken down airplane.
Hypothesis:  A person is at a diner, ordering an omelette.
Label:  2 

Premise:  A person on a horse jumps over a broken down airplane.
Hypothesis:  A person is outdoors, on a horse.
Label:  0 



In [9]:
from collections import Counter
label_map = {0: 'entailment', 1: 'neutral', 2: 'contradiction'}

total = len(snli["test"])
test_labels = [example["label"] for example in snli["test"]]
label_counts = Counter(test_labels)

for label_id in sorted(label_map):
    count = label_counts[label_id]
    percent = count / total * 100
    print(f"{label_map[label_id]:<15} ({label_id}) : {count} exemples ({percent:.2f}%)")

entailment      (0) : 3368 exemples (34.28%)
neutral         (1) : 3219 exemples (32.77%)
contradiction   (2) : 3237 exemples (32.95%)


On remarque que le  dataset test est équilibré. Ainsi un modèle qui prédirait uniformément une des trois classes aurait une accuracy de presque $1/3$

# Ollama server

Dans cette partie on s'intéresse à l'étude de deux architecture Zero shot et Few Shot model.
Pour cela on utilise le modèle mistrallite ainsi qu'un Ollama serveur afin de pouvoir l'utiliser et interagir avec lui.

## Zero Shot modèle

Le modèle zero shot est un modèle où l'on utilise un modèle de langage qui doit prédire une classe : entailment, neutral ou contradiction. 

Le modèle ne reçoit qu'un prompt et n'as pas vu d'exemple de la tâche qu'il doit faire.

In [13]:
import random
from sklearn.metrics import accuracy_score

In [14]:
import re
import json

#Pour contraindre le modèle au format on lui dit de renvoyer un JSON
def extract_json(content):
    match = re.search(r'\{[^}]+\}', content)
    if match:
        json_str = match.group()
        try:
            data = json.loads(json_str)
            answer = data.get("answer", "").strip().lower()
            if answer in {"entailment", "contradiction", "neutral"}:
                return answer
            else:
                return "invalid_answer"
        except json.JSONDecodeError:
            return "json_error"
    else:
        return "no_json_found"

In [25]:
def predict_nli_zeroshot(premise, hypothesis):
    system_prompt_zeroshot = (
        "Tu es un expert en inférence textuelle. "
        "Classifie la relation entre deux phrases en répondant uniquement par un objet JSON valide, et rien d'autre. "
        "Le format doit être : {\"answer\": \"<label>\"} où <label> est l'une des options suivantes : entailment, contradiction, neutral. "
        "Ne rajoute aucune explication.\n\n"
    )
    user_prompt_zeroshot = (
        f"""
        Premise: {premise}\n
        Hypothesis: {hypothesis}\n
        Réponse:
        """
        )
    response = completion(
        model="ollama/mistrallite",  # ou le nom de modèle que tu utilises
        messages=[{"content": system_prompt_zeroshot,"role": "system"}, {"content": user_prompt_zeroshot,"role": "user"}],
        api_base="http://localhost:11434",
        stream=False
    )
    # Récupère le contenu de la réponse
    content = response.choices[0].message["content"].strip().lower()
    return extract_json(content)

In [28]:
def compute_accuracy_zero_shot(n_examples=50):
    # Pour tester des cas de deboggage, on prend n exemples aléatoires du split "test"
    # On peut prendre la longueur du dataset test pour évaluer sur tout le dataset test
    test_examples = random.sample(list(snli["test"]), n_examples)

    true_labels = []
    predicted_labels = []

    # Mapping du dataset SNLI :
    label_map = {0: "entailment", 1: "neutral", 2: "contradiction"}

    for example in test_examples:
        try:
            premise = example["premise"]
            hypothesis = example["hypothesis"]
            numeric_label = example["label"]
            true_label = label_map[numeric_label]
        except KeyError as e:
            print(f"Clé manquante dans l'exemple {example}: {e}")
            continue
        except Exception as e:
            print(f"Erreur lors du traitement de l'exemple {example}: {e}")
            continue

        prediction = predict_nli_zeroshot(premise, hypothesis)
        true_labels.append(true_label)
        predicted_labels.append(prediction)

    # Filtrer les prédictions valides pour le calcul de l'accuracy
    valid_pairs = [(t, p) for t, p in zip(true_labels, predicted_labels)
                  if p in {"entailment", "contradiction", "neutral"}]

    if valid_pairs:
        filtered_true, filtered_pred = zip(*valid_pairs)
        accuracy = accuracy_score(filtered_true, filtered_pred)
        print(f"Accuracy sur {len(valid_pairs)}/{len(true_labels)} exemples valides: {accuracy:.2f}")
        return accuracy
    else:
        print("Aucune prédiction valide pour calculer l'accuracy.")
        return 0.0


In [44]:
# On affecte la variable size à la taille du dataset test
size = len(snli["test"])

In [40]:
# Calcul de l'accuracy
accuracy_few_zero_shot = compute_accuracy_zero_shot(size)
print("accuracy_few_zero_shot : ",accuracy_few_zero_shot)

Accuracy sur 8680/9824 exemples valides: 0.37
accuracy_few_zero_shot :  0.3721198156682028


## Few Shot Ollama model

In [35]:
def predict_nli_few_shot(premise, hypothesis):
    system_prompt_few_shot = (
        "Your task is to classify the relationship between a premise and a hypothesis as 'entailment', 'contradiction', or 'neutral'.\n"
        "You must respond with a valid JSON object, and nothing else.\n"
        "Format: {\"answer\": \"<label>\"} where <label> is one of: entailment, contradiction, neutral.\n"
        "DO NOT add explanations. DO NOT write anything else.\n\n"
        "Examples:\n"
        "Premise: A man is playing guitar.\n"
        "Hypothesis: A person is making music.\n"
        "Answer: {\"answer\": \"entailment\"}\n\n"
        "Premise: A cat is sleeping on a couch.\n"
        "Hypothesis: A cat is jumping from a table.\n"
        "Answer: {\"answer\": \"contradiction\"}\n\n"
        "Premise: A group of people is walking in a park.\n"
        "Hypothesis: The group is exercising.\n"
        "Answer: {\"answer\": \"neutral\"}\n\n"
        "Now classify the following:\n"
    )
    user_prompt_few_shot = (
        f"""
        Premise: {premise}\n
        Hypothesis: {hypothesis}\n
        """
        )
    response = completion(
        model="ollama/mistrallite",
        messages=[{"content": system_prompt_few_shot,"role": "system"}, {"content": user_prompt_few_shot,"role": "user"}],
        api_base="http://localhost:11434",
        stream=False
    )
    # Récupère le contenu de la réponse
    content = response.choices[0].message["content"].strip().lower()
    return extract_json(content)

In [36]:
def compute_accuracy_few_shot(n_examples=100):

    # Pour tester des cas de deboggage, on prend n exemples aléatoires du split "test"
    # On peut prendre la longueur du dataset test pour évaluer sur tout le dataset test

    test_examples = random.sample(list(snli["test"]), n_examples)

    true_labels = []
    predicted_labels = []

    # Mapping du dataset SNLI :
    label_map = {0: "entailment", 1: "neutral", 2: "contradiction"}

    for example in test_examples:
        try:
            premise = example["premise"]
            hypothesis = example["hypothesis"]
            numeric_label = example["label"]
            true_label = label_map[numeric_label]
        except KeyError as e:
            print(f"Clé manquante dans l'exemple {example}: {e}")
            continue
        except Exception as e:
            print(f"Erreur lors du traitement de l'exemple {example}: {e}")
            continue

        prediction = predict_nli_few_shot(premise, hypothesis)
        true_labels.append(true_label)
        predicted_labels.append(prediction)

    # Filtrer les prédictions valides pour le calcul de l'accuracy
    valid_pairs = [(t, p) for t, p in zip(true_labels, predicted_labels)
                  if p in {"entailment", "contradiction", "neutral"}]

    if valid_pairs:
        filtered_true, filtered_pred = zip(*valid_pairs)
        accuracy = accuracy_score(filtered_true, filtered_pred)
        print(f"Accuracy sur {len(valid_pairs)}/{len(true_labels)} exemples valides: {accuracy:.2f}")
        return accuracy
    else:
        print("Aucune prédiction valide pour calculer l'accuracy.")
        return 0.0

In [43]:
# Calcul de l'accuracy (à adapter selon tes besoins)
acc_few_shots = compute_accuracy_few_shot(size)
print("Accuracy Few shot:", acc_few_shots)

Accuracy sur 9109/9824 exemples valides: 0.42
Accuracy Few shot: 0.42364694258425734


## Zero Shot vs Few Shot

Dans le cadre de notre tâche de reconnaissance d'inférence textuelle, nous avons comparé deux approches : **Zero Shot** et **Few Shot**.

- **Zero Shot** : le modèle n’a accès à aucun exemple spécifique de la tâche. Il s’appuie uniquement sur sa compréhension générale du langage.
- **Few Shot** : quelques exemples annotés sont fournis au modèle pour l’aider à mieux comprendre la tâche à accomplir.

### Résultats obtenus :

| Méthode     | Accuracy (%) | Gain vs. aléatoire (%) |
|-------------|--------------|--------------------|
| Aléatoire   | 33           | -                  |
| Zero Shot   | 37           | +4                 |
| Few Shot    | 42           | +9                 |

On observe que :
- L’approche **Zero Shot** atteint une accuracy de **37%**, légèrement supérieure à la prédiction aléatoire (**33%**).
- L’approche **Few Shot** obtient une accuracy de **42%**, ce qui montre un gain plus significatif.

Ainsi, même avec très peu d’exemples, le **Few Shot learning** permet d’améliorer les performances du modèle sur une tâche spécifique.

# BERT

Dans cette partie on étudie une architecture **BERT**.

Il s'agit de finetuner un modèle pré-entrainé disponible sur HuggingFace.

### **Imports:**

In [3]:
import torch
import numpy as np

import evaluate
from transformers import (
    AutoTokenizer,
    DistilBertForSequenceClassification,
    DistilBertTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments
)

from sklearn.metrics import classification_report

### **Tokenisation:**

In [4]:
# Initialize tokenizer
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Tokenization function
def tokenize_function(sentence):
    tokenized = tokenizer(sentence["premise"], sentence["hypothesis"], padding="max_length", max_length=128, truncation=True) # truncation=True will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model
    # Add labels back to the tokenized dictionary
    return tokenized

In [5]:
# Encode dataset
snli_encoded = snli.map(tokenize_function, batched=True)

Map: 100%|██████████| 9842/9842 [00:01<00:00, 9751.76 examples/s] 


In [6]:
# Example
example = tokenize_function(snli["train"][:5])
print("example", example)
print(type(example))

example {'input_ids': [[101, 1037, 2711, 2006, 1037, 3586, 14523, 2058, 1037, 3714, 2091, 13297, 1012, 102, 1037, 2711, 2003, 2731, 2010, 3586, 2005, 1037, 2971, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1037, 2711, 2006, 1037, 3586, 14523, 2058, 1037, 3714, 2091, 13297, 1012, 102, 1037, 2711, 2003, 2012, 1037, 15736, 1010, 13063, 2019, 18168, 12260, 4674, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1037, 2711, 2006, 1037, 3586, 14523, 2

In [8]:
# Train and eval datasets
train_data = snli_encoded["train"]
eval_data = snli_encoded["validation"]

### **Implémentation de la métrique d'évaluation:**

In [9]:
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

### **Entraînement du modèle:**

Nous utilisons DistilBERT comme modèle de base. 

Nous avons décidé de geler tous les poids de notre DistilBERT sauf le dernier:
- Entraîner uniquement la tête du classificateur réduit considérablement le coût computationnel et l'utilisation de la mémoire, rendant l'affinage plus efficace. Cependant, cette méthode limite la capacité du modèle à s’adapter aux nouvelles données, ce qui limitait les performances de notre modèle (accuracy d'environ 55%)
- En dégelant la dernière couche du transformeur, on permet une certaine adaptation à la tâche d'nli tout en maintenant un entraînement efficace, trouvant ainsi un équilibre entre économies de calcul et flexibilité du modèle

In [10]:
num_labels = 3
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

# Freeze the DistilBERT base model weights so only the classifier head is trained
for param in model.distilbert.parameters():
    param.requires_grad = False

# Unfreeze the last transformer layer
for param in model.distilbert.transformer.layer[-1].parameters():
    param.requires_grad = True

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Choix des paramètres:**

- <i>metric_name = "accuracy"</i> :
    - Nous avons choisi d'utiliser la précision comme métrique d’évaluation, car elle est simple et efficace pour les tâches de classification.
- <i>batch_size = 128</i> :
    - Nous avons choisi un grand batch_size pour accélérer l'entraînement.
    - Les GPUs utilisés pour faire tourner le modèle ont une capacité suffisante pour des batch de 128, alors que pour 256 par exemple, nous avions des erreurs de mémoire.  
- <i>eval_strategy="steps"</i> :
    - Afin de surveiller les performances plus fréquemment pendant l’entraînement.
- <i>save_strategy="steps"</i> :
    - Sauvegarde du modèle toutes les 250 étapes pour garantir que le meilleur modèle soit conservé au fur et à mesure de l'entraînement.  
- <i>learning_rate=5e-5</i> :
    - Choix d'un taux d'apprentissage légèrement plus élevé que la norme. Nous avions testé plusieurs "learning rates", entre [1e-5, 5e-4]
    - Ce learning rate semble adapté (au vu de nos tests) à un modèle dont seule la dernière couche du transformeur est fine-tunée et dont le reste des poids est gelé.
- <i>num_train_epochs=3</i> :
    - Nous avons fixé le nombre d'époques à 3, il s'agit d'un compromis entre avoir un entraînement suffisant pour la tâche d'nli et avoir un temps d'entraînement pas trop long.  
- <i>weight_decay=0.001</i> :
    - Nous avons choisi de faire une régularisation L2 (weight decay) pour éviter le surapprentissage en pénalisant les poids trop grands du modèle.  
- <i>warmup_steps=int(0.05*len(train_data)/batch_size)</I> :
    - C'est une pratique courante d’utiliser 5-10% des étapes totales pour l’échauffement, permettant au modèle de stabiliser l'apprentissage en début d'entraînement.  

In [33]:
metric_name = "accuracy"
model_name = model_checkpoint.split("/")[-1]

batch_size = 128

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-SNLI5",
    eval_strategy= "steps", # monitor performance more frequently
    eval_steps=250,
    save_strategy = "steps",
    save_steps=250,
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    #weight_decay=0.001, # L2 regularization the optimizer to prevent overfitting by penalizing large weights in the model
    warmup_steps=int(0.05*len(train_data)/batch_size), # common practice to use 5-10% of total steps for warmup (5% since frozen weights)
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    lr_scheduler_type="reduce_lr_on_plateau",  # Reduce learning rate when performance plateaus
    # fp16=True,  # Enable mixed precision training for faster training
    push_to_hub=False,
)

In [34]:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


In [35]:
trainer.train()
trainer.save_model("HomeDistilBert")

Step,Training Loss,Validation Loss,Accuracy
250,No log,0.69624,0.780837
500,0.380500,0.521546,0.799837
750,0.380500,0.499915,0.803394
1000,0.519900,0.483518,0.803698
1250,0.519900,0.475946,0.810404
1500,0.529600,0.473947,0.814062
1750,0.529600,0.472094,0.811928
2000,0.516800,0.47185,0.812335
2250,0.516800,0.462318,0.817618
2500,0.508500,0.469238,0.817923


### **Test du modèle:**

In [36]:
# Load the trained model and tokenizer
model = DistilBertForSequenceClassification.from_pretrained("HomeDistilBert")
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

In [37]:
def evaluate_model(model, snli_encoded, tokenizer):
    # Set model to evaluation mode and move it to the appropriate device
    model.eval()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Prepare test dataset
    test_dataset = snli_encoded["test"]

    predictions = []
    true_labels = []

    # Loop through the test dataset for evaluation
    for i in range(len(test_dataset)):
        # Convert lists to tensors and add a batch dimension with unsqueeze(0)
        input_ids = torch.tensor(test_dataset[i]['input_ids']).unsqueeze(0).to(device)
        attention_mask = torch.tensor(test_dataset[i]['attention_mask']).unsqueeze(0).to(device)

        inputs = {
            'input_ids': input_ids,
            'attention_mask': attention_mask
        }

        # Get the model prediction without tracking gradients
        with torch.no_grad():
            outputs = model(**inputs)
            pred = outputs.logits.argmax(dim=-1).item()

        predictions.append(pred)
        true_labels.append(test_dataset[i]['label'])

    # Print a detailed classification report
    print("\nDetailed Classification Report:")
    print(classification_report(
        true_labels,
        predictions,
        target_names=['Entailment', 'Contradiction', 'Neutral']
    ))

    # Print sample predictions
    print("\nSample Predictions:")
    for i in range(min(5, len(test_dataset))):
        input_ids = torch.tensor(test_dataset[i]['input_ids']).unsqueeze(0).to(device)
        attention_mask = torch.tensor(test_dataset[i]['attention_mask']).unsqueeze(0).to(device)

        inputs = {
            'input_ids': input_ids,
            'attention_mask': attention_mask
        }

        with torch.no_grad():
            outputs = model(**inputs)
            pred = outputs.logits.argmax(dim=-1).item()

        # Decode the input_ids to show the original text (e.g., the premise)
        text = tokenizer.decode(test_dataset[i]['input_ids'], skip_special_tokens=True)

        print(f"\nExample {i+1}:")
        print(f"Text: {text}")
        print(f"True Label: {test_dataset[i]['label']}")
        print(f"Predicted Label: {pred}")

# Run the evaluation
evaluate_model(model, snli_encoded, tokenizer)


Detailed Classification Report:
               precision    recall  f1-score   support

   Entailment       0.85      0.87      0.86      3368
Contradiction       0.79      0.80      0.79      3219
      Neutral       0.86      0.83      0.84      3237

     accuracy                           0.83      9824
    macro avg       0.83      0.83      0.83      9824
 weighted avg       0.83      0.83      0.83      9824


Sample Predictions:

Example 1:
Text: this church choir sings to the masses as they sing joyous songs from the book at a church. the church has cracks in the ceiling.
True Label: 1
Predicted Label: 2

Example 2:
Text: this church choir sings to the masses as they sing joyous songs from the book at a church. the church is filled with song.
True Label: 0
Predicted Label: 0

Example 3:
Text: this church choir sings to the masses as they sing joyous songs from the book at a church. a choir singing at a baseball game.
True Label: 2
Predicted Label: 2

Example 4:
Text: a woman 

In [38]:
# Evaluate the model on the test dataset
test_results = trainer.evaluate(eval_dataset=snli_encoded["test"])

# Print the evaluation results
print("Test results:", test_results)

Test results: {'eval_loss': 0.42883583903312683, 'eval_accuracy': 0.8326547231270358, 'eval_runtime': 7.8873, 'eval_samples_per_second': 1245.548, 'eval_steps_per_second': 9.763, 'epoch': 3.0}


### **Pistes d'amélioration:**

**Augmenter le nombre d'epochs** :  
- Actuellement, le nombre d'époques est fixé à 3, ce qui est souvent suffisant pour une première évaluation rapide. Cependant, il peut être utile d'augmenter le nombre d'époques à 5 ou plus pour permettre au modèle de s'adapter davantage aux données et de mieux exploiter les caractéristiques de l'apprentissage.

**Changer le learning rate en fonction des performances** :  
- Si la précision ne s'améliore plus après plusieurs évaluations, il peut être utile d'ajuster dynamiquement le learning rate. On peut réduire le learning rate lorsque la performance ne progresse plus, ce qui permet de mieux explorer les minima locaux et d'affiner davantage les poids. 

**Augmenter le "warmup steps"** :  
- Si l’on utilise un learning rate faible au début, il peut être utile d'augmenter le nombre de "warmup steps" pour permettre une montée en puissance progressive du learning rate au début de l'entraînement. Cela peut aider à éviter des ajustements trop brusques des poids, particulièrement si la convergence est lente. 

**Utilisation de techniques de régularisation supplémentaires** :  
- Afin de prévenir le surapprentissage, nous pourrions explorer d'autres techniques de régularisation comme le dropout, ou d'ajuster davantage le weight decay de notre régularisation actuelle. 

**Utilisation de batch size plus grand ou plus petit** :  
- On pourrait essayer d'augmenter la taille du batch size si la mémoire GPU le permet, pour augmenter la vitesse d'entraînement. Cependant, si le batch size est trop grand, cela peut mener à une moindre capacité de généralisation. Il s'agit donc de trouver un équilibre optimal.

**Explorer d'autres modèles** :  
- Il peut être pertinent d'essayer des variantes plus grandes comme BERT ou RoBERTa pour voir si elles offrent des gains de performance significatifs.

**Améliorer les données d'entrée** :  
- Si le modèle ne parvient pas à bien apprendre, il peut être utile d'explorer des méthodes de "data augmentation" ou de "prétraitement" des données, comme le nettoyage, la normalisation des textes ou l'ajout de bruit pour augmenter la robustesse du modèle.