## TP ZERO-SHOT CLASSIFICATION

### Réalisé par Mathieu SAUVEUR 

**Note avant de commencer : quelques parties de ce TP peuvent être légèrement similaires à celles de Byong Hee Lee car nous avons travaillé ensemble, cependant nous remettons chacun notre propre travail.**

**_But du TP_ :** *Le but de ce TP est de reproduire une pipeline de zero-shot classification sur deux datasets avec deux scoring function, suivant l'article suivant: Small language models for Zero-shot classification. Nous avons pu nous baser du tutoriel Hugging Face : CS224N: Hugging Face Transformers Tutorial (Spring '24). Nous avons utilisé comme modèle initial TinyStories (1 million de paramètres) ainsi que quelques une de ses versions un peu larges*

In [2]:
import pandas as pd
from transformers import pipeline
from datasets import load_dataset
from sklearn.metrics import accuracy_score
from tqdm import tqdm
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer

  _torch_pytree._register_pytree_node(





  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


## Premier dataset : **ag_news** 

ag_news est un dataset de classification de texte qui contient des articles de presse provenant de 4 catégories différentes : World, Sports, Business et Science/Technology. 

### Les scoring funcions utilisées pour l'évaluation des modèles sont :
- **Probabilities** : La meilleure prédiction est celle qui a la probabilité la plus élevée.
- **DCPMI** : La meilleure prédiction est celle qui a le score DCPMI le plus élevé. Le score DCPMI étant le ration entre la probabilité d'un certain label par rapport à un texte donné et la probabilité de ce même label par rapport à un texte générique généré spécifiquement pour ce label.

In [24]:
dataset = load_dataset("ag_news")
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

In [25]:
unique_labels = set(dataset['train']['label'])
print(unique_labels)

{0, 1, 2, 3}


In [5]:
classifier = pipeline("zero-shot-classification", model="roneneldan/TinyStories-1M")
# classifier = pipeline("zero-shot-classification", model="roneneldan/TinyStories-33M")
# classifier = pipeline("zero-shot-classification", model="calum/tinystories-gpt2-3M")

labels = ["World", "Sports", "Business", "Science/Technology"] # Labels associés récupérés sur Kaggle/Huggingface

Some weights of GPTNeoForSequenceClassification were not initialized from the model checkpoint at roneneldan/TinyStories-1M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


In [27]:
def calculate_score(probabilities, conditional_probabilities=None, scoring_function="Probability"):
    """_summary_ 
        Calculate the score of the predicted label based on the probabilities of the labels.
        If scoring_function is "Probability", the score is the maximum probability.
        If scoring_function is "DCPMI", the score is the maximum DCPMI.

    Args:
        probabilities (list): The probabilities of each label.
        conditional_probabilities (list): The conditional probabilities of each label.
        scoring_function (str): The scoring function to use.

    Returns:
        int: The index of the predicted label.
    """
    if scoring_function == "Probability":
        return probabilities.index(max(probabilities))
    elif scoring_function == "DCPMI":
        dcpmis = [prob / conditional_probabilities[i][i] for i, prob in enumerate(probabilities)]
        return dcpmis.index(max(dcpmis))
    # On pourrait rajouter ici d'autre elif pour des fonctions de scoring différentes


def get_conditional_probabilities(labels, classifier):
    """_summary_
        Get the conditional probabilities for each label.

    Args:
        labels (list): The list of labels.
        classifier (pipeline): The zero-shot classification pipeline.

    Returns:
        list: The list of conditional probabilities.
    """
    conditional_probabilities = []
    for label in labels:
        input_text = f"This article discusses topics related to {label}."
        result = classifier(input_text, candidate_labels=labels)
        label_to_prob = dict(zip(result["labels"], result["scores"]))
        ordered_probs = [label_to_prob[label] for label in labels]
        conditional_probabilities.append(ordered_probs)
        print(f"Text about {label}  --> Conditional probabilities for it being either World/Sports/Business/Science&Technology: {ordered_probs}") # On affiche ici le résultat pour chaque phrase témoin relative à chaque label
    return conditional_probabilities


def apply_model_and_evaluate(dataset, labels, classifier, scoring_function="Probability"):
    """_summary_
        Apply the model to the dataset and calculate the accuracy of the predictions.

    Args:
        dataset (Dataset): The dataset to evaluate.
        labels (list): The list of labels.
        classifier (pipeline): The zero-shot classification pipeline.
        scoring_function (str): The scoring function to use.

    Returns:
        float: The accuracy of the model.
        DataFrame: The comparison between the true and predicted labels.
    """
    y_true = []
    y_pred = []

    # On ne calcul les probabilités conditionnelles que si on utilise la DCPMI
    if scoring_function=="DCPMI":
        conditional_probabilities = get_conditional_probabilities(labels, classifier)
    n=0
    nb_items_processed = True       # Paramètre, si initialisé sur False, permet d'afficher le résultat du traitement des n premiers items, comprenant le texte, les labels, les probabilités et le label prédit

    for item in tqdm(dataset['test'].select(range(3000)), desc="Processing"):       # Ici on va sélectionner le nombre d'élement sur lesquels on veut tester le modèle. pour tester sur tous les élements, on supprime .select()
        text = item['text'] 
        true_label = item['label']
        y_true.append(true_label)
        
        # On initialise le prompt donné au classifieur
        prompt = f"Is the following sentence negative, positive or neutral?\n{text}"            
        # On le lui donne avec les labels candidats
        result = classifier(prompt, candidate_labels=labels)         

        # Les probabilité obtenues sont classées de la plus grande à la plus faible dans le résultat, on va donc les remettre dans l'ordre des labels :
        label_to_prob = dict(zip(result["labels"], result["scores"]))
        probabilities = [label_to_prob[label] for label in labels]

        # On fait les prédictions en fonction de la fonction de scoring choisie
        if scoring_function=="Probability":
            predicted_label = calculate_score(probabilities, scoring_function)
            y_pred.append(predicted_label)
            
        elif scoring_function=="DCPMI":
            predicted_label = calculate_score(probabilities, conditional_probabilities, scoring_function)
            y_pred.append(predicted_label)

        if not nb_items_processed: # Boucle qui affiche les n premiers items traités
            print(f"\nItem n°{n+1}")
            print(f"Text: {text}")
            print(f"Labels: {labels}")
            print(f"Probabilities: {probabilities}")
            print(f"Predicted label: {predicted_label}\n")
            n += 1
            if n == 1:
                three_item_processed = True

    # On va calculer l'accuracy de notre prédiction zero-shot pour tous les items étudiés
    accuracy = accuracy_score(y_true, y_pred)

    # On créé un DataFrame pour comparer les labels prédits et les labels réels, qui va être retourné en même temps que l'accuracy
    comparaison_true_predicted = pd.DataFrame({"True Label": y_true, "Predicted Label": y_pred})

    return accuracy, comparaison_true_predicted

### Évaluation des prédictions zero-shot sur le dataframe (modèle basique)

In [9]:
# On appelle la fonction apply_model_and_evaluate avec la scoring_function "Probability" (implémentée de base)
accuracy_proba, comp_proba = apply_model_and_evaluate(dataset, labels, classifier)
print(f">>>>>>>  The accuracy of the TinyStories model zero-shot prediction using Probabilities is : {accuracy_proba}  <<<<<<<\n")

# On appelle la fonction apply_model_and_evaluate avec la scoring_function "DCPMI"
accuracy_DCPMI, comp_DCPMI = apply_model_and_evaluate(dataset, labels, classifier, scoring_function="DCPMI")
print(f">>>>>>>  The accuracy of the TinyStories model zero-shot prediction using DCPMI is : {accuracy_DCPMI}  <<<<<<<\n")

Processing:   0%|          | 0/3000 [00:00<?, ?it/s]

Processing: 100%|██████████| 3000/3000 [15:23<00:00,  3.25it/s]


>>>>>>>  The accuracy of the TinyStories model zero-shot prediction using Probabilities is : 0.26166666666666666  <<<<<<<

Text about World  --> Conditional probabilities for it being either World/Sports/Business/Science/Technology: [0.22768542170524597, 0.2462114691734314, 0.24311888217926025, 0.28298428654670715]
Text about Sports  --> Conditional probabilities for it being either World/Sports/Business/Science/Technology: [0.22698958218097687, 0.24694472551345825, 0.24610139429569244, 0.27996426820755005]
Text about Business  --> Conditional probabilities for it being either World/Sports/Business/Science/Technology: [0.22893641889095306, 0.24610473215579987, 0.2467629760503769, 0.2781957685947418]
Text about Science/Technology  --> Conditional probabilities for it being either World/Sports/Business/Science/Technology: [0.23483063280582428, 0.2503969073295593, 0.24123473465442657, 0.2735377550125122]


Processing: 100%|██████████| 3000/3000 [12:57<00:00,  3.86it/s]

>>>>>>>  The accuracy of the TinyStories model zero-shot prediction using DCPMI is : 0.24766666666666667  <<<<<<<






### Évaluation des prédictions zero-shot sur le dataframe (modèle plus large)

In [28]:
classifier = pipeline("zero-shot-classification", model="calum/tinystories-gpt2-3M")

# On appelle la fonction apply_model_and_evaluate avec la scoring_function "Probability" (implémentée de base)
accuracy_proba, comp_proba = apply_model_and_evaluate(dataset, labels, classifier)
print(f">>>>>>>  The accuracy of the TinyStories model zero-shot prediction using Probabilities is : {accuracy_proba}  <<<<<<<\n")

# On appelle la fonction apply_model_and_evaluate avec la scoring_function "DCPMI"
accuracy_DCPMI, comp_DCPMI = apply_model_and_evaluate(dataset, labels, classifier, scoring_function="DCPMI")
print(f">>>>>>>  The accuracy of the TinyStories model zero-shot prediction using DCPMI is : {accuracy_DCPMI}  <<<<<<<\n")

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at calum/tinystories-gpt2-3M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.
Processing:   0%|          | 0/3000 [00:00<?, ?it/s]Tokenizer was not supporting padding necessary for zero-shot, attempting to use  `pad_token=eos_token`
Processing: 100%|██████████| 3000/3000 [08:24<00:00,  5.95it/s]


>>>>>>>  The accuracy of the TinyStories model zero-shot prediction using Probabilities is : 0.25833333333333336  <<<<<<<

Text about no  --> Conditional probabilities for it being either World/Sports/Business/Science&Technology: [0.43308448791503906, 0.5669155716896057]
Text about yes  --> Conditional probabilities for it being either World/Sports/Business/Science&Technology: [0.4498637616634369, 0.5501362681388855]


Processing: 100%|██████████| 3000/3000 [08:31<00:00,  5.86it/s] 

>>>>>>>  The accuracy of the TinyStories model zero-shot prediction using DCPMI is : 0.24833333333333332  <<<<<<<






L'architecture du code utilisé dans les parties suivantes est la même que celle présentée dans cette partie. C'est pourquoi, mis à part les docstrings, les fonctions ne seront pas commentées dans les parties suivantes.

## Deuxième dataset : **financial_phrasebank** 

financial_phrasebank est un dataset de classification de texte qui contient des phrases financières provenant de 3 catégories différentes : neutral, positive et negative.

### Les scoring funcions utilisées pour l'évaluation des modèles sont :
- **Probabilities** : La meilleure prédiction est celle qui a la probabilité la plus élevée.
- **DCPMI** : La meilleure prédiction est celle qui a le score DCPMI le plus élevé. Le score DCPMI étant le ration entre la probabilité d'un certain label par rapport à un texte donné et la probabilité de ce même label par rapport à un texte générique généré spécifiquement pour ce label.

In [29]:
dataset = load_dataset('financial_phrasebank', 'sentences_allagree')
dataset

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 2264
    })
})

In [11]:
unique_labels = set(dataset['train']['label'])
print(unique_labels)

{0, 1, 2}


In [12]:
classifier = pipeline("zero-shot-classification", model="roneneldan/TinyStories-1M")
# classifier = pipeline("zero-shot-classification", model="roneneldan/TinyStories-33M")
# classifier = pipeline("zero-shot-classification", model="calum/tinystories-gpt2-3M")

labels = ["negative", "neutral", "positive"]

Some weights of GPTNeoForSequenceClassification were not initialized from the model checkpoint at roneneldan/TinyStories-1M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


In [30]:
def calculate_score(probabilities, conditional_probabilities=None, scoring_function="Probability"):
    """_summary_ 
        Calculate the score of the predicted label based on the probabilities of the labels.
        If scoring_function is "Probability", the score is the maximum probability.
        If scoring_function is "DCPMI", the score is the maximum DCPMI.

    Args:
        probabilities (list): The probabilities of each label.
        conditional_probabilities (list): The conditional probabilities of each label.
        scoring_function (str): The scoring function to use.

    Returns:
        int: The index of the predicted label.
    """
    if scoring_function == "Probability":
        return probabilities.index(max(probabilities))
    elif scoring_function == "DCPMI":
        dcpmis = [prob / conditional_probabilities[i][i] for i, prob in enumerate(probabilities)]
        return dcpmis.index(max(dcpmis))
    return 0

def get_conditional_probabilities(labels, classifier):
    """_summary_
        Get the conditional probabilities for each label.

    Args:
        labels (list): The list of labels.
        classifier (pipeline): The zero-shot classification pipeline.

    Returns:
        list: The list of conditional probabilities.
    """
    conditional_probabilities = []
    for label in labels:
        input_text = f"This sentence is {label}."
        result = classifier(input_text, candidate_labels=labels)
        label_to_prob = dict(zip(result["labels"], result["scores"]))
        ordered_probs = [label_to_prob[label] for label in labels]
        conditional_probabilities.append(ordered_probs)
        print(f"For the {label} sentence  --> Conditional probabilities for it being either negative/neutral/positive: {ordered_probs}")
    return conditional_probabilities

def apply_model_and_evaluate(dataset, labels, classifier, scoring_function="Probability"):
    """_summary_
        Apply the model to the dataset and calculate the accuracy of the predictions.

    Args:
        dataset (Dataset): The dataset to evaluate.
        labels (list): The list of labels.
        classifier (pipeline): The zero-shot classification pipeline.
        scoring_function (str): The scoring function to use.

    Returns:
        float: The accuracy of the model.
        DataFrame: The comparison between the true and predicted labels.
    """
    y_true = []
    y_pred = []
    if scoring_function=="DCPMI":
        conditional_probabilities = get_conditional_probabilities(labels, classifier)
    n=0
    nb_items_processed = True 

    for item in tqdm(dataset['train'], desc="Processing"):  # .select(range(500))
        text = item['sentence'] 
        true_label = item['label']
        y_true.append(true_label)
        
        prompt = f"Is the following sentence negative, positive or neutral?\n{text}" #
        result = classifier(prompt, candidate_labels=labels)
        label_to_prob = dict(zip(result["labels"], result["scores"]))
        probabilities = [label_to_prob[label] for label in labels]

        if scoring_function=="Probability":
            predicted_label = calculate_score(probabilities, scoring_function)
            y_pred.append(predicted_label)
            
        elif scoring_function=="DCPMI":
            predicted_label = calculate_score(probabilities, conditional_probabilities, scoring_function)
            y_pred.append(predicted_label)

        if not nb_items_processed: 
            print(f"\nItem n°{n+1}")
            print(f"Text: {text}")
            print(f"Labels: {labels}")
            print(f"Probabilities: {probabilities}")
            print(f"Predicted label: {predicted_label}\n")
            n += 1
            if n == 1:
                three_item_processed = True

    
    accuracy = accuracy_score(y_true, y_pred)
    comparaison_true_predicted = pd.DataFrame({"True Label": y_true, "Predicted Label": y_pred})

    return accuracy, comparaison_true_predicted

### Évaluation des prédictions zero-shot sur les dataframe (modèle basique)

In [14]:
accuracy_proba, comp_proba = apply_model_and_evaluate(dataset, labels, classifier)
print(f">>>>>>>  The accuracy of the TinyStories model zero-shot prediction using Probabilities is : {accuracy_proba}  <<<<<<<\n")

accuracy_DCPMI, comp_DCPMI = apply_model_and_evaluate(dataset, labels, classifier, scoring_function="DCPMI")
print(f">>>>>>>  The accuracy of the TinyStories model zero-shot prediction using DCPMI is : {accuracy_DCPMI}  <<<<<<<\n")

Processing:   0%|          | 0/2264 [00:00<?, ?it/s]Tokenizer was not supporting padding necessary for zero-shot, attempting to use  `pad_token=eos_token`
Processing: 100%|██████████| 2264/2264 [05:36<00:00,  6.73it/s]


>>>>>>>  The accuracy of the TinyStories model zero-shot prediction using Probabilities is : 0.22924028268551236  <<<<<<<

For the negative sentence  --> Conditional probabilities for it being either negative/neutral/positive: [0.34229576587677, 0.31504562497138977, 0.3426586389541626]
For the neutral sentence  --> Conditional probabilities for it being either negative/neutral/positive: [0.3448764681816101, 0.3156158924102783, 0.3395076394081116]
For the positive sentence  --> Conditional probabilities for it being either negative/neutral/positive: [0.3425242602825165, 0.3149176836013794, 0.3425580561161041]


Processing: 100%|██████████| 2264/2264 [06:07<00:00,  6.15it/s]

>>>>>>>  The accuracy of the TinyStories model zero-shot prediction using DCPMI is : 0.5552120141342756  <<<<<<<






### Évaluation des prédictions zero-shot sur les dataframe (modèle plus large)

In [31]:
classifier = pipeline("zero-shot-classification", model="calum/tinystories-gpt2-3M")

accuracy_proba, comp_proba = apply_model_and_evaluate(dataset, labels, classifier)
print(f">>>>>>>  The accuracy of the TinyStories model zero-shot prediction using Probabilities is : {accuracy_proba}  <<<<<<<\n")

accuracy_DCPMI, comp_DCPMI = apply_model_and_evaluate(dataset, labels, classifier, scoring_function="DCPMI")
print(f">>>>>>>  The accuracy of the TinyStories model zero-shot prediction using DCPMI is : {accuracy_DCPMI}  <<<<<<<\n")

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at calum/tinystories-gpt2-3M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.
Processing:   0%|          | 0/2264 [00:00<?, ?it/s]Tokenizer was not supporting padding necessary for zero-shot, attempting to use  `pad_token=eos_token`
Processing: 100%|██████████| 2264/2264 [04:06<00:00,  9.18it/s]


>>>>>>>  The accuracy of the TinyStories model zero-shot prediction using Probabilities is : 0.15812720848056538  <<<<<<<

For the no sentence  --> Conditional probabilities for it being either negative/neutral/positive: [0.5022147297859192, 0.4977853000164032]
For the yes sentence  --> Conditional probabilities for it being either negative/neutral/positive: [0.5056576728820801, 0.49434226751327515]


Processing: 100%|██████████| 2264/2264 [04:10<00:00,  9.04it/s]

>>>>>>>  The accuracy of the TinyStories model zero-shot prediction using DCPMI is : 0.2628091872791519  <<<<<<<






## Troisième dataset : **sms_spam** 

sms_spam est un dataset de classification de texte qui contient des messages SMS provenant de 2 catégories différentes : spam sms et normal sms. Il y a donc deux label [no, yes] en réponse à la question "Is this message spam ?".

### Les scoring funcions utilisées pour l'évaluation des modèles sont :
- **Probabilities** : La meilleure prédiction est celle qui a la probabilité la plus élevée.
- **DCPMI** : La meilleure prédiction est celle qui a le score DCPMI le plus élevé. Le score DCPMI étant le ration entre la probabilité d'un certain label par rapport à un texte donné et la probabilité de ce même label par rapport à un texte générique généré spécifiquement pour ce label.

In [32]:
dataset = load_dataset("ucirvine/sms_spam")
dataset

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    train: Dataset({
        features: ['sms', 'label'],
        num_rows: 5574
    })
})

In [16]:
unique_labels = set(dataset['train']['label'])
print(unique_labels)

{0, 1}


In [34]:
classifier = pipeline("zero-shot-classification", model="roneneldan/TinyStories-1M")
# classifier = pipeline("zero-shot-classification", model="roneneldan/TinyStories-33M")
# classifier = pipeline("zero-shot-classification", model="calum/tinystories-gpt2-3M")

labels = ['no', 'yes']

Some weights of GPTNeoForSequenceClassification were not initialized from the model checkpoint at roneneldan/TinyStories-1M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


In [33]:
def calculate_score(probabilities, conditional_probabilities=None, scoring_function="Probability"):
    """_summary_ 
        Calculate the score of the predicted label based on the probabilities of the labels.
        If scoring_function is "Probability", the score is the maximum probability.
        If scoring_function is "DCPMI", the score is the maximum DCPMI.

    Args:
        probabilities (list): The probabilities of each label.
        conditional_probabilities (list): The conditional probabilities of each label.
        scoring_function (str): The scoring function to use.

    Returns:
        int: The index of the predicted label.
    """
    if scoring_function == "Probability":
        return probabilities.index(max(probabilities))
    elif scoring_function == "DCPMI":
        dcpmis = [prob / conditional_probabilities[i][i] for i, prob in enumerate(probabilities)]
        return dcpmis.index(max(dcpmis))
    return 0

def get_conditional_probabilities(classifier, domains =['a normal sms', 'a spam sms']):
    """_summary_
        Get the conditional probabilities of each label for each domain.

    Args:
        classifier (pipeline): The zero-shot classification pipeline.
        domains (list): The list of domains.

    Returns:
        list: The list of conditional probabilities for each domain.
    """
    conditional_probabilities = []
    for domain in domains:
        input_text = f"This message is {domain}"
        result = classifier(input_text, candidate_labels=domains)
        label_to_prob = dict(zip(result["labels"], result["scores"]))
        ordered_probs = [label_to_prob[domain] for domain in domains]
        conditional_probabilities.append(ordered_probs)
        print(f"For {domain} --> Conditional probabilities for it being a spam (normal/spam): {ordered_probs}")
    return conditional_probabilities

def apply_model_and_evaluate(dataset, labels, classifier, scoring_function="Probability"):
    """_summary_
        Apply the model to the dataset and calculate the accuracy of the predictions.

    Args:
        dataset (Dataset): The dataset to evaluate.
        labels (list): The list of labels.
        classifier (pipeline): The zero-shot classification pipeline.
        scoring_function (str): The scoring function to use.

    Returns:
        float: The accuracy of the model.
        DataFrame: The comparison between the true and predicted labels.
    """
    y_true = []
    y_pred = []
    if scoring_function=="DCPMI":
        conditional_probabilities = get_conditional_probabilities(classifier)
    n=0
    nb_items_processed = True 

    for item in tqdm(dataset['train'].select(range(3000)), desc="Processing"): 
        text = item['sms'] 
        true_label = item['label']
        y_true.append(true_label)
        
        prompt = f"Is the following message a spam? Answer by yes or no.\n{text}" 
        result = classifier(prompt, candidate_labels=labels)
        label_to_prob = dict(zip(result["labels"], result["scores"]))
        probabilities = [label_to_prob[label] for label in labels]

        if scoring_function=="Probability":
            predicted_label = calculate_score(probabilities, scoring_function)
            y_pred.append(predicted_label)
            
        elif scoring_function=="DCPMI":
            predicted_label = calculate_score(probabilities, conditional_probabilities, scoring_function)
            y_pred.append(predicted_label)

        if not nb_items_processed: 
            print(f"\nItem n°{n+1}")
            print(f"Text: {text}")
            print(f"Labels: {labels}")
            print(f"Probabilities: {probabilities}")
            print(f"Predicted label: {predicted_label}\n")
            n += 1
            if n == 1:
                three_item_processed = True

    
    accuracy = accuracy_score(y_true, y_pred)
    comparaison_true_predicted = pd.DataFrame({"True Label": y_true, "Predicted Label": y_pred})

    return accuracy, comparaison_true_predicted

### Évaluation des prédictions zero-shot sur les dataframes (modèle basique)

In [35]:
accuracy_proba, comp_proba = apply_model_and_evaluate(dataset, labels, classifier)
print(f">>>>>>>  The accuracy of the TinyStories model zero-shot prediction using Probabilities is : {accuracy_proba}  <<<<<<<\n")

accuracy_DCPMI, comp_DCPMI = apply_model_and_evaluate(dataset, labels, classifier, scoring_function="DCPMI")
print(f">>>>>>>  The accuracy of the TinyStories model zero-shot prediction using DCPMI is : {accuracy_DCPMI}  <<<<<<<\n")

Processing:   0%|          | 0/3000 [00:00<?, ?it/s]Tokenizer was not supporting padding necessary for zero-shot, attempting to use  `pad_token=eos_token`
Processing: 100%|██████████| 3000/3000 [05:51<00:00,  8.54it/s] 


>>>>>>>  The accuracy of the TinyStories model zero-shot prediction using Probabilities is : 0.8406666666666667  <<<<<<<

For a normal sms --> Conditional probabilities for it being a spam (normal/spam): [0.4985571503639221, 0.5014428496360779]
For a spam sms --> Conditional probabilities for it being a spam (normal/spam): [0.49690547585487366, 0.503094494342804]


Processing: 100%|██████████| 3000/3000 [05:42<00:00,  8.76it/s]

>>>>>>>  The accuracy of the TinyStories model zero-shot prediction using DCPMI is : 0.849  <<<<<<<






### Évaluation des prédictions zero-shot sur les dataframes (modèle plus large)

In [36]:
classifier = pipeline("zero-shot-classification", model="calum/tinystories-gpt2-3M")

accuracy_proba, comp_proba = apply_model_and_evaluate(dataset, labels, classifier)
print(f">>>>>>>  The accuracy of the TinyStories model zero-shot prediction using Probabilities is : {accuracy_proba}  <<<<<<<\n")

accuracy_DCPMI, comp_DCPMI = apply_model_and_evaluate(dataset, labels, classifier, scoring_function="DCPMI")
print(f">>>>>>>  The accuracy of the TinyStories model zero-shot prediction using DCPMI is : {accuracy_DCPMI}  <<<<<<<\n")

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at calum/tinystories-gpt2-3M and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.
Processing:   0%|          | 0/3000 [00:00<?, ?it/s]Tokenizer was not supporting padding necessary for zero-shot, attempting to use  `pad_token=eos_token`
Processing: 100%|██████████| 3000/3000 [07:23<00:00,  6.77it/s]


>>>>>>>  The accuracy of the TinyStories model zero-shot prediction using Probabilities is : 0.8176666666666667  <<<<<<<

For a normal sms --> Conditional probabilities for it being a spam (normal/spam): [0.4909922480583191, 0.5090077519416809]
For a spam sms --> Conditional probabilities for it being a spam (normal/spam): [0.4946073889732361, 0.5053926706314087]


Processing: 100%|██████████| 3000/3000 [06:15<00:00,  7.99it/s]

>>>>>>>  The accuracy of the TinyStories model zero-shot prediction using DCPMI is : 0.8506666666666667  <<<<<<<






### Observations 

*Étude des résultats obtenus pour les différents modèles:*

1) Modèle TinyStories 1M (premier modèle)
- En ce qui concerne le premier dataset on obtient des accuracy aux alentours de 25%/26% pour un nombre de labels possible de 4. Le modèle fait à peu près les mêmes prédictions qu'un modèle prédisant aléatoirement les labels (1 chance sur 4).
Pour ce dataset, le meilleur modèle (MBZUAI/LaMini-GPT-124M) donne une accuracy de 73,4% donc, avec ce que j'ai testé nous en sommes loin.

- Pour le deuxième dataset, j'ai 3 labels. J'obtiens 22% avec la fonction score 'Probabilité' et 55% avec la fonction score 'DPCMI'. La première fonction me donne moins bien que l'aléatoire (1 chance sur 3) tandisque la deuxième, un peu mieux (+20%)
Pour ce dataset, les meilleur modèle (MBZUAI/LaMini-GPT-774M) donne un accuracy de 74,4%, donc avec DCPMI, nous nous en rapprochons légèrement.

- Pour le troisième dataset, nous étions face à une tâche de spam classification, donc le label est binaire : soit le sms est un spam, soit il ne l'est pas. J'ai globalement des scores de l'ordre de 85% d'accuracy, ce qui est beaucoup. C'est d'ailleurs 15% de plus que le meilleur score obtenu par le modèle mosaicml/mpt-7b évoqué dans l'article. Cela me parait étrange, mais je n'ai pas trouvé l'endorit qui pourrait causerr cela dans mon code. 
Il est aussi bon de mentionner que lors de certaines exécutions du même code, il m'arrivait d'avoir des score de l'ordre de 16% uniquement.

2) Modèle tinystories-gpt2-3MM (deuxième modèle)

- Par rapport aux premiers résultats, pour le premier dataset, j'obtiens des accuracy équivalentes voire légèrement moins bonnes.

- Par rapport aux premiers résultats, pour le second dataset, j'obtiens également des accuracy équivalentes voire légèrement moins bonnes.

- Par rapport aux premiers résultats, pour le troisième dataset, j'obtiens les mêmes résultats.

Les deux modèles testés ne sont peut-être pas suffisamment différents pour avec de meilleurs résultats. On peut cependant déduire que le modèle Tinystories n'est le mieux adapté pour éffectuer de la zero-shot classification. 

*Commentaire sur la scoring function DCPMI :*

J'ai remarqué qu'une phrase témoin ayant vu un certain label n'est pratiquement jamais labélisé avec ce dit label, ce qui n'est pas normal. Pour que la fonction score DCPMI fonctionne à son plein potentiel, il serait plus pertinent que les labels les plus probabless pour des phrases témoins données soient les labels contenus dans ces dernières. J'ai essayé de résoudre ce souci, en vain. Et ce même en essayant avec des modèle beaucoup plus performants que ceux présenté dans ce nontebook. Cela n'empêche néanmoins pas que la fonction score DCPMI puisse fonctionner.

*comparaison DCPMI et Probavilité*

De manière générale, j'ai remarqué que les deux fonctions de scoring me permettaient d'avoir environ les mêmes résultats (mis à part pour le deuxième dataset dans lequel DCPMI était légèrement meilleur mais probablement dû à l'aléatoire favorable). Cependant, peut-être aurais-je pu obtenir dess résultats différents si ma fonction de scoring DCPMI avait mieux fonctionnée.