# Zero-shot Classification with Language Models

The goal of this lab are to:
- Familiarize yourself with Large Language Models
- Try to reproduce results (imperfectly) described in a scientific paper
- Implement rigorous experiments

We will in this lab aim to produce detailled results following the methodology presented in [Small Language Models Are Good Too: An Empirical Study of Zero-Shot Classification](https://aclanthology.org/2024.lrec-main.1299/) (Lepagnol et al., LREC-COLING 2024).
The goal of this paper is to investigate the zero-shot capabilities of relatively small Language Models on classification by **scoring the labels** of the classification tasks, with various *scoring functions*.

### Instructions

Assuming a label set of classes $\mathcal{C}$ which we assume to be included in the vocabulary of the model ($\mathcal{C} \cup \mathcal{V}$; and assuming an input context ${x}$, we:
- Create a **prompt** ${x}'$ = $f(x)$; the function $f$ is task-dependant and given in Table 11 of the paper.
- Instead of generative an answer, we will use **the probability for the first token to be generated to compute a score** $s(y|x'), \forall y \in \mathcal{C}$; scoring functions are described in Section 3.5 of the paper.
- Use the $\text{Argmax}_{y \in \mathcal{C}} s(y|x')$ as prediction.
- Compute the appropriate metric for the dataset and compare to a simple baseline (*i.e*, random or majority draw).


#### What to report

For this lab, your job is to implement such experiments for:
- **Two different tasks** presented in the paper,
- **Two different scoring** used in the paper.
and report the appropriate metrics.

It would be ideal if you could use on the the smaller models in the paper (see Table 9 and 10), which are available in Huggingface (for example, [LaMini-T5](https://huggingface.co/MBZUAI/LaMini-T5-61M)). However, if none of the smaller model can run on your machine or Colab, you should show that you implemented experiments that run with a very small model like [TinyStories-1M](https://huggingface.co/roneneldan/TinyStories-1M), even if results will be very poor.

In [1]:
import pandas as pd

df_spam = pd.read_csv("spam.csv", encoding="ISO-8859-1")
df_bbc = pd.read_csv("bbc-text.csv")

In [3]:
# Renommer les colonnes
df_spam = df_spam.rename(columns={"v1": "label", "v2": "text"})
df_bbc = df_bbc.rename(columns={"category": "label", "text": "text"})

# Garder uniquement les colonnes utiles
df_spam = df_spam[["label", "text"]]
df_bbc = df_bbc[["label", "text"]]

# Afficher un aperçu
print(df_spam.head())
print(df_bbc.head())

  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
           label                                               text
0           tech  tv future in the hands of viewers with home th...
1       business  worldcom boss  left books alone  former worldc...
2          sport  tigers wary of farrell  gamble  leicester say ...
3          sport  yeading face newcastle in fa cup premiership s...
4  entertainment  ocean s twelve raids box office ocean s twelve...


In [5]:
# Définition des classes de chaque dataset
labels_spam = df_spam["label"].unique().tolist()
labels_bbc = df_bbc["label"].unique().tolist()

# Affichage des classes pour le dataset SPAM
print(f"Classes du dataset SPAM (binaire, 2 classes) : {labels_spam}")

# Affichage des classes pour le dataset BBC
print(f"Classes du dataset BBC (5 classes) : {labels_bbc}")


Classes du dataset SPAM (binaire, 2 classes) : ['ham', 'spam']
Classes du dataset BBC (5 classes) : ['tech', 'business', 'sport', 'entertainment', 'politics']


In [6]:
# Print d'exemples de chaque dataset

sample_spam = df_spam.sample(1)
text_spam = sample_spam["text"].values[0]
true_label_spam = sample_spam["label"].values[0]

sample_bbc = df_bbc.sample(1)
text_bbc = sample_bbc["text"].values[0]
true_label_bbc = sample_bbc["label"].values[0]

print(f"Texte d'un exemple de SMS : {text_spam}")
print(f"Label réel : {true_label_spam}")
print(f"Texte d'un exemple d'article BBC : {text_bbc}")
print(f"Label réel : {true_label_bbc}")

Texte d'un exemple de SMS : Dont search love, let love find U. Thats why its called falling in love, bcoz U dont force yourself, U just fall and U know there is smeone to hold U... BSLVYL
Label réel : ham
Texte d'un exemple d'article BBC : brown ally rejects budget spree chancellor gordon brown s closest ally has denied suggestions there will be a budget giveaway on 16 march.  ed balls  ex-chief economic adviser to the treasury  said there would be no spending spree before polling day. but mr balls  a prospective labour mp  said he was confident the chancellor would meet his fiscal rules. he was speaking as sir digby jones  cbi director general  warned mr brown not to be tempted to use any extra cash on pre-election bribes.  mr balls  who stepped down from his treasury post to stand as a labour candidate in the election  had suggested that mr brown would meet his golden economic rule -  with a margin to spare . he said he hoped more would be done to build on current tax credit rules.  

In [7]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch.nn.functional as F
from tqdm import tqdm

# Chargement du modèle et du tokenizer
checkpoint = "MBZUAI/LaMini-T5-61M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

# Définition des labels
labels_spam = ["spam", "ham"]
labels_bbc = ["business", "entertainment", "politics", "sport", "tech"]

# Fonction de scoring
def evaluate_model(df, labels):
    correct_prob = 0
    correct_cosine = 0
    total = len(df)

    for _, row in tqdm(df.iterrows(), total=total, desc="Évaluation en cours"):
        text = row["text"]
        true_label = row["label"]

        input_ids = tokenizer(text, return_tensors="pt").input_ids

        # Scoring par probabilité
        with torch.no_grad():
            decoder_input_ids = tokenizer("<pad>", return_tensors="pt").input_ids
            outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
            logits = outputs.logits[:, -1, :]

            # Appliquer une température
            temperature = 0.7
            scores = F.softmax(logits / temperature, dim=-1)

            # Obtenir les scores pour chaque label
            label_scores = {label: scores[0, tokenizer.convert_tokens_to_ids(label)] for label in labels}

        # Scoring par similarité cosinus
        with torch.no_grad():
            embeddings = model.encoder(input_ids)[0].mean(dim=1)
            label_embeddings = {label: model.encoder(tokenizer(label, return_tensors="pt").input_ids)[0].mean(dim=1)
                                for label in labels}

            cos_scores = {label: F.cosine_similarity(embeddings, label_embeddings[label]).item()
                          for label in labels}

        # Prédictions finales
        pred_prob = max(label_scores, key=label_scores.get)
        pred_cosine = max(cos_scores, key=cos_scores.get)

        # Mise à jour des scores
        if pred_prob == true_label:
            correct_prob += 1
        if pred_cosine == true_label:
            correct_cosine += 1

    # Résultats finaux
    accuracy_prob = correct_prob / total
    accuracy_cosine = correct_cosine / total

    return accuracy_prob, accuracy_cosine

# Évaluation sur le dataset spam
acc_prob_spam, acc_cosine_spam = evaluate_model(df_spam, labels_spam)
print(f"Accuracy SPAM (probabilité) : {acc_prob_spam:.2%}")
print(f"Accuracy SPAM (similarité cosinus) : {acc_cosine_spam:.2%}")

# Évaluation sur le dataset BBC
acc_prob_bbc, acc_cosine_bbc = evaluate_model(df_bbc, labels_bbc)
print(f"Accuracy BBC (probabilité) : {acc_prob_bbc:.2%}")
print(f"Accuracy BBC (similarité cosinus) : {acc_cosine_bbc:.2%}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

Évaluation en cours:   0%|          | 0/5572 [00:00<?, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Évaluation en cours: 100%|██████████| 5572/5572 [15:07<00:00,  6.14it/s]


Accuracy SPAM (probabilité) : 13.41%
Accuracy SPAM (similarité cosinus) : 82.82%


Évaluation en cours:   0%|          | 0/2225 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (979 > 512). Running this sequence through the model will result in indexing errors
Évaluation en cours: 100%|██████████| 2225/2225 [1:04:01<00:00,  1.73s/it]

Accuracy BBC (probabilité) : 19.73%
Accuracy BBC (similarité cosinus) : 81.57%





L'évaluation du modèle LaMini-T5-61M sur deux datasets de classification de texte a produit les résultats suivants

- Dataset SPAM (classification binaire) :

Sur 5572 exemples, nous obtenons tout d'abord pour la pécision basée sur la probabilité 13,41%. Pour la précision basée sur la similarité cosinus nous obtenons 82,82%.

- Dataset BBC (classification en 5 classes) :

Sur 2225 exemples, nous obtenons tout d'abord pour la précision basée sur la probabilité 19,73%. Pour la précision basée sur la similarité cosinus, nous obtenons 81,57%.

- Analyse des résultats :

Précision basée sur la probabilité :

Les faibles précisions obtenues montrent que le modèle éprouve des difficultés à attribuer des probabilités élevées aux classes correctes. En effet, en observant les probabilités obtenues nous remarquons qu'elles sont assez faibles, celles-ci se différencient au 5ème ou 6ème chiffre après la virugle. Cela peut expliquer pourquoi le modèle peut parfois ne pas réussir à comparer et à choisir la bonne probabilité. C'est pourquoi nous avons choisi d'ajouter une température, mais cela n'a pas changé de façon significative les résultats. Cela peut également indiquer que le modèle n'est pas suffisamment calibré pour ces tâches spécifiques ou que la méthode d'évaluation par probabilité n'est pas optimale dans ce contexte.

Précision basée sur la similarité cosinus :

Des précisions nettement supérieures indiquent que les représentations vectorielles qu'il génère sont plus cohérentes et alignées avec les classes correctes. Cela suggère que la représentation des textes par le modèle est pertinente.