# Tâche #3 : Classification d'incidents avec des modèles *Transformers*

Toujours avec la même tâche et les mêmes fichiers de textes, utiliser la librairie HuggingFace pour accomplir cette tâche. On demande plus spécifiquement d’utiliser le modèle bert-base-uncased et un autre modèle de votre choix.
Les consignes associées à cette tâche sont:
-	Nom du notebook : transformer.ipynb
-	Tokenisation : Celle fournie par les tokeniseurs accompagnant les modèles transformers.
-	Plongements de mots : Ceux du modèle transformer.
-	Normalisation : Lettre en minuscule pour Bert. Aucune contrainte pour le 2e modèle.
-	Choix du 2e transformer: Un modèle encodeur préentraîné pour l’anglais. Le modèle ne doit pas être une autre version de Bert et doit être significativement différent. Utilisez un 2 fichier pour ce modèle si nécessaire (une copie de celui-ci).
-	Analyse : Comparer les résultats obtenus avec les 2 modèles transformers. Présentez également une comparaison globale des résultats obtenus avec tous les modèles utilisés dans ce travail et ceux du travail précédent (TP #1).


Vous pouvez ajouter au *notebook* toutes les cellules dont vous avez besoin pour votre code, vos explications ou la présentation de vos résultats. Vous pouvez également ajouter des sous-sections (par ex. des sous-sections 1.1, 1.2 etc.) si cela améliore la lisibilité.

Notes :
- Évitez les bouts de code trop longs ou trop complexe. Par exemple, il est difficile de comprendre 4-5 boucles ou conditions imbriquées. Si c'est le cas, définissez des sous-fonctions pour refactoriser et simplifier votre code.
- Expliquez sommairement votre démarche.
- Expliquez les choix que vous faites au niveau de la programmation et des modèles (si non trivial).
- Analyser vos résultats. Indiquez ce que vous observez, si c'est bon ou non, si c'est surprenant, etc.
- Une analyse quantitative et qualitative d'erreurs est intéressante et permet de mieux comprendre le comportement d'un modèle.

## 1. Création du jeu de données (*dataset*)

In [1]:
import spacy
import json

spacy_model = spacy.load("en_core_web_sm")
embedding_size = spacy_model.meta['vectors']['width']

# Définition des chemins vers les fichiers de données
train_data_path = './data/incidents_train.json'
dev_data_path = './data/incidents_dev.json'
test_data_path = './data/incidents_test.json'

def load_incident_dataset(filename):
    with open(filename, 'r') as fp:
        incident_list = json.load(fp)

    return incident_list


# Créer les DataFrames pour chaque partition de données
train_list  = load_incident_dataset(train_data_path)
dev_list  = load_incident_dataset(dev_data_path)
test_list = load_incident_dataset(test_data_path)

# Affichage de l'information de base sur les DataFrames
display(f"Train data: text_size {len(train_list)}")
display(f"Dev data: text_size {len(dev_list)}")
display(f"Test data: text_size {len(test_list)}")



# Vérification des premiers enregistrements dans l'ensemble d'entraînement
train_list[0]


'Train data: text_size 2475'

'Dev data: text_size 531'

'Test data: text_size 531'

{'text': ' At approximately 8:50 a.m. on October 29  1997  Employee #1 was painting a  single story house at 2657 7th Ave  Sacramento  CA. He was caulking around the  peak of the roof line on the west side of the house  20 ft above the ground.  He was working off of a 24 ft aluminum extension ladder so that his feet were  approximately 12 to 13 feet above the ground. Employee #1 fell and suffered a  concussion and two dislocated discs in his lower back and was hospitalized.  The ladder was not secured to prevent movement.                                 ',
 'label': '5'}

## 2. Création de modèle(s)

Comme 2e modèle, nous avons choisi le modèle electra parce qu'il donne d'assez bon résultat tout en étant léger et assez différent de Bert.

In [6]:
from transformers import TrainingArguments, Trainer, BertTokenizer, AutoTokenizer, DataCollatorWithPadding, BertForSequenceClassification

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
electra_tokenizer = AutoTokenizer.from_pretrained("google/electra-small-discriminator")


In [4]:
from datasets import Dataset


def preprocess_function(data, tokenizer):
    tokenized = tokenizer(data["text"], padding=True, truncation=True)
    tokenized["labels"] = int(data["label"])
    return tokenized

def prepare_dataset(tokenizer):

    tokenized_train_list = [preprocess_function(item, tokenizer) for item in train_list]
    tokenized_validate_list = [preprocess_function(item, tokenizer) for item in dev_list]
    tokenized_test_list = [preprocess_function(item, tokenizer) for item in test_list]

    print(tokenized_train_list[0])


    # Assuming tokenized_train_dataset is a list of dictionaries
    train_dataset = Dataset.from_dict({"input_ids": [item["input_ids"] for item in tokenized_train_list],
                                    "attention_mask": [item["attention_mask"] for item in tokenized_train_list],
                                    "labels": [item["labels"] for item in tokenized_train_list]})

    dev_dataset = Dataset.from_dict({"input_ids": [item["input_ids"] for item in tokenized_validate_list],
                                  "attention_mask": [item["attention_mask"] for item in tokenized_validate_list],
                                  "labels": [item["labels"] for item in tokenized_validate_list]})

    test_dataset = Dataset.from_dict({"input_ids": [item["input_ids"] for item in tokenized_test_list],
                                  "attention_mask": [item["attention_mask"] for item in tokenized_test_list],
                                  "labels": [item["labels"] for item in tokenized_test_list]})
    
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True,  max_length="max_length")
    
    return train_dataset, dev_dataset, test_dataset, data_collator




## 3. Entraînement de modèle(s)

In [16]:
import numpy as np
from datasets import load_metric
import wandb



metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Création et entrainement de Bert

In [8]:
bert_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=9)

bert_train_dataset, bert_dev_dataset, bert_test_dataset, bert_datacollator = prepare_dataset(bert_tokenizer)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'input_ids': [101, 2012, 3155, 1022, 1024, 2753, 1037, 1012, 1049, 1012, 2006, 2255, 2756, 2722, 7904, 1001, 1015, 2001, 4169, 1037, 2309, 2466, 2160, 2012, 20549, 2581, 5504, 13642, 11932, 6187, 1012, 2002, 2001, 6187, 5313, 6834, 2105, 1996, 4672, 1997, 1996, 4412, 2240, 2006, 1996, 2225, 2217, 1997, 1996, 2160, 2322, 3027, 2682, 1996, 2598, 1012, 2002, 2001, 2551, 2125, 1997, 1037, 2484, 3027, 13061, 5331, 10535, 2061, 2008, 2010, 2519, 2020, 3155, 2260, 2000, 2410, 2519, 2682, 1996, 2598, 1012, 7904, 1001, 1015, 3062, 1998, 4265, 1037, 23159, 1998, 2048, 4487, 14540, 24755, 3064, 15303, 1999, 2010, 2896, 2067, 1998, 2001, 24735, 1012, 1996, 10535, 2001, 2025, 7119, 2000, 4652, 2929, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [11]:
training_args = TrainingArguments(
    output_dir="model_task3/bert",
    learning_rate=1e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False
)

bert_trainer = Trainer(
    model=bert_model,
    args=training_args,
    train_dataset=bert_train_dataset,
    eval_dataset=bert_dev_dataset,
    tokenizer=bert_tokenizer,
    data_collator=bert_datacollator,
    compute_metrics=compute_metrics,
)

bert_trainer.train()

bert_trainer.save_model()

  0%|          | 0/3875 [06:06<?, ?it/s]


RuntimeError: MPS backend out of memory (MPS allocated: 13.20 GB, other allocations: 4.93 GB, max allowed: 18.13 GB). Tried to allocate 45.41 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

In [None]:
bert_trainer.evaluate()

### Création et entrainement de Electra

In [12]:
from transformers import AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments


# Define your model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained("google/electra-small-discriminator", num_labels=9)

electra_train_dataset, electra_dev_dataset, electra_test_dataset, electra_datacollator = prepare_dataset(electra_tokenizer)


[A
[A
[A
[A
[A
[A
Downloading pytorch_model.bin: 100%|██████████| 54.2M/54.2M [00:08<00:00, 6.73MB/s]
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'input_ids': [101, 2012, 3155, 1022, 1024, 2753, 1037, 1012, 1049, 1012, 2006, 2255, 2756, 2722, 7904, 1001, 1015, 2001, 4169, 1037, 2309, 2466, 2160, 2012, 20549, 2581, 5504, 13642, 11932, 6187, 1012, 2002, 2001, 6187, 5313, 6834, 2105, 1996, 4672, 1997, 1996, 4412, 2240, 2006, 1996, 2225, 2217, 1997, 1996, 2160, 2322, 3027, 2682, 1996, 2598, 1012, 2002, 2001, 2551, 2125, 1997, 1037, 2484, 3027, 13061, 5331, 10535, 2061, 2008, 2010, 2519, 2020, 3155, 2260, 2000, 2410, 2519, 2682, 1996, 2598, 1012, 7904, 1001, 1015, 3062, 1998, 4265, 1037, 23159, 1998, 2048, 4487, 14540, 24755, 3064, 15303, 1999, 2010, 2896, 2067, 1998, 2001, 24735, 1012, 1996, 10535, 2001, 2025, 7119, 2000, 4652, 2929, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [14]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="model_task3/electra",
    learning_rate=1e-4,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

# Initialize the Trainer
electra_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=electra_train_dataset,
    eval_dataset=electra_dev_dataset,
    tokenizer=electra_tokenizer,
    data_collator=electra_datacollator,
    compute_metrics=compute_metrics,
)

# Start training (assuming you have train_dataset and dev_dataset defined)
electra_trainer.train()

electra_trainer.save_model()



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A

RuntimeError: MPS backend out of memory (MPS allocated: 8.43 GB, other allocations: 9.68 GB, max allowed: 18.13 GB). Tried to allocate 64.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

## 4. Évaluation et analyse de résultats

### Evaluation Bert

In [17]:
#wandb.init(project='evaluation-bert')

eval_metrics = bert_trainer.evaluate()

train_metrics = bert_trainer.evaluate(bert_train_dataset)

#wandb.log({"train_loss": train_metrics["loss"], "eval_loss": eval_metrics["loss"]})

#wandb.finish()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ct

Error: An unexpected error occurred

### Evaluation Electra

In [None]:
wandb.init(project='evaluation-electra')

eval_metrics = electra_trainer.evaluate()

train_metrics = electra_trainer.evaluate(electra_train_dataset)

wandb.log({"train_loss": train_metrics["loss"], "eval_loss": eval_metrics["loss"]})

wandb.finish()