# Tâche #3 : Classification d'incidents avec des modèles *Transformers*

Toujours avec la même tâche et les mêmes fichiers de textes, utiliser la librairie HuggingFace pour accomplir cette tâche. On demande plus spécifiquement d’utiliser le modèle bert-base-uncased et un autre modèle de votre choix.
Les consignes associées à cette tâche sont:
-	Nom du notebook : transformer.ipynb
-	Tokenisation : Celle fournie par les tokeniseurs accompagnant les modèles transformers.
-	Plongements de mots : Ceux du modèle transformer.
-	Normalisation : Lettre en minuscule pour Bert. Aucune contrainte pour le 2e modèle.
-	Choix du 2e transformer: Un modèle encodeur préentraîné pour l’anglais. Le modèle ne doit pas être une autre version de Bert et doit être significativement différent. Utilisez un 2 fichier pour ce modèle si nécessaire (une copie de celui-ci).
-	Analyse : Comparer les résultats obtenus avec les 2 modèles transformers. Présentez également une comparaison globale des résultats obtenus avec tous les modèles utilisés dans ce travail et ceux du travail précédent (TP #1).


Vous pouvez ajouter au *notebook* toutes les cellules dont vous avez besoin pour votre code, vos explications ou la présentation de vos résultats. Vous pouvez également ajouter des sous-sections (par ex. des sous-sections 1.1, 1.2 etc.) si cela améliore la lisibilité.

Notes :
- Évitez les bouts de code trop longs ou trop complexe. Par exemple, il est difficile de comprendre 4-5 boucles ou conditions imbriquées. Si c'est le cas, définissez des sous-fonctions pour refactoriser et simplifier votre code.
- Expliquez sommairement votre démarche.
- Expliquez les choix que vous faites au niveau de la programmation et des modèles (si non trivial).
- Analyser vos résultats. Indiquez ce que vous observez, si c'est bon ou non, si c'est surprenant, etc.
- Une analyse quantitative et qualitative d'erreurs est intéressante et permet de mieux comprendre le comportement d'un modèle.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [17]:
!rm -r /content/model3

## 1. Création du jeu de données (*dataset*)

In [2]:
pip install datasets



In [3]:
import spacy
import datasets
spacy_model = spacy.load("en_core_web_sm")
embedding_size = spacy_model.meta['vectors']['width']


import pandas as pd
import json
import numpy as np
# Assurez-vous que le modèle de langue de spacy est téléchargé
# python -m spacy download fr_core_news_md (par exemple pour le français)

# Charger le modèle de langue de spacy

# Définition des chemins vers les fichiers de données
train_data_path = '/content/drive/MyDrive/TP2_NLP/incidents_dev.json'
dev_data_path = '/content/drive/MyDrive/TP2_NLP/incidents_test.json'
test_data_path = '/content/drive/MyDrive/TP2_NLP/incidents_train.json'

def load_incident_dataset(filename):
    with open(filename, 'r') as fp:
        incident_list = json.load(fp)

        #text = [item["text"] for item in incident_list]
        #target = np.array([int(item["label"]) for item in incident_list])

    #return text, target
    return incident_list


# Créer les DataFrames pour chaque partition de données
train_list  = load_incident_dataset(train_data_path)
dev_list  = load_incident_dataset(dev_data_path)
test_list = load_incident_dataset(test_data_path)

# Affichage de l'information de base sur les DataFrames
display(f"Train data: text_size {len(train_list)}")
display(f"Dev data: text_size {len(dev_list)}")
display(f"Test data: text_size {len(test_list)}")



# Vérification des premiers enregistrements dans l'ensemble d'entraînement
train_list[0]


'Train data: text_size 531'

'Dev data: text_size 531'

'Test data: text_size 2475'

{'text': ' At approximately 11:30 a.m. on December 8  2004  Employee #1  a laborer  who  had only been working for a construction company for one month  was laying out  sheets of plywood sheathing on the 2 by 6 feet nominal roof joists of a flat  roof. The employer specializes in nonresidential wood frame construction. In  order to complete the work quickly Employee #1 and a coworker  under the  direction of a supervisor  began installing only whole pieces of plywood  measuring approximately 4 by 8 feet. Because the plywood did not fit the roof  area precisely  holes and gaps were left in various areas. In one particular  spot a gap  measuring approximately two inches  was left at the end of one of  the pieces of plywood which was tacked down with a few nails. As Employee #1  and the coworker were beginning to install smaller pieces of plywood to fill  the gaps  Employee #1 stepped on the portion of the plywood with the two-inch  gap. The gap came loose from its securing nails  causing

## 2. Création de modèle(s)

In [4]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, BertTokenizer, BertModel, AutoTokenizer, AutoModel, DataCollatorWithPadding, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
electra_tokenizer = AutoTokenizer.from_pretrained("google/electra-small-discriminator")


In [5]:
from datasets import Dataset
import pandas as pd


def preprocess_function(data):
    tokenized = electra_tokenizer(data["text"], padding=True, truncation=True)
    tokenized["labels"] = int(data["label"])
    return tokenized

tokenized_train_list = [preprocess_function(item) for item in train_list]
tokenized_validate_list = [preprocess_function(item) for item in dev_list]
tokenized_test_list = [preprocess_function(item) for item in test_list]

print(tokenized_train_list[0])


# Assuming tokenized_train_dataset is a list of dictionaries
train_dataset = Dataset.from_dict({"input_ids": [item["input_ids"] for item in tokenized_train_list],
                                    "attention_mask": [item["attention_mask"] for item in tokenized_train_list],
                                    "labels": [item["labels"] for item in tokenized_train_list]})

dev_dataset = Dataset.from_dict({"input_ids": [item["input_ids"] for item in tokenized_validate_list],
                                  "attention_mask": [item["attention_mask"] for item in tokenized_validate_list],
                                  "labels": [item["labels"] for item in tokenized_validate_list]})

test_dataset = Dataset.from_dict({"input_ids": [item["input_ids"] for item in tokenized_test_list],
                                  "attention_mask": [item["attention_mask"] for item in tokenized_test_list],
                                  "labels": [item["labels"] for item in tokenized_test_list]})


bert_data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True,  max_length="max_length")




{'input_ids': [101, 2012, 3155, 2340, 1024, 2382, 1037, 1012, 1049, 1012, 2006, 2285, 1022, 2432, 7904, 1001, 1015, 1037, 4450, 2121, 2040, 2018, 2069, 2042, 2551, 2005, 1037, 2810, 2194, 2005, 2028, 3204, 2001, 10201, 2041, 8697, 1997, 20228, 26985, 21867, 2075, 2006, 1996, 1016, 2011, 1020, 2519, 15087, 4412, 8183, 5130, 1997, 1037, 4257, 4412, 1012, 1996, 11194, 16997, 1999, 2512, 6072, 5178, 19909, 3536, 4853, 2810, 1012, 1999, 2344, 2000, 3143, 1996, 2147, 2855, 7904, 1001, 1015, 1998, 1037, 11190, 2953, 5484, 2104, 1996, 3257, 1997, 1037, 12366, 2211, 23658, 2069, 2878, 4109, 1997, 20228, 26985, 9854, 3155, 1018, 2011, 1022, 2519, 1012, 2138, 1996, 20228, 26985, 2106, 2025, 4906, 1996, 4412, 2181, 10785, 8198, 1998, 16680, 2020, 2187, 1999, 2536, 2752, 1012, 1999, 2028, 3327, 3962, 1037, 6578, 9854, 3155, 2048, 5282, 2001, 2187, 2012, 1996, 2203, 1997, 2028, 1997, 1996, 4109, 1997, 20228, 26985, 2029, 2001, 26997, 2098, 2091, 2007, 1037, 2261, 10063, 1012, 2004, 7904, 1001, 1015,

## 3. Entraînement de modèle(s)

In [6]:
pip install evaluate



In [7]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)


  metric = load_metric("accuracy")


In [None]:
# import torch.nn as nn
# import torch

# class CustomTrainer(Trainer):
#     def compute_loss(self, model, inputs, return_outputs=False):
#         labels = inputs.pop("labels")

#         # Forward pass
#         outputs = model(**inputs)
#         logits = outputs.logits

#         # Get the model's classification layer weights
#         classification_layer_weights = model.classifier.weight

#         # Assuming you want to use the weights directly, you can pass them to the loss function
#         loss_fct = nn.CrossEntropyLoss(weight=classification_layer_weights)

#         # Compute the loss
#         loss = loss_fct(logits, labels)

#         return (loss, outputs) if return_outputs else loss

In [None]:
#!pip install transformers[torch]
#!pip install accelerate -U

In [16]:
import transformers

In [None]:
bert_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=9)

training_args = TrainingArguments(
    output_dir="model3/bert",
    learning_rate=1e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=25,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False
)

trainer = Trainer(
    model=bert_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    tokenizer=tokenizer,
    data_collator=bert_data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()


In [None]:
trainer.evaluate()

Gpt2 model

In [None]:
# import torch
# import torch.nn as nn
# from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW
# from transformers import Trainer, TrainingArguments

# # Load the tokenizer and model
# tokenizer = AutoTokenizer.from_pretrained("google/electra-small-discriminator")
# model = AutoModelForSequenceClassification.from_pretrained("google/electra-small-discriminator")

# # Define a classification head on top of the model
# classification_head = nn.Sequential(
#     nn.Linear(model.config.hidden_size, 128),
#     nn.ReLU(),
#     nn.Linear(128, 9)  # num_classes is the number of classes in your classification task
# )

# # Combine the Electra model and classification head
# model.classifier = classification_head

# # Define training arguments
# training_args = TrainingArguments(
#     output_dir="./results",
#     per_device_train_batch_size=32,
#     per_device_eval_batch_size=64,
#     num_train_epochs=3,
#     evaluation_strategy="steps",
#     eval_steps=500,
#     save_total_limit=1,
#     learning_rate=2e-5,
#     save_steps=1000,
# )

# # Initialize Trainer
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     data_collator=None,  # You can specify a data collator if needed
#     train_dataset=train_dataset,
#     eval_dataset=eval_dataset,
# )

# # Fine-tune the model
# trainer.train()

# # Evaluate the model
# results = trainer.evaluate()
# print(results)


In [18]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from transformers import Trainer, TrainingArguments

# Load your Electra tokenizer
electra_tokenizer = AutoTokenizer.from_pretrained("google/electra-small-discriminator")

# Define your model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained("google/electra-small-discriminator", num_labels=9)

# Define training arguments
training_args = TrainingArguments(
    output_dir="model3/electra",
    learning_rate=1e-4,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    num_train_epochs=50,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    tokenizer=tokenizer,
    data_collator=bert_data_collator,
    compute_metrics=compute_metrics,
)

# Start training (assuming you have train_dataset and dev_dataset defined)
trainer.train()


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.928348,0.359699
2,No log,1.817859,0.359699
3,No log,1.675107,0.421846
4,No log,1.509881,0.453861
5,No log,1.354795,0.553672
6,No log,1.272395,0.596987
7,No log,1.327544,0.576271
8,No log,1.182477,0.606403
9,No log,1.137582,0.623352
10,No log,1.105932,0.640301




TrainOutput(global_step=850, training_loss=0.3467350701724782, metrics={'train_runtime': 239.5023, 'train_samples_per_second': 110.855, 'train_steps_per_second': 3.549, 'total_flos': 630141625674270.0, 'train_loss': 0.3467350701724782, 'epoch': 50.0})

## 4. Évaluation et analyse de résultats

In [19]:
trainer.evaluate()



{'eval_loss': 1.1059316396713257,
 'eval_accuracy': 0.64030131826742,
 'eval_runtime': 1.3014,
 'eval_samples_per_second': 408.018,
 'eval_steps_per_second': 6.916,
 'epoch': 50.0}