In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
!apt install git-lfs
!pip install seqeval


In [2]:
import numpy as np 
import pandas as pd

La classification de tokens :

* NER : l’attribution d’une label a chaque token.
* POS : associer chaque token et sa fonction grammaticale 
* Chunking :extraire des informations d'un texte, comme des lieux, des noms de personnes.

    B) le début d’un chunk

    I) à l’interieur d’un chunk 

    O) n’appartient pas a aucun chunk 



la base de données CoNLL-2003 concerne la Reconnaissance d'entités nommées (NER) .
elle contient 4 types d'entités nommées : les personnes, les lieux, les organisations et n'appartient pas a aucun groupe 

# 1) Data exploration

In [43]:
#importer la base 
from datasets import load_dataset

raw_datasets = load_dataset("conll2003")



  0%|          | 0/3 [00:00<?, ?it/s]

In [44]:
df=pd.DataFrame(raw_datasets["train"])

In [45]:
df.head()

Unnamed: 0,id,tokens,pos_tags,chunk_tags,ner_tags
0,0,"[EU, rejects, German, call, to, boycott, Briti...","[22, 42, 16, 21, 35, 37, 16, 21, 7]","[11, 21, 11, 12, 21, 22, 11, 12, 0]","[3, 0, 7, 0, 0, 0, 7, 0, 0]"
1,1,"[Peter, Blackburn]","[22, 22]","[11, 12]","[1, 2]"
2,2,"[BRUSSELS, 1996-08-22]","[22, 11]","[11, 12]","[5, 0]"
3,3,"[The, European, Commission, said, on, Thursday...","[12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 3...","[11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 1...","[0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, ..."
4,4,"[Germany, 's, representative, to, the, Europea...","[22, 27, 21, 35, 12, 22, 22, 27, 16, 21, 22, 2...","[11, 11, 12, 13, 11, 12, 12, 11, 12, 12, 12, 1...","[5, 0, 0, 0, 0, 3, 4, 0, 0, 0, 1, 2, 0, 0, 0, ..."


In [46]:
#noms des differents labels
ner_feature = raw_datasets["train"].features["ner_tags"] 
label_names=ner_feature.feature.names 
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

* O : aucune entité.
* B-PER/I-PER : début/à l'intérieur d'une entité de type personne.
* B-ORG/I-ORG :  début/à l'intérieur d'une entité organisation.
* B-LOC/I-LOC :  début ou à l'intérieur d'une entité de localisation.
* B-MISC/I-MISC :  début/à l'intérieur d'une entité divers.




In [47]:
words = raw_datasets["train"][3]["tokens"]
labels = raw_datasets["train"][3]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep . 
O   B-ORG    I-ORG      O    O  O        O  O         O    B-MISC O      O  O         O  O    B-MISC  O    O     O          O         O       O   O   O       O   O  O           O  O     O 


"European Commission" : organisation ( "European": B debut d'une entité / "Commission": I à internieur d'une entité )

"German" & "British" : localisation



#2.  Processing the data






### a) tokenisation



On a des entrées prétokénisées : il suffit d'ajouter is_split_into_words=True dans les parametres de tokenizer pour gerer cela

In [42]:
from transformers import AutoTokenizer
#importer BERT tokenizer 
model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [27]:
inputs = tokenizer(raw_datasets["train"][3]["tokens"], is_split_into_words=True)
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

*  input_ids:la representation numérique des tokens

*  token_type_ids: dans le cas d'une classification sur des paires de phrases ou des question-reponse, cette liste permet d'identifier les deux parties de l'entrée.

* attention_mask: lors de padding ou/et batch sequence cette liste binaire indique les tokens qui doivent etre prise en charge dans le calcul.

In [28]:
tokens=inputs.tokens()
print(tokens)

['[CLS]', 'The', 'European', 'Commission', 'said', 'on', 'Thursday', 'it', 'disagreed', 'with', 'German', 'advice', 'to', 'consumers', 'to', 's', '##hun', 'British', 'la', '##mb', 'until', 'scientists', 'determine', 'whether', 'mad', 'cow', 'disease', 'can', 'be', 'transmitted', 'to', 'sheep', '.', '[SEP]']


[CLS]: indique le debut

[SEP]: indique la fin 

**probleme** : le mot lamb a été séparé en deux tokens ce qui va créer un decalage entre le token (=word car is_split_into_words=True ) et les labels

**solution** :

En general c'est difficile de savoir si deux tokens font partie de la même mot ou non.
heureusement nous avons un fast tokenizer qui garde une trace du mot d'ou provient chaque token.

### b) labelisation

In [29]:
tokenizer.is_fast

True

In [30]:
word_ids=inputs.word_ids()
print(tokens[14:21])
print(word_ids[14:21])

['to', 's', '##hun', 'British', 'la', '##mb', 'until']
[13, 14, 14, 15, 16, 16, 17]


* 's', '##hun'   appartiennent au 14eme mot
* 'la', '##mb'   appartiennt au 16eme mot





On va etendre la liste des labels selon ces regles:
    
    1) les tokens spéciaux reçoivent la label de -100 (valeur par default ignorée par la fontion perte)

    2) chaque token reçoit la même étiquette que le token qui a commencé le mot dans lequel il se trouve

    3) Pour les tokens à l’intérieur d’un mot mais pas au début, nous remplaçons le B- par I- puisque le token ne commence pas l’entité

In [31]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Début d'un nouveau mot !
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Token spécial
            new_labels.append(-100)
        else:
            # Même mot que le token précédent
            label = labels[word_id]
            # Si l'étiquette est B-XXX, nous la changeons en I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [32]:
labels = raw_datasets["train"][3]["ner_tags"]
labels_etendue=align_labels_with_tokens(labels, word_ids)
print("Token: ",tokens)
print("Mot: ",word_ids)
print("Label:",labels)
print("Label etendue: ",labels_etendue)
if len(tokens)==len(labels_etendue):
    print("\n\ntokens et labels on meme longeur")

Token:  ['[CLS]', 'The', 'European', 'Commission', 'said', 'on', 'Thursday', 'it', 'disagreed', 'with', 'German', 'advice', 'to', 'consumers', 'to', 's', '##hun', 'British', 'la', '##mb', 'until', 'scientists', 'determine', 'whether', 'mad', 'cow', 'disease', 'can', 'be', 'transmitted', 'to', 'sheep', '.', '[SEP]']
Mot:  [None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 14, 15, 16, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, None]
Label: [0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Label etendue:  [-100, 0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


tokens et labels on meme longeur


### c) tokenisation paralléle & batchs

Les fast tokenizer sont capable de paralleliser la tokenisation de plusieurs textes. donc pour gagner en vitesse (~ 4 fois plus rapide ) on va transmettre un batchs de textes au tokenizer avec la methode map en ajoutant le parametre "batched"=True

In [33]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [34]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)



In [35]:
df=pd.DataFrame(tokenized_datasets["train"])
df

Unnamed: 0,input_ids,token_type_ids,attention_mask,labels
0,"[101, 7270, 22961, 1528, 1840, 1106, 21423, 14...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]"
1,"[101, 1943, 14428, 102]","[0, 0, 0, 0]","[1, 1, 1, 1]","[-100, 1, 2, -100]"
2,"[101, 26660, 13329, 12649, 15928, 1820, 118, 4...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[-100, 5, 6, 6, 6, 0, 0, 0, 0, 0, -100]"
3,"[101, 1109, 1735, 2827, 1163, 1113, 9170, 1122...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-100, 0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, ..."
4,"[101, 1860, 112, 188, 4702, 1106, 1103, 1735, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[-100, 5, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 1, ..."
...,...,...,...,...
14036,"[101, 1113, 5286, 131, 102]","[0, 0, 0, 0, 0]","[1, 1, 1, 1, 1]","[-100, 0, 0, 0, -100]"
14037,"[101, 1784, 1160, 102]","[0, 0, 0, 0]","[1, 1, 1, 1]","[-100, 0, 0, -100]"
14038,"[101, 10033, 123, 8083, 122, 102]","[0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1]","[-100, 3, 0, 3, 0, -100]"
14039,"[101, 1784, 1210, 102]","[0, 0, 0, 0]","[1, 1, 1, 1]","[-100, 0, 0, -100]"


### d) Data collation


In [36]:
print(len(df.iloc[0,0]))
print(len(df.iloc[1,0]))

12
4


On a des séquences de longueurs différentes et on ne peut pas utiliser DataCollatorWithPadding (padding sequences dans la meme batch) parce que cela "pads"  juste l'input et ici on a les labels qui doivent aussi etre "padded" de la meme façon.

comme précédemment, on va utiliser la valeur -100 pour "padding" les labels avec la fonction  DataCollatorForTokenClassification


In [37]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [38]:
#avant padding
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]
[-100, 1, 2, -100]


In [39]:
#apres padding 
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

# 3) Fine-tuning the model 

### a) metriques:

* Les modèles de classification des tokens sont évalués en fonction de la précision du rappel et du score F1.

* On veut calculer les metriques à chaque époque pour cela on va definir une fonction compute_metrics( ).

* on va utiliser seqeval comme metrique mais il faut noter que cette fonction prend en entrée la liste des labels comme chaine de caractere.


In [40]:
import evaluate

metric = evaluate.load("seqeval")

In [48]:
#exemple de seqeval
labels_decoded = [label_names[i] for i in labels]
print("labels_decoded: ",labels_decoded)
predictions = labels_decoded.copy()
predictions[2] = "O"
metric.compute(predictions=[predictions], references=[labels_decoded])

labels_decoded:  ['O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'ORG': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'overall_precision': 0.6666666666666666,
 'overall_recall': 0.6666666666666666,
 'overall_f1': 0.6666666666666666,
 'overall_accuracy': 0.9666666666666667}

In [49]:
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    #creer predicitons a partir des argmax des logit (logit et prob on meme sens de variation)
    predictions = np.argmax(logits, axis=-1)

    # Suppression de l'index ignoré (tokens spéciaux) et conversion en étiquettes pour label et prediction
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }


### b) Finetuning du modèle avec Trainer


#### 1-modele 1

Nous allons utiliser AutoModelForTokenClassification. Au lieu de transmettre num_labels on va construire deux dictionnaires id2label et label2id qui relient l'indentifiant de label au label

In [67]:
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [68]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [66]:
from transformers import AutoModelForTokenClassification

model1 = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

In [69]:
#verifier le nombre d'etiquette 
model1.config.num_labels


9

In [70]:
# les hyperparametres 
from transformers import TrainingArguments

args = TrainingArguments(
    "bert-finetuned-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
)

In [71]:
# trainer 
from transformers import Trainer
from copy import deepcopy
L=[]
trainer = Trainer(
    model=model1,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)


In [111]:
trainer.train()

***** Running training *****
  Num examples = 14041
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 5268
  Number of trainable parameters = 107726601


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0836,0.063776,0.911789,0.93588,0.923677,0.983134
2,0.0401,0.063253,0.929798,0.947324,0.938479,0.984959
3,0.0189,0.059954,0.933884,0.950858,0.942295,0.986387


***** Running Evaluation *****
  Num examples = 3250
  Batch size = 8
Saving model checkpoint to bert-finetuned-ner/checkpoint-1756
Configuration saved in bert-finetuned-ner/checkpoint-1756/config.json
Model weights saved in bert-finetuned-ner/checkpoint-1756/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-1756/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-1756/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3250
  Batch size = 8
Saving model checkpoint to bert-finetuned-ner/checkpoint-3512
Configuration saved in bert-finetuned-ner/checkpoint-3512/config.json
Model weights saved in bert-finetuned-ner/checkpoint-3512/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-3512/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-3512/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3250
  Batch size = 8
Saving model checkpoin

TrainOutput(global_step=5268, training_loss=0.06584466436005218, metrics={'train_runtime': 610.3736, 'train_samples_per_second': 69.012, 'train_steps_per_second': 8.631, 'total_flos': 918992408223438.0, 'train_loss': 0.06584466436005218, 'epoch': 3.0})

#### 2-test modele 1

In [112]:
#test model
model1=model1.to("cpu") #mettre le model sur le CPU

token_classifier = pipeline(task="token-classification", model=model1,tokenizer=tokenizer, aggregation_strategy="simple")#pipeline


In [113]:
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

[{'entity_group': 'PER',
  'score': 0.9978536,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.99099255,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9989127,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [114]:
token_classifier("Fayçal Braham teaches NLP at Paris Dauphine University at Tunis")

[{'entity_group': 'PER',
  'score': 0.99956656,
  'word': 'Fayçal Braham',
  'start': 0,
  'end': 13},
 {'entity_group': 'MISC',
  'score': 0.62056464,
  'word': 'NLP',
  'start': 22,
  'end': 25},
 {'entity_group': 'ORG',
  'score': 0.9976444,
  'word': 'Paris Dauphine University',
  'start': 29,
  'end': 54},
 {'entity_group': 'LOC',
  'score': 0.9987615,
  'word': 'Tunis',
  'start': 58,
  'end': 63}]

### c) Finetuning du modèle avec boucle de training

#### 1-modele 2

Cette partie permet de faire le meme fitetuning que dans la partie precedente sans utiliser la classe AutoModelForTokenClassification


  DataLoader représente un itérable Python sur un ensemble de données qui permet de regrouper les données dans des lots (batch) et de personnaliser de l'ordre de chargement des données

In [50]:
from torch.utils.data import DataLoader
#convertir les elements de la dataset (train & test) en batchs de taille 8

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,#melanger les elements de la dataset pour esperer avoir des elements non corrolés dans un batch 
    #afin que ceci permet de faire un mise a jour pertinant lors de la desente de gradient 
    collate_fn=data_collator,#data collator: DataCollatorForTokenClassification pour un padding dynamique 
    batch_size=8,#fixer la taille de batch_size a 8
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
)

In [51]:
for batch1 in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([2, 12]),
 'token_type_ids': torch.Size([2, 12]),
 'attention_mask': torch.Size([2, 12]),
 'labels': torch.Size([2, 12])}

In [56]:
from transformers import AutoModelForTokenClassification
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}
model2 = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

In [57]:
#verification de shape 
outputs = model2(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(2.1476, grad_fn=<NllLossBackward0>) torch.Size([2, 12, 9])


nous utilisons ici l'optimiseur AdamW qui est une variante d'Adam avec une decroissance de poids appropriée lr=5e-5

In [59]:
from transformers import AdamW
optimizer = AdamW(model2.parameters(), lr=5e-5)



get_scheduler permet de reduire progressivement notre taux d'apprentissage (lr) jusq'à 0.

In [60]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

On peut executer notre entrainemement sur differentes config : 
* CPU |GPU | TPU

De plus cette éxecution peut etre :

* une seule machine avec differents core

* sur plusieurs machines appelées noeuds 

In [62]:
#importer accelerator
from accelerate import Accelerator
#instantition
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model2, optimizer, train_dataloader, eval_dataloader)

In [63]:
#convertir les labels et les predictions en deux liste de chaines de caracteres pour calculer le metrique seqeval


def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels = labels.detach().cpu().clone().numpy()

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return true_labels, true_predictions

In [64]:
from tqdm.auto import tqdm
import torch

results_acc_dic={}
lrs=[]
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model2.train()
    #le modele en mode d'apprentissage ce qui avtivera
    # certaines couches de reseau de neuronnes comme les dropout layer
    for batch in train_dataloader:
        outputs = model2(**batch)#calculer les outputs
        loss = outputs.loss#calculer loss
        accelerator.backward(loss)
        optimizer.step()
        #effectuer une etape d'apprentissage en calculant le gradien
        
        lr_scheduler.step()#mise a jour lr avec lr_scheduler
        lrs.append(optimizer.param_groups[0]["lr"])#enregistrer Lrs 

        optimizer.zero_grad()
        #initialier le gradient de fnt lose a zero
        #pour que ces valeurs ne soient pas ajouter aux prochaines mises à jour
        progress_bar.update(1)

    # Evaluation
    model2.eval()
    for batch in eval_dataloader:
        with torch.no_grad():
            outputs = model2(**batch)

        predictions = outputs.logits.argmax(dim=-1)
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        predictions = accelerator.pad_across_processes(predictions, dim=1, pad_index=-100)
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(predictions)
        labels_gathered = accelerator.gather(labels)

        true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=true_predictions, references=true_labels)

    results = metric.compute()
    results_acc_dic[epoch]={key: results[f"overall_{key}"] for key in ["precision", "recall", "f1", "accuracy"]}



  0%|          | 0/5268 [00:00<?, ?it/s]

In [65]:
pd.DataFrame(results_acc_dic).T

Unnamed: 0,precision,recall,f1,accuracy
0,0.924436,0.902415,0.913293,0.980676
1,0.945473,0.920229,0.93268,0.98568
2,0.94968,0.930114,0.939795,0.986843


#### 2- test modele 2 

In [75]:
#test modele
from transformers import pipeline
model2=model2.to("cpu") #mettre le model sur le CPU
token_classifier = pipeline(task="token-classification", model=model2,tokenizer=tokenizer, aggregation_strategy="simple")#pipeline
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")#prediction

[{'entity_group': 'PER',
  'score': 0.9978805,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.98005295,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9981675,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [76]:
token_classifier("Fayçal Braham teaches NLP at Paris Dauphine University at Tunis")

[{'entity_group': 'PER',
  'score': 0.9984937,
  'word': 'Fayçal Braham',
  'start': 0,
  'end': 13},
 {'entity_group': 'MISC',
  'score': 0.4391573,
  'word': 'NL',
  'start': 22,
  'end': 24},
 {'entity_group': 'ORG',
  'score': 0.6617797,
  'word': '##P',
  'start': 24,
  'end': 25},
 {'entity_group': 'ORG',
  'score': 0.99545175,
  'word': 'Paris Dauphine University',
  'start': 29,
  'end': 54},
 {'entity_group': 'LOC',
  'score': 0.941696,
  'word': 'Tunis',
  'start': 58,
  'end': 63}]

In [101]:
#changer aggregation_strategy 
aggregation_strategy=["none","simple","first","average","max"]
resultat1={}
for agg_strat in aggregation_strategy:
  token_classifier = pipeline(task="token-classification", model=model2,tokenizer=tokenizer, aggregation_strategy=agg_strat)#pipeline
  resultat1[agg_strat]=pd.DataFrame(token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn."))#prediction

In [106]:
for agg_strat in aggregation_strategy:
  print(agg_strat,'\n',resultat1[agg_strat],'\n')

none 
   entity     score  index      word  start  end
0  B-PER  0.998636      4         S     11   12
1  I-PER  0.998315      5      ##yl     12   14
2  I-PER  0.997390      6      ##va     14   16
3  I-PER  0.997181      7      ##in     16   18
4  B-ORG  0.982122     12        Hu     33   35
5  I-ORG  0.977187     13   ##gging     35   40
6  I-ORG  0.980850     14      Face     41   45
7  B-LOC  0.998168     16  Brooklyn     49   57 

simple 
   entity_group     score          word  start  end
0          PER  0.997881       Sylvain     11   18
1          ORG  0.980053  Hugging Face     33   45
2          LOC  0.998168      Brooklyn     49   57 

first 
   entity_group     score          word  start  end
0          PER  0.998636       Sylvain     11   18
1          ORG  0.981486  Hugging Face     33   45
2          LOC  0.998168      Brooklyn     49   57 

average 
   entity_group     score          word  start  end
0          PER  0.748252       Sylvain     11   18
1          ORG  0.

In [109]:
resultat2={}
for agg_strat in aggregation_strategy:
  token_classifier = pipeline(task="token-classification", model=model2,tokenizer=tokenizer, aggregation_strategy=agg_strat)#pipeline
  resultat2[agg_strat]=pd.DataFrame(token_classifier( "Fayçal Braham teaches NLP at Paris Dauphine University at Tunis"))#prediction


In [110]:
for agg_strat in aggregation_strategy:
  print(agg_strat,'\n',resultat2[agg_strat],'\n')

none 
     entity     score  index        word  start  end
0    B-PER  0.997947      1         Fay      0    3
1    I-PER  0.997919      2        ##ça      3    5
2    I-PER  0.997834      3         ##l      5    6
3    I-PER  0.999007      4           B      7    8
4    I-PER  0.999339      5       ##rah      8   11
5    I-PER  0.998917      6        ##am     11   13
6   B-MISC  0.439157      8          NL     22   24
7    I-ORG  0.661780      9         ##P     24   25
8    B-ORG  0.996412     11       Paris     29   34
9    I-ORG  0.995340     12          Da     35   37
10   I-ORG  0.997814     13        ##up     37   39
11   I-ORG  0.997736     14      ##hine     39   43
12   I-ORG  0.989956     15  University     44   54
13   B-LOC  0.967628     17          Tu     58   60
14   I-LOC  0.915764     18       ##nis     60   63 

simple 
   entity_group     score                       word  start  end
0          PER  0.998494              Fayçal Braham      0   13
1         MISC  0.4391

**Limite du model :** 


Ce modèle est limité par son ensemble de données d'entraînement composé d'articles de presse annotés par des entités et datant d'une période spécifique. Il peut ne pas être généralisé pour tous les cas d'utilisation dans différents domaines.