# Reconnaître les entités nommées
## Entraîner le modèle CamemBERT

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pjox/tutoriel-ner-philologie-computationnelle/blob/master/train_ner_transformer.ipynb)

In [10]:
# Nous commençons par importer les bibliothèques nécessaires et par télécharger les fichiers d'entraînement et de test.
! pip install flair gdown
! gdown https://drive.google.com/uc\?id\=1NgRHG94lQNZ37TSWJZQHScMc69U8QJ6Z
! unzip mini_presto.zip

Downloading...
From: https://drive.google.com/uc?id=1BpPdP1xDh0ai4Jz71nc8f0IxjboVHokH
To: /Users/portizsu/Code/github.com/pjox/tutoriel-ner-philologie-computationnelle/mini_presto.zip
100%|████████████████████████████████████████| 166k/166k [00:00<00:00, 22.6MB/s]
Archive:  mini_presto.zip
   creating: mini_presto/
  inflating: mini_presto/dev.conll   
  inflating: mini_presto/test.conll  
  inflating: mini_presto/train.conll  


In [17]:
# 1. Nous allons maintenant utiliser la bibliothèque flair pour lire les fichiers d'entraînement et de test.
from flair.data import Corpus
from flair.datasets import ColumnCorpus

# definir les colonnes
columns = {0: 'text', 2: 'pos', 3: 'ner'}

# c'est le dossier dans lequel sont les fichiers train, test et dev
data_folder = 'mini_presto/'

# initier un corpus en utilisant le format de colonne, le dossier de données et les noms des fichiers train, dev et test
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='train.conll',
                              test_file='test.conll',
                              dev_file='dev.conll')

2021-12-02 16:57:01,054 Reading data from mini_presto
2021-12-02 16:57:01,056 Train: mini_presto/train.conll
2021-12-02 16:57:01,058 Dev: mini_presto/dev.conll
2021-12-02 16:57:01,058 Test: mini_presto/test.conll


In [12]:
len(corpus.train)

1299

In [13]:
print(corpus.train[0].to_tagged_string('ner'))

l' origine & antiquité du grand Pantagruel <B-pers> .


In [15]:
from flair.embeddings import TransformerWordEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

# 2. quelle étiquette voulons-nous prédire ?
label_type = 'ner'

# 3. créer le dictionnaire des étiquettes à partir du corpus
label_dict = corpus.make_label_dictionary(label_type=label_type)
print(label_dict)

# 4. initialiser le transformer en mode fine-tuneable AVEC le contexte
embeddings = TransformerWordEmbeddings(model='camembert-base',
                                       layers="-1",
                                       subtoken_pooling="mean",
                                       fine_tune=True,
                                       use_context=True,
                                       )

# 5. initialiser le tagger de séquences bare-bones (pas de CRF, pas de RNN, pas de reprojection)
tagger = SequenceTagger(hidden_size=256,
                        embeddings=embeddings,
                        tag_dictionary=label_dict,
                        tag_type='ner',
                        use_crf=False,
                        use_rnn=False,
                        reproject_embeddings=False,
                        )

# 6. initialiser l'entraîneur
trainer = ModelTrainer(tagger, corpus)

# 7. lancer fine-tuning
trainer.fine_tune('resources/taggers/ner-transformer-mini-presto',
                  learning_rate=5.0e-6,
                  mini_batch_size=16,
                  max_epochs=5)

2021-12-02 15:56:26,541 Computing label dictionary. Progress:


100%|██████████| 1299/1299 [00:00<00:00, 11872.59it/s]

2021-12-02 15:56:26,656 Corpus contains the labels: pos (#37935), ner (#37935)
2021-12-02 15:56:26,656 Created (for label 'ner') Dictionary with 16 tags: <unk>, O, B-pers, B-time, I-time, B-loc, I-pers, I-loc, B-amount, I-amount, B-prod, I-prod, B-org, I-org, B-func, I-func
Dictionary with 16 tags: <unk>, O, B-pers, B-time, I-time, B-loc, I-pers, I-loc, B-amount, I-amount, B-prod, I-prod, B-org, I-org, B-func, I-func



Downloading: 100%|██████████| 424M/424M [00:45<00:00, 9.87MB/s]


2021-12-02 15:57:20,261 ----------------------------------------------------------------------------------------------------
2021-12-02 15:57:20,264 Model: "SequenceTagger(
  (embeddings): TransformerWordEmbeddings(
    (model): CamembertModel(
      (embeddings): RobertaEmbeddings(
        (word_embeddings): Embedding(32005, 768, padding_idx=1)
        (position_embeddings): Embedding(514, 768, padding_idx=1)
        (token_type_embeddings): Embedding(1, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): RobertaEncoder(
        (layer): ModuleList(
          (0): RobertaLayer(
            (attention): RobertaAttention(
              (self): RobertaSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias

Token indices sequence length is longer than the specified maximum sequence length for this model (597 > 512). Running this sequence through the model will result in indexing errors


2021-12-02 15:58:10,916 ----------------------------------------------------------------------------------------------------
2021-12-02 15:58:10,922 Exiting from training early.
2021-12-02 15:58:10,923 Saving model ...
2021-12-02 15:58:11,381 Done.
2021-12-02 15:58:11,595 ----------------------------------------------------------------------------------------------------
2021-12-02 15:58:11,597 Testing using last state of model ...
2021-12-02 15:59:57,584 0.0	0.0	0.0	0.0
2021-12-02 15:59:57,585 
Results:
- F-score (micro) 0.0
- F-score (macro) 0.0
- Accuracy 0.0

By class:
              precision    recall  f1-score   support

        time     0.0000    0.0000    0.0000         1
        pers     0.0000    0.0000    0.0000        88
        prod     0.0000    0.0000    0.0000         1
         loc     0.0000    0.0000    0.0000        32
         org     0.0000    0.0000    0.0000         1
      amount     0.0000    0.0000    0.0000         5
       <unk>     0.0000    0.0000    0.00

{'test_score': 0.0,
 'dev_score_history': [],
 'train_loss_history': [],
 'dev_loss_history': []}

In [None]:
from flair.data import Sentence

# Charger le modèle fine-tuné
tagger = SequenceTagger.load('resources/taggers/ner-transformer-mini-presto/best-model.pt')

sentence = Sentence("Il étoit gouverneur de Charlemont, qui par la paix deviendra une place très-considérable ; outre cela, il commandoit à Metz.")

# prédire les balises NER
tagger.predict(sentence)

# imprimer une phrase avec des balises prédites
print(sentence.to_tagged_string())