# IIC-3670 NLP UC

- Versiones de librerías, python 3.8.10

- numpy 1.20.3
- flair 0.12
- allennlp 0.9.0


### Voy a trabajar con bert base uncased de la librería flair. Necesita torch 1.9.0 o mayor. Cargo el encoder y construyo el embedding de una oración.

In [1]:
from flair.embeddings import TransformerWordEmbeddings
from flair.data import Sentence

embedding = TransformerWordEmbeddings('bert-base-uncased')
#embedding = TransformerWordEmbeddings('roberta-base')

sentence = Sentence('George Washington was born in Washington')

embedding.embed(sentence)

[Sentence[6]: "George Washington was born in Washington"]

### Vemos el embedding de cada palabra de la oración

In [2]:
for token in sentence:
    print(token)
    print(token.embedding)

Token[0]: "George"
tensor([-7.7858e-02,  5.0599e-01, -1.7383e-01, -5.7768e-01,  8.5991e-01,
        -4.4526e-01,  5.1470e-01,  1.9235e-01,  1.1062e-01, -8.2717e-01,
        -1.7893e-01, -5.7098e-01, -3.0730e-02,  1.1882e-01, -6.7864e-01,
        -1.3466e-01,  7.5936e-01,  1.0573e-02, -1.1145e-01,  1.5357e-01,
        -8.8367e-01,  3.9509e-01, -4.4996e-01,  2.6870e-01,  6.0829e-01,
         2.0158e-01,  1.2647e-01,  7.0656e-01,  3.7739e-02, -8.3689e-01,
         4.5149e-01, -5.0183e-01,  2.2537e-01,  5.7245e-01, -5.3343e-01,
        -2.9224e-01, -2.3169e-01,  8.5069e-01,  5.1448e-01, -3.4850e-01,
         1.7520e-01, -3.5871e-01,  7.0467e-01, -4.2313e-01,  1.7307e-01,
        -1.6502e-01,  2.0721e-01, -1.0091e+00,  6.8144e-02, -8.5671e-01,
         3.5520e-01, -1.4831e-01, -3.1613e-01,  7.2204e-01,  3.6080e-01,
         4.0647e-01, -2.5118e-01, -1.2226e-01,  5.5348e-02, -4.6344e-01,
        -1.1319e-02,  8.2398e-01,  6.5795e-01, -1.5961e-01, -4.9053e-01,
        -6.0648e-02, -1.5673e-01

### Puedo elegir cuales capas del transformer encoder voy a usar

In [3]:
embeddings = TransformerWordEmbeddings('bert-base-uncased', layers='-1', layer_mean=False)
embeddings.embed(sentence)
print(sentence[0].embedding.size())

sentence.clear_embeddings()

embeddings = TransformerWordEmbeddings('bert-base-uncased', layers='-1,-2', layer_mean=False)
embeddings.embed(sentence)
print(sentence[0].embedding.size())

sentence.clear_embeddings()


embeddings = TransformerWordEmbeddings('bert-base-uncased', layers='all', layer_mean=False)
embeddings.embed(sentence)
print(sentence[0].embedding.size())

torch.Size([768])
torch.Size([1536])
torch.Size([9984])


### Voy a leer un dataset para NER

In [1]:
from flair.embeddings import TransformerWordEmbeddings
from flair.data import Sentence
import flair.datasets

corpus = flair.datasets.CONLL_03_SPANISH()
print(corpus)

2024-04-17 12:41:53,432 Reading data from /home/marcelo/.flair/datasets/conll_03_spanish
2024-04-17 12:41:53,433 Train: /home/marcelo/.flair/datasets/conll_03_spanish/esp.train
2024-04-17 12:41:53,433 Dev: /home/marcelo/.flair/datasets/conll_03_spanish/esp.testa
2024-04-17 12:41:53,434 Test: /home/marcelo/.flair/datasets/conll_03_spanish/esp.testb
Corpus: 8323 train + 1915 dev + 1517 test sentences


### Voy a usar la primera capa de ROBERTA como embeddings

In [2]:
label_dict = corpus.make_label_dictionary(label_type='ner', add_unk=False)

#label_dict = corpus.make_tag_dictionary(tag_type='ner')
print(label_dict)

embeddings = TransformerWordEmbeddings(model='roberta-base',
                                       layers="-1",
                                       layer_mean=False,
                                       subtoken_pooling="first",
                                       fine_tune=True,
                                       use_context=True,
                                       model_max_length=512,
                                       )


2024-04-17 12:41:57,649 Computing label dictionary. Progress:


8323it [00:00, 45280.51it/s]

2024-04-17 12:41:57,859 Dictionary created for label 'ner' with 4 values: ORG (seen 7390 times), LOC (seen 4914 times), PER (seen 4321 times), MISC (seen 2173 times)





Dictionary with 4 tags: ORG, LOC, PER, MISC


### Entreno usando los embeddings de ROBERTA con el sequence tagger de flair

In [3]:
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

tagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=label_dict,
    tag_format="BIOES",
    tag_type='ner',
    use_crf=False,
    use_rnn=False,
    reproject_embeddings=False,
)


trainer = ModelTrainer(tagger, corpus)

2024-04-17 12:42:06,279 SequenceTagger predicts: Dictionary with 17 tags: O, S-ORG, B-ORG, E-ORG, I-ORG, S-LOC, B-LOC, E-LOC, I-LOC, S-PER, B-PER, E-PER, I-PER, S-MISC, B-MISC, E-MISC, I-MISC


En el contexto de NER (Reconocimiento de Entidades Nombradas) utilizando el formato CoNLL (Conference on Natural Language Learning), las anotaciones "S", "B", "E" e "I" se refieren a cómo las palabras dentro de una entidad nombrada son etiquetadas. Estas etiquetas son parte del esquema de etiquetado BIOES, también conocido como BIESO, que se usa para marcar los límites y la posición de las palabras dentro de las entidades. Aquí está el significado de cada etiqueta:

- **B (Beginning)**: Marca el inicio de una entidad. Se utiliza para la primera palabra de una entidad.

- **I (Inside)**: Se utiliza para una palabra que está dentro de una entidad, pero no es ni el inicio ni el final. Esta etiqueta se utiliza para las palabras que siguen a la etiqueta B en una entidad de múltiples palabras.

- **E (End)**: Indica el final de una entidad. Se utiliza para la última palabra de una entidad de múltiples palabras.

- **S (Single)**: Se utiliza cuando una entidad está compuesta por una sola palabra. Esta etiqueta indica que la palabra es simultáneamente el inicio y el final de la entidad.

Este esquema de etiquetado ayuda a mejorar la precisión en la identificación de entidades, especialmente en casos donde las entidades están compuestas de múltiples palabras, permitiendo a los modelos de aprendizaje automático reconocer mejor los límites de las entidades.

### Y hacemos el fine-tuning del NER tagger

In [4]:
trainer.fine_tune(
    base_path="resources/taggers/ner-roberta-base",
    train_with_dev=False,
    max_epochs=8,
    learning_rate=2.0e-5,
    mini_batch_size=4,
    shuffle=False,
)



Gathered 9510 of total 50265
Reducing vocab size by 81.0803%
Reducing model size by 25.1115%
Reducing training parameter count by 25.1115%


2024-04-17 12:42:19,580 ----------------------------------------------------------------------------------------------------
2024-04-17 12:42:19,582 Model: "SequenceTagger(
  (embeddings): TransformerWordEmbeddings(
    (model): RobertaModel(
      (embeddings): RobertaEmbeddings(
        (word_embeddings): Embedding(50266, 768)
        (position_embeddings): Embedding(514, 768, padding_idx=1)
        (token_type_embeddings): Embedding(1, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): RobertaEncoder(
        (layer): ModuleList(
          (0): RobertaLayer(
            (attention): RobertaAttention(
              (self): RobertaSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
          

2024-04-17 12:42:19,584 ----------------------------------------------------------------------------------------------------
2024-04-17 12:42:19,586 Corpus: "Corpus: 8323 train + 1915 dev + 1517 test sentences"
2024-04-17 12:42:19,587 ----------------------------------------------------------------------------------------------------
2024-04-17 12:42:19,588 Parameters:
2024-04-17 12:42:19,589  - learning_rate: "0.000020"
2024-04-17 12:42:19,590  - mini_batch_size: "4"
2024-04-17 12:42:19,591  - patience: "3"
2024-04-17 12:42:19,592  - anneal_factor: "0.5"
2024-04-17 12:42:19,593  - max_epochs: "8"
2024-04-17 12:42:19,595  - shuffle: "False"
2024-04-17 12:42:19,596  - train_with_dev: "False"
2024-04-17 12:42:19,596  - batch_growth_annealing: "False"
2024-04-17 12:42:19,597 ----------------------------------------------------------------------------------------------------
2024-04-17 12:42:19,598 Model training base path: "resources/taggers/ner-roberta-base"
2024-04-17 12:42:19,599 -----

100%|██████████| 479/479 [00:14<00:00, 32.75it/s]

2024-04-17 12:46:10,466 Evaluating as a multi-label problem: False
2024-04-17 12:46:10,510 DEV : loss 0.12500788271427155 - f1-score (micro avg)  0.8026
2024-04-17 12:46:10,538 ----------------------------------------------------------------------------------------------------





2024-04-17 12:46:31,701 epoch 2 - iter 208/2081 - loss 0.13478385 - time (sec): 21.16 - samples/sec: 1263.22 - lr: 0.000019
2024-04-17 12:46:53,871 epoch 2 - iter 416/2081 - loss 0.12907947 - time (sec): 43.33 - samples/sec: 1271.54 - lr: 0.000019
2024-04-17 12:47:15,513 epoch 2 - iter 624/2081 - loss 0.12617677 - time (sec): 64.97 - samples/sec: 1165.98 - lr: 0.000019
2024-04-17 12:47:37,086 epoch 2 - iter 832/2081 - loss 0.12365354 - time (sec): 86.55 - samples/sec: 1174.74 - lr: 0.000018
2024-04-17 12:47:59,238 epoch 2 - iter 1040/2081 - loss 0.11800991 - time (sec): 108.70 - samples/sec: 1191.36 - lr: 0.000018
2024-04-17 12:48:22,078 epoch 2 - iter 1248/2081 - loss 0.12495660 - time (sec): 131.54 - samples/sec: 1201.04 - lr: 0.000018
2024-04-17 12:48:43,892 epoch 2 - iter 1456/2081 - loss 0.12233771 - time (sec): 153.35 - samples/sec: 1215.93 - lr: 0.000018
2024-04-17 12:49:05,307 epoch 2 - iter 1664/2081 - loss 0.11952784 - time (sec): 174.77 - samples/sec: 1213.69 - lr: 0.000017


100%|██████████| 479/479 [00:18<00:00, 25.68it/s]

2024-04-17 12:50:08,471 Evaluating as a multi-label problem: False





2024-04-17 12:50:08,538 DEV : loss 0.12301015108823776 - f1-score (micro avg)  0.8465
2024-04-17 12:50:08,576 ----------------------------------------------------------------------------------------------------
2024-04-17 12:50:30,314 epoch 3 - iter 208/2081 - loss 0.10279048 - time (sec): 21.74 - samples/sec: 1229.80 - lr: 0.000016
2024-04-17 12:50:52,195 epoch 3 - iter 416/2081 - loss 0.09486375 - time (sec): 43.62 - samples/sec: 1263.18 - lr: 0.000016
2024-04-17 12:51:13,128 epoch 3 - iter 624/2081 - loss 0.09347314 - time (sec): 64.55 - samples/sec: 1173.60 - lr: 0.000016
2024-04-17 12:51:34,578 epoch 3 - iter 832/2081 - loss 0.09354814 - time (sec): 86.00 - samples/sec: 1182.19 - lr: 0.000016
2024-04-17 12:51:55,958 epoch 3 - iter 1040/2081 - loss 0.09083844 - time (sec): 107.38 - samples/sec: 1205.97 - lr: 0.000015
2024-04-17 12:52:18,300 epoch 3 - iter 1248/2081 - loss 0.09994466 - time (sec): 129.72 - samples/sec: 1217.85 - lr: 0.000015
2024-04-17 12:52:40,366 epoch 3 - iter 14

100%|██████████| 479/479 [00:17<00:00, 27.08it/s]

2024-04-17 12:54:02,411 Evaluating as a multi-label problem: False





2024-04-17 12:54:02,445 DEV : loss 0.12724953889846802 - f1-score (micro avg)  0.8688
2024-04-17 12:54:02,471 ----------------------------------------------------------------------------------------------------
2024-04-17 12:54:23,659 epoch 4 - iter 208/2081 - loss 0.08630366 - time (sec): 21.19 - samples/sec: 1261.71 - lr: 0.000014
2024-04-17 12:54:44,473 epoch 4 - iter 416/2081 - loss 0.07990218 - time (sec): 42.00 - samples/sec: 1311.83 - lr: 0.000013
2024-04-17 12:55:06,372 epoch 4 - iter 624/2081 - loss 0.07636733 - time (sec): 63.90 - samples/sec: 1185.57 - lr: 0.000013
2024-04-17 12:55:27,287 epoch 4 - iter 832/2081 - loss 0.07515861 - time (sec): 84.81 - samples/sec: 1198.73 - lr: 0.000013
2024-04-17 12:55:48,300 epoch 4 - iter 1040/2081 - loss 0.07214483 - time (sec): 105.83 - samples/sec: 1223.67 - lr: 0.000013
2024-04-17 12:56:09,712 epoch 4 - iter 1248/2081 - loss 0.07503518 - time (sec): 127.24 - samples/sec: 1241.61 - lr: 0.000012
2024-04-17 12:56:30,702 epoch 4 - iter 14

100%|██████████| 479/479 [00:16<00:00, 28.50it/s]

2024-04-17 12:57:50,792 Evaluating as a multi-label problem: False
2024-04-17 12:57:50,824 DEV : loss 0.13338154554367065 - f1-score (micro avg)  0.8767
2024-04-17 12:57:50,850 ----------------------------------------------------------------------------------------------------





2024-04-17 12:58:12,367 epoch 5 - iter 208/2081 - loss 0.06134422 - time (sec): 21.52 - samples/sec: 1242.44 - lr: 0.000011
2024-04-17 12:58:33,680 epoch 5 - iter 416/2081 - loss 0.05921549 - time (sec): 42.83 - samples/sec: 1286.48 - lr: 0.000011
2024-04-17 12:58:54,725 epoch 5 - iter 624/2081 - loss 0.05679157 - time (sec): 63.87 - samples/sec: 1186.06 - lr: 0.000010
2024-04-17 12:59:15,833 epoch 5 - iter 832/2081 - loss 0.05727826 - time (sec): 84.98 - samples/sec: 1196.38 - lr: 0.000010
2024-04-17 12:59:39,237 epoch 5 - iter 1040/2081 - loss 0.05701575 - time (sec): 108.39 - samples/sec: 1194.80 - lr: 0.000010
2024-04-17 13:00:03,355 epoch 5 - iter 1248/2081 - loss 0.05953548 - time (sec): 132.50 - samples/sec: 1192.29 - lr: 0.000009
2024-04-17 13:00:25,087 epoch 5 - iter 1456/2081 - loss 0.05792990 - time (sec): 154.24 - samples/sec: 1208.97 - lr: 0.000009
2024-04-17 13:00:46,607 epoch 5 - iter 1664/2081 - loss 0.05783981 - time (sec): 175.76 - samples/sec: 1206.86 - lr: 0.000009


100%|██████████| 479/479 [00:18<00:00, 25.41it/s]

2024-04-17 13:01:48,860 Evaluating as a multi-label problem: False





2024-04-17 13:01:48,898 DEV : loss 0.17233379185199738 - f1-score (micro avg)  0.8522
2024-04-17 13:01:48,929 ----------------------------------------------------------------------------------------------------
2024-04-17 13:02:10,332 epoch 6 - iter 208/2081 - loss 0.05027897 - time (sec): 21.40 - samples/sec: 1249.03 - lr: 0.000008
2024-04-17 13:02:32,248 epoch 6 - iter 416/2081 - loss 0.04677513 - time (sec): 43.32 - samples/sec: 1271.94 - lr: 0.000008
2024-04-17 13:02:54,646 epoch 6 - iter 624/2081 - loss 0.04507152 - time (sec): 65.72 - samples/sec: 1152.81 - lr: 0.000008
2024-04-17 13:03:17,097 epoch 6 - iter 832/2081 - loss 0.04536561 - time (sec): 88.17 - samples/sec: 1153.15 - lr: 0.000007
2024-04-17 13:03:38,988 epoch 6 - iter 1040/2081 - loss 0.04443522 - time (sec): 110.06 - samples/sec: 1176.64 - lr: 0.000007
2024-04-17 13:04:01,482 epoch 6 - iter 1248/2081 - loss 0.04728063 - time (sec): 132.55 - samples/sec: 1191.86 - lr: 0.000007
2024-04-17 13:04:24,364 epoch 6 - iter 14

100%|██████████| 479/479 [00:17<00:00, 26.72it/s]

2024-04-17 13:05:48,713 Evaluating as a multi-label problem: False
2024-04-17 13:05:48,752 DEV : loss 0.16247394680976868 - f1-score (micro avg)  0.8771
2024-04-17 13:05:48,784 ----------------------------------------------------------------------------------------------------





2024-04-17 13:06:09,995 epoch 7 - iter 208/2081 - loss 0.03868834 - time (sec): 21.21 - samples/sec: 1260.34 - lr: 0.000005
2024-04-17 13:06:31,488 epoch 7 - iter 416/2081 - loss 0.03868234 - time (sec): 42.70 - samples/sec: 1290.25 - lr: 0.000005
2024-04-17 13:06:52,550 epoch 7 - iter 624/2081 - loss 0.03678776 - time (sec): 63.77 - samples/sec: 1188.08 - lr: 0.000005
2024-04-17 13:07:13,552 epoch 7 - iter 832/2081 - loss 0.03638934 - time (sec): 84.77 - samples/sec: 1199.40 - lr: 0.000004
2024-04-17 13:07:34,555 epoch 7 - iter 1040/2081 - loss 0.03491890 - time (sec): 105.77 - samples/sec: 1224.34 - lr: 0.000004
2024-04-17 13:07:55,985 epoch 7 - iter 1248/2081 - loss 0.03650993 - time (sec): 127.20 - samples/sec: 1242.00 - lr: 0.000004
2024-04-17 13:08:17,049 epoch 7 - iter 1456/2081 - loss 0.03618516 - time (sec): 148.26 - samples/sec: 1257.66 - lr: 0.000004
2024-04-17 13:08:37,897 epoch 7 - iter 1664/2081 - loss 0.03662307 - time (sec): 169.11 - samples/sec: 1254.27 - lr: 0.000003


100%|██████████| 479/479 [00:18<00:00, 25.43it/s]

2024-04-17 13:09:38,740 Evaluating as a multi-label problem: False
2024-04-17 13:09:38,777 DEV : loss 0.1679566204547882 - f1-score (micro avg)  0.8757
2024-04-17 13:09:38,808 ----------------------------------------------------------------------------------------------------





2024-04-17 13:10:00,354 epoch 8 - iter 208/2081 - loss 0.04130037 - time (sec): 21.54 - samples/sec: 1240.76 - lr: 0.000003
2024-04-17 13:10:21,935 epoch 8 - iter 416/2081 - loss 0.03790122 - time (sec): 43.13 - samples/sec: 1277.62 - lr: 0.000002
2024-04-17 13:10:42,632 epoch 8 - iter 624/2081 - loss 0.03290789 - time (sec): 63.82 - samples/sec: 1187.00 - lr: 0.000002
2024-04-17 13:11:03,818 epoch 8 - iter 832/2081 - loss 0.03338194 - time (sec): 85.01 - samples/sec: 1195.99 - lr: 0.000002
2024-04-17 13:11:24,773 epoch 8 - iter 1040/2081 - loss 0.03045543 - time (sec): 105.96 - samples/sec: 1222.11 - lr: 0.000001
2024-04-17 13:11:46,199 epoch 8 - iter 1248/2081 - loss 0.03398755 - time (sec): 127.39 - samples/sec: 1240.16 - lr: 0.000001
2024-04-17 13:12:07,560 epoch 8 - iter 1456/2081 - loss 0.03365243 - time (sec): 148.75 - samples/sec: 1253.55 - lr: 0.000001
2024-04-17 13:12:28,864 epoch 8 - iter 1664/2081 - loss 0.03410493 - time (sec): 170.05 - samples/sec: 1247.32 - lr: 0.000001


100%|██████████| 479/479 [00:17<00:00, 27.48it/s]


2024-04-17 13:13:28,260 Evaluating as a multi-label problem: False
2024-04-17 13:13:28,295 DEV : loss 0.1680973768234253 - f1-score (micro avg)  0.879
2024-04-17 13:13:29,101 ----------------------------------------------------------------------------------------------------
2024-04-17 13:13:29,103 Testing using last state of model ...


100%|██████████| 380/380 [00:13<00:00, 28.58it/s]

2024-04-17 13:13:42,413 Evaluating as a multi-label problem: False





2024-04-17 13:13:42,467 0.8719	0.8817	0.8768	0.8388
2024-04-17 13:13:42,468 
Results:
- F-score (micro) 0.8768
- F-score (macro) 0.8611
- Accuracy 0.8388

By class:
              precision    recall  f1-score   support

         ORG     0.8549    0.8964    0.8752      1400
         LOC     0.8731    0.8506    0.8617      1084
         PER     0.9633    0.9633    0.9633       735
        MISC     0.7441    0.7441    0.7441       340

   micro avg     0.8719    0.8817    0.8768      3559
   macro avg     0.8588    0.8636    0.8611      3559
weighted avg     0.8722    0.8817    0.8767      3559

2024-04-17 13:13:42,469 ----------------------------------------------------------------------------------------------------


{'test_score': 0.8767812238055323,
 'dev_score_history': [0.8026315789473684,
  0.8465116279069766,
  0.868781378366043,
  0.8766651355075793,
  0.8522139160437033,
  0.8770840519719444,
  0.8757055638751295,
  0.878979427651994],
 'train_loss_history': [0.4196061343392371,
  0.11689172179770276,
  0.09449662407041759,
  0.07456361308340184,
  0.057899872601580235,
  0.04665862448102976,
  0.03753470630178461,
  0.03574421755466544],
 'dev_loss_history': [0.12500788271427155,
  0.12301015108823776,
  0.12724953889846802,
  0.13338154554367065,
  0.17233379185199738,
  0.16247394680976868,
  0.1679566204547882,
  0.1680973768234253]}

### Y vemos como funciona

In [11]:
# Cargar el modelo preentrenado
model = SequenceTagger.load('resources/taggers/ner-roberta-base/final-model.pt')


# create example sentence
from flair.data import Sentence
sentence = Sentence("George Washington fue a Washington")

# predict tags and print
model.predict(sentence)

print(sentence.to_tagged_string())

2024-04-17 13:39:43,255 SequenceTagger predicts: Dictionary with 17 tags: O, S-ORG, B-ORG, E-ORG, I-ORG, S-LOC, B-LOC, E-LOC, I-LOC, S-PER, B-PER, E-PER, I-PER, S-MISC, B-MISC, E-MISC, I-MISC
Sentence[5]: "George Washington fue a Washington" → ["George Washington"/LOC, "Washington"/LOC]
