# **Módulo 3 NLP Module Project**

**Part 2. NER: Take a basic, pretrained NER model, and train further on a task-specific dataset**

**Hayali Monserrat Marina Garduño, A01751188**

## Install & Initialize Libraries

In [1]:
!pip install transformers
!pip install wandb
!pip install evaluate
!pip install seqeval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 24.6 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 38.7 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 74.0 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wandb
  Downloading wandb-0.13.5-py2.py3-none-any.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 25

In [2]:
from datasets import load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForTokenClassification
from huggingface_hub import notebook_login
import evaluate
import numpy as np

## Authenticate with Huggingface

In [3]:
# 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
notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


## Preprocess Dataset & Train Model

In [4]:
dataset_trivia = load_dataset("tner/mit_movie_trivia")

Downloading builder script:   0%|          | 0.00/2.23k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.78k [00:00<?, ?B/s]

Downloading and preparing dataset mit_movie_trivia/mit_movie_trivia to /root/.cache/huggingface/datasets/tner___mit_movie_trivia/mit_movie_trivia/1.0.0/9f23f5011b1b3386fdb5aaa7a8c061285946a54f5d9f84f5226b36fe6f60000a...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/498k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257k [00:00<?, ?B/s]

   

Extracting data files #1:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #0:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #2:   0%|          | 0/1 [00:00<?, ?obj/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset mit_movie_trivia downloaded and prepared to /root/.cache/huggingface/datasets/tner___mit_movie_trivia/mit_movie_trivia/1.0.0/9f23f5011b1b3386fdb5aaa7a8c061285946a54f5d9f84f5226b36fe6f60000a. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
MAX_TRAIN_LENGTH = len(dataset_trivia["train"])
N_EXAMPLES_TO_TRAIN = MAX_TRAIN_LENGTH #You can choose a samaller dataset (up to a max of 6816 samples)

In [8]:
model_checkpoint = "bert-base-cased"

In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) # Create tokenizer with model checkpoint

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [10]:
def align_labels_with_tokens(labels, lista_palabras_ids):
    etiquetas_alineadas = []
    current_word = None
    for id in lista_palabras_ids:
        if id != current_word:
            current_word = id
            if id is None:
                label = -100
            else:
                label = labels[id]
            etiquetas_alineadas.append(label)
        elif id is None:
            etiquetas_alineadas.append(-100)
        else:
            label = labels[id]
            if label % 2 == 1:
                label += 1
            etiquetas_alineadas.append(label)

    return etiquetas_alineadas

In [11]:
def tokenize_and_align_labels(samples):
    tokenized_inputs = tokenizer(
        samples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = samples["tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [12]:
tokenized_datasets = dataset_trivia.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=["tokens", "tags"],
)

  0%|          | 0/7 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [13]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer) #Label padding
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [14]:
metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [15]:
label_names = ["O","B-Actor","I-Actor","B-Plot","I-Plot","B-Opinion","I-Opinion","B-Award","I-Award","B-Year","B-Genre","B-Origin","I-Origin","B-Director","I-Director","I-Genre","I-Year","B-Soundtrack","I-Soundtrack","B-Relationship","I-Relationship","B-Character_Name","I-Character_Name","B-Quote","I-Quote"]

In [16]:
trainer_metrics = []
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)

    dict_metrics = {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

    trainer_metrics.append(dict_metrics)

    return dict_metrics

In [17]:
label2id = {
    "O": 0,
    "B-Actor": 1,
    "I-Actor": 2,
    "B-Plot": 3,
    "I-Plot": 4,
    "B-Opinion": 5,
    "I-Opinion": 6,
    "B-Award": 7,
    "I-Award": 8,
    "B-Year": 9,
    "B-Genre": 10,
    "B-Origin": 11,
    "I-Origin": 12,
    "B-Director": 13,
    "I-Director": 14,
    "I-Genre": 15,
    "I-Year": 16,
    "B-Soundtrack": 17,
    "I-Soundtrack": 18,
    "B-Relationship": 19,
    "I-Relationship": 20,
    "B-Character_Name": 21,
    "I-Character_Name": 22,
    "B-Quote": 23,
    "I-Quote": 24
}
id2label = dict(map(reversed, label2id.items()))

In [18]:
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

In [19]:
args = TrainingArguments(
    "bert-finetuned-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
    report_to="wandb"
)

In [20]:
train_dataset_subset = tokenized_datasets["train"].shuffle(seed=42).select(range(N_EXAMPLES_TO_TRAIN))

In [21]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset_subset,
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

In [23]:
#wandb_key = 4f5397d204cb46ba17498077cb49b184100f585d
trainer.train()

***** Running training *****
  Num examples = 6816
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2556
  Number of trainable parameters = 107738905
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.6838,0.381188,0.591019,0.675043,0.630243,0.875091
2,0.3127,0.349605,0.627588,0.709466,0.66602,0.885944
3,0.2386,0.352136,0.637783,0.71704,0.675093,0.889033


***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to bert-finetuned-ner/checkpoint-852
Configuration saved in bert-finetuned-ner/checkpoint-852/config.json
Model weights saved in bert-finetuned-ner/checkpoint-852/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-852/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-852/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
Saving model checkpoint to bert-finetuned-ner/checkpoint-1704
Configuration saved in bert-finetuned-ner/checkpoint-1704/config.json
Model weights saved in bert-finetuned-ner/checkpoint-1704/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-1704/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-1704/special_tokens_map.json
***** Running Evaluation *****
  Num examp

TrainOutput(global_step=2556, training_loss=0.3716568797593572, metrics={'train_runtime': 298.6407, 'train_samples_per_second': 68.47, 'train_steps_per_second': 8.559, 'total_flos': 400442958658800.0, 'train_loss': 0.3716568797593572, 'epoch': 3.0})