<a href="https://colab.research.google.com/github/nassibehgolizadeh/NLP-Projects/blob/main/NER_C_L.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NER with Transformers

In [None]:
!pip install -q datasets
!pip install -q transformers
!pip install -q accelerate
!pip install -q seqeval
!pip install -q evaluate

[K     |████████████████████████████████| 452 kB 34.6 MB/s 
[K     |████████████████████████████████| 212 kB 60.0 MB/s 
[K     |████████████████████████████████| 182 kB 71.6 MB/s 
[K     |████████████████████████████████| 132 kB 16.7 MB/s 
[K     |████████████████████████████████| 127 kB 24.0 MB/s 
[K     |████████████████████████████████| 5.8 MB 28.2 MB/s 
[K     |████████████████████████████████| 7.6 MB 68.0 MB/s 
[K     |████████████████████████████████| 191 kB 31.4 MB/s 
[K     |████████████████████████████████| 43 kB 2.4 MB/s 
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 81 kB 9.0 MB/s 
[?25h

In [None]:
import datasets
import transformers
import torch
import accelerate
import evaluate
from tqdm.auto import tqdm

## Dataset

preparing data

In [None]:
raw_datasets = datasets.load_dataset("conll2003")

Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Since we want to do NER we will look at NER tags

In [None]:
print(raw_datasets["test"][0]["tokens"])
print(raw_datasets["test"][0]["ner_tags"])

['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.']
[0, 0, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0]


to access correspondance between integers of classes

In [None]:
ner_feature = raw_datasets["train"].features["ner_tags"]

to access just the name of the classes

In [None]:
ner_feature.feature.names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

Now to decode labels and printing them in an example

In [None]:
# choose deferent instance by changing i
i = 3
names = ner_feature.feature.names
words = raw_datasets["train"][i]["tokens"]
labels = raw_datasets["train"][i]["ner_tags"]
l_1 = ""
l_2 = ""
# iterate though words and corresponding labels
for word, label in zip(words, labels):
    # retrieve label name
    label_name = names[label]
    # calculating max len for better representation
    max_length = max(len(word), len(label_name))
    #  l_1 for tokens
    l_1 += word + " " * (max_length - len(word) + 1)
    # l_2 for labels
    l_2 += label_name + " " * (max_length - len(label_name) + 1)

print(l_1, l_2, sep= "\n")

The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep . 
O   B-ORG    I-ORG      O    O  O        O  O         O    B-MISC O      O  O         O  O    B-MISC  O    O     O          O         O       O   O   O       O   O  O           O  O     O 


## processing the data

In [None]:
checkpoint = "dslim/bert-base-NER"

In [None]:
# to load the tokenizer to tokenie the input with proper tokenizer
tokenizer = transformers.BertTokenizerFast.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/829 [00:00<?, ?B/s]

In [None]:
# to check if the tokenizer is fast or not
tokenizer.is_fast

True

In [None]:
inputs = tokenizer(
    raw_datasets["train"][0]["tokens"], # feature of dataset we want to tokenize
    is_split_into_words= True, # token feature is seperated to words already
)

an example of using tokenizer and its output

In [None]:
print(inputs.tokens(), inputs.word_ids(), raw_datasets["train"][0]["ner_tags"], sep= "\n")

['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]']
[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]
[3, 0, 7, 0, 0, 0, 7, 0, 0]


we need a function to:
* special tokens get a label of -100
* each token gets the same label as the token that started the word's inside

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        # if word_id is not none assign word_id to current word for word tracking
        if word_id != current_word:
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        # append -100 if word_id is none
        elif word_id is None:
            new_labels.append(-100)
        else:
            label = labels[word_id]
            #add +1 to current label to cover I- labels
            if label % 2 == 1:
                label +=1
            new_labels.append(label)
    return new_labels

In [None]:
def tokenize_and_align_labels(example):
    tokenized_input = tokenizer(example["tokens"],
                                truncation= True,
                                is_split_into_words= True)
    all_labels = example["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_input.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))
        tokenized_input["labels"] = new_labels
    return tokenized_input

In [None]:
# now to apply the function using map method
tokenized_datasets = raw_datasets.map(tokenize_and_align_labels,
                                      batched= True,
                                      remove_columns= raw_datasets["train"].column_names)

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

In [None]:
tokenized_datasets["train"][0]

{'input_ids': [101,
  7270,
  22961,
  1528,
  1840,
  1106,
  21423,
  1418,
  2495,
  12913,
  119,
  102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'labels': [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]}

## Preparing data for custom loop

In [None]:
tokenized_datasets.set_format("torch")

In [None]:
# preparing train data
train_dataloader = torch.utils.data.DataLoader(
    tokenized_datasets["train"],
    shuffle= True,
    collate_fn= transformers.DataCollatorForTokenClassification(tokenizer= tokenizer),#for dynamic padding and doing necessary processing on tokens
    batch_size= 16,
    )
# preparing validation data
eval_dataloader = torch.utils.data.DataLoader(
    tokenized_datasets["validation"],
    collate_fn= transformers.DataCollatorForTokenClassification(tokenizer= tokenizer),#for dynamic padding and doing necessary processing on tokens
    batch_size= 16,
    )

## Model for fine-tuning

In [None]:
id2label = {i: label for i, label in enumerate(names)}
label2id = {k: v for v, k in id2label.items()}

In [None]:
model = transformers.BertForTokenClassification.from_pretrained(checkpoint,
                                                                id2label= id2label,
                                                                label2id= label2id)

Downloading:   0%|          | 0.00/433M [00:00<?, ?B/s]

## Preparing model for training

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr= 2e-5)

pushing everything into accelerator

In [None]:
accl = accelerate.Accelerator()

In [None]:
model, optimizer, train_dataloader, eval_dataloader = accl.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

defining learning rate scheduler

In [None]:
num_train_epochs = 4
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

In [None]:
num_training_steps

3512

In [None]:
lr_sch = transformers.get_scheduler(
    name= "linear",
    optimizer= optimizer,
    num_warmup_steps= 200,
    num_training_steps= num_training_steps - 200,
)

to postprocess model output and labels for evaluation

In [None]:
def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels = labels.detach().cpu().clone().numpy()
    # filter outs the special tokens
    true_labels = [[names[l] for l in label if l != -100] for label in labels]
    true_predictions = [[names[p] for (p,l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]
    return true_labels, true_predictions

## Defining Metric

In [None]:
metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

## Training Loop

In evaluation since two process may have padded the inputs and labels to different shapes, we need to use `accl.pad_across_processes`

In [None]:
output_dir = "my_model"

In [None]:
progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_train_epochs):
    # training part of the model
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accl.backward(loss)
        optimizer.step()
        lr_sch.step()
        optimizer.zero_grad()
        progress_bar.update(1)
    # model validation
    model.eval()
    for batch in eval_dataloader:
        with torch.no_grad():
            outputs = model(**batch)
        predictions = outputs.logits.argmax(dim= -1)
        labels = batch["labels"]
        predictions = accl.pad_across_processes(predictions, dim= 1, pad_index= -100)
        labels = accl.pad_across_processes(labels, dim= 1, pad_index= -100)
        predictions_gathered = accl.gather(predictions)
        labels_gathered = accl.gather(labels)
        true_predictions, true_labels = postprocess(
            predictions_gathered, labels_gathered
        )
        metric.add_batch(
            predictions= true_predictions,
            references= true_labels
        )
    result = metric.compute()
    print(f"epoch {epoch}:", {key: result[f"overall_{key}"] for key in ["precision", "recall", "f1", "accuracy"]})
    accl.wait_for_everyone()
    if accl.is_main_process:
        tokenizer.save_pretrained(output_dir)
        model.save_pretrained(output_dir)

  0%|          | 0/3512 [00:00<?, ?it/s]

epoch 0: {'precision': 0.9500168293503871, 'recall': 0.9327495042961005, 'f1': 0.9413039853259965, 'accuracy': 0.9862394772473068}
epoch 1: {'precision': 0.9501851228542578, 'recall': 0.9258773368317481, 'f1': 0.9378737541528239, 'accuracy': 0.9861953258374051}
epoch 2: {'precision': 0.9513631773813531, 'recall': 0.9317619910993902, 'f1': 0.941460571238238, 'accuracy': 0.9867398598928593}
epoch 3: {'precision': 0.9513631773813531, 'recall': 0.9317619910993902, 'f1': 0.941460571238238, 'accuracy': 0.9867398598928593}


## Using fine-tuned model

In [None]:
ner = transformers.pipeline("token-classification", model= "/content/my_model",
                            aggregation_strategy= "simple")

ner("Tesla is located in United States")

[{'entity_group': 'ORG',
  'score': 0.9821116,
  'word': 'Tesla',
  'start': 0,
  'end': 5},
 {'entity_group': 'LOC',
  'score': 0.9990192,
  'word': 'United States',
  'start': 20,
  'end': 33}]