<a href="https://colab.research.google.com/github/jan-kreischer/EPFL_ANN_Projects/blob/main/Project-05/ex05_ner_jan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 5 - Sequence and Sentiment Classification using Transformers
In this assignment we are supposed to work on Named Entity Recognition using BERT.  
We selected German as our target language.
In the first part of the assignment the pre-trained transformer model (for German) downloaded from Huggingface is fine tuned for Named Entity Recognition on the Polyglot Ner German dataset.

## 1. Setup
### 1.1 Dependencies
Disclaimer: The output of cells which do not produce not helpful output (for example the pip install comands) were cleared to make the program easier to read

In [3]:
!pip install datasets transformers sklearn




### 1.2 Imports

In [4]:
import datasets
from datasets import load_dataset
from transformers import BertTokenizer, BertForTokenClassification, Trainer, TrainingArguments, AutoConfig
from sklearn.preprocessing import LabelEncoder

# Misc
import csv
import re
from io import StringIO
import requests
import string
import numpy as np
import matplotlib.pyplot as plt  
import seaborn as sn

# Pandas
import pandas as pd
pd.set_option('display.max_rows', 100)
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, AveragePooling1D, Dense, Dropout, Activation, Embedding
from keras import backend as K
from keras.callbacks import EarlyStopping
from tensorflow.keras.utils import to_categorical

# Torch
import torch

# Sklearn
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

### 1.3 Constants

In [5]:
THRESHOLD = 7000
MAX_LENGTH = 512
N_EPOCHS = 5
BATCH_SIZE = 8

### 1.4 Environment
We check if the environment we are using is properly setup, such that we are using GPU for training our models.

In [23]:
# Check if device supports CUDA interface
CUDA = torch.cuda.is_available()
# Make program run on gpu (cuda:0) if available
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu:0')
torch.cuda.set_device(device)
print('Using device:', device)

Using device: cuda:0


In [24]:
# Check and print information about available GPU
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Wed Nov 24 09:32:39 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    27W / 250W |      2MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [25]:
# Get GPU name
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-80eb3a39-123a-322e-a52d-cbe57ce77608)


In [26]:
# Check Memory
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


## 2. Data Preparation
### 2.1 Data Acquisition

In [6]:
# Here we are checking if the german polyglot dataset
# contains more than the required amount of at least 7000 sentences
# As you can see this holds true
dataset = datasets.load_dataset('polyglot_ner', 'de', split='train')
print("The dataset contains {} sentences. This is more than the required threshold of {}".format(dataset.num_rows, THRESHOLD))

Downloading:   0%|          | 0.00/2.28k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.01k [00:00<?, ?B/s]

Downloading and preparing dataset polyglot_ner/de (download: 1.03 GiB, generated: 149.48 MiB, post-processed: Unknown size, total: 1.18 GiB) to /root/.cache/huggingface/datasets/polyglot_ner/de/1.0.0/616830d0e733473b4151a0836757c166374e34854c125146eabe206825cc1343...


Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

Dataset polyglot_ner downloaded and prepared to /root/.cache/huggingface/datasets/polyglot_ner/de/1.0.0/616830d0e733473b4151a0836757c166374e34854c125146eabe206825cc1343. Subsequent calls will reuse this data.
The dataset contains 547578 sentences. This is more than the required threshold of 7000


In [7]:
dataset = datasets.load_dataset('polyglot_ner', 'de', split='train[:{}]'.format(THRESHOLD))

Reusing dataset polyglot_ner (/root/.cache/huggingface/datasets/polyglot_ner/de/1.0.0/616830d0e733473b4151a0836757c166374e34854c125146eabe206825cc1343)


In [8]:
# Show one sample from the dataset
print(dataset[100]["ner"])
print(dataset[100]["words"])

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'ORG', 'ORG', 'O', 'LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'ORG', 'ORG', 'ORG', 'O', 'O', 'O', 'O']
['Jonathan', 'Hutton', 'machte', 'seinen', 'B.A', '.-', 'Abschluss', 'am', 'Jesus', 'College', 'in', 'Cambridge', 'und', 'promovierte', 'über', 'Krokodil', '-', 'Ökologie', 'an', 'der', 'University', 'of', 'Zimbabwe', 'im', 'Jahre', '1984', '.']


loading the BERT tokenizer

In [9]:
 # Loading the BERT Tokenizer which internally uses BERT.
# We load the BERT base model pre trained for cased German language
tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased')

Downloading:   0%|          | 0.00/249k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/474k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

### 2.2 Data Preparation



In [10]:
# The tokenizer does tokenization and numericalization
# The length of the output tensors is 512.
# If the output tensor is shorter than that it will be padded to this length
# If the output tensor would be longer than that it will be truncated
encoded_dataset = [tokenizer(item['words'], return_tensors="pt", padding='max_length', truncation=True, max_length=MAX_LENGTH, is_split_into_words=True) for item in dataset]

In [11]:
# Here you can see that our encoded sentences
# Are represented by tensors of length 512
print(encoded_dataset[0]['input_ids'].shape)

torch.Size([1, 512])



> ...then the labels



In [12]:
le = LabelEncoder()

As can be seen below, the labels in this dataset are


*   ``LOC, O, ORG, PER``

I decided to keep those labels as they are, because I think it makes sense that the classifier learns to classify those fine-grained labels. Even though the labels are not in the classical IOB-format as explained in the lecture and tutorial. 






In [13]:
# setting the labels manually because there are so little, I previously extracted them from the dataset
# I added an <UNK> token in case that there is another label in the test set. Additionally, I added a <PAD> label because I want
# to exclude this in the end for the eval. 
labels_correct = ['<UNK>', '<PAD>', 'LOC', 'O', 'ORG', 'PER']

In [14]:
y_encoded = []
le.fit(labels_correct)

for idx, item in enumerate(dataset['ner']):
    item = ['<UNK>' if s not in le.classes_ else s for s in item]
    y_encoded.append(le.transform(item))

In [15]:
print(le.classes_)

['<PAD>' '<UNK>' 'LOC' 'O' 'ORG' 'PER']


> checking the encoded labels

In [16]:
# Show one example of encoded labels
# In this case
# 5 refers to 'PER'
# 3 refers to 'O'
# 2 refers to 'LOC'
# according to the order of labels above
print(dataset[16]["ner"])
print(y_encoded[16])

['PER', 'PER', 'PER', 'O', 'O', 'O', 'O', 'O', 'O', 'LOC', 'LOC', 'O', 'O', 'O', 'O', 'LOC', 'O']
[5 5 5 3 3 3 3 3 3 2 2 3 3 3 3 2 3]




> Zipping the words and the labels together again \\
> Padding the labels to the same length as the words



In [17]:
import torch
for enc_item, item in zip(encoded_dataset, y_encoded):
    i = item.size
    while i < 512:
         item = np.append(item, 0)
         i += 1
    enc_item['labels'] = torch.LongTensor([item])

> Shuffeling the dataset

In [18]:
# 
from random import shuffle
shuffle(encoded_dataset)

## 3. Modelling

### Preparing the dataset

> This next cell squeezes the tensors in the dataset such that they are basically just one list with the numbers in it. This can be seen a few cells below. 

In [19]:
for item in encoded_dataset:
    for key in item:
        item[key] = torch.squeeze(item[key])
        
train_dataset_1k = encoded_dataset[:1000]
train_dataset_3k = encoded_dataset[1000:4000]
test_dataset_2k = encoded_dataset[4000:6000]

> Checking the dimensions

In [21]:
# Here we check and make sure that
# all the tensors have length 512
for key, val in test_dataset_2k[3].items():
    print(f'key: {key}, dimensions: {val.size()}')

key: input_ids, dimensions: torch.Size([512])
key: token_type_ids, dimensions: torch.Size([512])
key: attention_mask, dimensions: torch.Size([512])
key: labels, dimensions: torch.Size([512])


In [22]:
# Checking that the datasets have the correct length
print(len(train_dataset_1k))
print(len(train_dataset_3k))
print(len(test_dataset_2k))

1000
3000
2000


In [23]:
# Checking one example from the training set
print(train_dataset_1k[0])

{'input_ids': tensor([    3, 16239, 19818, 15617, 26897, 26903,  2755,  1311, 19170,  2258,
         4946,   144,    88,  3032, 18360, 24166, 26898, 26918,    30,   261,
           30,   929,  6410, 14764,    53,   541, 16889, 26901,    42,  4146,
         3172,   956,   266, 26914,     4,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0, 

### Calculation of f1-score


> In the next two cells I calculate the f1-micro and the f1-macro score. For each true-label - prediction pair, I excluded the padding labels at the end as those are not relevant for the evaluation. I then concatenate all the lists and calculate the score over the entire list of predictions. 

In [24]:
from sklearn.metrics import f1_score

# samples.label_ids ... corresponds to the true labels
# samples.predictions ... contains a vector of likelihoods for the individual labels
# samples.predictions.argmax(-1) ... corresponds to the integer value of the class with the highest likelihood
def evaluate_f1(samples):
  all_y_true = []
  all_y_pred = []
  for y_true, y_pred in zip(samples.label_ids, samples.predictions.argmax(-1)):
      y_true = [label for label in y_true if label != 0]
      all_y_true.extend(y_true)

      y_pred = y_pred[:len(y_true)]
      all_y_pred.extend(y_pred)

  micro_f1 = f1_score(all_y_true, all_y_pred, average='micro')
  macro_f1 = f1_score(all_y_true, all_y_pred, average='macro')
  print("Micro F1: {}".format(micro_f1))
  print("Macro F1: {}".format(macro_f1))

# Model fine-tuned with 1000 sentences (and non-frozen embeddings)
I decided to choose rather small epochs and batch sizes. Using small batch size was recommended in the tutorial. I actually tried to have batch size 16 but this did not work with the memory. \\
In previous exercises I noticed that  more epochs did mostly not improve the model. So, I just decided to keep those low this time. I tried once with a few more but the model did not sigificantly improve. Also, I think that it might eventually overfit when chosing too many epochs.


In [None]:
# We load the base model freshly everytime before fine tuning in order to 
# ensure not fine tuning an already tuned model
model = BertForTokenClassification.from_pretrained('bert-base-german-cased', num_labels=6);

In [73]:
# Run this cell in order to ensure unfrozen parameters
for param in model.base_model.parameters():
    param.requires_grad = True

In [74]:
args = TrainingArguments(
    num_train_epochs=N_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    output_dir='results',
    logging_dir='logs',
    no_cuda=False, # Use cuda if available
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    train_dataset=train_dataset_1k
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [75]:
trainer.train()

***** Running training *****
  Num examples = 1000
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 625


Step,Training Loss
500,0.0349


Saving model checkpoint to results/checkpoint-500
Configuration saved in results/checkpoint-500/config.json
Model weights saved in results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in results/checkpoint-500/tokenizer_config.json
Special tokens file saved in results/checkpoint-500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=625, training_loss=0.02926311950683594, metrics={'train_runtime': 296.9198, 'train_samples_per_second': 16.84, 'train_steps_per_second': 2.105, 'total_flos': 1306531031040000.0, 'train_loss': 0.02926311950683594, 'epoch': 5.0})

In [87]:
predictions = trainer.predict(test_2k)

***** Running Prediction *****
  Num examples = 2000
  Batch size = 8


In [92]:
evaluate_f1(predictions)

Micro F1: 0.9064798788403288 ()
Macro F1: 0.3462378399939087 ()


# Model with 3000 sentences (and non-frozen embeddings)





In [93]:
# We load the base model freshly everytime before fine tuning in order to 
# ensure not fine tuning an already tuned model
model = BertForTokenClassification.from_pretrained('bert-base-german-cased', num_labels=6);

loading configuration file https://huggingface.co/bert-base-german-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/98877e98ee76b3977d326fe4f54bc29f10b486c317a70b6445ac19a0603b00f0.1f2afedb22f9784795ae3a26fe20713637c93f50e2c99101d952ea6476087e5e
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_e

In [94]:
# Run this cell in order to ensure unfrozen parameters
for param in model.base_model.parameters():
    param.requires_grad = True

In [95]:
args = TrainingArguments(
    num_train_epochs=N_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    output_dir='results',
    logging_dir='logs',
    no_cuda=False, # Use cuda if available
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    train_dataset=train_dataset_3k
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [96]:
trainer.train()

***** Running training *****
  Num examples = 3000
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1875


Step,Training Loss
500,0.2796
1000,0.1459
1500,0.0862


Saving model checkpoint to results/checkpoint-500
Configuration saved in results/checkpoint-500/config.json
Model weights saved in results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in results/checkpoint-500/tokenizer_config.json
Special tokens file saved in results/checkpoint-500/special_tokens_map.json
Saving model checkpoint to results/checkpoint-1000
Configuration saved in results/checkpoint-1000/config.json
Model weights saved in results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to results/checkpoint-1500
Configuration saved in results/checkpoint-1500/config.json
Model weights saved in results/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in results/checkpoint-1500/special_tokens_map.json


Training complet

TrainOutput(global_step=1875, training_loss=0.14558352457682291, metrics={'train_runtime': 889.5117, 'train_samples_per_second': 16.863, 'train_steps_per_second': 2.108, 'total_flos': 3919593093120000.0, 'train_loss': 0.14558352457682291, 'epoch': 5.0})

In [98]:
predictions = trainer.predict(test_dataset_2k)

***** Running Prediction *****
  Num examples = 2000
  Batch size = 8


In [99]:
evaluate_f1(predictions)

Micro F1: 0.9301709216789268 ()
Macro F1: 0.4655650933995753 ()


# Model with 3000 sentences (and frozen embeddings)

In [25]:
# We load the base model freshly everytime before fine tuning in order to 
# ensure not fine tuning an already tuned model
model = BertForTokenClassification.from_pretrained('bert-base-german-cased', num_labels=6);

Downloading:   0%|          | 0.00/419M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertForTokenClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-b

In [26]:
# Run this cell in order to ensure unfrozen parameters
for param in model.base_model.parameters():
    param.requires_grad = False

In [28]:
args = TrainingArguments(
    num_train_epochs=N_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    output_dir='results',
    logging_dir='logs',
    no_cuda=False,  # Use cuda if available
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    train_dataset=train_dataset_3k
)

In [29]:
trainer.train()

***** Running training *****
  Num examples = 3000
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1875


Step,Training Loss
500,1.0423
1000,0.6425
1500,0.5946


Saving model checkpoint to results/checkpoint-500
Configuration saved in results/checkpoint-500/config.json
Model weights saved in results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in results/checkpoint-500/tokenizer_config.json
Special tokens file saved in results/checkpoint-500/special_tokens_map.json
Saving model checkpoint to results/checkpoint-1000
Configuration saved in results/checkpoint-1000/config.json
Model weights saved in results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to results/checkpoint-1500
Configuration saved in results/checkpoint-1500/config.json
Model weights saved in results/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in results/checkpoint-1500/special_tokens_map.json


Training complet

TrainOutput(global_step=1875, training_loss=0.7233329345703124, metrics={'train_runtime': 318.6046, 'train_samples_per_second': 47.08, 'train_steps_per_second': 5.885, 'total_flos': 3919593093120000.0, 'train_loss': 0.7233329345703124, 'epoch': 5.0})

In [30]:
predictions = trainer.predict(test_dataset_2k)

***** Running Prediction *****
  Num examples = 2000
  Batch size = 8


In [31]:
evaluate_f1(predictions)

Micro F1: 0.9047567214542288 ()
Macro F1: 0.19025492832823093 ()
