## Italian-Language BERT Tranformer Model Sample

<b> Italian NER variant: Fine-tune a pretrained BERT model on the Huggingface polyglot_ner data to perform named entity recognition. The data consists of multiple languages. For this task, only one language (Italian) will be selected to fine-tune the BERT model. The fine-tuning will performed 3 times. Once with a dataset composed of 1000 sentences, a second with 3000 sentences and lastly a third with 3000 sentences and frozen embedding</b>

In [1]:
#Import the main tools for the task, copied from previous exercise
## Done on just in case basis

# Import the necessary packages and libraries
import os
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch
import warnings

In [None]:
## Running the basic installation:
#!pip install datasets

In [4]:
## First we experiment with datasets:
## Main instruments needed to retrieve Huggingface datasets in convenient manner:
# Source: https://huggingface.co/docs/datasets/load_hub
# Source: https://huggingface.co/docs/datasets/index
from datasets import load_dataset

#Ignore filterwarnings
warnings.filterwarnings('ignore') 

# Print all the available datasets
from huggingface_hub import list_datasets

#Check all the datasets available:
#print([dataset.id for dataset in list_datasets()])

In [5]:
# Check the data type:
#print(type([dataset.id for dataset in list_datasets()]))

<class 'list'>


In [6]:
## Interested in the polyglot-ner dataset (https://huggingface.co/datasets/polyglot_ner)
# Find it by matching directly:
datasetlist = [dataset.id for dataset in list_datasets()]
for elem in datasetlist:
    if elem == 'polyglot_ner':
        print('polyglot_ner is here!')
        
## Yes, polyglot-ner dataset is available:

polyglot_ner is here!


<b> We want to fine-tune the BERT model on one of the languages (Italian) of the dataset that fulfill the following requirements:  (1) is not English, (2) Has already a pretrained BERT-base. (3) The language contains at least 7k sentences. The following code block will attempt to find a language with these conditions. </b> 

In [8]:
# Load the dataset
## First we try with streaming to avoid downloading the massive set:
## We also check if we can isolate the language of interest:
polyglot_ner_mainset = load_dataset('polyglot_ner', 'it')

In [10]:
## Check attributes and data:
print(polyglot_ner_mainset)

## Italian has around 378K sentences and the results seems to match!

DatasetDict({
    train: Dataset({
        features: ['id', 'lang', 'words', 'ner'],
        num_rows: 378325
    })
})


In [11]:
## Save it:
polyglot_ner_mainset.save_to_disk("it.csv")

Saving the dataset (0/1 shards):   0%|          | 0/378325 [00:00<?, ? examples/s]

<b> The "it" for Italian language, with 378325 sentences which meets and exceeds the requirements, will be used to fine-tune the pretrained BERT-model (Source: https://huggingface.co/dbmdz/bert-base-italian-cased). Next the data will be prepared</b>

In [24]:
## Now we check the mainset in Italian:
# First element/sentence returned a success!
print(polyglot_ner_mainset['train']['ner'][0])

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [15]:
## Import the BertForTokenClassification, Tokenizer for BERT and pipeline system:
# Bring back the dataloader and dataset builder as done in the previous CNN and RNN assignments:
from torch.utils.data import DataLoader, Dataset
# tqdm needed to see measure bars
from tqdm import tqdm
from datasets import load_dataset
from transformers import DataCollatorForTokenClassification
import torch
from torch.nn.utils.rnn import pad_sequence
from transformers import pipeline
from transformers import BertForTokenClassification, BertTokenizer, DataCollatorForTokenClassification
from transformers import Trainer, TrainingArguments
from torch.utils.data import DataLoader
from datasets import load_metric
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer

In [16]:
# Load the Italian tokenizer
tokenizer = AutoTokenizer.from_pretrained('dbmdz/bert-base-italian-cased')

In [17]:
# Function to tokenize and align labels
def tokenize_and_align_labels(examples):
    # Truncation is necessary and padding is applied to the various sentence lengths
    tokenized_inputs = tokenizer(examples['words'], truncation=True, padding='max_length', is_split_into_words=True, max_length=128)
    
    labels = []
    for i, label in enumerate(examples['ner']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # Use label dictionary to convert string labels to integer
                label_ids.append(label_to_id[label[word_idx]])
            else:
                # For subwords/wordpieces, set label to -100 (ignored in loss step)
                label_ids.append(-100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [25]:
# Create a label_to_id dictionary for index creation:
label_to_id = {label: i for i, label in enumerate(set([lbl for sublist in polyglot_ner_mainset['train']['ner'] for lbl in sublist]))}

In [26]:
## Check the dictionary of 4 elems: location being 0, organization 1, 'O' at 2 and person 3:
print(label_to_id)

{'LOC': 0, 'ORG': 1, 'O': 2, 'PER': 3}


In [27]:
# Apply the function to tokenize and align labels
tokenized_it_ner_dataset = polyglot_ner_mainset.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/378325 [00:00<?, ? examples/s]

In [28]:
## Training: ##

In [31]:
# Check the full tokenized dataset that has been padded, tokenized, truncated
## The output looks good, dataset is ready for lock and loading:
print(tokenized_it_ner_dataset['train'][0])

{'id': '0', 'lang': 'it', 'words': ['Ma', 'tra', 'il', 'prigioniero', 'e', 'la', 'sua', 'carceriera', 'nasce', 'un', 'rapporto', 'malato', 'basato', 'su', 'violenza', 'e', 'amore', ',', 'passione', 'e', 'tortura', 'per', 'un', 'lieto', 'fine', 'azzardato', 'e', 'difficile', 'da', 'digerire', '.'], 'ner': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 'input_ids': [102, 348, 293, 162, 15297, 126, 146, 497, 646, 10631, 218, 7514, 141, 2899, 12338, 5973, 171, 6632, 126, 3711, 1307, 8164, 126, 11760, 156, 141, 12809, 1027, 10201, 863, 112, 126, 2726, 203, 120, 28184, 113, 697, 103, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 

<b> Now that the data is tokenized and processed, we will load the pretrained model, define a trainer function as suggested by the instructions.</b>

In [32]:
# We import all the needed instruments for the training, BERT configurations for tokens and arguments:
from transformers import BertForTokenClassification, AutoTokenizer, TrainingArguments, Trainer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('dbmdz/bert-base-italian-cased')

In [33]:
# Load the pretrained model with a token classification head
num_labels = len(label_to_id)
model = BertForTokenClassification.from_pretrained('dbmdz/bert-base-italian-cased', num_labels=num_labels)

### Special Warning provided ### Analyzed

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-italian-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [34]:
# Establish training args:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

<b> First we will fine-tune the model using the first dataset made of only 1000 sentences of the Italian subset.</b>

In [51]:
# Import the metric instruments from SKLearn:
from sklearn.metrics import accuracy_score, f1_score

# Create a subset of the dataset for training (first 1000 sentences)
train_subset_1k = tokenized_it_ner_dataset['train'].select(range(1000))

# Create a subset for testing (next 200 sentences)
test_subset = tokenized_it_ner_dataset['train'].select(range(1000, 1200))

In [52]:
# Function for computing metrics:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    # Flatten the lists and exclude labels for special tokens (i.e., -100)
    flat_labels = [label for sublist in labels for label in sublist if label != -100]
    flat_preds = [pred for sublist, label_sublist in zip(preds, labels) for pred, label in zip(sublist, label_sublist) if label != -100]

    accuracy = accuracy_score(flat_labels, flat_preds)
    f1_micro = f1_score(flat_labels, flat_preds, average='micro')
    f1_macro = f1_score(flat_labels, flat_preds, average='macro')

    ## Returns multiple key metrics of interest:
    return {
        'accuracy': accuracy,
        'f1_micro': f1_micro,
        'f1_macro': f1_macro,
    }

In [53]:
# Initialize the Trainer with the training subset, test subset, and compute_metrics function
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_subset_1k,
    eval_dataset=test_subset,  
    compute_metrics=compute_metrics  
)


In [54]:
# Train with 3 epochs:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1 Micro,F1 Macro
1,0.082,0.090421,0.96191,0.96191,0.76114
2,0.0528,0.083081,0.968599,0.968599,0.760925
3,0.0543,0.087292,0.966741,0.966741,0.757627


TrainOutput(global_step=189, training_loss=0.06296443198093031, metrics={'train_runtime': 1171.8441, 'train_samples_per_second': 2.56, 'train_steps_per_second': 0.161, 'total_flos': 195976111104000.0, 'train_loss': 0.06296443198093031, 'epoch': 3.0})

<b>TrainOutput(global_step=189, training_loss=0.16357500180996284, metrics={'train_runtime': 1523.0356, 'train_samples_per_second': 1.97, 'train_steps_per_second': 0.124, 'total_flos': 195976111104000.0, 'train_loss': 0.16357500180996284, 'epoch': 3.0}) <b>

In [55]:
# Evaluate and save the model:
eval_results = trainer.evaluate()
print(eval_results)

# Save the model
model.save_pretrained("./italian_bert_model_1k")

{'eval_loss': 0.08729159086942673, 'eval_accuracy': 0.9667409884801189, 'eval_f1_micro': 0.9667409884801189, 'eval_f1_macro': 0.7576269915990677, 'eval_runtime': 26.0691, 'eval_samples_per_second': 7.672, 'eval_steps_per_second': 0.153, 'epoch': 3.0}


<b>{'eval_loss': 0.08729159086942673, 'eval_accuracy': 0.9667409884801189, 'eval_f1_micro': 0.9667409884801189, 'eval_f1_macro': 0.7576269915990677, 'eval_runtime': 26.0691, 'eval_samples_per_second': 7.672, 'eval_steps_per_second': 0.153, 'epoch': 3.0}<b>

<b> Next, train the model with 3000 samples. </b>

In [41]:
# Create a subset of the dataset for training (first 3000 sentences)
train_subset_3k = tokenized_it_ner_dataset['train'].select(range(3000))

# Create a subset for testing (next 200 sentences)
test_subset_3k = tokenized_it_ner_dataset['train'].select(range(3000, 3200))

In [42]:
# Initialize the Trainer with the training subset, test subset, and compute_metrics function
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_subset_3k,
    eval_dataset=test_subset_3k,  
    compute_metrics=compute_metrics  
)

In [43]:
# Train again with just 3 epochs:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1 Micro,F1 Macro
1,0.0755,0.095812,0.955146,0.955146,0.673774
2,0.0635,0.094063,0.956449,0.956449,0.691412
3,0.0472,0.09686,0.955332,0.955332,0.68615


TrainOutput(global_step=564, training_loss=0.06880511898309627, metrics={'train_runtime': 3434.8856, 'train_samples_per_second': 2.62, 'train_steps_per_second': 0.164, 'total_flos': 587928333312000.0, 'train_loss': 0.06880511898309627, 'epoch': 3.0})

<b> Results: TrainOutput(global_step=564, training_loss=0.06880511898309627, metrics={'train_runtime': 3434.8856, 'train_samples_per_second': 2.62, 'train_steps_per_second': 0.164, 'total_flos': 587928333312000.0, 'train_loss': 0.06880511898309627, 'epoch': 3.0}) <b>

In [44]:
# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)

# Save the model
model.save_pretrained("./italian_bert_model_3k")

{'eval_loss': 0.09686033427715302, 'eval_accuracy': 0.9553322166387493, 'eval_f1_micro': 0.9553322166387493, 'eval_f1_macro': 0.6861496257407507, 'eval_runtime': 27.6687, 'eval_samples_per_second': 7.228, 'eval_steps_per_second': 0.145, 'epoch': 3.0}


<b>{'eval_loss': 0.09686033427715302, 'eval_accuracy': 0.9553322166387493, 'eval_f1_micro': 0.9553322166387493, 'eval_f1_macro': 0.6861496257407507, 'eval_runtime': 27.6687, 'eval_samples_per_second': 7.228, 'eval_steps_per_second': 0.145, 'epoch': 3.0}<b>

<b> Lastly, build a model with 3000 sentences again but with frozen embedding = embedding weights of the pretrained model are retained as they are "frozen" but the other weights of the model will be changed. Useful when having small dataset and to avoid overfitting issues.</b>

In [45]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('dbmdz/bert-base-italian-cased')

# Load the pretrained model 
num_labels = len(label_to_id)
model = BertForTokenClassification.from_pretrained('dbmdz/bert-base-italian-cased', num_labels=num_labels)

# Freeze the embeddings. 
# Source: https://discuss.huggingface.co/t/how-to-freeze-some-layers-of-bertmodel/917
for param in model.bert.embeddings.parameters():
    param.requires_grad = False

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-italian-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [46]:
# Establish args for the model training:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

In [47]:
# Create a subset of the dataset for training of 3000 units
train_subset_3k = tokenized_it_ner_dataset['train'].select(range(3000, 6000))
# Create testing kits (next 200 sentences)
test_subset = tokenized_it_ner_dataset['train'].select(range(6000, 6200))

In [48]:
# Final trainer for the 3k with frozen embeds:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_subset_3k,
    eval_dataset=test_subset,
    compute_metrics=compute_metrics  
)

In [49]:
# Train final model:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1 Micro,F1 Macro
1,0.1285,0.076202,0.968985,0.968985,0.687896
2,0.0575,0.078107,0.965221,0.965221,0.693348
3,0.0602,0.078969,0.966475,0.966475,0.7408


TrainOutput(global_step=564, training_loss=0.09692417655853515, metrics={'train_runtime': 3392.3096, 'train_samples_per_second': 2.653, 'train_steps_per_second': 0.166, 'total_flos': 587928333312000.0, 'train_loss': 0.09692417655853515, 'epoch': 3.0})

<b> TrainOutput(global_step=564, training_loss=0.09692417655853515, metrics={'train_runtime': 3392.3096, 'train_samples_per_second': 2.653, 'train_steps_per_second': 0.166, 'total_flos': 587928333312000.0, 'train_loss': 0.09692417655853515, 'epoch': 3.0})<b>

In [50]:
# Evaluate the last model like the others:
eval_results = trainer.evaluate()
print(eval_results)

# Save the final 3k with freezing:
model.save_pretrained('./italian_bert_model_3k_plus_frozen_embeds')

{'eval_loss': 0.07896889001131058, 'eval_accuracy': 0.9664754392255288, 'eval_f1_micro': 0.9664754392255288, 'eval_f1_macro': 0.7407998544314155, 'eval_runtime': 25.676, 'eval_samples_per_second': 7.789, 'eval_steps_per_second': 0.156, 'epoch': 3.0}


<b>{'eval_loss': 0.07896889001131058, 'eval_accuracy': 0.9664754392255288, 'eval_f1_micro': 0.9664754392255288, 'eval_f1_macro': 0.7407998544314155, 'eval_runtime': 25.676, 'eval_samples_per_second': 7.789, 'eval_steps_per_second': 0.156, 'epoch': 3.0}<b>

In [None]:
### FINISHED ###