## COM S 579 BERT MODEL
Goal for 3rd two weeks: Detect the type of NEs. For example in CoNLL, there are 5 types:  non-NE/O, PERson, ORGanizatin, LOCation, Miscellaneous. The goal for the 3rd two weeks is to predict the type correctly. First obtained the contextual embedding for each token using a BERT-like model. Then train a neural network to predict the type of each token from their BERT-embedding. The neural network is shared by all tokens. There are several strategies. You can do a 9-class classification (B/I- 4 types, and O ) or you can do in two separate NNs, the first stage predict B/I/O while the second predicts the 5 types.   

Team group: Mario Mastrandrea, Yuting Yang

Python version: Python 3.10.5 64-bit

### 1. Import CONLL2003

In [1]:
from datasets import load_dataset
dataset = load_dataset("conll2003")

Found cached dataset conll2003 (/Users/mariomastrandrea/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98)


  0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [3]:
dataset['train'][0]['ner_tags']

[3, 0, 7, 0, 0, 0, 7, 0, 0]

In [4]:
"""NER label names in CONLL2003"""
label_names = dataset['train'].features['ner_tags'].feature.names
label_names

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [5]:
"""Check the correspondence among tokens, NER tags in string, and NER tags in number"""
tokens = dataset['train'][0]['tokens']
labels = dataset['train'][0]['ner_tags']
line1 = ""
line2 = ""
line3 = ""
for token, label in zip(tokens, labels):
    full_label = label_names[label]
    # print(label, full_label)
    max_length = max(len(token), len(full_label))  
    line1 += token + " " * (max_length - len(token) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)
    line3 += str(label) + " " * (max_length)
print(f'Tokens:      {line1}')

print(f'NER tag no.: {line3}')
print(f'NER tag:     {line2}')


Tokens:      EU    rejects German call to boycott British lamb . 
NER tag no.: 3     0       7      0    0  0       7       0    0 
NER tag:     B-ORG O       B-MISC O    O  O       B-MISC  O    O 


### 2. Create *Tokenizer* object

In [6]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [49]:
inputs = tokenizer(dataset['train'][0]['tokens'], is_split_into_words=True)
# print(inputs.tokens())
# print(inputs['input_ids'])
# print(dataset['train'][0]['ner_tags'])
# print(inputs.word_ids())
new_labels = align_labels_with_tokens(dataset['train'][0]['ner_tags'], inputs.word_ids())


tokens_line = []
input_ids_line = []
new_labels_line = []

for token, input_id, new_label in zip(inputs.tokens(), inputs['input_ids'], new_labels):
    token = str(token)
    input_id = str(input_id)
    new_label = str(new_label)
    max_length = max(len(token), len(input_id), len(new_label))

    if len(token) == max_length:
        tokens_line.append(token)
        input_ids_line.append(input_id + (" " * (max_length - len(input_id))))
        new_labels_line.append(new_label + (" " * (max_length - len(new_label))))

    elif len(input_id) == max_length:
        tokens_line.append(token + (" " * (max_length - len(token))))
        input_ids_line.append(input_id)
        new_labels_line.append(new_label + (" " * (max_length - len(new_label))))

    elif len(new_label) == max_length:
        tokens_line.append(token + (" " * (max_length - len(token))))
        input_ids_line.append(input_id + (" " * (max_length - len(input_id))))
        new_labels_line.append(new_label)
    
print("  str tokens:   " + " ".join(tokens_line))
print("  int tokens:   " + " ".join(input_ids_line))
print("aligned tags:   " + " ".join(new_labels_line))

  str tokens:   [CLS] EU   rejects German call to   boycott British la   ##mb  .   [SEP]
  int tokens:   101   7270 22961   1528   1840 1106 21423   1418    2495 12913 119 102  
aligned tags:   -100  3    0       7      0    0    0       7       0    0     0   -100 


In [8]:
inputs['label'] = [-100, 7, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 1, 2, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [31]:
inputs

{'input_ids': [101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### 3. Tokenize CONLL2003 training set and align labels with tokens

In [44]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            current_word = word_id
            label = -100 if word_id is None else labels[word_id] 
            new_labels.append(label)
        elif word_id is None:
            new_labels.append(-100)
        else:
            label = labels[word_id] 
            # for tokens with the same word_id. The tokens correspond to one entity. The first token is B and the rest are Is. However, labels[word_id] only returns the token labeled with "B", of which the NER tagging integer is always an odd number. As the word_ids passed to this step represent "I" tokens, we need to convert the label the even number, which represents the tokens labeled with "I"
            
            # print(f'before: {label}')
            if label % 2 == 1:
                label += 1
            # print(f'after: {label}')
            new_labels.append(label)
    return new_labels

In [11]:
labels = dataset['train'][10]['ner_tags']
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[7, 0, 0, 1, 2, 2, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[-100, 7, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 1, 2, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [12]:
def tokenize_and_align_labels(examples):
    tokenzied_inputs = tokenizer(
        examples['tokens'], truncation = True, is_split_into_words = True
    )
    all_labels = examples['ner_tags']
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenzied_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))
    tokenzied_inputs['labels'] = new_labels
    return tokenzied_inputs
    

In [13]:
tokenized_dataset = dataset.map(
    tokenize_and_align_labels,
    batched = True,
    remove_columns = dataset['train'].column_names
)

Loading cached processed dataset at /Users/mariomastrandrea/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98/cache-77a1416803abdc9a.arrow
Loading cached processed dataset at /Users/mariomastrandrea/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98/cache-e4239e5b11dd4f73.arrow


  0%|          | 0/4 [00:00<?, ?ba/s]

### 4. Fine-tune the pre-trained model

In [14]:
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer = tokenizer)

In [15]:
batch = data_collator([tokenized_dataset['train'][i] for i in range(2)])
batch['labels']

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

In [16]:
import evaluate
metric = evaluate.load('seqeval')
import numpy as np

In [17]:
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=2)

    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
}

In [18]:
id2label = {i: label for i, label in enumerate(label_names)} # the correspondence between label names and label ids
label2id = {n: m for m, n in id2label.items()}
id2label
label2id

{'O': 0,
 'B-PER': 1,
 'I-PER': 2,
 'B-ORG': 3,
 'I-ORG': 4,
 'B-LOC': 5,
 'I-LOC': 6,
 'B-MISC': 7,
 'I-MISC': 8}

In [19]:
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained(
     'bert-base-cased',
     num_labels = 9,
)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

In [20]:
from transformers import AutoModelForTokenClassification
model1 = AutoModelForTokenClassification.from_pretrained(
     'bert-base-cased', # whether to distinguish capitalization
    id2label = id2label,
    label2id = label2id,
)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

### 5. Fine-tune the model

In [21]:
from transformers import TrainingArguments
args = TrainingArguments(
    'bert-finetuned-ner',
    evaluation_strategy = 'epoch',
    # save_strategy = 'epoch',
    per_device_train_batch_size= 16,
    per_device_eval_batch_size= 16,
    learning_rate = 2e-5,
    num_train_epochs = 3,
    weight_decay = 0.01,
)

In [22]:
from transformers import Trainer
trainer = Trainer(
    model = model,
    args = args,
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['validation'],
    data_collator = data_collator,
    compute_metrics = compute_metrics,
    tokenizer = tokenizer,
)

In [24]:
"""
__Fifth Try__
model
axis = 2 
num_train_epochs = 3
per_device_train_batch_size = 16
per_device_evl_batch_size = 16
"""
trainer.train()

***** Running training *****
  Num examples = 14041
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2634
  Number of trainable parameters = 107726601


  0%|          | 0/2634 [00:00<?, ?it/s]

Saving model checkpoint to bert-finetuned-ner/checkpoint-500
Configuration saved in bert-finetuned-ner/checkpoint-500/config.json


{'loss': 0.1664, 'learning_rate': 1.604403948367502e-05, 'epoch': 0.57}


Model weights saved in bert-finetuned-ner/checkpoint-500/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-500/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3250
  Batch size = 16


  0%|          | 0/204 [00:00<?, ?it/s]

{'eval_loss': 0.0627116933465004, 'eval_precision': 0.9042239685658153, 'eval_recall': 0.9294850218781555, 'eval_f1': 0.9166804979253111, 'eval_accuracy': 0.9818096191205039, 'eval_runtime': 337.2949, 'eval_samples_per_second': 9.635, 'eval_steps_per_second': 0.605, 'epoch': 1.0}


Saving model checkpoint to bert-finetuned-ner/checkpoint-1000
Configuration saved in bert-finetuned-ner/checkpoint-1000/config.json


{'loss': 0.0737, 'learning_rate': 1.2247532270311316e-05, 'epoch': 1.14}


Model weights saved in bert-finetuned-ner/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-1000/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to bert-finetuned-ner/checkpoint-1500
Configuration saved in bert-finetuned-ner/checkpoint-1500/config.json


{'loss': 0.0445, 'learning_rate': 8.45102505694761e-06, 'epoch': 1.71}


Model weights saved in bert-finetuned-ner/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-1500/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-1500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3250
  Batch size = 16


  0%|          | 0/204 [00:00<?, ?it/s]

{'eval_loss': 0.05602369084954262, 'eval_precision': 0.9278043945151164, 'eval_recall': 0.9451363177381353, 'eval_f1': 0.9363901625677366, 'eval_accuracy': 0.9853711661859069, 'eval_runtime': 366.0124, 'eval_samples_per_second': 8.879, 'eval_steps_per_second': 0.557, 'epoch': 2.0}


Saving model checkpoint to bert-finetuned-ner/checkpoint-2000
Configuration saved in bert-finetuned-ner/checkpoint-2000/config.json


{'loss': 0.0333, 'learning_rate': 4.6545178435839035e-06, 'epoch': 2.28}


Model weights saved in bert-finetuned-ner/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-2000/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-2000/special_tokens_map.json
Saving model checkpoint to bert-finetuned-ner/checkpoint-2500
Configuration saved in bert-finetuned-ner/checkpoint-2500/config.json


{'loss': 0.0261, 'learning_rate': 8.580106302201975e-07, 'epoch': 2.85}


Model weights saved in bert-finetuned-ner/checkpoint-2500/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-2500/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-2500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3250
  Batch size = 16


  0%|          | 0/204 [00:00<?, ?it/s]



Training completed. Do not forget to share your model on huggingface.co/models =)




{'eval_loss': 0.05379325523972511, 'eval_precision': 0.9292079207920793, 'eval_recall': 0.9476607202961965, 'eval_f1': 0.9383436093984335, 'eval_accuracy': 0.9859598516512628, 'eval_runtime': 415.2941, 'eval_samples_per_second': 7.826, 'eval_steps_per_second': 0.491, 'epoch': 3.0}
{'train_runtime': 18738.9061, 'train_samples_per_second': 2.248, 'train_steps_per_second': 0.141, 'train_loss': 0.06635913273196557, 'epoch': 3.0}


TrainOutput(global_step=2634, training_loss=0.06635913273196557, metrics={'train_runtime': 18738.9061, 'train_samples_per_second': 2.248, 'train_steps_per_second': 0.141, 'train_loss': 0.06635913273196557, 'epoch': 3.0})

In [None]:
"""
__Fourth Try__
model
axis = -1 
num_train_epochs = 1
__SUCCESSFUL OUTPUT BUT LOW PERFORMANCE__
"""
trainer.train()

***** Running training *****
  Num examples = 14041
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1756
  Number of trainable parameters = 108898569


  0%|          | 0/1756 [00:00<?, ?it/s]

{'loss': 0.7376, 'learning_rate': 1.4305239179954442e-05, 'epoch': 0.28}
{'loss': 0.4935, 'learning_rate': 8.610478359908885e-06, 'epoch': 0.57}
{'loss': 0.4375, 'learning_rate': 2.9157175398633257e-06, 'epoch': 0.85}


***** Running Evaluation *****
  Num examples = 3250
  Batch size = 8


  0%|          | 0/407 [00:00<?, ?it/s]

Saving model checkpoint to bert-finetuned-ner/checkpoint-1756
Configuration saved in bert-finetuned-ner/checkpoint-1756/config.json


{'eval_loss': 0.38606491684913635, 'eval_precision': 0.4584942084942085, 'eval_recall': 0.39969707169303265, 'eval_f1': 0.42708146016903437, 'eval_accuracy': 0.8789221169129334, 'eval_runtime': 132.737, 'eval_samples_per_second': 24.485, 'eval_steps_per_second': 3.066, 'epoch': 1.0}


Model weights saved in bert-finetuned-ner/checkpoint-1756/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-1756/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-1756/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




{'train_runtime': 3100.5381, 'train_samples_per_second': 4.529, 'train_steps_per_second': 0.566, 'train_loss': 0.5357341375329229, 'epoch': 1.0}


TrainOutput(global_step=1756, training_loss=0.5357341375329229, metrics={'train_runtime': 3100.5381, 'train_samples_per_second': 4.529, 'train_steps_per_second': 0.566, 'train_loss': 0.5357341375329229, 'epoch': 1.0})

In [None]:
"""
__Third Try__
model1
axis = -1 
excluded 'fi' and 'accuracy' in the evaluation metrics
num_train_epochs = 3
__ERROR__
"""
trainer.train()

***** Running training *****
  Num examples = 14041
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 5268
  Number of trainable parameters = 107726601


  0%|          | 0/5268 [00:00<?, ?it/s]

{'loss': 0.0221, 'learning_rate': 4.768413059984815e-06, 'epoch': 0.28}
{'loss': 0.0217, 'learning_rate': 2.8701594533029615e-06, 'epoch': 0.57}
{'loss': 0.0216, 'learning_rate': 9.719058466211087e-07, 'epoch': 0.85}


***** Running Evaluation *****
  Num examples = 3250
  Batch size = 8


IndexError: list index out of range

In [None]:
"""
__Second Try__
model1
axis = -1 
num_train_epochs = 3
__ERROR__
"""
trainer.train()

***** Running training *****
  Num examples = 14041
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 5268
  Number of trainable parameters = 107726601


  0%|          | 0/5268 [00:00<?, ?it/s]

{'loss': 0.0441, 'learning_rate': 1.143507972665148e-05, 'epoch': 0.28}
{'loss': 0.0438, 'learning_rate': 9.53682611996963e-06, 'epoch': 0.57}
{'loss': 0.0384, 'learning_rate': 7.638572513287777e-06, 'epoch': 0.85}


***** Running Evaluation *****
  Num examples = 3250
  Batch size = 8


IndexError: list index out of range

In [None]:
"""
__First Try__
model1
axis = 1
num_train_epochs = 3
__ERROR__
"""
trainer.train()

***** Running training *****
  Num examples = 14041
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 5268
  Number of trainable parameters = 107726601


  0%|          | 0/5268 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 0.2574, 'learning_rate': 1.810174639331815e-05, 'epoch': 0.28}
{'loss': 0.098, 'learning_rate': 1.6203492786636296e-05, 'epoch': 0.57}
{'loss': 0.089, 'learning_rate': 1.4305239179954442e-05, 'epoch': 0.85}


***** Running Evaluation *****
  Num examples = 3250
  Batch size = 8


  0%|          | 0/407 [00:00<?, ?it/s]

IndexError: list index out of range