**`⚠`** <font size="2">**WARNING:** Some of the code lines in this notebook may be cropped out due to display limitations. To view the entire code properly, please click on this [link](https://nbviewer.org/github/nataliakoliou/NLP-Various-Implementations/blob/main/Assignment-3/nlp-3.ipynb) to open the notebook in nbviewer or this [link](https://colab.research.google.com/github/nataliakoliou/NLP-Various-Implementations/blob/main/Assignment-3/nlp-3.ipynb) to open the notebook in Google Colab.</font>

# **NLP-Various Implementations | Sequence Labeling with Pre-trained Language Models**

**Overview:** In this part of the project, I implemented a sequence modeling algorithm focusing on tasks such as named-entity recognition, part-of-speech tagging, and text chunking. I reported the algorithm's performance on the test set in terms of accuracy and macro-average accuracy. Moreover, I analyzed a specific sentence from the test set, highlighting misclassified tokens and correctly tagged tokens. I further evaluated the model's performance on a new sentence from an online newspaper, identifying both correct and incorrect predictions for named entities. I then made some modifications to the original algorithm and compared the results to its initial performance: I froze the weights related to the pre-trained BERT language model, I extended the training data by concatenating the training and validation sets and I replaced the pre-trained BERT languagemodel with the pre-trained roberta-base6 model. Finally, I utilized ChatGPT with both zero-shot and few-shot prompting to perform tasks such as named-entity recognition, part-of-speech tagging, and text chunking. This allowed me to assess the model's ability to generate accurate predictions without explicit training for each specific task, as well as its capacity to improve performance with limited task-specific examples.

In [None]:
!pip install torch
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m94.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m114.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transforme

## **1. Import all the necessary modules**

**Briefly:** `torch` library provides support for deep learning operations using tensors, `pandas` library provides data manipulation and analysis tools and `torch.optim module` provides optimization algorithms for training neural networks. Additionally, `torchtext.vocab.build_vocab_from_iterator` function provides a way to build a vocabulary from an iterator of text using TorchText, `transformers` library includes the BertForTokenClassification and RobertaForTokenClassification models along with their corresponding tokenizers for token classification tasks using pre-trained transformer models, whereas `sklearn.metrics` module provides various metrics such as accuracy_score, balanced_accuracy_score, and classification_report for evaluating model performance. Finally, `tqdm` library provides a progress bar to track loops and iterations.

In [None]:
import torch
import pandas as pd
import torch.optim as optim 
from torchtext.vocab import build_vocab_from_iterator
from transformers import BertForTokenClassification, BertTokenizerFast, RobertaForTokenClassification, RobertaTokenizerFast
from sklearn.metrics import accuracy_score, balanced_accuracy_score, classification_report
import tqdm

## **2. Define and initialize the models' parameters**

We first set up some configurations for our machine learning model. We define EPOCHS as 3, indicating the number of times the model iterates over the dataset during training. BATCH_SIZE is set to 8, determining how many samples are processed together in each iteration. LR is the learning rate (1e-5), controlling the step size for updating the model's parameters during gradient descent. We also use tqdmn, which is a function for displaying a progress bar during training. We then set the base directory path for the project's dataset. Finally, the device variable is assigned "cuda" or "cpu" depending on GPU availability, allowing us to use the GPU if possible for faster computations.

In [None]:
EPOCHS = 3
BATCH_SIZE = 8
LR = 1e-5
tqdmn = tqdm.notebook.tqdm
base_path = '/content/'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # uses GPU if available

## **3. Load and preprocess the training, validation, testing and example datasets**

The load_data function loads the data from the corresponding dataset text-files. It calls the load_sentences function four times to load the training, testing, validation, and example datasets in a sentence format. It then then returns these output sentences as train_sentences, test_sentences, valid_sentences, and example_sentences.

The load_sentences function reads the contents of a file specified by filepath. It iterates through each line of the file and checks if the line is either '-DOCSTART- -X- -X- O\n' or an empty line ('\n'). If the line matches either condition and tokens list is non-empty, it means that the end of a sentence has been reached. In that case, the function appends the accumulated tokens, POS tags, chunk tags, and NER tags to the sentences list as a dictionary. Each dictionary represents a sentence and contains the corresponding tokens, POS tags, chunk tags, and NER tags. If the line contains some sentence token (carries information), it splits the line into separate elements, extracts the tokens, POS tags, chunk tags, and NER tags, and appends them to their respective lists. The function finally returns the list of sentences.

### **A. BERT-based named entity recognition model**

**A.1:** The load_data function is called, to store the preprocessed data into the variables train_sentences, test_sentences, valid_sentences, and example_sentences.

In [None]:
def load_data(base_path):
    train_sentences = load_sentences(base_path + 'train.txt')
    test_sentences = load_sentences(base_path + 'test.txt')
    valid_sentences = load_sentences(base_path + 'valid.txt')
    example_sentences = load_sentences(base_path + 'example.txt')
    return train_sentences, test_sentences, valid_sentences, example_sentences

def load_sentences(filepath):
    sentences, tokens, pos_tags, chunk_tags, ner_tags = [], [], [], [], []
    with open(filepath, 'r') as f:
        for line in f.readlines():
            if (line == ('-DOCSTART- -X- -X- O\n') or line == '\n'):
                if len(tokens) > 0: # only-case: empty line, which indicates the end of a sentence
                    sentences.append({'tokens': tokens, 'pos_tags': pos_tags, 'chunk_tags': chunk_tags, 'ner_tags': ner_tags})
                    tokens, pos_tags, chunk_tags, ner_tags = [], [], [], []
            else:
                l = line.split(' ')
                tokens.append(l[0])
                pos_tags.append(l[1])
                chunk_tags.append(l[2])
                ner_tags.append(l[3].strip('\n'))
    return sentences

train_sentences, test_sentences, valid_sentences, example_sentences = load_data(base_path)

## **4. Create the tagset within the training dataset**

The create_tagset function takes two parameters: sentences, which represents a list of sentences containing various tags, and tagtype, which specifies the type of tag for which the tagset is being created. First, it extracts the tags of the specified tagtype from each sentence in the sentences list. Next, it utilizes the build_vocab_from_iterator function to build a vocabulary (tagmap) from the tags list. This step creates a mapping between each unique tag and a numerical index, allowing for efficient handling and lookup of tags during subsequent processing. To create the tagset, the function flattens the tags list, converting it into a single-dimensional list. It then creates a set from this flattened list, discarding any duplicate tags. The resulting tagset represents the unique tags of the specified tagtype present in the dataset. The function then prints the size of the tagset, indicating the number of unique tags within the tagset. Finally, it returns the tagmap and tagset as a tuple.

**A.2:** The create_tagset function is called to generate a tagmap and tagset for the Named Entity Recognition (NER) tags within the train_sentences dataset.

In [None]:
def create_tagset(sentences, tagtype):
    tags = [sentence[tagtype] for sentence in sentences]
    tagmap = build_vocab_from_iterator(tags)
    tagset = set([item for sublist in tags for item in sublist])
    print('Tagset size:', len(tagset))
    return tagmap, tagset

tagmap, tagset = create_tagset(train_sentences, 'ner_tags')

Tagset size: 9


## **5. Initialize the model and count its parameters**

The initialize_model function first calls the setup function, passing the model_name, tagset, and update as arguments. The setup function checks the model_name and initializes the corresponding tokenizer and token classification model (either BERT or RoBERTa). It also allows the option to update or freeze the weights of the pre-trained model based on the update parameter. The model and tokenizer objects are returned from the setup function.

Next, the model is moved to the specified device (e.g., GPU). An AdamW optimizer is created, taking the model parameters and learning rate (lr) as inputs. The count_parameters function is then called to count and display the total number of parameters, trainable parameters, and frozen parameters in the model. The function returns the model, tokenizer, and optimizer.

**A.3:** The initialize_model function is called to initializes a BERT-based token classification model. The inputs provided are "bert" (indicating the model type), a tagset (representing classification classes), a device (specifying the computational device), a learning rate (determining optimization step size), and a boolean flag (for updating parameters). The returned values, namely the model, tokenizer, and optimizer, are then assigned to their respective variables.

In [None]:
def initialize_model(model_name, tagset, device, lr, update):
    model, tokenizer = setup(model_name, tagset, update)
    model.to(device)
    optimizer = optim.AdamW(params=model.parameters(), lr=lr)
    count_parameters(model)
    return model, tokenizer, optimizer

def setup(model_name, tagset, update):
    if model_name == "bert":
        tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
        model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=len(tagset))
        model.bert.requires_grad_(update)  # updates (True) or freezes (False) the weights of the pre-trained BERT model
    elif model_name == "roberta":
        tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', add_prefix_space=True)
        model = RobertaForTokenClassification.from_pretrained('roberta-base', num_labels=len(tagset))
        model.roberta.requires_grad_(update)  # updates (True) or freezes (False) the weights of the pre-trained Roberta model
    return model, tokenizer

def count_parameters(model):
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    frozen_params = total_params - trainable_params
    data = [{'Total': total_params, 'Trainable': trainable_params, 'Frozen': frozen_params}]
    df = pd.DataFrame(data, index=['Parameters'])
    display(df)

model, tokenizer, optimizer = initialize_model("bert", tagset, device, LR, True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

Unnamed: 0,Total,Trainable,Frozen
Parameters,108898569,108898569,0


## **6. Encode the training, validation, testing and example datasets**

The encode_data function applies the encode_sentence function to each sentence in the training, validation, testing, and example datasets. It creates and returns new datasets, with each sentence encoded and mapped to corresponding labels using the encode_sentence function.

The encode_sentence function takes a tag mapping, tokenizer, a sentence, and a tag type as inputs. It uses the tokenizer to encode the sentence's tokens, truncating and padding them as needed. It then aligns the labels for the tokens based on the provided tag mapping and the original sentence's labels. The function returns a dictionary containing the input IDs, attention mask, and label IDs, all converted to Torch tensors.

The align_label function is a helper function used within encode_sentence to align the labels with the encoded tokens. It iterates over the token word IDs, mapping the corresponding labels to their respective IDs using the provided tag mapping. It handles cases where tokens represent special or padding tokens, new words or entities, or consecutive tokens belonging to the same word or entity. The function returns a list of label IDs aligned with the tokens.

> In the align_label function, the label ID of -100 is assigned to consecutive tokens in the sequence to indicate that they belong to the same word or entity. This is done to handle the alignment between the original labels and the tokenized tokens. For example, when encountering "1-2" in the original text file, which the tokenizer splits into three tokens "1", "-", and "2", it becomes essential to indicate their shared entity. By assigning a label ID of -100 to the last two consecutive tokens, the model can recognize their relationship with the first token, forming a coherent entity. Tokens with a label ID of -100 are excluded from the training process, enabling the model to focus on individual tokens representing distinct words or entities rather than treating consecutive tokens as separate entities.

**A.4:** The encode_data function is called with these inputs: tagmap (tag mapping), tokenizer, train_sentences, valid_sentences, test_sentences, example_sentences, and 'ner_tags' (tag type). The returned datasets represent the encoded versions of the training, validation, testing, and example sentences, respectively.

In [None]:
def encode_data(tagmap, tokenizer, train_sentences, valid_sentences, test_sentences, example_sentences, tagtype):
    train_dataset = [encode_sentence(tagmap, tokenizer, sentence, tagtype) for sentence in train_sentences]
    valid_dataset = [encode_sentence(tagmap, tokenizer, sentence, tagtype) for sentence in valid_sentences]
    test_dataset = [encode_sentence(tagmap, tokenizer, sentence, tagtype) for sentence in test_sentences]
    example_dataset = [encode_sentence(tagmap, tokenizer, sentence, tagtype) for sentence in example_sentences]
    return train_dataset, valid_dataset, test_dataset, example_dataset

def encode_sentence(tagmap, tokenizer, sentence, tagtype):
    encodings = tokenizer(sentence['tokens'], truncation=True, padding='max_length', is_split_into_words=True)
    labels = align_label(tagmap, encodings, sentence[tagtype])
    return { 'input_ids': torch.LongTensor(encodings.input_ids), 'attention_mask': torch.LongTensor(encodings.attention_mask), 'labels': torch.LongTensor(labels) }

def align_label(tagmap, tokens, labels):
    word_ids = tokens.word_ids()
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:
        if word_idx is None:  # case: special or padding token
            label_ids.append(-100)
        elif word_idx != previous_word_idx: #  case: token represents a new word or entity
            try:
                label_ids.append(tagmap[labels[word_idx]])
            except:
                label_ids.append(-100)
        else: # case: consecutive tokens in the sequence belong to the same word or entity
            label_ids.append(-100)
        previous_word_idx = word_idx
    return label_ids

train_dataset, valid_dataset, test_dataset, example_dataset = encode_data(tagmap, tokenizer, train_sentences, valid_sentences, test_sentences, example_sentences, 'ner_tags')

## **7. Create data loaders for the training, validation, testing and example datasets**

The create_loaders function takes train_dataset, valid_dataset, test_dataset, example_dataset, and batch_size as inputs. It utilizes the torch.utils.data.DataLoader class to create data loaders for each dataset with the specified batch size. The train_loader and test_loader are shuffled randomly, while the valid_loader and example_loader are not.

**A.5:** The create_loaders function is called with specific inputs (e.g. BATCH_SIZE = 8) and assigns the returned data loaders to corresponding variables: train_loader, valid_loader, test_loader, and example_loader.

In [None]:
def create_loaders(train_dataset, valid_dataset, test_dataset, example_dataset, batch_size):
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=batch_size)
    test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)
    example_loader = torch.utils.data.DataLoader(example_dataset, batch_size=batch_size)
    return train_loader, valid_loader, test_loader, example_loader

train_loader, valid_loader, test_loader, example_loader = create_loaders(train_dataset, valid_dataset, test_dataset, example_dataset, BATCH_SIZE)

## **8. Train and evaluate the classification model**

The train_model function takes these inputs: model (the neural network model), optimizer, train_loader (data loader for training data), valid_loader (data loader for validation data), tqdmn (a progress bar), device (the device on which the model is trained), epochs (number of training epochs), tagmap (tag mapping), and tokenizer. Within this function, a training loop is executed for the specified number of epochs. The model is set to training mode and batches of data are iterated through the train_loader. The model's predictions are computed using the provided batch, and the loss is calculated. The optimizer's gradients are then zeroed, the loss is backpropagated and the optimizer's step is performed. If a valid_loader is provided, the evaluate_model function is called to evaluate the model's performance on the validation data. The function returns the actual labels (Y_actual) and predicted labels (Y_preds). Finally, the train_model function returns the trained model.

The evaluate_model function iterates through the provided data loader and computes predictions for each batch. The true and predicted values are extracted, considering only valid tokens (true_values_all != -100). These values are stored in Y_actual and Y_preds, respectively. Optionally, misclassifications can be detected and displayed using the detect_misclassification function. After evaluating the model, the display_results function is called to display the accuracy, macro-accuracy, and classification report for the evaluated data type (e.g., "Validation"). The tagmap is used to map the label indices to their corresponding tags.

The detect_misclassification function first checks if the number of true values (the true labels for the valid tokens) is greater than or equal to 10 and if the true values are not equal to the predicted values. If both conditions are met, it proceeds to create a sentence-list of these tokens using the tokenizer. Then, it creates a data dictionary containing information about each token of the sentence, including the token itself, whether it was misclassified (based on the comparison between true and predicted labels), its true label value, and its predicted label value. This data dictionary is then used to create a pandas DataFrame to display the information in a tabular format. Finally, the function returns True to indicate that a misclassification was detected - this ensures that detect_misclassification won't be called twice in evaluate_model function.

The display_results function displays the evaluation results for a given dataset. It takes inputs such as the type of data, the actual labels, the predicted labels and a tagmap for label mappings. Based on the data_type, it determines the appropriate label for the dataset (e.g., "Test", "Validation", or "Example"). It then prints the accuracy, macro-accuracy, and classification report for the evaluation results. The classification report includes metrics such as precision, recall, and F1-score for each class in the dataset based on the tagmap.

**A.6:** The train_model function is called with specific inputs (e.g. EPOCHS = 3) and assigns the returned model to the corresponding model variable.

In [None]:
def train_model(model, optimizer, train_loader, valid_loader, tqdmn, device, epochs, tagmap, tokenizer):
    for epoch in tqdmn(range(epochs)):
        model.train()
        print('Epoch', epoch+1)
        for i, batch in enumerate(tqdmn(train_loader)):
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs[0]
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        if valid_loader:
            Y_actual, Y_preds = evaluate_model(tqdmn, device, model, valid_loader, False, tokenizer)
            display_results("valid", Y_actual, Y_preds, tagmap)
    return model

def evaluate_model(tqdmn, device, model, data_loader, detect, tokenizer):
    found = False
    model.eval()
    with torch.no_grad():
        Y_actual, Y_preds = [],[]
        for i, batch in enumerate(tqdmn(data_loader)):
            batch = { k: v.to(device) for k, v in batch.items() }
            outputs = model(**batch)
            for idx, _ in enumerate(batch['labels']):
                true_values_all = batch['labels'][idx]
                true_values = true_values_all[true_values_all != -100]
                pred_values = torch.argmax(outputs[1], dim=2)[idx]
                pred_values = pred_values[true_values_all != -100]
                Y_actual.append(true_values)
                Y_preds.append(pred_values)
                found = detect_misclassification(batch, idx, tokenizer, true_values_all, true_values, pred_values) if detect and not found else found
        Y_actual = torch.cat(Y_actual).detach().cpu().numpy()
        Y_preds = torch.cat(Y_preds).detach().cpu().numpy()
    return Y_actual, Y_preds

def detect_misclassification(batch, idx, tokenizer, true_values_all, true_values, pred_values):
    if len(true_values) >= 10 and not torch.equal(true_values, pred_values):
        tokens = [t for i, t in enumerate(tokenizer.convert_ids_to_tokens(batch['input_ids'][idx])) if true_values_all[i] != -100]
        data = [{'Token': token, 'Missclassified': False if true_values[i] == pred_values[i] else True, 'True': true_values[i].item(), 'Predicted': pred_values[i].item()}
            for i, token in enumerate(tokens)]
        df = pd.DataFrame(data)
        display(df)
        return True

def display_results(data_type, Y_actual, Y_preds, tagmap):
    result_labels = {"test": "Test", "valid": "Validation", "example": "Example"}
    accuracy_label = result_labels.get(data_type.lower(), "")
    print("\033[1m{} Accuracy:\033[0m {:.3f}".format(accuracy_label, accuracy_score(Y_actual, Y_preds)))
    print("\033[1m{} Macro-Accuracy:\033[0m {:.3f}".format(accuracy_label, balanced_accuracy_score(Y_actual, Y_preds)))
    print("\033[1mClassification Report:\033[0m\n{}".format(classification_report(Y_actual, Y_preds, labels=tagmap(tagmap.get_itos()), target_names=tagmap.get_itos(), zero_division=0)))

model = train_model(model, optimizer, train_loader, valid_loader, tqdmn, device, EPOCHS, tagmap, tokenizer)

  0%|          | 0/3 [00:00<?, ?it/s]

Epoch 1


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m 0.985
[1mValidation Macro-Accuracy:[0m 0.922
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       1.00      0.99      0.99     42759
       B-LOC       0.94      0.96      0.95      1837
       B-PER       0.98      0.99      0.98      1842
       B-ORG       0.92      0.94      0.93      1341
       I-PER       0.98      0.99      0.99      1307
       I-ORG       0.88      0.92      0.90       751
      B-MISC       0.91      0.84      0.87       922
       I-LOC       0.85      0.89      0.87       257
      I-MISC       0.70      0.77      0.73       346

    accuracy                           0.98     51362
   macro avg       0.91      0.92      0.91     51362
weighted avg       0.99      0.98      0.99     51362

Epoch 2


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m 0.987
[1mValidation Macro-Accuracy:[0m 0.928
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       1.00      1.00      1.00     42759
       B-LOC       0.95      0.98      0.96      1837
       B-PER       0.99      0.98      0.98      1842
       B-ORG       0.94      0.93      0.93      1341
       I-PER       0.99      0.99      0.99      1307
       I-ORG       0.92      0.87      0.90       751
      B-MISC       0.91      0.89      0.90       922
       I-LOC       0.89      0.91      0.90       257
      I-MISC       0.75      0.81      0.78       346

    accuracy                           0.99     51362
   macro avg       0.93      0.93      0.93     51362
weighted avg       0.99      0.99      0.99     51362

Epoch 3


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m 0.988
[1mValidation Macro-Accuracy:[0m 0.936
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       1.00      0.99      1.00     42759
       B-LOC       0.95      0.98      0.97      1837
       B-PER       0.98      0.99      0.98      1842
       B-ORG       0.93      0.94      0.93      1341
       I-PER       0.99      0.99      0.99      1307
       I-ORG       0.89      0.91      0.90       751
      B-MISC       0.92      0.90      0.91       922
       I-LOC       0.86      0.91      0.89       257
      I-MISC       0.81      0.79      0.80       346

    accuracy                           0.99     51362
   macro avg       0.92      0.94      0.93     51362
weighted avg       0.99      0.99      0.99     51362



**A.7:** The evaluate_model function is called to evaluate the performance of the model on the test dataset. By setting the detection parameter to True, the evaluation function performs additional misclassification detection. The evaluation results are returned as Y_actual and Y_preds, representing the actual and predicted labels, respectively.

In [None]:
Y_actual, Y_preds = evaluate_model(tqdmn, device, model, test_loader, True, tokenizer)

  0%|          | 0/432 [00:00<?, ?it/s]

Unnamed: 0,Token,Missclassified,True,Predicted
0,pt,False,3,3
1,tam,False,5,5
2,tim,False,5,5
3,was,False,0,0
4,traded,False,0,0
5,at,False,0,0
6,$,False,0,0
7,15,False,0,0
8,per,False,0,0
9,g,True,0,3


**A.8:** The display_results function is called to display the evaluation results on the test dataset. It takes the parameters "test" as the data type, Y_actual and Y_preds representing the actual and predicted labels respectively, and tagmap for mapping the labels.

In [None]:
display_results("test", Y_actual, Y_preds, tagmap)

[1mTest Accuracy:[0m 0.978
[1mTest Macro-Accuracy:[0m 0.904
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       1.00      0.99      0.99     38323
       B-LOC       0.91      0.95      0.93      1668
       B-PER       0.96      0.97      0.97      1617
       B-ORG       0.89      0.89      0.89      1661
       I-PER       0.98      0.99      0.99      1156
       I-ORG       0.85      0.89      0.86       835
      B-MISC       0.82      0.83      0.82       702
       I-LOC       0.80      0.92      0.86       257
      I-MISC       0.62      0.71      0.66       216

    accuracy                           0.98     46435
   macro avg       0.87      0.90      0.89     46435
weighted avg       0.98      0.98      0.98     46435



**A.9:** The evaluate_model function is called to evaluate the performance of the model on the example dataset. By setting the detection parameter to True, the evaluation function performs additional misclassification detection. The evaluation results are returned as Y_actual and Y_preds, representing the actual and predicted labels, respectively.

In [None]:
Y_actual, Y_preds = evaluate_model(tqdmn, device, model, example_loader, True, tokenizer)

  0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,Token,Missclassified,True,Predicted
0,in,False,0,0
1,an,False,0,0
2,economist,True,3,0
3,impact,True,5,0
4,web,False,0,0
5,supported,False,0,0
6,by,False,0,0
7,deutsche,True,3,0
8,bank,True,5,0
9,",",False,0,0


**A.10:** The display_results function is called to display the evaluation results on the example dataset. It takes the parameters "example" as the data type, Y_actual and Y_preds representing the actual and predicted labels respectively, and tagmap for mapping the labels.

In [None]:
display_results("example", Y_actual, Y_preds, tagmap)

[1mExample Accuracy:[0m 0.818
[1mExample Macro-Accuracy:[0m 0.200
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       0.82      1.00      0.90        27
       B-LOC       0.00      0.00      0.00         0
       B-PER       0.00      0.00      0.00         0
       B-ORG       0.00      0.00      0.00         2
       I-PER       0.00      0.00      0.00         0
       I-ORG       0.00      0.00      0.00         2
      B-MISC       0.00      0.00      0.00         1
       I-LOC       0.00      0.00      0.00         0
      I-MISC       0.00      0.00      0.00         1

   micro avg       0.82      0.82      0.82        33
   macro avg       0.09      0.11      0.10        33
weighted avg       0.67      0.82      0.74        33



### **B. BERT-based named entity recognition model with frozen parameters**

The model is initialized as a BERT-based NER model using the initialize_model function. This function takes the model type as "bert", the tagset for classification, the device for computation, the learning rate for optimization, and a boolean flag set to False, indicating that the weights of the BERT model should be frozen. The datasets are then encoded using the encode_data function. Data loaders are created using the create_loaders function. The model is trained using the train_model function with a specified number of epochs. The evaluate_model function is used to evaluate the model's performance on the test dataset, without enabling the misclassification detection process. The evaluation results are displayed using the display_results function, providing insights into the model's performance on the testing dataset.

In [None]:
model, tokenizer, optimizer = initialize_model("bert", tagset, device, LR, False)
train_dataset, valid_dataset, test_dataset, example_dataset = encode_data(tagmap, tokenizer, train_sentences, valid_sentences, test_sentences, example_sentences, 'ner_tags')
train_loader, valid_loader, test_loader, example_loader = create_loaders(train_dataset, valid_dataset, test_dataset, example_dataset, BATCH_SIZE)
model = train_model(model, optimizer, train_loader, valid_loader, tqdmn, device, EPOCHS, tagmap, tokenizer)
Y_actual, Y_preds = evaluate_model(tqdmn, device, model, test_loader, False, tokenizer)
display_results("test", Y_actual, Y_preds, tagmap)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

Unnamed: 0,Total,Trainable,Frozen
Parameters,108898569,6921,108891648


  0%|          | 0/3 [00:00<?, ?it/s]

Epoch 1


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m 0.834
[1mValidation Macro-Accuracy:[0m 0.126
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       0.84      1.00      0.91     42759
       B-LOC       0.51      0.10      0.16      1837
       B-PER       0.34      0.02      0.04      1842
       B-ORG       0.62      0.02      0.03      1341
       I-PER       0.00      0.00      0.00      1307
       I-ORG       0.00      0.00      0.00       751
      B-MISC       0.00      0.00      0.00       922
       I-LOC       0.00      0.00      0.00       257
      I-MISC       0.00      0.00      0.00       346

    accuracy                           0.83     51362
   macro avg       0.26      0.13      0.13     51362
weighted avg       0.74      0.83      0.77     51362

Epoch 2


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m 0.839
[1mValidation Macro-Accuracy:[0m 0.140
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       0.84      1.00      0.91     42759
       B-LOC       0.65      0.12      0.21      1837
       B-PER       0.74      0.04      0.08      1842
       B-ORG       0.52      0.09      0.15      1341
       I-PER       0.67      0.00      0.01      1307
       I-ORG       0.00      0.00      0.00       751
      B-MISC       0.00      0.00      0.00       922
       I-LOC       0.00      0.00      0.00       257
      I-MISC       0.00      0.00      0.00       346

    accuracy                           0.84     51362
   macro avg       0.38      0.14      0.15     51362
weighted avg       0.78      0.84      0.77     51362

Epoch 3


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m 0.857
[1mValidation Macro-Accuracy:[0m 0.199
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       0.86      1.00      0.92     42759
       B-LOC       0.80      0.33      0.47      1837
       B-PER       0.86      0.19      0.31      1842
       B-ORG       0.62      0.19      0.29      1341
       I-PER       0.93      0.08      0.16      1307
       I-ORG       0.00      0.00      0.00       751
      B-MISC       0.00      0.00      0.00       922
       I-LOC       0.00      0.00      0.00       257
      I-MISC       0.00      0.00      0.00       346

    accuracy                           0.86     51362
   macro avg       0.45      0.20      0.24     51362
weighted avg       0.81      0.86      0.81     51362



  0%|          | 0/432 [00:00<?, ?it/s]

[1mTest Accuracy:[0m 0.853
[1mTest Macro-Accuracy:[0m 0.208
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       0.86      1.00      0.92     38323
       B-LOC       0.78      0.33      0.46      1668
       B-PER       0.81      0.19      0.31      1617
       B-ORG       0.68      0.26      0.37      1661
       I-PER       0.92      0.10      0.17      1156
       I-ORG       0.00      0.00      0.00       835
      B-MISC       0.00      0.00      0.00       702
       I-LOC       0.00      0.00      0.00       257
      I-MISC       0.00      0.00      0.00       216

    accuracy                           0.85     46435
   macro avg       0.45      0.21      0.25     46435
weighted avg       0.81      0.85      0.81     46435



**Comparison:** The two models A and B have significantly different performances.

* The first model achieved a test accuracy of 0.978 and a test macro-accuracy of 0.904. It demonstrated high precision, recall, and F1-score across most of the classes, particularly for the "O" category. It exhibited strong overall performance with a weighted average F1-score of 0.98.

* In contrast, the second model achieved a lower test accuracy of 0.853 and a considerably lower test macro-accuracy of 0.208. It struggled to accurately classify the majority of the classes, as reflected by the low precision, recall, and F1-scores. Notably, it performed poorly on the "I-ORG," "B-MISC," "I-LOC," and "I-MISC" categories, where precision, recall, and F1-scores were all close to zero.

Overall, model A outperformed model B in terms of classification accuracy and overall model performance. Freezing the BERT parameters limited the model's ability to adapt and fine-tune the pre-trained embeddings for the specific task, resulting in significantly lower performance. The lack of parameter updates hindered its learning process, leading to reduced accuracy, poor macro-accuracy, and low precision, recall, and F1-scores across multiple categories.

### **C. BERT-based named entity recognition model trained on both training and validation datasets**

The training and validation sentences are concatenated using the extend method to create a combined dataset. The tag mapping and tagset are then generated using the create_tagset function. The model, tokenizer, and optimizer are initialized using the initialize_model function, where the BERT model's parameters are set to be trainable. The datasets are encoded using the encode_data function, and data loaders are created using the create_loaders function. The model is trained using the train_model function with a specified number of epochs. The evaluate_model function is called to evaluate the model's performance on the test dataset, without enabling the misclassification detection process. Finally, the display_results function is used to display the evaluation results on the test dataset.

In [None]:
train_sentences.extend(valid_sentences)  # concatenates the training and validation sentences
tagmap, tagset = create_tagset(train_sentences, 'ner_tags')
model, tokenizer, optimizer = initialize_model("bert", tagset, device, LR, True)
train_dataset, _, test_dataset, _ = encode_data(tagmap, tokenizer, train_sentences, [], test_sentences, [], 'ner_tags')
train_loader, _, test_loader, _ = create_loaders(train_dataset, [], test_dataset, [], BATCH_SIZE)
model = train_model(model, optimizer, train_loader, [], tqdmn, device, EPOCHS, tagmap, tokenizer)
Y_actual, Y_preds = evaluate_model(tqdmn, device, model, test_loader, False, tokenizer)
display_results("test", Y_actual, Y_preds, tagmap)

Tagset size: 9


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

Unnamed: 0,Total,Trainable,Frozen
Parameters,108898569,108898569,0


  0%|          | 0/3 [00:00<?, ?it/s]

Epoch 1


  0%|          | 0/2568 [00:00<?, ?it/s]

Epoch 2


  0%|          | 0/2568 [00:00<?, ?it/s]

Epoch 3


  0%|          | 0/2568 [00:00<?, ?it/s]

  0%|          | 0/432 [00:00<?, ?it/s]

[1mTest Accuracy:[0m 0.980
[1mTest Macro-Accuracy:[0m 0.912
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       1.00      0.99      0.99     38323
       B-LOC       0.92      0.94      0.93      1668
       B-PER       0.98      0.97      0.97      1617
       B-ORG       0.90      0.90      0.90      1661
       I-PER       0.98      0.99      0.99      1156
      B-MISC       0.85      0.84      0.84       702
       I-ORG       0.84      0.90      0.87       835
      I-MISC       0.63      0.78      0.70       216
       I-LOC       0.78      0.90      0.83       257

    accuracy                           0.98     46435
   macro avg       0.88      0.91      0.89     46435
weighted avg       0.98      0.98      0.98     46435



**Comparison:** The two models A and C, exhibit high performance in identifying named entities.

* The first model achieved a test accuracy of 0.978 and a macro-accuracy of 0.904. It demonstrated strong precision, recall, and F1-scores for most entity types, with particularly high performance for entity type "O" and "I-PER".

* The second model, trained on both the training and validation datasets, performed slightly better with a test accuracy of 0.980 and a macro-accuracy of 0.912. It showed improvements in precision, recall, and F1-scores for certain entity types such as "I-ORG", "B-MISC", "I-LOC", and "I-MISC".

Overall, model C demonstrated higher accuracy and macro-accuracy than model A, making it slightly superior in identifying named entities across different categories. The concatenation of the training and validation datasets had a positive impact on the model's performance. By combining the two datasets, the model had access to a larger and more diverse set of training examples, allowing it to learn more effectively and potentially improve its ability to recognize named entities. This resulted in a slightly higher test accuracy, macro-accuracy, and improved performance in terms of precision, recall, and F1-scores for several entity types.

### **D. BERT-based part of speech model**

The model is initialized as a BERT-based POS model using the initialize_model function. The model type is set to "bert", and the tagset for classification is based on part-of-speech tags. The model's parameters are set to be trainable. The encode_data function is then used to encode the training, validation, test, and example sentences, creating corresponding datasets. Data loaders are created using the create_loaders function with a specified batch size. The model is trained using the train_model function with the specified number of epochs, using the optimizer and data loaders. The evaluate_model function is called to evaluate the model's performance on the test dataset, with the misclassification detection enabled. The evaluation results, including the actual and predicted labels, are obtained as Y_actual and Y_preds. Finally, the display_results function is used to display the evaluation results, providing insights into the model's performance on the test dataset.

In [2]:
train_sentences, test_sentences, valid_sentences, example_sentences = load_data(base_path)
tagmap, tagset = create_tagset(train_sentences, 'pos_tags')
model, tokenizer, optimizer = initialize_model("bert", tagset, device, LR, True)
train_dataset, valid_dataset, test_dataset, example_dataset = encode_data(tagmap, tokenizer, train_sentences, valid_sentences, test_sentences, example_sentences, 'ner_tags')
train_loader, valid_loader, test_loader, example_loader = create_loaders(train_dataset, valid_dataset, test_dataset, example_dataset, BATCH_SIZE)
model = train_model(model, optimizer, train_loader, valid_loader, tqdmn, device, EPOCHS, tagmap, tokenizer)
Y_actual, Y_preds = evaluate_model(tqdmn, device, model, test_loader, True, tokenizer)
display_results("test", Y_actual, Y_preds, tagmap)

Tagset size: 45


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

Unnamed: 0,Total,Trainable,Frozen
Parameters,108926253,108926253,0


  0%|          | 0/3 [00:00<?, ?it/s]

Epoch 1


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m nan
[1mValidation Macro-Accuracy:[0m nan
[1mClassification Report:[0m
              precision    recall  f1-score   support

         NNP       0.00      0.00      0.00       0.0
          NN       0.00      0.00      0.00       0.0
          CD       0.00      0.00      0.00       0.0
          IN       0.00      0.00      0.00       0.0
          DT       0.00      0.00      0.00       0.0
          JJ       0.00      0.00      0.00       0.0
         NNS       0.00      0.00      0.00       0.0
         VBD       0.00      0.00      0.00       0.0
           .       0.00      0.00      0.00       0.0
           ,       0.00      0.00      0.00       0.0
          VB       0.00      0.00      0.00       0.0
         VBN       0.00      0.00      0.00       0.0
          RB       0.00      0.00      0.00       0.0
          CC       0.00      0.00      0.00       0.0
          TO       0.00      0.00      0.00       0.0
         PRP       0.00      0.0

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m nan
[1mValidation Macro-Accuracy:[0m nan
[1mClassification Report:[0m
              precision    recall  f1-score   support

         NNP       0.00      0.00      0.00       0.0
          NN       0.00      0.00      0.00       0.0
          CD       0.00      0.00      0.00       0.0
          IN       0.00      0.00      0.00       0.0
          DT       0.00      0.00      0.00       0.0
          JJ       0.00      0.00      0.00       0.0
         NNS       0.00      0.00      0.00       0.0
         VBD       0.00      0.00      0.00       0.0
           .       0.00      0.00      0.00       0.0
           ,       0.00      0.00      0.00       0.0
          VB       0.00      0.00      0.00       0.0
         VBN       0.00      0.00      0.00       0.0
          RB       0.00      0.00      0.00       0.0
          CC       0.00      0.00      0.00       0.0
          TO       0.00      0.00      0.00       0.0
         PRP       0.00      0.0

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m nan
[1mValidation Macro-Accuracy:[0m nan
[1mClassification Report:[0m
              precision    recall  f1-score   support

         NNP       0.00      0.00      0.00       0.0
          NN       0.00      0.00      0.00       0.0
          CD       0.00      0.00      0.00       0.0
          IN       0.00      0.00      0.00       0.0
          DT       0.00      0.00      0.00       0.0
          JJ       0.00      0.00      0.00       0.0
         NNS       0.00      0.00      0.00       0.0
         VBD       0.00      0.00      0.00       0.0
           .       0.00      0.00      0.00       0.0
           ,       0.00      0.00      0.00       0.0
          VB       0.00      0.00      0.00       0.0
         VBN       0.00      0.00      0.00       0.0
          RB       0.00      0.00      0.00       0.0
          CC       0.00      0.00      0.00       0.0
          TO       0.00      0.00      0.00       0.0
         PRP       0.00      0.0

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


  0%|          | 0/432 [00:00<?, ?it/s]

[1mTest Accuracy:[0m nan
[1mTest Macro-Accuracy:[0m nan
[1mClassification Report:[0m
              precision    recall  f1-score   support

         NNP       0.00      0.00      0.00       0.0
          NN       0.00      0.00      0.00       0.0
          CD       0.00      0.00      0.00       0.0
          IN       0.00      0.00      0.00       0.0
          DT       0.00      0.00      0.00       0.0
          JJ       0.00      0.00      0.00       0.0
         NNS       0.00      0.00      0.00       0.0
         VBD       0.00      0.00      0.00       0.0
           .       0.00      0.00      0.00       0.0
           ,       0.00      0.00      0.00       0.0
          VB       0.00      0.00      0.00       0.0
         VBN       0.00      0.00      0.00       0.0
          RB       0.00      0.00      0.00       0.0
          CC       0.00      0.00      0.00       0.0
          TO       0.00      0.00      0.00       0.0
         PRP       0.00      0.00      0.00 

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


The evaluate_model function is called to evaluate the model's performance on the example dataset, with the misclassification detection enabled. The evaluation results, including the actual and predicted labels, are obtained as Y_actual and Y_preds. The display_results function is then used to display the evaluation results, providing insights into the model's performance on the example dataset.

In [3]:
Y_actual, Y_preds = evaluate_model(tqdmn, device, model, example_loader, True, tokenizer)
display_results("example", Y_actual, Y_preds, tagmap)

  0%|          | 0/1 [00:00<?, ?it/s]

[1mExample Accuracy:[0m nan
[1mExample Macro-Accuracy:[0m nan
[1mClassification Report:[0m
              precision    recall  f1-score   support

         NNP       0.00      0.00      0.00       0.0
          NN       0.00      0.00      0.00       0.0
          CD       0.00      0.00      0.00       0.0
          IN       0.00      0.00      0.00       0.0
          DT       0.00      0.00      0.00       0.0
          JJ       0.00      0.00      0.00       0.0
         NNS       0.00      0.00      0.00       0.0
         VBD       0.00      0.00      0.00       0.0
           .       0.00      0.00      0.00       0.0
           ,       0.00      0.00      0.00       0.0
          VB       0.00      0.00      0.00       0.0
         VBN       0.00      0.00      0.00       0.0
          RB       0.00      0.00      0.00       0.0
          CC       0.00      0.00      0.00       0.0
          TO       0.00      0.00      0.00       0.0
         PRP       0.00      0.00     

  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


### **E. BERT-based text chunking model**

The tagmap and tagset are generated using the create_tagset function, with the tag type being "chunk_tags". The model, tokenizer, and optimizer are initialized using the initialize_model function, with the model type set as "bert" and the parameters set to be trainable. The encode_data function is then employed to encode the training, validation, test, and example sentences, resulting in corresponding datasets. Data loaders are created using the create_loaders function, with a specified batch size. The model is trained using the train_model function with the designated number of epochs, utilizing the optimizer and data loaders. The evaluate_model function is called to evaluate the model's performance on the test dataset, with the misclassification detection enabled. The evaluation results, including the actual and predicted labels, are obtained as Y_actual and Y_preds. Finally, the display_results function is used to display the evaluation results, providing insights into the model's performance on the test dataset.

In [5]:
tagmap, tagset = create_tagset(train_sentences, 'chunk_tags')
model, tokenizer, optimizer = initialize_model("bert", tagset, device, LR, True)
train_dataset, valid_dataset, test_dataset, example_dataset = encode_data(tagmap, tokenizer, train_sentences, valid_sentences, test_sentences, example_sentences, 'ner_tags')
train_loader, valid_loader, test_loader, example_loader = create_loaders(train_dataset, valid_dataset, test_dataset, example_dataset, BATCH_SIZE)
model = train_model(model, optimizer, train_loader, valid_loader, tqdmn, device, EPOCHS, tagmap, tokenizer)
Y_actual, Y_preds = evaluate_model(tqdmn, device, model, test_loader, True, tokenizer)
display_results("test", Y_actual, Y_preds, tagmap)

Tagset size: 20


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

Unnamed: 0,Total,Trainable,Frozen
Parameters,108907028,108907028,0


  0%|          | 0/3 [00:00<?, ?it/s]

Epoch 1


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m 1.000
[1mValidation Macro-Accuracy:[0m 1.000
[1mClassification Report:[0m
              precision    recall  f1-score   support

        I-NP       0.00      0.00      0.00         0
        B-NP       0.00      0.00      0.00         0
           O       1.00      1.00      1.00     42759
        B-PP       0.00      0.00      0.00         0
        B-VP       0.00      0.00      0.00         0
        I-VP       0.00      0.00      0.00         0
      B-ADVP       0.00      0.00      0.00         0
      B-SBAR       0.00      0.00      0.00         0
      B-ADJP       0.00      0.00      0.00         0
       B-PRT       0.00      0.00      0.00         0
      I-ADJP       0.00      0.00      0.00         0
      I-ADVP       0.00      0.00      0.00         0
        I-PP       0.00      0.00      0.00         0
      B-INTJ       0.00      0.00      0.00         0
     I-CONJP       0.00      0.00      0.00         0
       B-LST       0.00     

  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m 1.000
[1mValidation Macro-Accuracy:[0m 1.000
[1mClassification Report:[0m
              precision    recall  f1-score   support

        I-NP       0.00      0.00      0.00         0
        B-NP       0.00      0.00      0.00         0
           O       1.00      1.00      1.00     42759
        B-PP       0.00      0.00      0.00         0
        B-VP       0.00      0.00      0.00         0
        I-VP       0.00      0.00      0.00         0
      B-ADVP       0.00      0.00      0.00         0
      B-SBAR       0.00      0.00      0.00         0
      B-ADJP       0.00      0.00      0.00         0
       B-PRT       0.00      0.00      0.00         0
      I-ADJP       0.00      0.00      0.00         0
      I-ADVP       0.00      0.00      0.00         0
        I-PP       0.00      0.00      0.00         0
      B-INTJ       0.00      0.00      0.00         0
     I-CONJP       0.00      0.00      0.00         0
       B-LST       0.00     

  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m 1.000
[1mValidation Macro-Accuracy:[0m 1.000
[1mClassification Report:[0m
              precision    recall  f1-score   support

        I-NP       0.00      0.00      0.00         0
        B-NP       0.00      0.00      0.00         0
           O       1.00      1.00      1.00     42759
        B-PP       0.00      0.00      0.00         0
        B-VP       0.00      0.00      0.00         0
        I-VP       0.00      0.00      0.00         0
      B-ADVP       0.00      0.00      0.00         0
      B-SBAR       0.00      0.00      0.00         0
      B-ADJP       0.00      0.00      0.00         0
       B-PRT       0.00      0.00      0.00         0
      I-ADJP       0.00      0.00      0.00         0
      I-ADVP       0.00      0.00      0.00         0
        I-PP       0.00      0.00      0.00         0
      B-INTJ       0.00      0.00      0.00         0
     I-CONJP       0.00      0.00      0.00         0
       B-LST       0.00     

  0%|          | 0/432 [00:00<?, ?it/s]

[1mTest Accuracy:[0m 1.000
[1mTest Macro-Accuracy:[0m 1.000
[1mClassification Report:[0m
              precision    recall  f1-score   support

        I-NP       0.00      0.00      0.00         0
        B-NP       0.00      0.00      0.00         0
           O       1.00      1.00      1.00     38323
        B-PP       0.00      0.00      0.00         0
        B-VP       0.00      0.00      0.00         0
        I-VP       0.00      0.00      0.00         0
      B-ADVP       0.00      0.00      0.00         0
      B-SBAR       0.00      0.00      0.00         0
      B-ADJP       0.00      0.00      0.00         0
       B-PRT       0.00      0.00      0.00         0
      I-ADJP       0.00      0.00      0.00         0
      I-ADVP       0.00      0.00      0.00         0
        I-PP       0.00      0.00      0.00         0
      B-INTJ       0.00      0.00      0.00         0
     I-CONJP       0.00      0.00      0.00         0
       B-LST       0.00      0.00      0

The evaluate_model function is called to evaluate the model's performance on the example dataset, with the misclassification detection enabled. The evaluation results, including the actual and predicted labels, are obtained as Y_actual and Y_preds. The display_results function is then used to display the evaluation results, providing insights into the model's performance on the example dataset.

In [6]:
Y_actual, Y_preds = evaluate_model(tqdmn, device, model, example_loader, True, tokenizer)
display_results("example", Y_actual, Y_preds, tagmap)

  0%|          | 0/1 [00:00<?, ?it/s]

[1mExample Accuracy:[0m 1.000
[1mExample Macro-Accuracy:[0m 1.000
[1mClassification Report:[0m
              precision    recall  f1-score   support

        I-NP       0.00      0.00      0.00         0
        B-NP       0.00      0.00      0.00         0
           O       1.00      1.00      1.00        27
        B-PP       0.00      0.00      0.00         0
        B-VP       0.00      0.00      0.00         0
        I-VP       0.00      0.00      0.00         0
      B-ADVP       0.00      0.00      0.00         0
      B-SBAR       0.00      0.00      0.00         0
      B-ADJP       0.00      0.00      0.00         0
       B-PRT       0.00      0.00      0.00         0
      I-ADJP       0.00      0.00      0.00         0
      I-ADVP       0.00      0.00      0.00         0
        I-PP       0.00      0.00      0.00         0
      B-INTJ       0.00      0.00      0.00         0
     I-CONJP       0.00      0.00      0.00         0
       B-LST       0.00      0.00 

### **F. RoBERTa-based named entity recognition model**

The tagmap and tagset are generated using the create_tagset function, with the tag type being "ner_tags". The model is then initialized as a RoBERTa-based NER model using the initialize_model function. This function takes the model type as "roberta", the tagset for classification, the device for computation, the learning rate for optimization, and a boolean flag set to True, indicating that the parameters of the BERT model should be trainable. The datasets are then encoded using the encode_data function. Data loaders are created using the create_loaders function. The model is trained using the train_model function with a specified number of epochs. The evaluate_model function is used to evaluate the model's performance on the test dataset, without enabling the misclassification detection process. The evaluation results are displayed using the display_results function, providing insights into the model's performance on the testing dataset.

In [8]:
tagmap, tagset = create_tagset(train_sentences, 'ner_tags')
model, tokenizer, optimizer = initialize_model("roberta", tagset, device, LR, True)
train_dataset, valid_dataset, test_dataset, example_dataset = encode_data(tagmap, tokenizer, train_sentences, valid_sentences, test_sentences, example_sentences, 'ner_tags')
train_loader, valid_loader, test_loader, _ = create_loaders(train_dataset, valid_dataset, test_dataset, example_dataset, BATCH_SIZE)
model = train_model(model, optimizer, train_loader, valid_loader, tqdmn, device, EPOCHS, tagmap, tokenizer)
Y_actual, Y_preds = evaluate_model(tqdmn, device, model, test_loader, False, tokenizer)
display_results("test", Y_actual, Y_preds, tagmap)

Tagset size: 9


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForTokenClassification: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able

Unnamed: 0,Total,Trainable,Frozen
Parameters,124061961,124061961,0


  0%|          | 0/3 [00:00<?, ?it/s]

Epoch 1


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m 0.988
[1mValidation Macro-Accuracy:[0m 0.939
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       1.00      1.00      1.00     42759
       B-LOC       0.96      0.96      0.96      1837
       B-PER       0.96      0.99      0.97      1842
       B-ORG       0.94      0.92      0.93      1341
       I-PER       0.97      0.98      0.98      1307
       I-ORG       0.94      0.90      0.92       751
      B-MISC       0.90      0.88      0.89       922
       I-LOC       0.90      0.97      0.93       257
      I-MISC       0.82      0.85      0.84       346

    accuracy                           0.99     51362
   macro avg       0.93      0.94      0.94     51362
weighted avg       0.99      0.99      0.99     51362

Epoch 2


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m 0.991
[1mValidation Macro-Accuracy:[0m 0.953
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       1.00      1.00      1.00     42759
       B-LOC       0.97      0.96      0.97      1837
       B-PER       0.98      0.99      0.98      1842
       B-ORG       0.91      0.96      0.94      1341
       I-PER       0.99      0.98      0.99      1307
       I-ORG       0.93      0.94      0.93       751
      B-MISC       0.93      0.91      0.92       922
       I-LOC       0.93      0.96      0.95       257
      I-MISC       0.87      0.88      0.87       346

    accuracy                           0.99     51362
   macro avg       0.95      0.95      0.95     51362
weighted avg       0.99      0.99      0.99     51362

Epoch 3


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]

[1mValidation Accuracy:[0m 0.991
[1mValidation Macro-Accuracy:[0m 0.957
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       1.00      1.00      1.00     42759
       B-LOC       0.98      0.97      0.98      1837
       B-PER       0.98      0.99      0.98      1842
       B-ORG       0.95      0.94      0.95      1341
       I-PER       0.98      0.99      0.98      1307
       I-ORG       0.94      0.92      0.93       751
      B-MISC       0.90      0.94      0.92       922
       I-LOC       0.92      0.96      0.94       257
      I-MISC       0.87      0.90      0.88       346

    accuracy                           0.99     51362
   macro avg       0.95      0.96      0.95     51362
weighted avg       0.99      0.99      0.99     51362



  0%|          | 0/432 [00:00<?, ?it/s]

[1mTest Accuracy:[0m 0.982
[1mTest Macro-Accuracy:[0m 0.918
[1mClassification Report:[0m
              precision    recall  f1-score   support

           O       1.00      0.99      1.00     38323
       B-LOC       0.95      0.93      0.94      1668
       B-PER       0.98      0.96      0.97      1617
       B-ORG       0.89      0.93      0.91      1661
       I-PER       0.98      0.99      0.98      1156
       I-ORG       0.87      0.93      0.90       835
      B-MISC       0.79      0.86      0.82       702
       I-LOC       0.85      0.89      0.87       257
      I-MISC       0.58      0.77      0.67       216

    accuracy                           0.98     46435
   macro avg       0.88      0.92      0.90     46435
weighted avg       0.98      0.98      0.98     46435



**Comparison:** The two models A and F, exhibit high performance in identifying named entities.

* The first model achieved a test accuracy of 0.978 and a test macro-accuracy of 0.904. In the classification report, it demonstrated high precision, recall, and f1-score for most of the named entity labels, with particularly strong performance for the "O" label. The model had an overall accuracy of 0.98, with a macro average f1-score of 0.89.

* The second model achieved a slightly higher test accuracy of 0.982 and a test macro-accuracy of 0.918. In the classification report, it also exhibited high precision, recall, and f1-score for most labels, with notable performance for the "O" and "B-PER" labels. The model had an overall accuracy of 0.98, with a macro average f1-score of 0.90.

Overall, both models performed well, achieving high accuracy and demonstrating strong precision, recall, and f1-scores for various named entity labels. Model F showed slightly better results than model A, with slightly higher accuracy and macro-accuracy scores compared to the BERT-based model. It also had marginally higher f1-scores for some specific labels. Therefore, using RoBERTa for named entity recognition would be recommended.

## **9. Sequence Labeling using ChatGPT with zero-shot prompting**

##### **Prompt:** Assign POS, Chunk and NER tags to the tokens of the following sentence:

`tam` `tim` `was` `traded` `at` `$` `15` `per` `g` `in` `london` `on` `friday` `at` `around` `09` `gm` `.`

##### **Answer:** Here are the assigned POS, Chunk, and NER tags for each token in the given sentence:

| Token | tam   | tim   | was   | traded | at    | \$ | 15 | per | g | in    | london | on   | friday | at    | around | 09 | gm | .   |
|-------|-------|-------|-------|--------|-------|---|----|-----|---|-------|--------|------|---------|-------|--------|----|----|-----|
| **POS Tag**   | NNP   | NNP   | VBD   | VBN    | IN    | $ | CD | IN  | NN | IN    | NNP    | IN   | NNP     | IN    | IN     | CD | NN | .   |
| **Chunk Tag** | B-NP  | I-NP  | B-VP  | I-VP   | B-PP  | B-NP | I-NP | B-PP | B-NP | B-PP  | B-NP   | B-PP | B-NP    | B-PP  | B-PP   | B-NP | I-NP | O   |
| **NER Tag**   | O     | O     | O     | O      | O     | O | O  | O    | O   | O   | B-LOC | O      | B-DATE | O     | O      | O  | O  | O   |




##### **Evaluation on wrong assignments:** [row(i), row(i+1)] ~ [Wrong Tags, Actual Tags]

| Token        | POS Tag | Chunk Tag | NER Tag    |
|--------------|---------|-----------|------------|
| **tam**      | NNP     | B-NP      | O          |
|              | NNP     | I-NP      | I-ORG      |
| **tim**      | NNP     | I-NP      | O          |
|              | NNP     | I-NP      | I-ORG      |
| **g**        | NN      | B-NP      | O          |
|              | NNP     | B-NP      | O          |
| **london**   | NNP     | B-NP      | B-LOC      |
|              | NNP     | B-NP      | O          |
| **friday**   | NNP     | B-NP      | B-DATE     |
|              | NNP     | B-NP      | O          |
| **around**   | IN      | B-PP      | O          |
|              | IN      | B-NP      | O          |
| **09**       | CD      | B-NP      | O          |
|              | CD      | I-NP      | O          |
| **gm**       | NN      | I-NP      | O          |
|              | RB      | I-NP      | B-MISC     |

---
##### **Prompt:** Assign POS, Chunk and NER tags to the tokens of the following sentence:

`In` `an` `Economist` `Impact` `webinar` `supported` `by` `Deutsche` `Bank` `,` `two` `corporate` `treasurers` `shared` `how` `inflationary` `pressures` `are` `affecting` `their` `investment` `strategies` `and` `why` `the` `Covid-19` `pandemic` `is` `still` `shaping` `cash` `policies` `.`

##### **Answer:** Here are the assigned POS, Chunk, and NER tags for each token in the given sentence:

| Token | In   | an   | Economist | Impact | webinar | supported | by      | Deutsche | Bank | , | two | corporate | treasurers | shared | how | inflationary | pressures | are | affecting | their  | investment | strategies | and | why | the | Covid-19 | pandemic | is  | still  | shaping | cash | policies | . |
| -------- | ---- | ---- | --------- | ------ | ------- | --------- | ------- | --------- | ---- | - | --- | ---------- | ----------- | ------ | --- | ------------ | --------- | --- | --------- | ------ | ---------- | ---------- | --- | --- | --- | -------- | -------- | --- | ------ | -------- | ---- | --------- | - |
| **POS Tag**  | IN   | DT   | NNP       | NNP    | NN      | VBD       | IN      | NNP       | NNP  | , | CD  | JJ         | NNS         | VBD    | WRB | JJ           | NNS       | VBP | VBG       | PRP$   | NN         | NNS        | CC  | WRB | DT  | NNP      | NN       | VBZ | RB     | VBG      | NN   | NNS       | . |
| **Chunk Tag** | B-PP | B-NP | I-NP      | I-NP   | I-NP    | B-VP      | B-PP    | B-NP      | I-NP | O | B-NP | I-NP       | I-NP        | B-VP   | B-ADVP | B-NP         | I-NP      | B-VP | I-VP      | B-NP    | I-NP       | I-NP       | O   | O   | B-NP | I-NP     | I-NP   | B-VP | I-ADVP | I-VP     | B-NP | I-NP      | O |
| **NER Tag**  | O    | O    | B-ORG     | O      | O       | O         | O       | B-ORG     | I-ORG | O | O    | O          | O           | O      | O   | O            | O         | O   | O         | O      | O          | O          | O   | O   | O   | B-MISC   | O        | O   | O      | O        | O    | O         | O |

##### **Evaluation on wrong assignments:** [row(i), row(i+1)] ~ [Wrong Tags, Actual Tags]

| Token        | POS Tag | Chunk Tag | NER Tag    |
|--------------|---------|-----------|------------|
| **Impact**   | NNP     | I-NP      | O          |
|              | NNP     | I-NP      | I-ORG      |
| **why**      | WRB     | O         | O          |
|              | WRB     | B-ADVP    | O          |
| **pandemic** | NN      | I-NP      | O          |
|              | NN      | I-NP      | I-MISC     |

## **10. Sequence Labeling using ChatGPT with few-shot prompting**

##### **Prompt:** Assign POS, Chunk and NER tags to the tokens of the following sentence:

`tam` `tim` `was` `traded` `at` `$` `15` `per` `g` `in` `london` `on` `friday` `at` `around` `09` `gm` `.`

the way I did with this sentence:

`Playing` `Sunday` `:` `West` `Bromwich` `v` `Bolton`

| Token          | Playing | Sunday | :   | West   | Bromwich | v    | Bolton |
|----------------|---------|--------|-----|--------|----------|------|--------|
| **POS Tag**    | VBG     | NNP    | :   | NNP    | NNP      | FW   | NNP    |
| **Chunk Tag**  | B-VP    | B-NP   | O   | B-NP   | I-NP     | I-NP | I-NP   |
| **NER Tag**    | O       | O      | O   | B-ORG  | I-ORG    | O    | B-ORG  |

##### **Answer:** Here are the assigned POS, Chunk, and NER tags for each token in the given sentence:

| Token | tam   | tim   | was   | traded | at    | \$ | 15 | per | g | in    | london | on   | friday | at    | around | 09 | gm | .   |
|-------|-------|-------|-------|--------|-------|---|----|-----|---|-------|--------|------|---------|-------|--------|----|----|-----|
| **POS Tag**   | NNP   | NNP   | VBD   | VBN    | IN    | $ | CD | IN  | NN | IN    | NNP    | IN   | NNP     | IN    | IN     | CD | NN | .   |
| **Chunk Tag** | B-NP  | B-NP  | B-VP  | I-VP   | B-PP  | B-NP | I-NP | B-PP | B-NP | B-PP  | B-NP   | B-PP | B-NP    | B-PP  | B-ADVP   | B-NP | I-NP | O   |
| **NER Tag**   | O     | O     | O     | O      | O     | O | O  | O    | O   | O   | B-LOC | O      | O | O     | O      | B-TIME  | O  | O   |

##### **Evaluation on wrong assignments:** [row(i), row(i+1)] ~ [Wrong Tags, Actual Tags]

| Token        | POS Tag | Chunk Tag | NER Tag    |
|--------------|---------|-----------|------------|
| **tam**      | NNP     | B-NP      | O          |
|              | NNP     | I-NP      | I-ORG      |
| **tim**      | NNP     | I-NP      | O          |
|              | NNP     | I-NP      | I-ORG      |
| **g**        | NN      | B-NP      | O          |
|              | NNP     | B-NP      | O          |
| **london**   | NNP     | B-NP      | B-LOC      |
|              | NNP     | B-NP      | O          |
| **around**   | IN      | B-PP      | O          |
|              | IN      | B-NP      | O          |
| **09**       | CD      | B-NP      | B-TIME     |
|              | CD      | I-NP      | O          |
| **gm**       | NN      | I-NP      | O          |
|              | RB      | I-NP      | B-MISC     |

---

##### **Prompt:** Assign POS, Chunk and NER tags to the tokens of the following sentence:

`In` `an` `Economist` `Impact` `webinar` `supported` `by` `Deutsche` `Bank` `,` `two` `corporate` `treasurers` `shared` `how` `inflationary` `pressures` `are` `affecting` `their` `investment` `strategies` `and` `why` `the` `Covid-19` `pandemic` `is` `still` `shaping` `cash` `policies` `.`

the way I did with this sentence:

`We` `only` `realised` `after` `voting` `that` `we` `were` `being` `denied` `the` `rights` `of` `other` `Bosnian` `Moslems` `to` `choose` `our` `president` `.`

| Word | We | only | realised | after | voting | that | we   | were | being | denied | the | rights | of  | other | Bosnian | Moslems | to  | choose | our  | president | .  |
------------|------|------|----------|-------|--------|------|------|------|-------|--------|-----|--------|-----|-------|---------|---------|-----|--------|------|-----------|----|
| **POS Tag** | PRP  | RB   | VBD      | IN    | VBG    | IN   | PRP  | VBD  | VBG   | VBN    | DT  | NNS    | IN  | JJ    | NNP     | NNPS    | TO  | VB     | PRP$ | NN        | .  |
| **Chunk Tag**  | B-NP | B-ADVP | B-VP     | B-PP  | B-VP   | B-SBAR | B-NP | B-VP | I-VP  | I-VP   | B-NP | I-NP   | B-PP | B-NP  | I-NP    | I-NP    | B-VP | I-VP   | B-NP | I-NP      | O  |
| **NER Tag** | O    | O    | O        | O     | O      | O     | O    | O    | O     | O      | O    | O      | O    | O     | B-MISC  | I-MISC  | O    | O      | O    | O         | O  |

##### **Answer:** Here are the assigned POS, Chunk, and NER tags for each token in the given sentence:

| Word | In   | an   | Economist | Impact | webinar | supported | by      | Deutsche | Bank | , | two | corporate | treasurers | shared | how | inflationary | pressures | are | affecting | their  | investment | strategies | and | why | the | Covid-19 | pandemic | is  | still  | shaping | cash | policies | . |
| -------- | ---- | ---- | --------- | ------ | ------- | --------- | ------- | --------- | ---- | - | --- | ---------- | ----------- | ------ | --- | ------------ | --------- | --- | --------- | ------ | ---------- | ---------- | --- | --- | --- | -------- | -------- | --- | ------ | -------- | ---- | --------- | - |
| **POS Tag**  | IN   | DT   | NNP       | NNP    | NN      | VBD       | IN      | NNP       | NNP  | , | CD  | JJ         | NNS         | VBD    | WRB | JJ           | NNS       | VBP | VBG       | PRP$   | NN         | NNS        | CC  | WRB | DT  | NNP      | NN       | VBZ | RB     | VBG      | NN   | NNS       | . |
| **Chunk Tag** | B-PP | B-NP | I-NP      | I-NP   | I-NP    | B-VP      | B-PP    | B-NP      | I-NP | O | B-NP | I-NP       | I-NP        | B-VP   | B-ADVP | B-NP         | I-NP      | B-VP | I-VP      | B-NP    | I-NP       | I-NP       | O   | B-ADVP   | B-NP | I-NP     | I-NP   | B-VP | I-ADVP | I-VP     | B-NP | I-NP      | O |
| **NER Tag**  | O    | O    | O     | O      | O       | O         | O       | B-ORG     | I-ORG | O | O    | O          | O           | O      | O   | O            | O         | O   | O         | O      | O          | O          | O   | O   | O   | B-MISC   | I-MISC        | O   | O      | O        | O    | O         | O |

##### **Evaluation on wrong assignments:** [row(i), row(i+1)] ~ [Wrong Tags, Actual Tags]

| Word         | POS Tag | Chunk Tag | NER Tag    |
|--------------|---------|-----------|------------|
| **Economist**| NNP     | I-NP      | O          |
|              | NNP     | I-NP      | B-ORG      |
| **Impact**   | NNP     | I-NP      | O          |
|              | NNP     | I-NP      | I-ORG      |