# Final project: Named Entity Recognition (NER) with Sequence Labeling Models

### Project Submission Due: Dec 16th, 2022 (11.59PM)
Please keep your code (notebook) updated in **github repo**. 

# **Introduction** 🔎

---

In this project, you will implement a model that identifies named entities in text and tags them with the appropriate label. Particularly, the task of this project is **Named Entity Recognition**. A primer on this task is provided further on. The given dataset is a modified version of the CoNLL-2003 ([Sang et al](https://arxiv.org/pdf/cs/0306050v1.pdf)) dataset. Please use the datasets that we have released to you instead of versions found online as we have made simplifications to the dataset for your benefit. Your task is to develop NLP models to identify these named entities automatically. We will treat this as a **sequence-tagging task**: for each token in the input text, assign one of the following 5 labels: **ORG** (Organization), **PER** (Person), **LOC** (Location), **MISC** (Miscellaneous), and **O** (Not Named Entity). More information about the dataset is provided later

For this project, you will implement any approach at your will (e.g. one model that is covered in the lecture):
<!-- - Model 1 : a Hidden Markov Model (HMM) -->
<!-- - Model 2 : a Maximum Entropy Markov Model (MEMM), which is an adaptation of an HMM in which a Logistic Regression classifier (also known as a MaxEnt classifier) is used to obtain the lexical generation probabilities (i.e., the observation/emission probability matrix, so "observations" == "emissions" == "lexical generations"). Feature engineering is strongly suggested for this model! -->

<!-- Implementation of the Viterbi algorithm (for finding the most likely tag sequence to assign to an input text) is required for both models above, so make sure that you understand it ASAP. -->

<!-- You will implement and train two sequence tagging models, generate your predictions for the provided test set, and submit them to **Kaggle**. Please enter all code in this colab notebook and answer all the questions in the supporting document. -->

<!-- To refresh your memory on HMMs, MEMMs, and Viterbi you can refer to **Jurafsky & Martin Ch. 8.3–8.5** and the lecture slides which can be found on EdStem. -->

## **Named Entity Recognition: A Primer**

---

Let us now take a look at the task at hand: Named Entity Recognition (NER). This section provides a brief introduction to the task and why it is important.

**What is NER?**
As we've covered in the lecture, NER refers to the information extraction technique of identifying and categorizing key information about entities within textual data. Let's look at an example: 

<br/>

![picture](https://drive.google.com/uc?id=1mxwn1_2Ef16_MJeyl9jJwwR6IohUOeHO)

<br/>

In the above example, we can see that the text has numerous named entities that can be categorized as LOC (location), ORG (organization), PER (person), etc. 
To read more on NER, we refer to any of the following sources. Medium post [1](https://umagunturi789.medium.com/everything-you-need-to-know-about-named-entity-recognition-2a136f38c08f) and [2](https://medium.com/mysuperai/what-is-named-entity-recognition-ner-and-how-can-i-use-it-2b68cf6f545d).

## **Entity Level Mean F1**

---

Let's take a look at the metrics that you will focus on in this project. The standard measures to report for NER are recall, precision, and F1 score
(also called F-measure) evaluated at the **named entity level** (not at the token level). The code for this has been provided later under the validation section under Part 2. Please use this code when evaluating your models. 

If P and T are the sets of predicted and true *named entity spans*, respectively, (e.g, the five named entity spans in the above example are "Zifa", "Renate Goetschl", "Austria", "World Cup", and "Germany") then

####<center>Precision = $\frac{|\text{P}\;\cap\;\text{T}|}{|\text{P}|}$ and Recall = $\frac{|\text{P}\;\cap\;\text{T}|}{|\text{T}|}$ .</center><br/>


####<center>F1 = $\frac{2 * \text{Precision} * \text{Recall}}{\text{Precision} + \text{Recall}}$. </center><br/>

For each type of named entity, e.g. *LOC*ation, *MISC*ellaneous, *ORG*anization and *PER*son, we calculate the F1 score as shown above, and take the mean of all these F1 scores to get the **Entity Level Mean F1** score for the test set. If $N$ is the total number of labels (i.e., named entity types), then

####<center>Entity Level Mean F1 = $\frac{\sum_{i = 1}^{N} \text{F1}_{{label}_i}}{N}$. </center>

More details under the validation section in Part 2.



# **Part 1: Dataset**

Load the dataset as follows:
  1. Obtain the data from eLearning.
  2. Unzip the data. Put it into your google drive, and mount it on colab as per below:

In [1]:
!pip install transformers seqeval datasets evaluate

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25ldone
[?25h  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16180 sha256=7bfd333c2bdf5ac8a55febd67f14e9e8ab134a839e1a9dcbb41414bd8c745abb
  Stored in directory: /root/.cache/pip/wheels/05/96/ee/7cac4e74f3b19e3158dce26a20a1c86b3533c43ec72a549fd7
Successfully built seqeval
Installing collected packages: seqeval, evaluate
Successfully installed evaluate-0.4.0 seqeval-1.2.2
[0m

In [2]:
# from google.colab import drive
import os
import json
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# drive.mount('/content/drive', force_remount=True)

In [3]:
# TODO: please change the line below with your drive organization
# path = os.path.join(os.getcwd(), "drive", "MyDrive", "nlp-final-exam", "dataset")

path = os.path.join("/", "kaggle", "input", "nerdataset")

with open(os.path.join(path,'train.json'), 'r') as f:
     train = json.loads(f.read())

with open(os.path.join(path,'test.json'), 'r') as f:
     test = json.loads(f.read())

Here's a few things to note about the dataset above:
1. We have just loaded 2 json files: train and test. Please note that these files are different from the original release of the CoNNL-2003 since we have already processed and tokenized them for you. Hence, the documents are represented as a list of strings. **Note that it is not split into separate training and development/validation sets**. You will need to do this yourself as needed using the train set.
2. The train file contains the following 4 fields (each is a nested list): 
  - **'text'** - actual input tokens
  - **'NER'** - the token-level entity tag (ORG/PER/LOC/MISC/O) where **O is used to denote tokens that are not part of any named entity**
  - **'POS'** - the part of speech tag (will be handy for feature engineering of the MEMM model)
  - **'index'** - index of the token in the dataset
3. The test data only has 'text', 'POS' and 'index' fields. You will need to submit your prediction of the 'NER' tag to Kaggle. More instructions on this later!

Let's take a look at a sample sentence from the dataset!

In [4]:
raw_data = pd.DataFrame(train)

train_data, eval_data = train_test_split(raw_data, test_size=0.1)

eval_data.reset_index(drop=True, inplace=True)
train_data.reset_index(drop=True, inplace=True)

As you can see, the above the sentence, "Romania state budget soars in June.", has already been tokenized into an array of word tokens. The index array corresponds to the index of the token in the entire dataset (not the sentence). The POS tags and the NER tags correspond to the given indices. For example, the token: **Romania** has:
  - index: 0
  - POS: 'NNP'
  - NER: **'LOC'**

In [5]:
from datasets import Dataset

In [6]:
labels = ['O', 'PER', 'LOC', 'ORG', 'MISC']

id_to_label = {i: label for i, label in enumerate(labels)}
label_to_id = {v: k for k, v in id_to_label.items()}

In [7]:
from transformers import AutoTokenizer, DataCollatorForTokenClassification

checkpoint = 'dslim/bert-base-NER'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
data_collator = DataCollatorForTokenClassification(tokenizer)

Downloading:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [8]:
def align_with_labels(word_ids, entities):
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(entities[word_idx])
            else:
                label_ids.append(entities[word_idx])
            previous_word_idx = word_idx
        return label_ids

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['text'], truncation=True, is_split_into_words=True)
    labels = []
    true_indices = []
    tokens = []
    for i, seq_label in enumerate(examples['NER']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = align_with_labels(word_ids, [label_to_id[l] for l in seq_label])
        indices = align_with_labels(word_ids, examples['index'][i])
        tokens.append(tokenized_inputs.tokens(batch_index=i))
        true_indices.append(indices)
        labels.append(label_ids)
    tokenized_inputs['labels'] = labels
    tokenized_inputs['indices'] = true_indices
    tokenized_inputs['tokens'] = tokens
    return tokenized_inputs
 
train_data = Dataset.from_pandas(raw_data).map(tokenize_and_align_labels, batched = True, load_from_cache_file=False)
eval_data = Dataset.from_pandas(eval_data).map(tokenize_and_align_labels, batched = True, load_from_cache_file=False)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [9]:
import evaluate
metric = evaluate.load('seqeval')

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[id_to_label[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [id_to_label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

# **Part 2: Your Model**

---

In this section, you will implement a model that is covered in the lecture.
Feel free to use another notebook, or locally -- push them to your github repo.

The following is for evaluation and submission to kaggle, feel free to copy them to your implementation for evaluation.


In [10]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    checkpoint,
    id2label=id_to_label,
    label2id=label_to_id,
    ignore_mismatched_sizes=True
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading:   0%|          | 0.00/413M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dslim/bert-base-NER and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([9, 768]) in the checkpoint and torch.Size([5, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([5]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
os.environ["WANDB_DISABLED"] = "true"

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    'bert-finetuned-ner',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    num_train_epochs=20,
    weight_decay=0.01,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

trainer.train()

Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
The following columns in the training set don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: indices, NER, text, index, POS, tokens. If indices, NER, text, index, POS, tokens are not expected by `BertForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 756
  Num Epochs = 20
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 960


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.035706,0.932418,0.95786,0.944968,0.990517
2,No log,0.023226,0.948447,0.973298,0.960712,0.993588
3,No log,0.016633,0.963048,0.978638,0.970781,0.995376
4,No log,0.013556,0.959511,0.98281,0.971021,0.996025
5,No log,0.011765,0.971262,0.984229,0.977702,0.996592
6,No log,0.00925,0.970386,0.987066,0.978655,0.997379
7,No log,0.008136,0.97323,0.988985,0.981045,0.997512
8,No log,0.006748,0.975086,0.989569,0.982274,0.997926
9,No log,0.006088,0.977924,0.990654,0.984248,0.998079
10,No log,0.005457,0.980035,0.991238,0.985605,0.998237


The following columns in the evaluation set don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: indices, NER, text, index, POS, tokens. If indices, NER, text, index, POS, tokens are not expected by `BertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 756
  Batch size = 16
The following columns in the evaluation set don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: indices, NER, text, index, POS, tokens. If indices, NER, text, index, POS, tokens are not expected by `BertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 756
  Batch size = 16
The following columns in the evaluation set don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: indices, NER, text, index, POS, tokens. If indices, NER, text, index, POS, to

TrainOutput(global_step=960, training_loss=0.021337110052506127, metrics={'train_runtime': 1142.351, 'train_samples_per_second': 13.236, 'train_steps_per_second': 0.84, 'total_flos': 3875215620162120.0, 'train_loss': 0.021337110052506127, 'epoch': 20.0})

## **Validation Step (after your implementation)**

---

In this part of the project, we expect you to split the training data into train and validation datasets. You may use whatever split you see fit and use any external libraries to perform this split. You may want to look into the following function for splitting data: [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

Once you have split the data, train your model on the training data and evaluate it on the validation data. Report **Entity Level Mean F1**, which was explained earlier. Please use the code we have provided below to compute this metric.

Please also take a look into your misclassified cases, as we will be performing error analysis in the *Evaluation* section. We expect smoothing, unknown word handling and correct emission (i.e., lexical generation) probabilities.

Consider the example below. After getting a sequence of NER labels for the sequence of tokens from your Viterbi algorithm implementation, you need to convert the sequence of tokens, associated token indices and NER labels into a format which can be used to calculate **Entity Level Mean F1**. We do this by finding the starting and ending indices of the spans representing each entity (as given in the corpus) and adding it to a list that is associated with the label with which the spans are labelled. To score your validation data on Google Colab or your local device, you can get a dictionary format as shown in the picture below from the function **format_output_labels** of both the predicted and true label sequences, and use the two dictionaries as input to the **mean_f1** function.

NOTE: We do **not** include the spans of the tokens labelled as "O" in the formatted dictionary output.

![picture](https://docs.google.com/uc?export=download&id=1M57DEHgfusVPU_hlvmiOpkS3yn9GGEgj)

In [12]:
def format_output_labels(token_labels, token_indices):
    """
    Returns a dictionary that has the labels (LOC, ORG, MISC or PER) as the keys, 
    with the associated value being the list of entities predicted to be of that key label. 
    Each entity is specified by its starting and ending position indicated in [token_indices].

    Eg. if [token_labels] = ["ORG", "ORG", "O", "O", "ORG"]
           [token_indices] = [15, 16, 17, 18, 19]
        then dictionary returned is 
        {'LOC': [], 'MISC': [], 'ORG': [(15, 16), (19, 19)], 'PER': []}

    :parameter token_labels: A list of token labels (eg. PER, LOC, ORG or MISC).
    :type token_labels: List[String]
    :parameter token_indices: A list of token indices (taken from the dataset) 
                              corresponding to the labels in [token_labels].
    :type token_indices: List[int]
    """
    label_dict = {"LOC":[], "MISC":[], "ORG":[], "PER":[]}
    prev_label = token_labels[0]
    start = token_indices[0]
    for idx, label in enumerate(token_labels):
      if prev_label != label:
        end = token_indices[idx-1]
        if prev_label != "O":
            label_dict[prev_label].append((start, end))
        start = token_indices[idx]
      prev_label = label
      if idx == len(token_labels) - 1:
        if prev_label != "O":
            label_dict[prev_label].append((start, token_indices[idx]))
    return label_dict

In [13]:
# Code for mean F1

import numpy as np

def mean_f1(y_pred_dict, y_true_dict):
    """ 
    Calculates the entity-level mean F1 score given the actual/true and 
    predicted span labels.
    :parameter y_pred_dict: A dictionary containing predicted labels as keys and the 
                            list of associated span labels as the corresponding
                            values.
    :type y_pred_dict: Dict<key [String] : value List[Tuple]>
    :parameter y_true_dict: A dictionary containing true labels as keys and the 
                            list of associated span labels as the corresponding
                            values.
    :type y_true_dict: Dict<key [String] : value List[Tuple]>

    Implementation modified from original by author @shonenkov at
    https://www.kaggle.com/shonenkov/competition-metrics.
    """
    F1_lst = []
    for key in y_true_dict:
        TP, FN, FP = 0, 0, 0
        num_correct, num_true = 0, 0
        preds = y_pred_dict[key]
        trues = y_true_dict[key]
        for true in trues:
            num_true += 1
            if true in preds:
                num_correct += 1
            else:
                continue
        num_pred = len(preds)
        if num_true != 0:
            if num_pred != 0 and num_correct != 0:
                R = num_correct / num_true
                P = num_correct / num_pred
                F1 = 2*P*R / (P + R)
            else:
                F1 = 0      # either no predictions or no correct predictions
        else:
            continue
        F1_lst.append(F1)
    return np.mean(F1_lst)

In [None]:
# Usage using above example

pred_token_labels = ["ORG", "O", "PER", "PER", "O", "LOC", "O", "O", "O", "O", "MISC", "O", "O", "O", "O", "LOC"]
true_token_labels = ["ORG", "O", "PER", "PER", "O", "LOC", "O", "O", "O", "O", "MISC", "MISC", "O", "O", "O", "LOC"]
token_indices = [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28]

y_pred_dict = format_output_labels(pred_token_labels, token_indices)
print("y_pred_dict is : " + str(y_pred_dict))
y_true_dict = format_output_labels(true_token_labels, token_indices)
print("y_true_dict is : " + str(y_true_dict))

print("Entity Level Mean F1 score is : " + str(mean_f1(y_pred_dict, y_true_dict)))

In [17]:
# Evaluate/validate your model here
output = trainer.predict(train_data)
output

The following columns in the test set don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: indices, NER, text, index, POS, tokens. If indices, NER, text, index, POS, tokens are not expected by `BertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 756
  Batch size = 16


PredictionOutput(predictions=array([[[ 8.7458048e+00, -2.2759044e+00, -2.3415825e+00, -2.0738673e+00,
         -2.1761770e+00],
        [-2.9750774e+00, -1.9323852e+00,  7.8483944e+00, -1.4696587e+00,
         -2.0525610e+00],
        [ 9.3956213e+00, -2.7651374e+00, -2.3287251e+00, -1.9988110e+00,
         -2.4755347e+00],
        ...,
        [-1.0000000e+02, -1.0000000e+02, -1.0000000e+02, -1.0000000e+02,
         -1.0000000e+02],
        [-1.0000000e+02, -1.0000000e+02, -1.0000000e+02, -1.0000000e+02,
         -1.0000000e+02],
        [-1.0000000e+02, -1.0000000e+02, -1.0000000e+02, -1.0000000e+02,
         -1.0000000e+02]],

       [[ 8.8672285e+00, -2.0173681e+00, -2.3369391e+00, -2.2393041e+00,
         -2.3002679e+00],
        [-1.2249304e+00, -1.9342363e+00, -1.1852957e+00, -2.1441886e+00,
          7.5561209e+00],
        [ 9.3757601e+00, -2.4990096e+00, -2.4031560e+00, -2.2048471e+00,
         -2.5591688e+00],
        ...,
        [-1.0000000e+02, -1.0000000e+02, -1.0000000e

In [34]:
def map_labels_to_indices(labels, indices):
    l = []
    i = set()
    for seq_labels, seq_indices in zip(labels, indices):
        for label, index in zip(seq_labels, seq_indices):
            if index != -100 and index not in i:
                l.append(id_to_label[label])
                i.add(index)
    return (l, list(i))

In [33]:
# test map_labels_to_indices

print(map_labels_to_indices([[1, 2, 3]], [[0, 1, 1]]))

[1, 2, 3]
[0, 1, 1]
(['PER', 'LOC'], [0, 1])


In [35]:
vlabels, vindices = map_labels_to_indices(output.label_ids, train_data['indices'])

# for seq, seq_ind in zip():
#     for i, j in zip(seq, seq_ind):
#         if j != -100 and j not in vindices:
#             vlabels.append(id_to_label[i])
#             vindices.append(j)

In [36]:
valid_true_labels = []
valid_true_indices = []

for i, true_labels in enumerate(train_data['NER']):
    valid_true_labels.extend(true_labels)
    valid_true_indices.extend(train_data['index'][i])
    
y_pred_dict = format_output_labels(vlabels, vindices)
y_true_dict = format_output_labels(valid_true_labels, valid_true_indices)

print("Entity Level Mean F1 score is : " + str(mean_f1(y_pred_dict, y_true_dict)))

Entity Level Mean F1 score is : 0.9492546016632916


# **Part 3: Kaggle Submission**
---

Using the best-performing system from among all of your HMM and MEMM models, generate predictions for the test set, and submit them to Kaggle at https://www.kaggle.com/competitions/cs6320final. Note, you **need** to use our tokenizer as the labels on Kaggle corresponds to these. Below, we provide a function that submits given predicted tokens and associated token indices in the correct format. As a scoring metric on Kaggle, we use **Entity Level Mean F1**.

Your submission to Kaggle should be a CSV file consisting of five lines and two columns. The first line is a fixed header, and each of the remaining four lines corresponds to one of the four types of named entities. The first column is the label identifier *Id* (one of PER, LOC, ORG or MISC), and the second column *Predicted* is a list of entities (separated by single space) that you predict to be of that type. Each entity is specified by its starting and ending index (concatenated by a hypen) as given in the test corpus. 

You can use the function **create_submission** that takes the list of predicted labels and the list of associated token indices as inputs and creates the the output CSV file at a specified path.

NOTE: Ensure that there are **no** rows with *Id* = "O" in your Kaggle Submission.

![picture](https://docs.google.com/uc?export=download&id=1pQkAyOdWQz62jB-YBaj8mHuwI6iWJ1GZ)

In [37]:
import csv

def create_submission(output_filepath, token_labels, token_inds):
    """
    :parameter output_filepath: The full path (including file name) of the output file, 
                                with extension .csv
    :type output_filepath: [String]
    :parameter token_labels: A list of token labels (eg. PER, LOC, ORG or MISC).
    :type token_labels: List[String]
    :parameter token_indices: A list of token indices (taken from the dataset) 
                              corresponding to the labels in [token_labels].
    :type token_indices: List[int]
    """
    label_dict = format_output_labels(token_labels, token_inds)
    with open(output_filepath, mode='w') as csv_file:
        fieldnames = ['Id', 'Predicted']
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        for key in label_dict:
            p_string = " ".join([str(start)+"-"+str(end) for start,end in label_dict[key]])
            writer.writerow({'Id': key, 'Predicted': p_string})

In [None]:
# We generate a file with randomized outputs

import random
random.seed(43)

test_pred_labels = []
test_pred_inds = []
for idx, ex in enumerate(test['text']):
  for i, token in enumerate(ex):
    test_pred_labels.append(random.choice(['PER', 'ORG', 'LOC', 'MISC', 'O']))
  test_pred_inds += test['index'][idx]

# generate the file with predictions (the predicted_random.csv entry on kaggle)
create_submission(path + "/predicted.csv", test_pred_labels, test_pred_inds)

In [40]:
test_ds = Dataset.from_pandas(pd.DataFrame(test))

def tokenize_and_align_labels_test(sequence):
    tokenized_inputs = tokenizer(sequence['text'], truncation = True, is_split_into_words = True)
    true_indices = []
    for i, seq_indices in enumerate(sequence['index']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        indices = align_with_labels(word_ids, seq_indices)
        true_indices.append(indices)
    tokenized_inputs['indices'] = true_indices
    return tokenized_inputs

test_data = test_ds.map(tokenize_and_align_labels_test, batched = True, load_from_cache_file=False)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [41]:
label_ids = []

test_output = trainer.predict(test_data)

test_labels, test_indices = map_labels_to_indices(np.argmax(test_output.predictions, axis=2), test_data['indices'])

The following columns in the test set don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: POS, indices, text, index. If POS, indices, text, index are not expected by `BertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 189
  Batch size = 16


In [43]:
create_submission('./predicted2.csv', test_labels, test_indices)

In [None]:
def get_vocab_size(sentences):
    words = set()
    for sentence in sentences:
        for word in sentence:
            words.add(word)
    return len(words)

print(get_vocab_size(raw_data['text']))
print(get_vocab_size(test_ds['text']))

In [None]:
# error analysis


tag_err_cnt = { 'LOC': 0, 'ORG': 0, 'MISC': 0, 'PER': 0 }


for key in y_true_dict:
      incorrect = 0
      preds = y_pred_dict[key]
      trues = y_true_dict[key]
      for true in trues:
          if true not in preds:
              incorrect += 1
      tag_err_cnt[key] = incorrect

tag_err_cnt