# Assignment 2
You should submit the **UniversityNumber.ipynb** file and your final prediction file **UniversityNumber.test.out** to moodle. Make sure your code does not use your local files and that the results are reproducible. Before submitting, please **run your notebook and keep all running logs** so that we can check.

## 1 Data

We will conduct experiments on Conll2003, which contains 14041 sentences, and each sentence is annotated with the corresponding named entity tags. You can download the dataset from the following
link: https://data.deepai.org/conll2003.zip. We only focus on the token and NER tags, which are
the first and last columns in the dataset. The dataset is in the IOB format, which is a common format
for named entity recognition. The IOB format 1 is a simple way to represent the named entity tags. For
example, the sentence “I went to New York City last week” is annotated as follows:

    I O
    went O
    to O
    New B-LOC
    York I-LOC
    City I-LOC
    last O
    week O

In [20]:
def load_data():
    # load data
    import os
    import numpy as np

    data_dir = os.path.join(os.getcwd(), "data")
    train_path = os.path.join(data_dir, 'train.txt')
    valid_path = os.path.join(data_dir, 'valid.txt')
    test_path  = os.path.join(data_dir, 'test.txt')

    with open(train_path, 'r', encoding="utf-8") as f:
        train_raw = [l.strip() for l in f.readlines()]
        train = list()
        start_of_sentence = 0
        for i in range(len(train_raw)):
            if train_raw[i] == '': # end of sentence
                sent = list()
                tags = list()
                for l in train_raw[start_of_sentence:i]:
                    l = l.split(' ')
                    sent.append(l[0])
                    tags.append(l[-1])
                train.append((sent, tags))
                start_of_sentence = i+1

    with open(valid_path, 'r', encoding="utf-8") as f:
        valid_raw = [l.strip() for l in f.readlines()]
        valid = list()
        start_of_sentence = 0
        for i in range(len(valid_raw)):
            if valid_raw[i] == '': # end of sentence
                sent = list()
                tags = list()
                for l in valid_raw[start_of_sentence:i]:
                    l = l.split(' ')
                    sent.append(l[0])
                    tags.append(l[-1])
                valid.append((sent, tags))
                start_of_sentence = i+1
    with open(test_path, 'r', encoding="utf-8") as f:
        test_raw = [l.strip() for l in f.readlines()]
        test = list()
        start_of_sentence = 0
        for i in range(len(test_raw)):
            if test_raw[i] == '': # end of sentence
                sent = list()
                tags = list()
                for l in test_raw[start_of_sentence:i]:
                    l = l.split(' ')
                    sent.append(l[0])
                    tags.append(l[-1])
                test.append((sent, tags))
                start_of_sentence = i+1
    
    return train, valid, test

In [21]:
train, valid, test = load_data()
# print(train)
# print(valid)
# print(test)

## 2 Tagger
    You will train your tagger on the train set and evaluate it on the dev set. And then, you may tune the
    hyperparameters of your tagger to get the best performance on the dev set. Finally, you will evaluate
    your tagger on the test set to get the final performance.

    https://en.wikipedia.org/wiki/Inside–outside–beginning_(tagging)

    There are some key points you should pay attention to:
    • You will batch the sentences in the dataset to accelerate the training process. To batch the sentences, you may need to pad the sentences to the same length.
    • You are free to design the model architecture with (Bi)LSTM or Transformer unit for each part, but please do not use any pretrained weights in your basic taggers.
    • You will adjust the hyperparameters of your tagger to get the best performance on the dev set. The hyperparameters include the learning rate, batch size, the number of hidden units, the number of
    layers, the dropout rate, etc.
    • You will use seqeval to evaluate your tagger on the dev set and the test set.


### 2.1 LSTM Tagger
    We will first use an LSTM tagger to solve the NER problem. There is a very simple implementation of the
    LSTM tagger on PyTorch website https://pytorch.org/tutorials/beginner/nlp/sequence_models_
    tutorial.html. You can refer to this implementation to implement your LSTM tagger.


In [22]:
# Author: Robert Guthrie

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7f4c60acf370>

In [23]:
# prepare the data
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

# Tags are: DET - determiner; NN - noun; V - verb
# For example, the word "The" is a determiner
training_data = train
word_to_ix = {}
tag_to_ix = {}  # Assign each tag with a unique index
# For each words-list (sentence) and tags-list in each tuple of training_data
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:  # word has not been assigned an index yet
            word_to_ix[word] = len(word_to_ix)  # Assign each word with a unique index
    for word in tags:
        if word not in tag_to_ix:  # word has not been assigned an index yet
            tag_to_ix[word] = len(tag_to_ix)  # Assign each word with a unique index

# print(word_to_ix)
print(tag_to_ix)

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 64
HIDDEN_DIM = 64

{'O': 0, 'B-ORG': 1, 'B-MISC': 2, 'B-PER': 3, 'I-PER': 4, 'B-LOC': 5, 'I-ORG': 6, 'I-MISC': 7, 'I-LOC': 8}


In [24]:
# create the model
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

In [25]:
# train the model
LSTM_model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(LSTM_model.parameters(), lr=0.5)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
# with torch.no_grad():
#     inputs = prepare_sequence(training_data[1][0], word_to_ix)
#     tag_scores = LSTM_model(inputs)
#     print(tag_scores)

training_losses = []
for epoch in range(30):  # again, normally you would NOT do 300 epochs, it is toy data
    training_loss = 0.0
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        LSTM_model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = LSTM_model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        training_loss += loss.item()

        loss.backward()
        optimizer.step()
    training_losses.append(training_loss)
    print(f'epoch: {epoch+1}, training_loss {training_loss}')

epoch: 1, training_loss 8044.39072116892
epoch: 2, training_loss 5084.696591463069
epoch: 3, training_loss 3706.8099597147675
epoch: 4, training_loss 2754.044553090836
epoch: 5, training_loss 2156.147622264596
epoch: 6, training_loss 1738.0728709315645
epoch: 7, training_loss 1450.3971640910631
epoch: 8, training_loss 1266.0412989755587
epoch: 9, training_loss 1078.70192791935
epoch: 10, training_loss 988.3627979045807
epoch: 11, training_loss 870.9701561983657
epoch: 12, training_loss 805.908555736888
epoch: 13, training_loss 793.5945046868018
epoch: 14, training_loss 768.441322144903
epoch: 15, training_loss 749.7561716216561
epoch: 16, training_loss 694.9810463283322
epoch: 17, training_loss 620.954916520334
epoch: 18, training_loss 625.9417722546631
epoch: 19, training_loss 677.4467900753143
epoch: 20, training_loss 741.728653916519
epoch: 21, training_loss 750.4504607854976
epoch: 22, training_loss 678.9232132514862
epoch: 23, training_loss 602.6331838447898
epoch: 24, training_lo

### 2.2 Transformer Tagger
    We will also use Transformer to solve the NER problem. You can refer to the following link to implement
    your Transformer tagger: https://pytorch.org/tutorials/beginner/transformer_tutorial.html.

In [26]:
def export_to_file(export_file_path, data):
  tag_to_idx = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
  with open(export_file_path, "w") as f:
      for tokens, tags in data:
          if len(tokens) > 0:
              f.write(
                  str(len(tokens))
                  + "\t"
                  + "\t".join(tokens)
                  + "\t"
                  + "\t".join([str(tag_to_idx[tag]) for tag in tags])
                  + "\n"
              )


# os.mkdir("data")
export_to_file("./data/trans_train.txt", train)
export_to_file("./data/trans_valid.txt", valid)
export_to_file("./data/trans_test.txt", test)

In [27]:
!pip3 install datasets
!wget https://raw.githubusercontent.com/sighsmile/conlleval/master/conlleval.py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
--2022-11-21 17:58:19--  https://raw.githubusercontent.com/sighsmile/conlleval/master/conlleval.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7502 (7.3K) [text/plain]
Saving to: ‘conlleval.py.1’


2022-11-21 17:58:19 (98.4 MB/s) - ‘conlleval.py.1’ saved [7502/7502]



In [28]:
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from collections import Counter
from conlleval import evaluate

In [29]:

class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = keras.layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.ffn = keras.Sequential(
            [
                keras.layers.Dense(ff_dim, activation="relu"),
                keras.layers.Dense(embed_dim),
            ]
        )
        self.layernorm1 = keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = keras.layers.Dropout(rate)
        self.dropout2 = keras.layers.Dropout(rate)

    def call(self, inputs, training=False):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)


In [30]:

class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = keras.layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.pos_emb = keras.layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, inputs):
        maxlen = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        position_embeddings = self.pos_emb(positions)
        token_embeddings = self.token_emb(inputs)
        return token_embeddings + position_embeddings


In [31]:

class NERModel(keras.Model):
    def __init__(
        self, num_tags, vocab_size, maxlen=128, embed_dim=16, num_heads=2, ff_dim=32
    ):
        super(NERModel, self).__init__()
        self.embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
        self.transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
        self.dropout1 = layers.Dropout(0.1)
        self.ff = layers.Dense(ff_dim, activation="relu")
        self.dropout2 = layers.Dropout(0.1)
        self.ff_final = layers.Dense(num_tags, activation="softmax")

    def call(self, inputs, training=False):
        x = self.embedding_layer(inputs)
        x = self.transformer_block(x)
        x = self.dropout1(x, training=training)
        x = self.ff(x)
        x = self.dropout2(x, training=training)
        x = self.ff_final(x)
        return x


In [32]:

def make_tag_lookup_table():
    iob_labels = ["B", "I"]
    ner_labels = ["PER", "ORG", "LOC", "MISC"]
    all_labels = [(label1, label2) for label2 in ner_labels for label1 in iob_labels]
    all_labels = ["-".join([a, b]) for a, b in all_labels]
    all_labels = ["[PAD]", "O"] + all_labels
    return dict(zip(range(0, len(all_labels) + 1), all_labels))


mapping = make_tag_lookup_table()
print(mapping)

tag_to_idx = {mapping[idx]: idx for idx in mapping}
print(tag_to_idx)

{0: '[PAD]', 1: 'O', 2: 'B-PER', 3: 'I-PER', 4: 'B-ORG', 5: 'I-ORG', 6: 'B-LOC', 7: 'I-LOC', 8: 'B-MISC', 9: 'I-MISC'}
{'[PAD]': 0, 'O': 1, 'B-PER': 2, 'I-PER': 3, 'B-ORG': 4, 'I-ORG': 5, 'B-LOC': 6, 'I-LOC': 7, 'B-MISC': 8, 'I-MISC': 9}


In [33]:
all_tokens = []
for record in train:
  for token in record[0]:
    all_tokens.append(token)
    
all_tokens_array = np.array(list(map(str.lower, all_tokens)))

counter = Counter(all_tokens_array)
print(len(counter))

num_tags = len(mapping)
vocab_size = 20000

# We only take (vocab_size - 2) most commons words from the training data since
# the `StringLookup` class uses 2 additional tokens - one denoting an unknown
# token and another one denoting a masking token
vocabulary = [token for token, count in counter.most_common(vocab_size - 2)]

# The StringLook class will convert tokens to token IDs
lookup_layer = keras.layers.StringLookup(
    vocabulary=vocabulary
)

21010


In [34]:
data_dir = os.path.join(os.getcwd(), "data")
tans_train_path = os.path.join(data_dir, 'trans_train.txt')
tans_valid_path = os.path.join(data_dir, 'trans_valid.txt')
tans_test_path  = os.path.join(data_dir, 'trans_test.txt')

train_data = tf.data.TextLineDataset(tans_train_path)
valid_data = tf.data.TextLineDataset(tans_valid_path)
test_data = tf.data.TextLineDataset(tans_test_path)

# train_data = tf.data.Dataset.from_tensor_slices(train)
# valid_data = tf.data.Dataset.from_tensor_slices(valid)
# test_data = tf.data.Dataset.from_tensor_slices(test)

In [35]:

def map_record_to_training_data(record):
    record = tf.strings.split(record, sep="\t")
    length = tf.strings.to_number(record[0], out_type=tf.int32)
    tokens = record[1 : length + 1]
    tags = record[length + 1 :]
    tags = tf.strings.to_number(tags, out_type=tf.int64)
    tags += 1
    return tokens, tags


def lowercase_and_convert_to_ids(tokens):
    tokens = tf.strings.lower(tokens)
    return lookup_layer(tokens)


# We use `padded_batch` here because each record in the dataset has a
# different length.
batch_size = 32
train_dataset = (
    train_data.map(map_record_to_training_data)
    .map(lambda x, y: (lowercase_and_convert_to_ids(x), y))
    .padded_batch(batch_size)
)
valid_dataset = (
    valid_data.map(map_record_to_training_data)
    .map(lambda x, y: (lowercase_and_convert_to_ids(x), y))
    .padded_batch(batch_size)
)
test_dataset = (
    test_data.map(map_record_to_training_data)
    .map(lambda x, y: (lowercase_and_convert_to_ids(x), y))
    .padded_batch(batch_size)
)

ner_model = NERModel(num_tags, vocab_size, embed_dim=32, num_heads=4, ff_dim=64)

In [36]:

class CustomNonPaddingTokenLoss(keras.losses.Loss):
    def __init__(self, name="custom_ner_loss"):
        super().__init__(name=name)

    def call(self, y_true, y_pred):
        loss_fn = keras.losses.SparseCategoricalCrossentropy(
            from_logits=False, reduction=keras.losses.Reduction.NONE
        )
        loss = loss_fn(y_true, y_pred)
        mask = tf.cast((y_true > 0), dtype=tf.float32)
        loss = loss * mask
        return tf.reduce_sum(loss) / tf.reduce_sum(mask)


loss = CustomNonPaddingTokenLoss()

In [37]:
ner_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3), loss=loss)
ner_model.fit(train_dataset, epochs=10)


# def tokenize_and_convert_to_ids(text):
#     tokens = text.split()
#     return lowercase_and_convert_to_ids(tokens)


# # Sample inference using the trained model
# sample_input = tokenize_and_convert_to_ids(
#     "eu rejects german call to boycott british lamb"
# )
# sample_input = tf.reshape(sample_input, shape=[1, -1])
# print(sample_input)

# output = ner_model.predict(sample_input)
# prediction = np.argmax(output, axis=-1)[0]
# prediction = [mapping[i] for i in prediction]

# # eu -> B-ORG, german -> B-MISC, british -> B-MISC
# print(prediction)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f4d5db31650>

### 2.3 Results
    Please report your best performance on the test set in the same format as the test.txt file, but replace the ground truth labels with your prediction.

In [38]:
def LSTM_calculate_accuracy(model, data, tags, words):
    
    for s in range(len(data)):
        # print(data[s])
        for w in range(len(data[s][0])):
            # print(data[s][0][w])
            if data[s][0][w] not in words:  # word has not been assigned an index yet
                data[s][0][w] = '.'

    correct = 0
    # print(tags)

    pred_tags = []
    with torch.no_grad():
        for x, y in data:
            inputs = prepare_sequence(x, words)
            tag_scores = model(inputs)
            # print(x)

            idx = np.argmax(tag_scores, axis=1)
            # print(idx)
            pred_tag = [tags[i] for i in idx]
            # print("predicted result: ", pred_tag)
            # print("ground truth:     ", y)
            pred_tags.append(pred_tag)

            correct += sum([y==pred_tag])
    
    return correct/len(data), pred_tags

In [46]:
# print(tag_to_ix.keys())
tags = np.array([tag for tag in tag_to_ix])

acc_train, LSTM_pred_tags = LSTM_calculate_accuracy(LSTM_model, train, tags, word_to_ix)
print("accuracy of the LSTM model in train data: ", acc_train)

acc_valid, LSTM_pred_tags = LSTM_calculate_accuracy(LSTM_model, valid, tags, word_to_ix)
print("accuracy of the LSTM model in valid data: ", acc_valid)

acc_test, LSTM_pred_tags = LSTM_calculate_accuracy(LSTM_model, test, tags, word_to_ix)
print("accuracy of the LSTM model in test data: ", acc_test)

accuracy of the LSTM model in train data:  0.9117234936945353
accuracy of the LSTM model in valid data:  0.5816503173687247
accuracy of the LSTM model in test data:  0.46878393051031486


In [49]:
def LSTM_output_pred(pred, input_file, output_file):
    import os
    import numpy as np

    data_dir = os.path.join(os.getcwd(), "data")
    input_path = os.path.join(data_dir, input_file)
    output_path = os.path.join(data_dir, output_file)

    # pred_list = pred
    pred_list = []
    for p in pred:
      for q in p:
        pred_list.append(q)
      pred_list.append('')

    # print(len(np.concatenate(pred)))

    with open(input_path, 'r', encoding="utf-8") as f:
        input = [l.strip().split(' ') for l in f.readlines()]
    f.close()

    # print(len(pred_list), len(input))
    with open(output_path, 'w', encoding="utf-8") as f:
        while len(input) > 0:
          # print(len(input), len(pred_list))
          # print(input[0], pred_list[0])
          output = input.pop(0)
          output[-1] = pred_list.pop(0)
          # print(output)
          f.write(' '.join(output) + "\n")
    f.close()

    print(f'Result saved in {output_path} successfully.\n')

LSTM_output_pred(LSTM_pred_tags, 'test.txt', '3035470694.lstm.test.txt')

Result saved in /content/data/3035470694.lstm.test.txt successfully.



In [53]:

def transformer_calculate_metrics(dataset):
    all_true_tag_ids, all_predicted_tag_ids = [], []

    for x, y in dataset:
        output = ner_model.predict(x)
        predictions = np.argmax(output, axis=-1)
        predictions = np.reshape(predictions, [-1])

        true_tag_ids = np.reshape(y, [-1])

        mask = (true_tag_ids > 0) & (predictions > 0)
        true_tag_ids = true_tag_ids[mask]
        predicted_tag_ids = predictions[mask]

        all_true_tag_ids.append(true_tag_ids)
        all_predicted_tag_ids.append(predicted_tag_ids)

    all_true_tag_ids = np.concatenate(all_true_tag_ids)
    all_predicted_tag_ids = np.concatenate(all_predicted_tag_ids)

    predicted_tags = [mapping[tag] for tag in all_predicted_tag_ids]
    real_tags = [mapping[tag] for tag in all_true_tag_ids]

    evaluate(real_tags, predicted_tags)

    print(len(real_tags), len(predicted_tags))
    return predicted_tags


In [45]:
train_tags = transformer_calculate_metrics(train_dataset)
valid_tags = transformer_calculate_metrics(valid_dataset)

processed 204567 tokens with 23499 phrases; found: 23969 phrases; correct: 21659.
accuracy:  91.40%; (non-O)
accuracy:  98.48%; precision:  90.36%; recall:  92.17%; FB1:  91.26
              LOC: precision:  93.88%; recall:  96.40%; FB1:  95.12  7332
             MISC: precision:  86.17%; recall:  89.01%; FB1:  87.57  3551
              ORG: precision:  82.59%; recall:  82.98%; FB1:  82.78  6351
              PER: precision:  96.08%; recall:  98.05%; FB1:  97.05  6735
204567 204567
processed 51578 tokens with 5942 phrases; found: 5591 phrases; correct: 4084.
accuracy:  65.62%; (non-O)
accuracy:  93.58%; precision:  73.05%; recall:  68.73%; FB1:  70.82
              LOC: precision:  80.93%; recall:  82.47%; FB1:  81.69  1872
             MISC: precision:  73.20%; recall:  67.25%; FB1:  70.10  847
              ORG: precision:  70.08%; recall:  58.17%; FB1:  63.57  1113
              PER: precision:  66.46%; recall:  63.46%; FB1:  64.93  1759
51578 51578


In [55]:
def transformer_output_pred(pred, input_file, output_file):
    import os
    import numpy as np

    data_dir = os.path.join(os.getcwd(), "data")
    input_path = os.path.join(data_dir, input_file)
    output_path = os.path.join(data_dir, output_file)

    pred_list = pred
    # pred_list = []
    # for p in pred:
    #   for q in p:
    #     pred_list.append(mapping[q)
      # pred_list.append('')

    # print(len(np.concatenate(pred)))

    with open(input_path, 'r', encoding="utf-8") as f:
        input = [l.strip().split(' ') for l in f.readlines()]
    f.close()

    # print(len(pred_list), len(input))
    with open(output_path, 'w', encoding="utf-8") as f:
        while len(input) > 0:
          # print(len(input), len(pred_list))
          # print(input[0], pred_list[0])
          output = input.pop(0)
          if output != ['']:
            output[-1] = pred_list.pop(0)
          # print(output)
          f.write(' '.join(output) + "\n")
    f.close()

    print(f'Result saved in {output_path} successfully.\n')


In [56]:
transformer_pred_tags = transformer_calculate_metrics(test_dataset)
transformer_output_pred(transformer_pred_tags, 'test.txt', '3035470694.transformer.test.txt')
transformer_pred_tags = transformer_calculate_metrics(test_dataset)
transformer_output_pred(transformer_pred_tags, 'test.txt', '3035470694.test.out')

processed 46666 tokens with 5648 phrases; found: 5064 phrases; correct: 3306.
accuracy:  56.89%; (non-O)
accuracy:  91.41%; precision:  65.28%; recall:  58.53%; FB1:  61.73
              LOC: precision:  75.07%; recall:  77.10%; FB1:  76.07  1713
             MISC: precision:  62.59%; recall:  59.83%; FB1:  61.18  671
              ORG: precision:  67.20%; recall:  48.22%; FB1:  56.15  1192
              PER: precision:  53.70%; recall:  49.41%; FB1:  51.47  1488
46666 46666
Result saved in /content/data/3035470694.transformer.test.txt successfully.

processed 46666 tokens with 5648 phrases; found: 5064 phrases; correct: 3306.
accuracy:  56.89%; (non-O)
accuracy:  91.41%; precision:  65.28%; recall:  58.53%; FB1:  61.73
              LOC: precision:  75.07%; recall:  77.10%; FB1:  76.07  1713
             MISC: precision:  62.59%; recall:  59.83%; FB1:  61.18  671
              ORG: precision:  67.20%; recall:  48.22%; FB1:  56.15  1192
              PER: precision:  53.70%; recall:  4