# Training and Evaluating a LSTM model on CONLL2003
In this lab we will build our own model for NER. In this case, we will use an implementation of a LSTM in TensorFlow. As in the previous lab we will use the CONLL 2003 dataset to train and evaluate our model.

## Set-up
In this section we will set up the notebook by mounting the drive, doing all the required imports. We are going to use WandB (https://wandb.ai) for monitoring our model. You will need an account in WandB. If you prefer not having an account in WandB you can skip the parts related to WandB. Also you'll need to modify the code. 

Follow the instruction to create an account in the following url: https://app.wandb.ai/login?signup=true


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# WandB – Install the W&B library
!pip install wandb -q
import wandb
from wandb.keras import WandbCallback

## Loading the data
In this section we provide a function (`load_data_conll`) to load the data in CONLL format, in which we have each token per line along with multiple levels of annotations. The file contains a format of 4 whitespace separated colums(words, PoS, Chunk and NE tags). The function outputs a list where each item is composed of 2 lists: 1) a sentence as list of tokens, and NER tags a list of each token. For example:

`[['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']]`

In [None]:
"""
Load the training/testing data. 
input: conll format data, with 4 whitespace separated colums - words, PoS, Chunk and NE tags.
output: A list where each item is 2 lists.  sentence as a list of tokens, NER tags as a list for each token.
"""
#functions for preparing the data in the *.txt files
def load_data_conll(file_path):
    myoutput,words,tags = [],[],[]
    fh = open(file_path)
    for line in fh:
        line = line.strip()
        if line.startswith("-DOCSTART"):
            #skip -DOCSTART- and the next line
            fh.readline()
        elif line == "":
            #Sentence ended.
            myoutput.append([words,tags])
            words,tags = [],[]
        else:   
            parts = line.split()
            #word, pos_tag, chunk_tag, ner_tag = line.split()
            if len(parts) == 4:
                words.append(parts[0])
                tags.append(parts[-1])
    fh.close()
    return myoutput

In [None]:
work_dir = "drive/MyDrive/Colab Notebooks/nlp-app-II/data"
conll_dir = work_dir + "/conll2003/en"
train_path = conll_dir + "/train.txt"
dev_path = conll_dir + "/valid.txt"
test_path = conll_dir + "/test.txt"

conll_train = load_data_conll(train_path)
conll_dev = load_data_conll(dev_path)

## Data preprocessing
In this section we provide the code to prepare the data-format useful for the LSTM architecture. 

Basically, data is tokenized, indexed and padded. For that we use tensorflow.keras tools.

In [None]:
import tensorflow as tf

In [None]:
print("Tokenized text: {}".format(conll_train[0][0]))
print("Tokens labels: {}".format(conll_train[0][1]))

In [None]:
train_sentences = [" ".join(sent[0]) for sent in conll_train]
dev_sentences = [" ".join(sent[0]) for sent in conll_dev]

tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='',lower=False, oov_token='<OOV>')
tokenizer.fit_on_texts(train_sentences)

## text to word indices
train_sequences = tokenizer.texts_to_sequences(train_sentences)
dev_sequences = tokenizer.texts_to_sequences(dev_sentences)

## padding
train_data = tf.keras.preprocessing.sequence.pad_sequences(train_sequences, maxlen=40)
dev_data = tf.keras.preprocessing.sequence.pad_sequences(dev_sequences, maxlen=40)

In [None]:
# process labels
labels = set([label for sent in conll_train for label in sent[1]])
label2index = {name: i for i, name in enumerate(sorted(labels))}
index2label = {label2index[name]:name for name in label2index}
train_labels = []
for sent in conll_train:
    train_labels.append([label2index[label] for label in sent[1]])
train_labels = tf.keras.preprocessing.sequence.pad_sequences(train_labels, maxlen=40)

dev_labels = []
for sent in conll_dev:
    dev_labels.append([label2index[label] for label in sent[1]])
dev_labels = tf.keras.preprocessing.sequence.pad_sequences(dev_labels, maxlen=40)

# one-hot encoding
train_labels = tf.keras.utils.to_categorical(train_labels)
dev_labels = tf.keras.utils.to_categorical(dev_labels)

In [None]:
print(train_labels.shape)
print(train_data.shape)

print(dev_labels.shape)
print(dev_data.shape)


## Model definition and training

First, we are going to initialize wandb session to monitor the evolution of our model, and sepecify the hyperparameters of our model in a config object. 

In [None]:
# Initilize a new wandb run
wandb.init(entity="oierldl", project="ner-conll2003")

# Config is a variable that holds and saves hyperparameters and inputs
# Default values for hyper-parameters
config = wandb.config 
config.learning_rate = 0.01
config.epochs = 5
config.num_classes = len(labels)
config.batch_size = 128
config.optimizer = 'adam'
config.seed = 42
config.vocab_size = len(tokenizer.index_word)+1
config.emb_size = 300
config.lstm_size = 128
config.dropout = 0.2

### Model definition 
The following lines of code defined LSTM based sequence labeler. We stack the following tf layers: 1) input layers that set the shape of the input and connects with the 2) embedding layer, which generates the input for the 3) bidirectional LSTM layer. Output of the LSTM is passed through a 4) dropout layer. Note that we set `return_sequences=true` so we can get the representation of each token and apply automatically 5) the dense layer for classification.

In [None]:
# model
model = tf.keras.models.Sequential([
            tf.keras.layers.Input(shape=(None,), dtype='int32', name='word_ids'),
            tf.keras.layers.Embedding(config.vocab_size, config.emb_size,
                                      mask_zero=True, trainable=True),
            tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=config.lstm_size,
                                                               return_sequences=True)),
            tf.keras.layers.Dropout(config.dropout),
            tf.keras.layers.Dense(config.num_classes, activation='softmax')
           ])

In [None]:
# compile model with ADAM optimizer
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=config.learning_rate, ),
              loss= "categorical_crossentropy",
              metrics=['accuracy', tf.keras.metrics.Recall(), tf.keras.metrics.Precision()])


### Train the model

In [None]:
%%wandb

model.fit(train_data, train_labels, 
          batch_size=config.batch_size, epochs=config.epochs,
          validation_data=(dev_data, dev_labels) , callbacks=[WandbCallback()])


## Make predictions

In [None]:
import numpy as np

def make_predictions(x_test, y_test):
    preds = model.predict(x_test)
    preds = np.argmax(preds, axis=-1)
    labs = np.argmax(y_test, axis=-1)
    preds = [y[x != 0] for x, y in zip(x_test, preds)]
    labs = [y[x != 0] for x, y in zip(x_test, labs)]
    return preds, labs

In [None]:
def dump_to_file(words, labels, preds):
    f = open('output.tsv', 'w', encoding='utf-8')
    for i in range(len(preds)):
        for w, l,p in zip(words[i],labels[i], preds[i]):
            f.write(w + " " + l + " " + p + "\n")
        f.write('\n')
    f.close()



In [None]:
predictions, ground_truth = make_predictions(dev_data, dev_labels)

In [None]:
trues = [[index2label[i] for i in sentence] for sentence in ground_truth]
preds  = [[index2label[i] for i in sentence] for sentence in predictions]
words = [ sentence.split(" ") for sentence in dev_sentences]

In [None]:
dump_to_file(words, trues, preds)

In [None]:
!cp "drive/MyDrive/00-Irakaskuntza/HAP-LAP-masterra/NLP-Applications-2/Part1: Information-extraction/notebooks/conlleval.txt" .

In [None]:
!perl conlleval.txt < output.tsv

In [None]:
from sklearn.metrics import make_scorer,confusion_matrix

def print_cm(cm, labels):
    print("\n")
    """pretty print for confusion matrixes"""
    columnwidth = max([len(x) for x in labels] + [5])  # 5 is value length
    empty_cell = " " * columnwidth
    # Print header
    print("    " + empty_cell, end=" ")
    for label in labels:
        print("%{0}s".format(columnwidth) % label, end=" ")
    print()
    # Print rows
    for i, label1 in enumerate(labels):
        print("    %{0}s".format(columnwidth) % label1, end=" ")
        sum = 0
        for j in range(len(labels)):
            cell = "%{0}.0f".format(columnwidth) % cm[i, j]
            sum =  sum + int(cell)
            print(cell, end=" ")
        print(sum) #Prints the total number of instances per cat at the end.

def get_confusion_matrix(y_true,y_pred,labels):
    trues,preds = [], []
    for yseq_true, yseq_pred in zip(y_true, y_pred):
        trues.extend(yseq_true)
        preds.extend(yseq_pred)
    print_cm(confusion_matrix(trues,preds,labels=labels),labels)

In [None]:
def get_confusion_matrix(y_true,y_pred,labels):
    trues,preds = [], []
    for yseq_true, yseq_pred in zip(y_true, y_pred):
        trues.extend(yseq_true)
        preds.extend(yseq_pred)
    print_cm(confusion_matrix(trues,preds,labels=labels),labels)

In [None]:
get_confusion_matrix(trues, preds, [label for label in label2index])

## Exercise 1

Look at the confusion matrix. Note that, rows are gold labels and colums are the predicted labels. Where is located the confusion? 

## Exercise 2

Try different hyperparameter settings to see if you are able to improve the results of the NER model.  