<div style="text-align: right"><a href="http://ml-school.uni-koeln.de">Summer School "Deep Learning for
    Language Analysis"</a> <br/><strong>Text Analysis with Deep Learning</strong><br/>Sep 5 - 9, 2022<br/>Nils Reiter<br/><a href="mailto:nils.reiter@uni-koeln.de">nils.reiter@uni-koeln.de</a></div>

# Exercise 2

This exercise is about sequence labeling. A sequence of items (words, in this case) must the tagged with a sequence of labels. In this case the labels are named entity tags in the BIO scheme.

The data we will be using comes from the Groningen Meaning Bank (GMB). Its annotation scheme can be found [here](http://www.let.rug.nl/bjerva/gmb/manual.php). As always, we will first preprocess the data, and then create and train the model.

In [None]:
# limit GPU memory to 4 GB
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])
    except RuntimeError as e:
        print(e)

Import into a pandas dataframe, and fill missing values. Also, let's look at the head of the table. We also directly encode the strings as integers, using the [numpy-function `np.unique(...)`](https://numpy.org/doc/stable/reference/generated/numpy.unique.html). This will allow us to convert the index numbers back into readable tag strings later on.

For padding (see below), we will be using `_____` as a "word", and `O` as a tag. `_____` needs to be added to the lists of unique words as well.

In [None]:
import pandas as pd
import numpy as np

# read in CSV file
data = pd.read_csv("data/ner/gmb.csv",encoding = 'latin1')

# the first column of the file contains the sentence number
# -- but only for the first token of each sentence.
# The following line fills the rows downwards.
data = data.fillna(method = 'ffill')

# create a list of unique words and assign an integer number to it
unique_words, coded_words = np.unique(data["Word"], return_inverse=True)
data["Word_idx"] = coded_words
EMPTY_WORD_IDX = len(unique_words)
np.array(unique_words.tolist().append("_____"))
num_words = len(unique_words)+1

# create a list of unique tags and assign an integer number to it
unique_tags, coded_tags = np.unique(data["Tag"], return_inverse=True)
data["Tag_idx"]  = coded_tags
NO_TAG_IDX = unique_tags.tolist().index("O")
num_words_tag = len(unique_tags)

# for verification and inspection, we print out the table so far
data[1:20]

In this step, we convert the table in such a way that we can access individual sentences. The result of the function is a list of list of tuples, with the tuples containing the word, its part of speech tag and its named entity tag.

In [None]:
def get_sentences(data):
    n_sent=1
    agg_func = lambda s:[(w,t) for w,t in zip(s["Word_idx"].values.tolist(),
                                                     s["Tag_idx"].values.tolist())]
    grouped = data.groupby("Sentence #").apply(agg_func)
    return [s for s in grouped]


sentences = get_sentences(data)

# print out the first sentence for verification
print(sentences[0])

# extract list of tokens and list of ne tags
x = [ [ w[0] for w in s ] for s in sentences ]
y = [ [ w[1] for w in s ] for s in sentences ]

## Task 1: Padding

Now that we have sentences and tags encoded as integer values and in individual lists, we need to make sure that every input has the same length. This is called "padding", and the simple solution is to extend set all sequences with a null value, so that they are of the same length.

The padding can be done in two steps:
1. Find out how long the longest sentence is (hint: list comprehension!).
2. Use the keras function [`pad_sequences()`](https://keras.io/api/preprocessing/timeseries/) to do the padding on the x and y variables.


In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# find the maximum length for the sentences
max_len = max([len(s) for s in x])

# shorter sentences are now padded to same length, using (index of) padding symbol
x = pad_sequences(maxlen = max_len, sequences = x, padding = 'post', value = EMPTY_WORD_IDX)

# we do the same for the y data
y = pad_sequences(maxlen = max_len, sequences = y, padding = 'post', value = NO_TAG_IDX)


## Task 2: One-Hot-Encoding

Named entity recognition as done here is a multiclass classification problem: Each token is assigned one of more than two possible classes (`unique_tags` from before contains the list of classes). Multiclass classification problems are solved with neural networks, such that the network produces a vector of probabilities as output – one probability for each class. We can then easily extract the class with the highest probability as prediction.

Therefore, we need to encode our $y$ data into the same format. Luckily, keras provides a function to use here: [`to_categorical()`](https://keras.io/api/utils/python_utils/#to_categorical-function). Use it to map our output integers into vectors. 

In [None]:
from tensorflow.keras.utils import to_categorical

y = np.array([to_categorical(i, num_classes = num_words_tag) for i in  y])

## Train / test split

In contrast to our previous exercise, in which defined train and test data sets were given, we only have a single data set here. In these cases, we need to manually split the data set into training and test set. This can be done with a function from the library `scikit-learn`: [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), which yields a list of multiple outputs as return values.

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.1, random_state=1)

## Task 3: Model Architecture

Now we create the model architecture. You find a simple initial architecture below. Play around with it, try to improve its performance!

Things to try:
- Pretrained embeddings
- Bidirectionality
- More dense layers

In [None]:
from tensorflow.keras import models, layers, optimizers

model = models.Sequential()
model.add(layers.InputLayer(input_shape = (max_len,)))
model.add(layers.Embedding(input_dim = num_words, output_dim = 1, input_length = max_len))
model.add(layers.SimpleRNN(units = 5, return_sequences = True))
model.add(layers.Dense(num_words_tag, activation = 'softmax'))
model.summary()

model.compile(loss = 'categorical_crossentropy', metrics = ['accuracy'])

Run the training

In [None]:
history = model.fit(
    x_train, np.array(y_train),
    batch_size = 128,
    epochs = 1,
    verbose = 1
)

In [None]:
model.evaluate(x_test, np.array(y_test))

## Evaluation by class

So far, we have mostly looked at accuracy scores. For this task, however, this may not giving us the entire picture, because there are many different target classes, and the model might perform differently for them. So look at an evaluation by class. For this, the [function `classification_report(...)` from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) can be used.

In [None]:
from sklearn.metrics import classification_report

Y_test = np.argmax(y_test, axis=2)

y_pred = np.argmax(model.predict(x_test), axis=2)


print(classification_report(Y_test.flatten(), y_pred.flatten(), zero_division=0, target_names=unique_tags))