<div style="text-align: right"><a href="http://ml-school.uni-koeln.de">Summer School "Deep Learning for
    Language Analysis"</a> <br/><strong>Text Analysis with Deep Learning</strong><br/>Sep 5 - 9, 2022<br/>Nils Reiter<br/><a href="mailto:nils.reiter@uni-koeln.de">nils.reiter@uni-koeln.de</a></div>

# Exercise 3: Joint learning

A major advantage of the functional API of keras is that it allows us to do *joint learning*: We can train a network that makes multiple predictions at the same time. In this exercise, you will extend what you have been doing in Exercise 2, in order to simultaneously predict named entities *and* parts of speech.

In [None]:
# limit GPU memory to 4 GB
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])
    except RuntimeError as e:
        print(e)

## Setup

We import libraries and load the data as before.

In [1]:
import pandas as pd
import numpy as np

# read in CSV file
data = pd.read_csv("data/ner/gmb.csv", encoding = 'latin1')

# the first column of the file contains the sentence number
# -- but only for the first token of each sentence.
# The following line fills the rows downwards.
data = data.fillna(method = 'ffill')

## Task 1: Encode part of speech information

In Exercise 2, we only cared about the word and named entity columns. We need to extend this, in order to also handle the part of speech column. In principle, we can handle it similarly to the named entity column, but there is no obvious tag to pad sequences with. We therefore need to add an additional dummy part of speech tag.

In [None]:
# create a list of unique words and assign an integer number to it
unique_words, coded_words = np.unique(data["Word"], return_inverse=True)
data["Word_idx"] = coded_words
EMPTY_WORD_IDX = len(unique_words)
np.array(unique_words.tolist().append("_____"))
num_words = len(unique_words)+1

unique_pos_tags, coded_pos_tags = np.unique(data["POS"], return_inverse=True)
data["POS_idx"]  = coded_pos_tags
NO_POS_TAG_IDX = len(unique_pos_tags)
unique_pos_tags = unique_pos_tags.tolist()
unique_pos_tags.append("_")
unique_pos_tags = np.array(unique_pos_tags)
num_pos_tags = len(unique_pos_tags)


# create a list of unique tags and assign an integer number to it
unique_ne_tags, coded_ne_tags = np.unique(data["Tag"], return_inverse=True)
data["NE_idx"]  = coded_ne_tags
NO_NE_TAG_IDX = unique_ne_tags.tolist().index("O")
num_ne_tags = len(unique_ne_tags)

# for verification and inspection, we can inspect the table so far
data[1:20]

We also need to extend the `get_sentences()`-function, because it also needs to return the values of the part of speech index column.

In [None]:
# We are interested in sentence-wise processing.
# Therefore, we use a function that gives us individual sentences.
def get_sentences(data):
  n_sent=1
  agg_func = lambda s:[(w,p,t) 
    for w,p,t in zip(
      s["Word_idx"].values.tolist(),
      s["POS_idx"].values.tolist(),
      s["NE_idx"].values.tolist())]
  grouped = data.groupby("Sentence #").apply(agg_func)
  return [s for s in grouped]

sentences = get_sentences(data)

## Padding

As before, we have to pad the sentences to the same length. This time, we also pad the POS sequences.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# find the maximum length for the sentences
max_len = max([len(s) for s in sentences])

# extract the word index
x = [ [ w[0] for w in s ] for s in sentences ]

# extract the tag index
y_pos = [ [ w[1] for w in s ] for s in sentences ]
y_ne = [ [ w[2] for w in s ] for s in sentences ]

# shorter sentences are now padded to same length, using (index of) padding symbol
x = pad_sequences(maxlen = max_len, sequences = x, 
  padding = 'post', value = EMPTY_WORD_IDX)

# we do the same for the y data
y_ne = pad_sequences(maxlen = max_len, sequences = y_ne, 
  padding = 'post', value = NO_NE_TAG_IDX)
y_pos = pad_sequences(maxlen = max_len, sequences = y_pos, 
  padding = 'post', value = NO_POS_TAG_IDX)

y_ne = np.array(y_ne)
y_pos = np.array(y_pos)

# but we also convert the indices to one-hot-encoding
y_ne = to_categorical(y_ne, num_classes = num_ne_tags)
y_pos = to_categorical(y_pos, num_classes = num_pos_tags)


To split the data intro train and test, we need to do one more thing: We need to apply the same split to `y_pos` and `y_ne`. To do this, we supply a third argument to the `train_test_split()` function, namely a list of indices for the data points. These can then be applied to the `y_pos` data as well.

In [1]:
# split the data into training and test data
from sklearn.model_selection import train_test_split

x_train, x_test, y_ne_train, y_ne_test, train_indices, test_indices = train_test_split(x, y_ne, 
                                                                                  range(len(x)), test_size = 0.1, random_state=1)

y_pos_train = y_pos[train_indices]
y_pos_test = y_pos[test_indices]

NameError: name 'x' is not defined

## Task 2: Network Layout

Define the network using the functional API, such that it produces two outputs for each token. Keras provides [this guide](https://keras.io/guides/functional_api/) for functional API.

In [None]:
from tensorflow.keras import models, layers, optimizers

l_input = layers.Input(shape = (max_len,))
l_embedding = layers.Embedding(input_dim = num_words, output_dim = 50, input_length = max_len)(l_input)
l_lstm = layers.LSTM(units = 5, return_sequences = True)(l_embedding)
l_output_ne = layers.Dense(num_ne_tags, name="ne", activation = 'softmax')(l_lstm)
l_output_pos = layers.Dense(num_pos_tags, name="pos", activation = 'softmax')(l_lstm)

model = models.Model(inputs = l_input, outputs=[l_output_ne, l_output_pos])

model.summary()

# We use a different optimizer this time
model.compile(optimizer='Adam', 
  loss = 'categorical_crossentropy', metrics = ['accuracy'])




In [None]:
history = model.fit(
    x_train, [np.array(y_ne_train), np.array(y_pos_train)],
    batch_size = 64,
    epochs = 2,
    verbose = 1
)

In [None]:
model.evaluate(x_test, [y_ne_test, y_pos_test])

In [None]:
# Reverse one-hot-encoding for test data
y_ne_test = np.argmax(y_ne_test, axis=2)
y_pos_test = np.argmax(y_pos_test, axis=2)


In [None]:
from sklearn.metrics import classification_report

y_ne_pred, y_pos_pred = model.predict(x_test)

y_ne_pred = np.argmax(y_ne_pred, axis=2)
y_pos_pred = np.argmax(y_pos_pred, axis=2)

print(classification_report(y_ne_test.flatten(), y_ne_pred.flatten(), zero_division=0, target_names=unique_ne_tags))
print(classification_report(y_pos_test.flatten(), y_pos_pred.flatten(), zero_division=0, target_names=unique_pos_tags))
