# Assignment 3 : Sequence labelling with RNNs
In this assignement we will ask you to perform POS tagging.

You are asked to follow these steps:
*   Download the corpora and split it in training and test sets, structuring a dataframe.
*   Embed the words using GloVe embeddings
*   Create a baseline model, using a simple neural architecture
*   Experiment doing small modifications to the model
*   Evaluate your best model
*   Analyze the errors of your model

**Corpora**:
Ignore the numeric value in the third column, use only the words/symbols and its label.
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip 

**Splits**: documents 1-100 are the train set, 101-150 validation set, 151-199 test set.

**Baseline**: two layers architecture: a Bidirectional LSTM and a Dense/Fully-Connected layer on top.

**Modifications**: experiment using a GRU instead of the LSTM, adding an additional LSTM layer, and using a CRF in addition to the LSTM. Each of this change must be done by itself (don't mix these modifications).

**Training and Experiments**: all the experiments must involve only the training and validation sets.

**Evaluation**: in the end, only the best model of your choice must be evaluated on the test set. The main metric must be F1-Macro computed between the various part of speech (without considering punctuation classes).

**Error Analysis** (optional) : analyze the errors done by your model, try to understand which may be the causes and think about how to improve it.

**Report**: You are asked to deliver a small report of about 4-5 lines in the .txt file that sums up your findings.

## Outline

This notebook provides a Keras implementation of **three RNN models** with the purpose of performing sequence labelling, in particular **POS-tagging**.

The main outline of the process is the following:
- Raw data (199 labelled documents consisting of a varying number of sentences) is split into **train**, **validation** and **test** partitions and stored in a `pandas.DataFrame` object.
- The words in the dataframe - after some preprocessing - are substituted with their *Glove embeddings* (and with custom embeddings, when needed) and their labels (POS tags) are one-hot encoded.
- Models have the **embeddings** as input and **tags** as output.
- To simulate a real-world scenario, the different models are trained on the first partition and evaluated on the second one; lastly, only the best-performing model is tested on the last partition.
Moreover, the partitions are kept independent with respect to every aspect except the word embeddings (either retrieved from GloVe or computed) and the labels encoding.

## Imports

In [1]:
import os, shutil, zipfile, requests  #  file management
import sys
import pandas as pd  #  dataframe management
import numpy as np  #  data manipulation

# Text pre-processing
import re
from functools import reduce

# Neural models
import keras
from keras import layers

# Evaluating
from sklearn.metrics import classification_report

# Visualization
from tqdm import tqdm # progress bars
from tabulate import tabulate # printing

## File operations

### Download and extraction

We firstly take care of downloading and extracting the dataset from GitHub, if data is not found in current working directory.

In [2]:
# Download and extract data
dataset_folder = os.path.join(os.getcwd(), 'dependency_treebank')

if not os.path.exists(dataset_folder):
    # Download data
    url = 'https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip'
    r = requests.get(url, allow_redirects=True)
    dataset_zip = 'dependency_treebank.zip'
    dataset_zip_path = os.path.join(os.getcwd(), dataset_zip)
    open(dataset_zip, 'wb').write(r.content)
    if os.path.exists(dataset_zip):
        print("Successful download:", dataset_zip_path)
    # Extract the zip archive to directory 'dependency_treebank'
    with zipfile.ZipFile('dependency_treebank.zip', 'r') as zip_ref:
        zip_ref.extractall()
    if os.path.exists(dataset_folder):
        print("Successful extraction:", dataset_folder)
else:
    print("Data already downloaded")

Successful download: /content/dependency_treebank.zip
Successful extraction: /content/dependency_treebank


### Files splitting

Let's now create (or clean and recreate, if already present) the 'splits' folder, with 'train', 'validation' and 'test' subfolders within. This makes it easier to create the dataframe rows.

In [3]:
# Split dataset
splits = {
    0: 'train',
    1: 'validation',
    2: 'test'
}
# Create directories
split_folder = os.path.join(os.getcwd(), 'splits')

# Clean 'splits' directory, if present
if not os.path.exists(split_folder):
    os.makedirs(split_folder)
print("Cleaning splits folder...")
for folder in os.listdir(split_folder):
    for filename in folder:
        file_path = os.path.join(split_folder, filename)
        try:
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)
        except Exception as e:
            print('Failed to delete %s. Reason: %s' % (file_path, e))

print("Cleaned")
# Create subfolders
for split in splits.values():
    os.makedirs(os.path.join(split_folder, split), exist_ok=True)
# Copy files into directories according to name
for filename in os.listdir(dataset_folder):
    n = int(filename[4:-3])
    # 'Label' refers to 'splits' keys (0 for train, 1 for validation, 2 for test)
    label = -1
    if n < 101:
        label = 0
    elif n < 151:
        label = 1
    else:
        label = 2
    src = os.path.join(dataset_folder, filename)
    dst = os.path.join(split_folder, splits[label], filename)
    shutil.copyfile(src, dst)
# Check distribution
for split in splits.values():
    folder = os.path.join(split_folder, split)
    print(f"Files in {split}:", min([int(filename[4:-3]) for filename in os.listdir(folder)]), "-",
                               max([int(filename[4:-3]) for filename in os.listdir(folder)]))   

Cleaning splits folder...
Cleaned
Files in train: 1 - 100
Files in validation: 101 - 150
Files in test: 151 - 199


### `DataFrame` creation and saving

We now build a dataframe with columns `['document', 'labels', 'split']` and store it into 'Dataframes' folder.

In [4]:
dataframe_folder = os.path.join(os.getcwd(), "Dataframes")
if not os.path.exists(dataframe_folder):
    os.makedirs(dataframe_folder)
    
split_folder = os.path.join(os.getcwd(), 'splits')
splits = {
    0: 'train',
    1: 'validation',
    2: 'test'
}
dataframe_rows = []

debug = True

# Create a dataframe for each split
for split in splits.values():
    folder = os.path.join(split_folder, split)
    for filename in os.listdir(folder):
        file_path = os.path.join(folder, filename)
        try:
            if os.path.isfile(file_path):
                # open the file
                # Store lists of document and labels for each file
                document, labels = list(), list()
                with open(file_path, mode='r') as text_file:
                    for line in text_file:
                        splitted = line.split("\t")
                        if len(splitted) > 1:
                            word, label = splitted[0:2]
                            document.append(word)
                            labels.append(label)
                            
                # create single dataframe row
                dataframe_row = {
                    "document": " ".join(document),
                    "labels": " ".join(labels),
                    "split": split
                }

                # print detailed info for the first file
                if debug:
                    print("Information for first file:")
                    print("File path:", file_path)
                    print("Filename:", filename)
                    print("Document:", document)
                    print("Labels:", labels)
                    print("Split:", split)
                    print("Dataframe row:", dataframe_row)
                    debug = False
                dataframe_rows.append(dataframe_row)

        except Exception as e:
            print('Failed to process %s. Reason: %s' % (file_path, e))
            sys.exit(0)

# transform the list of rows in a proper dataframe
dataframe = pd.DataFrame(dataframe_rows)
dataframe = dataframe[["document",
                       "labels",
                       "split"]]
dataframe_path = os.path.join(dataframe_folder, "dataframe.pkl")
dataframe.to_pickle(dataframe_path)
print("Dataframe successfully saved into:", dataframe_path)

Information for first file:
File path: /content/splits/train/wsj_0050.dp
Filename: wsj_0050.dp
Document: ['Cooper', 'Tire', '&', 'Rubber', 'Co.', 'said', 'it', 'has', 'reached', 'an', 'agreement', 'in', 'principle', 'to', 'buy', 'buildings', 'and', 'related', 'property', 'in', 'Albany', ',', 'Ga.', ',', 'from', 'Bridgestone\\/Firestone', 'Inc', '.', 'Terms', 'were', "n't", 'disclosed', '.', 'The', 'tire', 'maker', 'said', 'the', 'buildings', 'consist', 'of', '1.8', 'million', 'square', 'feet', 'of', 'office', ',', 'manufacturing', 'and', 'warehousing', 'space', 'on', '353', 'acres', 'of', 'land', '.']
Labels: ['NNP', 'NNP', 'SYM', 'NNP', 'NNP', 'VBD', 'PRP', 'VBZ', 'VBN', 'DT', 'NN', 'IN', 'NN', 'TO', 'VB', 'NNS', 'CC', 'JJ', 'NN', 'IN', 'NNP', ',', 'NNP', ',', 'IN', 'NNP', 'NNP', '.', 'NNS', 'VBD', 'RB', 'VBN', '.', 'DT', 'NN', 'NN', 'VBD', 'DT', 'NNS', 'VBP', 'IN', 'CD', 'CD', 'JJ', 'NNS', 'IN', 'NN', ',', 'NN', 'CC', 'NN', 'NN', 'IN', 'CD', 'NNS', 'IN', 'NN', '.']
Split: train
Dataf

### `DataFrame` loading

The following code loads a previously saved `DataFrame`.

In [5]:
dataframe_folder = os.path.join(os.getcwd(), "Dataframes")
dataframe_path = os.path.join(dataframe_folder, "dataframe.pkl")
dataframe = pd.read_pickle(dataframe_path)
print("Full dataframe:")
dataframe

Full dataframe:


Unnamed: 0,document,labels,split
0,Cooper Tire & Rubber Co. said it has reached a...,NNP NNP SYM NNP NNP VBD PRP VBZ VBN DT NN IN N...,train
1,"J.P. Bolduc , vice chairman of W.R. Grace & Co...","NNP NNP , NN NN IN NNP NNP CC NNP , WDT VBZ DT...",train
2,Alleghany Corp. said it completed the acquisit...,NNP NNP VBD PRP VBD DT NN IN NNP NNPS CC NNP N...,train
3,The U.S. and Soviet Union are holding technica...,DT NNP CC NNP NNP VBP VBG JJ NNS IN JJ NN IN N...,train
4,Rekindled hope that two New England states wil...,VBN NN IN CD NNP NNP NNS MD VB JJR JJ NN VBD N...,train
...,...,...,...
194,Carnival Cruise Lines Inc. said potential prob...,NNP NNP NNP NNP VBD JJ NNS IN DT NN IN CD JJ N...,test
195,Freeport-McMoRan Inc. said it will convert its...,NNP NNP VBD PRP MD VB PRP$ NNP NNP NNPS NNP NN...,test
196,Tony Lama Co. said that Equus Investment II Li...,NNP NNP NNP VBD IN NNP NNP NNP NNP NNP VBZ VBN...,test
197,Hadson Corp. said it expects to report a third...,NNP NNP VBD PRP VBZ TO VB DT NN JJ NN IN $ CD ...,test


### Embedding model download

Download GloVe embedding model making use of the [Gensim](https://radimrehurek.com/gensim/) library.

In [6]:
import gensim
import gensim.downloader as gloader

def load_embedding_model(embedding_dimension=50):
    
    download_path = f"glove-wiki-gigaword-{embedding_dimension}"

    # Check download
    try:
        emb_model = gloader.load(download_path)
    except ValueError as e:
        print("Invalid embedding dimension! GloVe available embedding dimensions are 50, 100, 200, 300")
        raise e

    return emb_model

embedding_dimension = 300

embedding_model = load_embedding_model(embedding_dimension)        



## Preliminar steps

In this section, we define functions to pre-process input data in order to properly feed the neural models.
In particular, we will follow these steps:
- **Text pre-processing**: tokenize words in order to exploit GloVe embeddings
- **Vocabulary creation**: needed to build the co-occurrence matrix and to index and later retrieve words as integers
- **OOV terms handling**: assign an embedding vector to words found in the input data, but not in the embedding model
- **Embedding matrix computation**: arrange embeddings in a matrix to easily and quickly retrieve them

### Text pre-processing

Text pre-processing is a critic point when performing POS-tagging: significant alterations of input text could worsen the classification accuracy, particularly at *test time*.

Therefore, I decided to only perform **case lowering** (due to the presence of only lower-cased words in the GloVe model) and **brackets substitution** (in the Penn Treebank format, they are encoded as strings - such as `-LRB-` - but the GloVe model features them as they are - in this example `(`) in order to fully exploit *pre-trained* embeddings.

In [7]:
LRB_RE = re.compile('-LRB-')
RRB_RE = re.compile('-RRB-')
LCB_RE = re.compile('-LCB-')
RCB_RE = re.compile('-RCB-')

def replace_brackets(text):
    # Replaces round and curly brackets tags with actual characters
    return LCB_RE.sub('{', RCB_RE.sub('}', LRB_RE.sub('(', RRB_RE.sub(')', text))))

def lower(text):
    # Transforms given text to lower case.
    return text.lower()


PREPROCESSING_PIPELINE = [
                          replace_brackets,
                          lower
                          ]

# Anchor method

def text_prepare(text, filter_methods=None):
    """
    Applies a list of pre-processing functions in sequence (reduce).
    Note that the order is important here!
    """

    filter_methods = filter_methods if filter_methods is not None else PREPROCESSING_PIPELINE

    return reduce(lambda txt, f: f(txt), filter_methods, text)

# Pre-processing

print('Pre-processing text...')

print()
print('[Debug] Before:\n{}'.format(dataframe.document[:3]))
print()

# Replace each sentence with its pre-processed version
dataframe['document'] = dataframe['document'].apply(lambda txt: text_prepare(txt))

print('[Debug] After:\n{}'.format(dataframe.document[:3]))
print()

print("Pre-processing completed!")

Pre-processing text...

[Debug] Before:
0    Cooper Tire & Rubber Co. said it has reached a...
1    J.P. Bolduc , vice chairman of W.R. Grace & Co...
2    Alleghany Corp. said it completed the acquisit...
Name: document, dtype: object

[Debug] After:
0    cooper tire & rubber co. said it has reached a...
1    j.p. bolduc , vice chairman of w.r. grace & co...
2    alleghany corp. said it completed the acquisit...
Name: document, dtype: object

Pre-processing completed!


### Vocabulary creation

In order to get GloVe embeddings, we firstly need to create a vocabulary to store unique words from given corpus.

In [8]:
from collections import OrderedDict

def build_vocabulary(df):
    print("Building vocabulary...")
    # Initialize set of unique words
    word_listing = set()
    ## # Initialize two empty ordered dictionaries
    idx_to_word, word_to_idx = OrderedDict(), OrderedDict()
    voc_size = 0
    for document in tqdm(df, desc="Populating set of unique terms..."):
        # Add new words to set
        word_listing = word_listing|set(document)
        # Add new words to dictionaries
        for word in document:
            if word not in word_to_idx.keys():
                # Fill dictionaries keeping insertion order
                voc_size = len(word_to_idx)
                idx_to_word[voc_size] = word
                word_to_idx[word] = voc_size
    print("Done!")
    return idx_to_word, word_to_idx, word_listing

### OOV terms handling

#### Check number of OOV terms

In case the fraction of OOV terms is negligible, we could in principle neglect to handle them.
In case of validation/test, we can avoid to consider as OOV words already handled at training-time. 

In [9]:
def check_OOV_terms(embedding_model, word_listing, training=True, training_oov=None):
    
    OOV = []
    for word in word_listing:
        if training is not False:
            if word not in embedding_model:
                OOV.append(word)
        else:
            if word not in embedding_model and word not in training_oov:
                OOV.append(word)
    print("Total OOV terms: {0} ({1:.2f}%)".format(len(OOV), float(len(OOV)) * 100 / len(word_listing)))    
    return OOV

#### Compute co-occurrence matrix

Computing a co-occurrence matrix is necessary to provide OOV terms with **non-random embeddings**: we will need to get their *context words* along with their embeddings, in order to compute the OOV embedding vector **averaging** on them.

In [10]:
from collections import OrderedDict
from scipy.sparse import csr_matrix

def sparse_matrix(df, idx_to_word, word_to_idx, window_size):
    print("Computing sparse matrix...")

    # Sparse matrix inputs
    indices = []
    data = []
    indptr = [0]
    
    # Dictionary to store co-occurrence as pairs word:{context words:no. of occurrences}
    co_dict = OrderedDict()
    window_indexes = list(range(-window_size, 0)) + list(range(1, window_size+1))
    for document in tqdm(df, desc="Populating auxiliary dictionaries..."):
        for i, word in enumerate(document):
            # Extract word index from vocabulary
            index = word_to_idx[word]
            # Define context set extracting words from indexes
            context = [document[i+incr] for incr in window_indexes if i+incr in range(len(document))]
            # If entry not present, create it
            if word not in co_dict:
                co_dict[word] = {}
            # For every context word, either create the entry or add 1 to it
            for context_word in context:
                co_dict[word][context_word] = co_dict[word].get(context_word, 0) + 1

    
    # Loop over ordered dictionary to fill sparse matrix inputs
    for word, context in tqdm(co_dict.items(), desc="Populating auxiliary arrays..."):
        # List of (vocabulary) indexes of words contained in context
        indices += [word_to_idx[key] for key in context.keys()]
        # List of slicings for indices and data
        indptr.append(len(indices))
        # Occurrences of words in context of word
        data += list(context.values())
    
    print("Done!")
    # Return co-occurrence matrix
    return csr_matrix((data, indices, indptr), dtype=int)

def dense_matrix(df, idx_to_word, word_to_idx, window_size):
    print("Computing dense matrix...")
    # Initialize zeros-filled co-occurrence matrix
    com = np.zeros((len(word_to_idx),len(word_to_idx)), dtype=np.int)
    for document in tqdm(df, desc="Populating matrix..."):
        for i, word in enumerate(document):
            # Extract word index from vocabulary
            index = word_to_idx[word]
            # Define window as list of increments from center word
            window_indexes = list(range(-window_size, 0)) + list(range(1, window_size+1))
            # Define context extracting words from indexes
            context = [document[i+incr] for incr in window_indexes if i+incr in range(len(document))]
            # Extract context words index from vocabulary
            context_indexes = [word_to_idx[context_word] for context_word in context]
            # Update co-occurrence matrix
            for context_index in context_indexes:
                com[index, context_index] += 1
    print("Done!")
    return com

In [11]:
def co_occurrence_count(df, idx_to_word, word_to_idx, window_size=1):
    co_occurrence_matrix = None
    sparse = False
    
    if sparse is not False:
        co_occurrence_matrix = sparse_matrix(df, idx_to_word, word_to_idx, window_size)
    else:
        co_occurrence_matrix = dense_matrix(df, idx_to_word, word_to_idx, window_size)
    
    print("Co-occurrence matrix has shape:", co_occurrence_matrix.shape)
    return co_occurrence_matrix

### Embedding matrix computation

This matrix collects the **GloVe embeddings** of all the words in the given corpus.

Regarding **OOV terms**, the vector is computed as mean of context words' embeddings, when possible. If a OOV term has only other OOV terms as context words, the assigned embedding is a random vector.

In [12]:
# Returns vector embedding (if possible, as average of context words' embeddings)
def get_embedding_vector(embedding_model, embedding_dimension, word, contexts):
    # Check if word has context (i.e. if any of its neighbours is in the model)
    try:
        contexts[word]
    # If none of its neighbours is in the model, assign random vector
    except:
        return 1, 0, np.random.normal(size=(embedding_dimension))
    # Otherwise, compute the vector as weighted average of the vectors of the words in the context
    else:
        embeddings = [embedding_model[context_word] for context_word in contexts[word].keys()]
        weights = list(contexts[word].values())
        return 0, 1, np.average(embeddings, weights=weights, axis=0)

In [13]:
from collections import OrderedDict

def build_embedding_matrix(embedding_model, embedding_dimension, word_to_idx, idx_to_word, oov_terms, co_occurrence_count_matrix,
                           training=True, train_embedding_matrix=None, train_word_to_idx=None):

    # Choose how to handle OOV terms: mean or random
    random = False
    # Indices as in vocabulary
    oov_indices = np.array([word_to_idx[term] for term in oov_terms])
    # Indices as in oov sparse_matrix
    indices, values = co_occurrence_count_matrix[oov_indices,:].nonzero()
    # Context dictionary (word:context)
    oov_contexts = OrderedDict()
    for i, index in enumerate(indices):
        # Context contains only words in the model (i.e. having an embedding vector)
        context_word = idx_to_word[values[i]]
        if context_word in embedding_model:
            word = idx_to_word[oov_indices[index]]
            count = co_occurrence_count_matrix[oov_indices[index], values[i]]
            # {word: {context_words: counts}}
            oov_contexts[word] = {**oov_contexts.get(word, {}), **{context_word: count}}
   
   # Store embeddings in list
    vectors = []
    n_embedded_vectors = 0
    random_vectors = 0
    average_vectors = 0
    for word in word_to_idx.keys():
        vector = None
        # If word is in the model, get embedding vector
        # If its embedding has already been computed during training, retrieve it
        if word not in oov_terms:
            if training is not False:
                vector = embedding_model[word]
            else:
                try:
                    vector = embedding_model[word]
                except KeyError:
                    vector = train_embedding_matrix[train_word_to_idx[word]]
            n_embedded_vectors += 1
        # Otherwise, create a new one (either random or computed from context)
        elif random is not False:
            vector = np.random.normal(size=(embedding_dimension))
            n_oov_vectors += 1
        else:
            rnd, avg, vector = get_embedding_vector(embedding_model, embedding_dimension, word, oov_contexts)
            random_vectors += rnd
            average_vectors += avg
        
        vectors.append(vector)
    print(f"{len(vectors)} vectors in matrix:\n{n_embedded_vectors} embeddings already in model, {random_vectors+average_vectors} vectors computed for OOV words ({average_vectors} computed as average, {random_vectors} random)")
    # Return concatenation of model embeddings and OOV embeddings
    return np.array(vectors)#np.concatenate([vectors, oov_vectors], axis=0)

## Models definition

Observing documents being of very different lengths, I decided to design the RNN models as to accept **batches of size 1**, i.e. one document at a time, so as to **avoid to perform zero-padding** on input data.
Indeed, feeding the model with one sequence at a time, `keras` does not require them to be all of the same length.

I took this choice after noticing the significant **slowdown** introduced by **masking** the zero-padded input data.

### Baseline

We firstly build a simple *Biridectional LSTM (Long Short-Term Memory)*.
The output of the model is a **TimeDistributed Dense** (Fully-Connected) layer, in order to collect sequences (timesteps) of words as consecutive inputs, and produce a `n_classes`-dimensional output - representing the one-hot encoded POS tags - for each word in the sequence.

In [53]:
def baseline_model(embedding_dimension, n_classes):
    # Model creation
    model = keras.Sequential(
        [
            layers.Bidirectional(layers.LSTM(
                                            units=256,
                                            return_sequences=True,
                                            input_shape=(None, embedding_dimension)
                                            ),
                                 batch_input_shape=(1, None, embedding_dimension)
                                 ),
            layers.TimeDistributed(
                layers.Dense(n_classes, activation='softmax')
                )
        ],
        name='baseline'
    )
    # Model description
    model.summary()
    return model

### GRU model

In the second model, a *GRU (Gated Recurrent Unit)* layer substitutes the LSTM.

In [54]:
def gru_model(embedding_dimension, n_classes):
    # Model creation
    model = keras.Sequential(
        [
            layers.Bidirectional(layers.GRU(
                                            units=256,
                                            return_sequences=True,
                                            input_shape=(None, embedding_dimension)
                                            ),
                                 batch_input_shape=(1, None, embedding_dimension)
                                 ),
            layers.TimeDistributed(
                layers.Dense(n_classes, activation='softmax')
                )
        ],
        name='gru'
    )
    # Model description
    model.summary()
    return model

### Double LSTM model

In the third model, another LSTM layer is added after the first one.

In [55]:
def double_lstm_model(embedding_dimension, n_classes):
    # Model creation
    model = keras.Sequential(
        [
            layers.Bidirectional(layers.LSTM(
                                            units=256,
                                            return_sequences=True,
                                            input_shape=(None, embedding_dimension),
                                            ),
                                 batch_input_shape=(1, None, embedding_dimension)
                                 ),
            layers.Bidirectional(layers.LSTM(
                                            units=256,
                                            return_sequences=True),
                                 ),
            layers.TimeDistributed(layers.Dense(n_classes, activation='softmax'))
        ],
        name='double_lstm'
    )
    # Model description
    model.summary()
    return model

### (Optional) LSTM + CRF Model
In the fourth (optional) model, a *CRF (Conditional Random Field)* layer is added after the LSTM.

After trying several configurations with both `keras_contrib` and `tf2crf` packages, I gave up trying to build this fourth model, due to the rise of several errors to which I could not quickly figure out solutions.

I thus commented the code.

In [56]:
#!pip install git+https://www.github.com/keras-team/keras-contrib.git

In [57]:
#   from keras_contrib.layers import CRF
#   from keras_contrib.losses import crf_loss
#   from keras_contrib.metrics import crf_viterbi_accuracy
#   
#   def crf_model(embedding_dimension, n_classes):
#       # Model creation
#       model = keras.Sequential(
#           [
#               #layers.Masking(mask_value=0., batch_input_shape=(1, None, embedding_dimension)),
#               keras.layers.Bidirectional(keras.layers.LSTM(
#                                               units=256,
#                                               return_sequences=True,
#                                               input_shape=(None, embedding_dimension),
#                                               ),
#                                    batch_input_shape=(1, None, embedding_dimension)
#                                    ),
#            
#               keras.layers.TimeDistributed(keras.layers.Dense(n_classes, activation='softmax'),
#               CRF(
#                   units=n_classes,
#                   learn_mode='join',
#                   test_mode='viterbi',
#                   sparse_target=False
#                   ),   
#               #layers.Dense(1, activation="sigmoid")
#           ],
#           name='crf'
#       )
#       # Model description
#       model.summary()
#       return model

### Batch generator

A **batch generator** is needed to feed the model with **one** batch at a time, still using the `model.fit()` method. Indeed, it allows us to provide inputs (sequences) of different size (length).

In [58]:
def batch_generator(X_data, y_data, start=0):

    samples_per_epoch = X_data.shape[0]
    number_of_batches = samples_per_epoch # Due to batches being 1-D
    counter=start
    
    while True:
        # Reshape as 1-D batch
        X_batch = X_data[counter][np.newaxis, ...]
        y_batch = y_data[counter][np.newaxis, ...]
        counter += 1
        yield X_batch,y_batch

        # Restart counter to yield data in the next epoch
        if counter >= start + number_of_batches:
            counter = start

## Prepare training and validation data

In this section I firstly load and properly pre-process training and validation data to feed the models.

### Training

#### Load data

In [82]:
# Load training data into dataframe
train = dataframe[dataframe['split'] == 'train']
# Split into X (document) and y (tags) arrays
train_documents = train['document'].apply(lambda txt: txt.split())
train_labels = train['labels'].apply(lambda txt: txt.split())
print("Training documents:")
print(train_documents.head())
print("Training labels:")
print(train_labels.head())

Training documents:
0    [cooper, tire, &, rubber, co., said, it, has, ...
1    [j.p., bolduc, ,, vice, chairman, of, w.r., gr...
2    [alleghany, corp., said, it, completed, the, a...
3    [the, u.s., and, soviet, union, are, holding, ...
4    [rekindled, hope, that, two, new, england, sta...
Name: document, dtype: object
Training labels:
0    [NNP, NNP, SYM, NNP, NNP, VBD, PRP, VBZ, VBN, ...
1    [NNP, NNP, ,, NN, NN, IN, NNP, NNP, CC, NNP, ,...
2    [NNP, NNP, VBD, PRP, VBD, DT, NN, IN, NNP, NNP...
3    [DT, NNP, CC, NNP, NNP, VBP, VBG, JJ, NNS, IN,...
4    [VBN, NN, IN, CD, NNP, NNP, NNS, MD, VB, JJR, ...
Name: labels, dtype: object


#### Create embedding matrix

Create a matrix with shape `(vocabulary_size, embedding_dimension)` to store embeddings for each word in the corpus.

In [83]:
# Create vocabulary
idx_to_word, word_to_idx, word_listing = build_vocabulary(train_documents)
vocabulary_size = len(word_listing)
print('Vocabulary size:', vocabulary_size)

idx_to_label, label_to_idx, label_listing = build_vocabulary(train_labels)
n_classes = len(label_listing)
print('Number of classes (POS-tags):', n_classes)

Populating set of unique terms...: 100%|██████████| 100/100 [00:00<00:00, 4013.53it/s]
Populating set of unique terms...: 100%|██████████| 100/100 [00:00<00:00, 11276.83it/s]

Building vocabulary...
Done!
Vocabulary size: 7404
Building vocabulary...
Done!
Number of classes (POS-tags): 45





In [84]:
# Check OOV terms
oov_terms = check_OOV_terms(embedding_model, word_listing)

Total OOV terms: 355 (4.79%)


In [85]:
# Build co-occurrence matrix
co_occurrence_matrix = co_occurrence_count(train_documents, idx_to_word, word_to_idx, window_size=1)

Computing dense matrix...


Populating matrix...: 100%|██████████| 100/100 [00:00<00:00, 528.77it/s]


Done!
Co-occurrence matrix has shape: (7404, 7404)


In [86]:
# Build embedding matrix
embedding_matrix = build_embedding_matrix(embedding_model, embedding_dimension, word_to_idx, idx_to_word, oov_terms, co_occurrence_matrix)

print(f"Training embedding matrix shape: {embedding_matrix.shape}")
print(f"Training embedding matrix element type: {type(embedding_matrix[0,0])}")

7404 vectors in matrix:
7049 embeddings already in model, 355 vectors computed for OOV words (355 computed as average, 0 random)
Training embedding matrix shape: (7404, 300)
Training embedding matrix element type: <class 'numpy.float64'>


#### Data pre-processing

We need now to vectorize input documents and labels - replacing words with their embedding representations and one-hot encoding tags - in order to feed the network.

In [87]:
# Vectorized documents: sequences of indexes as in word vocabulary
X_seq = train_documents.apply(lambda txt: np.array([word_to_idx[i] for i in txt]))
# Vectorized labels: sequences of indexes as in label vocabulary
y_seq = train_labels.apply(lambda labels: np.array([label_to_idx[i] for i in labels]))

# Prepare data to train on batches
one_hot_matrix = np.identity((n_classes))
# X input data: each word index is replaced with the corresponding embedding
X_train_on_batch = X_seq.apply(lambda sequence: embedding_matrix[sequence])
# y input data: each label is replaced with the corresponding one-hot encoded label
y_train_on_batch = y_seq.apply(lambda sequence: one_hot_matrix[sequence])

In [88]:
# Visualize some examples
print("X train example:")
print(X_train_on_batch[42])
print("y train example:")
print(y_train_on_batch[42])

X train example:
[[ 0.058563    0.093696   -0.16047999 ...  0.11121    -0.48603001
   0.42183   ]
 [-0.40684    -0.33581001  0.46884    ... -0.10412    -0.63138002
   0.62768   ]
 [-0.22994    -0.043043   -0.34722    ... -0.36552    -0.45416999
   0.10034   ]
 ...
 [ 0.43693     0.085203    0.010649   ...  0.11165    -0.57367998
   0.42956001]
 [ 0.43665001  0.18793    -0.17022    ...  0.032894   -0.52144003
   0.22295   ]
 [-0.12559     0.01363     0.10306    ... -0.34224001 -0.022394
   0.13684   ]]
y train example:
[[1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### Validation

#### Load data

In [89]:
# Load validation data into dataframe
val = dataframe[dataframe['split'] == 'validation']
# Split into X (document) and y (tags) arrays
val_documents = val['document'].apply(lambda txt: txt.split())
val_labels = val['labels'].apply(lambda txt: txt.split())
print("Validation documents:")
print(val_documents.head())
print("Validation labels:")
print(val_labels.head())

Validation documents:
100    [allergan, inc., said, it, received, food, and...
101    [beauty, takes, backseat, to, safety, on, brid...
102    [elco, industries, inc., said, it, expects, ne...
103    [american, city, business, journals, inc., sai...
104    [the, following, were, among, yesterday, 's, o...
Name: document, dtype: object
Validation labels:
100    [NNP, NNP, VBD, PRP, VBD, NNP, CC, NNP, NNP, N...
101    [NN, VBZ, NN, TO, NNP, IN, NNPS, NN, VBZ, IN, ...
102    [NNP, NNPS, NNP, VBD, PRP, VBZ, JJ, NN, IN, DT...
103    [NNP, NNP, NNP, NNPS, NNP, VBD, PRP$, NN, ,, N...
104    [DT, VBG, VBD, IN, NN, POS, NNS, CC, NNS, IN, ...
Name: labels, dtype: object


#### Create embedding matrix

Create a matrix with shape `(vocabulary_size, embedding_dimension)` to store embeddings for each word in the corpus.

In [90]:
# Create vocabulary
val_idx_to_word, val_word_to_idx, val_word_listing = build_vocabulary(val_documents)
print('Vocabulary size:', len(val_word_listing))

Populating set of unique terms...: 100%|██████████| 50/50 [00:00<00:00, 2474.43it/s]

Building vocabulary...
Done!
Vocabulary size: 5420





In [91]:
# Check OOV terms (making use of training ones)
val_oov_terms = check_OOV_terms(embedding_model, val_word_listing, training=False, training_oov=oov_terms)

Total OOV terms: 189 (3.49%)


In [92]:
# Build co-occurrence matrix
val_co_occurrence_matrix = co_occurrence_count(val_documents, val_idx_to_word, val_word_to_idx, window_size=1)

Populating matrix...: 100%|██████████| 50/50 [00:00<00:00, 388.01it/s]

Computing dense matrix...
Done!
Co-occurrence matrix has shape: (5420, 5420)





In [93]:
# Build embedding matrix
val_embedding_matrix = build_embedding_matrix(embedding_model, embedding_dimension, val_word_to_idx, val_idx_to_word, val_oov_terms, val_co_occurrence_matrix,
                                              training=False, train_embedding_matrix=embedding_matrix, train_word_to_idx=word_to_idx)

print(f"Validation embedding matrix shape: {val_embedding_matrix.shape}")
print(f"Validation embedding matrix element type: {type(val_embedding_matrix[0,0])}")

5420 vectors in matrix:
5231 embeddings already in model, 189 vectors computed for OOV words (189 computed as average, 0 random)
Validation embedding matrix shape: (5420, 300)
Validation embedding matrix element type: <class 'numpy.float64'>


#### Data pre-processing

We need now to vectorize input documents and labels - replacing words with their embedding representations and one-hot encoding tags - in order to feed the network.

In [94]:
# Vectorized documents: replace each word with its embedding
X_val_seq = val_documents.apply(lambda txt: np.array([val_word_to_idx[i] for i in txt]))
# Vectorized labels: sequences of indexes as in label vocabulary
y_val_seq = val_labels.apply(lambda labels: np.array([label_to_idx[i] for i in labels]))

# Prepare data to evaluate on batches
X_val_on_batch = X_val_seq.apply(lambda sequence: val_embedding_matrix[sequence])
y_val_on_batch = y_val_seq.apply(lambda sequence: one_hot_matrix[sequence])

In [95]:
# Show some examples
print("X val example:", X_val_on_batch[142])
print("y val example:", y_val_on_batch[142])

X val example: [[-0.44398999  0.12817    -0.25246999 ... -0.20043001 -0.082191
  -0.06255   ]
 [ 0.04656     0.21318001 -0.0074364  ...  0.0090611  -0.20988999
   0.053913  ]
 [ 0.36700001 -0.29550001  0.48686999 ...  0.32574999 -0.053274
  -0.20083   ]
 ...
 [-0.24132     0.12063     0.1919     ... -0.063158   -0.32837
   0.15507001]
 [ 0.11697    -0.29819    -0.15219    ... -0.56623    -0.22992
  -0.098088  ]
 [-0.12559     0.01363     0.10306    ... -0.34224001 -0.022394
   0.13684   ]]
y val example: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## Training

### Models creation

In [96]:
baseline = baseline_model(embedding_dimension, n_classes)
gru = gru_model(embedding_dimension, n_classes)
double_lstm = double_lstm_model(embedding_dimension, n_classes)
#crf = crf_model(embedding_dimension, n_classes)
models = {
          'baseline' : baseline,
          'gru' : gru,
          'double_lstm' : double_lstm,
          #'crf' : crf
}

Model: "baseline"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_16 (Bidirectio (1, None, 512)            1140736   
_________________________________________________________________
time_distributed_12 (TimeDis (None, None, 45)          23085     
Total params: 1,163,821
Trainable params: 1,163,821
Non-trainable params: 0
_________________________________________________________________
Model: "gru"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_17 (Bidirectio (1, None, 512)            857088    
_________________________________________________________________
time_distributed_13 (TimeDis (None, None, 45)          23085     
Total params: 880,173
Trainable params: 880,173
Non-trainable params: 0
_________________________________________________________________
Model: "double_lstm"
________

### Optimizers, metrics, callbacks

Having a look at some papers, I decided to make use of the **Adam optimizer** algorithm, and to employ **Categorical Crossentropy** as loss function.

During training, the tracked metrics are (categorical) *accuracy*, *precision* and *recall*.

At this step, it is not meaningful to track **(macro-averaged) F1-score** - which will be the final evaluation score - due to its non-linearity. Indeed, `model.fit()` performes metrics computation one batch at a time, and then averages on all the batches, providing a meaningless result when it comes to such a nonlinear metric.

In [97]:
# Optimizer
opt = keras.optimizers.Adam(learning_rate=0.001)
# Number of epochs
n_epochs=40
# Metrics to track during training
metrics = [
           'acc', # Alias for keras.metrics.CategoricalAccuracy(), in case of multi-class output
           keras.metrics.Precision(),
           keras.metrics.Recall(),
            ]
# Objective function to minimize
loss = keras.losses.CategoricalCrossentropy()

Reducing the **learning rate** after a certain number of epochs helps improving model accuracy along with convergence speed: choosing a constant learning rate could lead to either too slow (low learning rate) or too rough (high learning rate) training.

However, this should not be done too early, to avoid an excessive slowdown: I decided to start the reducing after 3/4 of the epochs.

In [98]:
# Callback to decrease learning rate after 2/3 of the epochs
def scheduler(epoch, lr):
    if epoch < n_epochs *3/4:
        return lr
    else:
        return lr * np.math.exp(-0.1)

cbk = keras.callbacks.LearningRateScheduler(scheduler)

### Compiling

In [99]:
# All at once
for model in models.values():
    model.compile(
        loss=loss,
        optimizer=opt,
        metrics=metrics
    )

### Fitting

In [100]:
# All at once
for name, model in models.items():
    print(f"Training {name} model...")
    model.fit(
    batch_generator(X_train_on_batch, y_train_on_batch, start=0),
    epochs=n_epochs,
    batch_size=1,
    steps_per_epoch=X_train_on_batch.shape[0],
    callbacks=[cbk],
    # The following line makes the training freeze at the end of the first epoch (?)
    # I thus decided to perform evaluation only after training
    #validation_data=batch_generator(X_val_on_batch, y_val_on_batch, start=100)
    )

Training baseline model...
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Training gru model...
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Training 

## Validation

### Models evaluation

**Punctuation masking**, i.e. removing labels corresponding to symbols (`[ . , : ( ) '' `` # $ ]`), is needed to appropriately compute *F1-score*. In fact, these labels exactly coincide to corresponding tokens in a 1-1 relation, providing little or no information about the model's performance. Indeed, these tokens could be labelled simply through the usage of a dictionary.

In [103]:
# Define list of punctuation labels
punctuation_labels = [".", ",", ":", "-LRB-", "-RRB-", "''", "``", "#", "$" ]
punctuation_indices = [label_to_idx[label] for label in punctuation_labels]

# Mask punctuation from DataFrame rows (with expected columns ['y_true', 'y_pred'])
def punctuation_mask(row, labels):
    # Create 'True' mask
    mask = np.ones(row['y_true'].shape, dtype=bool)
    # Set to 'False' elements corresponding to labels 
    for label in punctuation_indices:
        mask = np.logical_and(mask, row['y_true'] != label)
    # Return masked inputs as DataFrame row
    return pd.Series([row['y_true'][mask], row['y_pred'][mask]], index=['y_true', 'y_pred'])

The following function provides an evaluation of the given model making use of `scikit-learn`'s function `classification_report`.

Using `model.evaluate` is inaccurate due to the reasons discussed in this [section](#scrollTo=Zcx486xYrz6u&line=1&uniqifier=1) (i.e. averaging on batches does not provide meaningful information, when it comes to nonlinear functions).

In [105]:
def evaluate_model(X, y_true, model, embedding_dimension, punctuation_indices):
    
    # Extract predictions from model
    y_pred = X.apply(lambda sequence: model.predict(sequence.reshape(1,sequence.shape[0],embedding_dimension)).argmax(-1).reshape(sequence.shape[0]))
    # Build a DataFrame with both true labels and predicted ones
    y = pd.concat([y_true, y_pred], axis=1).rename(columns={'labels': 'y_true', 'document': 'y_pred'})
    # Filter out labels corresponding to punctuation classes
    y = y.apply(lambda row: punctuation_mask(row, punctuation_indices), axis=1)
    # Concatenate to get two flat arrays
    y_true = np.concatenate(list(y['y_true']))
    y_pred = np.concatenate(list(y['y_pred']))
    # Return report using scikit-learn's function
    return classification_report(y_true, y_pred, output_dict=True)['macro avg']

In [106]:
# Reverse dictionary (score:name), to later extract best model from maximum value
f1_scores = {}
# Print report and store f1-score results
for name, model in models.items():
    print(f"Evaluating {name} model...")
    report = evaluate_model(X_val_on_batch, y_val_seq, model, embedding_dimension, punctuation_indices)
    print(tabulate([report.keys(), report.values()]))
    f1_scores[report['f1-score']] = name
    print('*********')

Evaluating baseline model...


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


------------------  ------------------  ------------------  -------
precision           recall              f1-score            support
0.6263771469015803  0.6209131493311185  0.6168224812950991  27418
------------------  ------------------  ------------------  -------
*********
Evaluating gru model...
------------------  -----------------  ------------------  -------
precision           recall             f1-score            support
0.7341656966060555  0.699053072853373  0.7045628516604224  27418
------------------  -----------------  ------------------  -------
*********
Evaluating double_lstm model...
------------------  ------------------  ------------------  -------
precision           recall              f1-score            support
0.6012593096138272  0.5689522500476847  0.5773810303316118  27418
------------------  ------------------  ------------------  -------
*********


### Best model extraction

Extract the best-performing model that will be later tested.

In [107]:
best_score = max(f1_scores)
best_model = f1_scores[best_score]
print(f"The best model is the {best_model} model, with an f1-score of {best_score}")

The best model is the gru model, with an f1-score of 0.7045628516604224


## Test

Finally, let's test our **best-performing** model on test data!

#### Load data

In [108]:
# Load Test data into dataframe
test = dataframe[dataframe['split'] == 'test']
# Split into X (document) and y (tags) arrays
test_documents = test['document'].apply(lambda txt: txt.split())
test_labels = test['labels'].apply(lambda txt: txt.split())
print("Test documents:")
print(test_documents.head())
print("Test labels:")
print(test_labels.head())

Test documents:
150    [trinity, industries, inc., said, it, reached,...
151    [rms, international, inc., ,, hasbrouk, height...
152    [intelogic, trace, inc., ,, san, antonio, ,, t...
153    [dell, computer, corp., said, it, cut, prices,...
154    [usx, corp., posted, a, 23, %, drop, in, third...
Name: document, dtype: object
Test labels:
150    [NNP, NNPS, NNP, VBD, PRP, VBD, DT, JJ, NN, TO...
151    [NNP, NNP, NNP, ,, NNP, NNP, ,, NNP, ,, VBG, D...
152    [NNP, NNP, NNP, ,, NNP, NNP, ,, NNP, ,, VBD, P...
153    [NNP, NNP, NNP, VBD, PRP, VBD, NNS, IN, JJ, IN...
154    [NNP, NNP, VBD, DT, CD, NN, NN, IN, NN, NN, ,,...
Name: labels, dtype: object


#### Create embedding matrix

Create a matrix with shape `(vocabulary_size, embedding_dimension)` to store embeddings for each word in the corpus.

In [109]:
# Create vocabulary
test_idx_to_word, test_word_to_idx, test_word_listing = build_vocabulary(test_documents)
print('Vocabulary size:', len(test_word_listing))

Populating set of unique terms...: 100%|██████████| 49/49 [00:00<00:00, 5071.21it/s]

Building vocabulary...
Done!
Vocabulary size: 3407





In [110]:
# Check OOV terms (making use of training ones)
test_oov_terms = check_OOV_terms(embedding_model, test_word_listing, training=False, training_oov=oov_terms)

Total OOV terms: 140 (4.11%)


In [111]:
# Build co-occurrence matrix
test_co_occurrence_matrix = co_occurrence_count(test_documents, test_idx_to_word, test_word_to_idx, window_size=1)

Populating matrix...: 100%|██████████| 49/49 [00:00<00:00, 860.74it/s]

Computing dense matrix...
Done!
Co-occurrence matrix has shape: (3407, 3407)





In [112]:
# Build embedding matrix
test_embedding_matrix = build_embedding_matrix(embedding_model, embedding_dimension, test_word_to_idx, test_idx_to_word, test_oov_terms, test_co_occurrence_matrix,
                                              training=False, train_embedding_matrix=embedding_matrix, train_word_to_idx=word_to_idx)

print(f"Test embedding matrix shape: {test_embedding_matrix.shape}")
print(f"Test embedding matrix element type: {type(test_embedding_matrix[0,0])}")

3407 vectors in matrix:
3267 embeddings already in model, 140 vectors computed for OOV words (140 computed as average, 0 random)
Test embedding matrix shape: (3407, 300)
Test embedding matrix element type: <class 'numpy.float64'>


### Data pre-processing

We need now to vectorize input documents and labels - replacing words with their embedding representations and one-hot encoding tags - in order to feed the network.

In [113]:
# Vectorized documents: replace each word with its embedding
X_test_seq = test_documents.apply(lambda txt: np.array([test_word_to_idx[i] for i in txt]))
# Vectorized labels: sequences of indexes as in label vocabulary
y_test_seq = test_labels.apply(lambda labels: np.array([label_to_idx[i] for i in labels]))

# Prepare data to test on batches
X_test_on_batch = X_test_seq.apply(lambda sequence: test_embedding_matrix[sequence])
y_test_on_batch = y_test_seq.apply(lambda sequence: one_hot_matrix[sequence])

In [114]:
# Show some examples
print("X test example:", X_test_on_batch[192])
print("y test example:", y_test_on_batch[192])

X test example: [[-0.11495     0.28154999  0.42993999 ... -0.29853001 -0.15554
   0.37920001]
 [-0.20197999  0.38538     0.11671    ...  0.022478   -0.3682
   0.34619999]
 [ 0.14841001 -0.035665    0.24180999 ...  0.0035853   0.2269
   0.46671   ]
 ...
 [-0.53399003 -0.046982    0.36667001 ... -0.27298999 -0.050886
  -0.33184001]
 [-0.13858999  0.25777999 -0.065696   ... -1.13069999 -0.0065591
  -0.11813   ]
 [-0.12559     0.01363     0.10306    ... -0.34224001 -0.022394
   0.13684   ]]
y test example: [[1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### Test best model

We finally compute the **macro-averaged F1-score** on **test data** for our best-performing model.

In [115]:
# Test best model
print(f"Testing {best_model} model...")
report = evaluate_model(X_test_on_batch, y_test_seq, models[best_model], embedding_dimension, punctuation_labels)
print(tabulate([report.keys(), report.values()]))
print('*********')

Testing gru model...
------------------  ------------------  ------------------  -------
precision           recall              f1-score            support
0.8052957431378859  0.8059550217031418  0.8000951759188805  13676
------------------  ------------------  ------------------  -------
*********


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [117]:
print(f"The best model ({best_model}) has an f1-score on test-data of {report['f1-score']}")

The best model (gru) has an f1-score on test-data of 0.8000951759188805
