## NOTE: Do not run the B_random_cv_mit_movie_query notebook at the same time as this notebook, as it is not recommended to have > 1 heavy tensorflow process running at the same time such as training or performing random search cross validation. 

In [1]:
import os
curr_dir = os.getcwd()

## Filepath

In [2]:
training_sets_filepath = os.path.join(curr_dir,'training_set','movie_queries_training_dataset.csv').replace('\\','/')
word_vectors_filepath = os.path.join(curr_dir,'word_vector','word_vector.txt').replace('\\','/')
target_to_index_filepath = os.path.join(curr_dir,'index_converter','target_to_index.txt').replace('\\','/')
save_weights_filepath = os.path.join(curr_dir,'model_training_weights','weights.{epoch:02d}.hdf5').replace('\\','/')
f1_hist_filepath = os.path.join(curr_dir,'training_hist','f1_hist.txt').replace('\\','/')
best_hyperparams_info_filepath = os.path.join(curr_dir,'random_search_data','best_hyperparameter_info.txt').replace('\\','/')

## Imports

In [3]:
import requests
import pickle
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.data import load
import numpy as np
import pandas as pd
import string
import re 
from keras import backend as k
from keras.models import Sequential, Model, load_model
from keras.layers import Dense, LSTM, Input, concatenate, TimeDistributed, Bidirectional, Masking
from keras_contrib.layers import CRF
from keras_contrib.metrics import crf_viterbi_accuracy, crf_accuracy
from keras_contrib.losses import crf_loss
from keras.optimizers import Adam  
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import EarlyStopping, ModelCheckpoint, Callback, TensorBoard
from keras.preprocessing.text import text_to_word_sequence
from keras.utils import to_categorical
from keras.callbacks import Callback
from sklearn.model_selection import train_test_split, ParameterGrid, ParameterSampler, GridSearchCV
from sklearn.metrics import f1_score
from sklearn.utils import shuffle
from keras.wrappers.scikit_learn import KerasClassifier
import tensorflow as tf

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### You only need to run the cell below once, you can delete the cell below and across all notebooks

In [4]:
nltk.download('tagsets')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Configurations

In [5]:
#Load dictionary to convert categories to index
with open(target_to_index_filepath, "rb") as t:
    target_to_index = pickle.load(t)
    
f1_labels = list(target_to_index.values())
f1_labels.pop(0)

#input_sequence for sentences, output_sequence for targets of sentences
input_sequence = []
output_sequence = []

#Save all model weights and then select the one with the best f1 score afterwards
save_weights = ModelCheckpoint(save_weights_filepath, save_best_only=False, save_weights_only=True, monitor='loss', mode='min')

In [6]:
#Load all possible pos tags
tagdict = load('help/tagsets/upenn_tagset.pickle')
all_pos = list(tagdict.keys())

all_pos_tags = []
for pos in all_pos:
    all_pos_tags.append('pos_'+pos)

# Prepare training data 

### Reading pre-processed dataframe

In [7]:
#Read pre-processed dataset for training
df = pd.read_csv(training_sets_filepath)
df_target = df.copy()

### Extract list of tokenized words from dataframe

In [8]:
#Get list of words from dataframe
tokenized_text = df['word'].tolist()

### Word Vectorization: FastText

Need to represent words as numbers because machine learning models cannot read raw text


* High ability to vectorize out-of-vocabulary words
  * Some texts may contain words (names, terminologies) that popular pre-trained word vectorization models such as GloVe and Word2Vec cannot vectorize as these words were very probably not included in their training corpus
  * FastText performs word embedding using character n-grams or sub words 

Example: n-gram = 3, the word 'matter' would be broken into <ma, mat, att, tte, ter, er>

Used Pre-Trained Model, not enough data to train a FastText model

Vectorized **lower text** because there are more lower case n-grams compared to n-grams with upper case letters in the FastText model's wikipedia training corpus

Standardised vectors for numbers as numbers have no semantic meaning, they shouldn't have different vectors

### Why make an API for word vectorization instead of including it in the notebook?

Pre-Trained FastText model is 7GB and loading it takes up alot of time & memory. Loading >1 FastText model will result in a memory error in the second notebook

API allows multiple notebooks to access the word vectorization


```python

def containsNumbers(check):
    return any(char.isdigit() for char in check)

@app.route("/word_vectorization", methods=['POST'])
def word_vectorization():
    word_vectors = []
    tokenized_text_lower = request.json
    for word in tokenized_text_lower:
        if containsNumbers(word):
            word_vector = fast_text_model.wv['<NUMBER>']
            word_vectors.append(word_vector)
            continue

        word_vector = fast_text_model.wv[word]

        word_vectors.append(word_vector)

    with open(word_vectors_filepath, "wb") as t:
        pickle.dump(word_vectors, t)
    return Response(status = 200)

```

In [9]:
#Call Word Vectorization API and load the processed word vectors
word_vector_api_data = tokenized_text
session = requests.Session()
session.trust_env = False
session.post('http://127.0.0.1:5000/word_vectorization', json = word_vector_api_data) #add proxies args if needed

with open(word_vectors_filepath, "rb") as t:
    word_vectors = pickle.load(t)

### Adding features to the training dataframe

For this set of text, there aren't many useful word features to help in learning. For instance, capitalisation cannot be considered for a word feature as the training data found online contains words that are all lower cased already.

In [10]:
#Add word featues to dataframe
df['word_vec'] = word_vectors
df = pd.get_dummies(df, columns=['pos'])

In [11]:
df

Unnamed: 0.1,Unnamed: 0,sentence_no,word,target,word_vec,pos_$,pos_CC,pos_CD,pos_DT,pos_EX,...,pos_VB,pos_VBD,pos_VBG,pos_VBN,pos_VBP,pos_VBZ,pos_WDT,pos_WP,pos_WP$,pos_WRB
0,0,0,what,O,"[0.4232092, -0.67043304, -0.28789037, 0.851811...",0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,1,0,movies,O,"[0.2133574, -0.056065083, -1.0704498, 0.231683...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,0,star,O,"[0.09002825, 0.3791883, -0.61237025, -0.014841...",0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,3,0,bruce,B-ACTOR,"[-0.40589693, 0.67957574, -0.7557319, 0.120949...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,0,willis,I-ACTOR,"[-0.2464564, 0.852988, -0.72730404, 0.00421249...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,5,1,show,O,"[0.05406701, -0.15761112, -0.7017879, 0.234091...",0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
6,6,1,me,O,"[0.44699496, -0.6717845, -1.0470167, 0.9299539...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,7,1,films,O,"[0.23820278, 0.3546044, -0.92324656, 0.2704622...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,8,1,with,O,"[0.025328377, -0.05867663, 0.08570069, 0.16157...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,9,1,drew,B-ACTOR,"[-0.47852015, 0.51010704, -0.2871722, 0.388102...",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Fix arrangement of columns

We need to add all the pos columns from nltk and rearrange them in order for consistency. We need to add all pos columns, incase future words we predict on have a pos that our training text does not contain.

In [12]:
#Add missing pos columns 
df_cols = list(df.columns)
add_pos_col = [add for add in all_pos_tags if add not in df_cols]
print(' ')
print('Missing pos tags: {}'.format(len(add_pos_col)))
print(' ')
print('Need to add pos columns: {}'.format(add_pos_col))

 
Missing pos tags: 12
 
Need to add pos columns: ['pos_LS', "pos_''", 'pos_UH', 'pos_--', 'pos_:', 'pos_(', 'pos_)', 'pos_.', 'pos_,', 'pos_``', 'pos_SYM', 'pos_POS']


In [13]:
#Assign a binary value of 0 to the newly added pos columns as they are not present in our training data
for added_pos in add_pos_col:
    df[added_pos] = 0

#Rearrange in fixed order for consistency
arrange_df_cols = ['sentence_no','word','word_vec']
for arrange_pos in all_pos_tags:
    arrange_df_cols.append(arrange_pos)
df = df.reindex(columns=arrange_df_cols)

In [14]:
df

Unnamed: 0,sentence_no,word,word_vec,pos_LS,pos_TO,pos_VBN,pos_'',pos_WP,pos_UH,pos_VBG,...,pos_MD,pos_VB,pos_WRB,pos_NNP,pos_EX,pos_NNS,pos_SYM,pos_CC,pos_CD,pos_POS
0,0,what,"[0.4232092, -0.67043304, -0.28789037, 0.851811...",0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,movies,"[0.2133574, -0.056065083, -1.0704498, 0.231683...",0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,star,"[0.09002825, 0.3791883, -0.61237025, -0.014841...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,bruce,"[-0.40589693, 0.67957574, -0.7557319, 0.120949...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,willis,"[-0.2464564, 0.852988, -0.72730404, 0.00421249...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,show,"[0.05406701, -0.15761112, -0.7017879, 0.234091...",0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
6,1,me,"[0.44699496, -0.6717845, -1.0470167, 0.9299539...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,1,films,"[0.23820278, 0.3546044, -0.92324656, 0.2704622...",0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
8,1,with,"[0.025328377, -0.05867663, 0.08570069, 0.16157...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1,drew,"[-0.47852015, 0.51010704, -0.2871722, 0.388102...",0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


### Creation of X training dataset

Create a dictionary where each key is a sentence number and each value is the feature vectors of all the words in that sentence

The feature vector of a word will be its word vector, its word features and pos tags

In [15]:
#Get the sentence feature vectors. Each sentence contains a list of all its word feature vectors.
df = df.drop(columns=['word'])
sentence_feature_vectors = {}
for index,row in df.iterrows():
    sentence_number = row[0]
    word_feature_vector = np.concatenate((row[1:]), axis = None)
    if sentence_number in sentence_feature_vectors.keys():
        sentence_feature_vectors[sentence_number].append(word_feature_vector)
    else:
        sentence_feature_vectors[sentence_number] = [word_feature_vector]

<br>
Now, each dictionary key is a sentence number. Each dictionary value is a list of word feature vectors of the sentence

In [16]:
print('Feature Vectors in the first sentence:')
sentence_feature_vectors[1]

Feature Vectors in the first sentence:


[array([ 0.05406701, -0.15761112, -0.7017879 ,  0.23409137,  0.49511296,
         0.5845138 , -0.21149723,  0.41470313, -0.66746515,  0.3139659 ,
        -0.22264327, -0.06002229, -0.26404837,  0.11272451, -0.07870636,
        -0.00662771, -0.09953663,  0.19331707, -0.65225816, -0.23743977,
         0.22146857,  0.44456705, -0.10762705, -0.02459927,  0.39042968,
         0.2107949 ,  0.09701276, -0.1647754 , -0.3796033 ,  0.01939271,
        -0.23600912,  0.01130283,  0.2009593 ,  0.05590365,  0.0929635 ,
        -0.14136171,  0.04371291, -0.08660819,  0.02132413, -0.14475128,
         0.7776699 , -0.3512436 , -0.50120497, -0.04874106,  0.31615117,
        -0.07757556,  0.19927543, -0.15894759, -0.02038171,  0.16272163,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0. 

### Padding and adding each sentence to the input sequence

This is done because keras accepts fixed-length input to improve performance by creating tensors of fixed shapes

In [17]:
#Pad length for sentences and append to the input_sequence 

#Length of the feature vector 
dummy_length = len(sentence_feature_vectors[1][0])

#Iterate over the sentences, pad them and add them to the input sequence
for sentence in sentence_feature_vectors.values():
    while len(sentence) < 80:
        sentence.append(np.array([0 for zero in range(dummy_length)]))

    input_sequence.append(np.array(sentence))

In [18]:
x = np.array(input_sequence)

In [19]:
print('Input Sequence:')
x

Input Sequence:


array([[[ 0.42320919, -0.67043304, -0.28789037, ...,  0.        ,
          0.        ,  0.        ],
        [ 0.2133574 , -0.05606508, -1.07044983, ...,  0.        ,
          0.        ,  0.        ],
        [ 0.09002825,  0.3791883 , -0.61237025, ...,  0.        ,
          0.        ,  0.        ],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]],

       [[ 0.05406701, -0.15761112, -0.70178789, ...,  0.        ,
          0.        ,  0.        ],
        [ 0.44699496, -0.67178452, -1.04701674, ...,  0.        ,
          0.        ,  0.        ],
        [ 0.23820278,  0.35460439, -0.92324656, ...,  0.        ,
          0.        ,  0.        ],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  

In [20]:
print('Shape of X training set: %s' %(x.shape,))
print('Number of samples (sentences): {}'.format(x.shape[0]))
print('Number of timesteps (word): {}'.format(x.shape[1]))
print('Number of features (features for each timestep): {}'.format(x.shape[2]))

Shape of X training set: (9775, 80, 95)
Number of samples (sentences): 9775
Number of timesteps (word): 80
Number of features (features for each timestep): 95


### Creation of Y training dataset

Create a dictionary where each key is a sentence number and each value is the targets of that sentence

In [21]:
#Add the target for each word of the sentence
targets = {}
for index,row in df_target.iterrows():
    sentence_number = row[1]
    word_target = row[-1]
    if sentence_number in targets.keys():
        targets[sentence_number].append(word_target)
    else:
        targets[sentence_number] = [word_target]

### Conversion of names to index, padding and add each sentence to the output sequence

In [22]:
#Convert the targets to their respective index, pad length for sentences and append to output_sequence
for sentence in targets.values():
    sentence = [target_to_index[target] for target in sentence]
    while len(sentence) < 80:
        sentence.append(target_to_index['O'])

    output_sequence.append(np.array(sentence))
    
y = np.array(output_sequence)

### One hot encoding of y values

CRF needs the input and output sequence to be 3 Dimensional, which is why one-hot encoding is done for the y output values

In [23]:
y = to_categorical(y, num_classes=25)

In [24]:
print('Output Sequence:')
y

Output Sequence:


array([[[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.]],

       [[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.]],

       [[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.]],

       ...,

       [[1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0.

In [25]:
print('Shape of Y training set: %s' %(y.shape,))
print('Number of samples (sentences): {}'.format(y.shape[0]))
print('Number of timesteps (word): {}'.format(y.shape[1]))
print('Number of targets (one-hot): {}'.format(y.shape[2]))

Shape of Y training set: (9775, 80, 25)
Number of samples (sentences): 9775
Number of timesteps (word): 80
Number of targets (one-hot): 25


<br>
<br>

Samples refer to the number of sentences we have

Timesteps refer to the number of words in each sentence (80 because of padding)

Features refer to the features of each word (word vector, capitalise etc.)

<br>
<br>

## Preparation of training set

In [26]:
x_s,y_s = shuffle(x,y,random_state=23)

In [27]:
x_train = x_s[:9286]
x_test = x_s[9286:]

In [28]:
y_train = y_s[:9286]
y_test = y_s[9286:]

I allocated a slightly higher proportion of data to training as the amount of data that is available is insufficient for a model to learn very well. Furthermore, the model would be validated on 489 sentences at the end of each epoch based on this proportion, which I feel is sufficient for its validation.

I am using the F1 score as the validation metric, thus I need to create my own callback function to perform this validation on the test data at the end of each epoch

We also need to get the prediciton on the test set and reshape both y sets so that it would be 2D 
as sklearn's f1 evaluation only accepts 2D inputs. Just all the words and 
their corresponding targets, not split into sentences.

In [29]:
y_shape = y_test.shape
y_newshape = (y_shape[0]*y_shape[1], y_shape[-1])
y_true_reshaped = np.reshape(y_test, y_newshape)

In [30]:
class Compute_f1_Of_Epoch(Callback):
    def __init__(self, x_test, y_newshape, y_true_reshaped):
        self.x_test = x_test
        self.y_newshape = y_newshape
        self.y_true_reshaped = y_true_reshaped
        
    def on_train_begin(self, logs={}):
        self.f1_of_epochs = []

    def on_epoch_end(self, epoch, logs={}):
        y_pred = self.model.predict(self.x_test) 
        y_pred_reshaped = np.reshape(y_pred, y_newshape)
        self.f1_of_epochs.append(f1_score(y_true_reshaped, y_pred_reshaped, average = 'macro', labels=f1_labels))
        print('The f1 score for this epoch is: {}'.format(f1_score(y_true_reshaped, y_pred_reshaped, average = 'macro', labels=f1_labels)))
        return

compute_f1_of_epoch = Compute_f1_Of_Epoch(x_test=x_test,y_newshape=y_newshape,y_true_reshaped=y_true_reshaped)

## Getting best hyperparameters of model

Change accordingly if needed. The following combination of hyperparameters is found to work quite well, which is why i added it in. If you wish to use the best hyperparameter info from your own random search cv, please change the cell below this cell to a markdown cell and the one below that to a code cell

In [31]:
best_hyperparameter_info = ['dummy',{'units_hyperparams': 100, 'recurrent_dropout_hyperparams': 0.3, 'optimizer_hyperparams': 'Adadelta', 'hidden_layers_hyperparams': 1, 'epochs_hyperparams': 250, 'dropout_hyperparams': 0.2, 'batch_size_hyperparams': 32}]

with open(best_hyperparams_info_filepath, "rb") as t:
    best_hyperparameter_info = pickle.load(t)

## Defining and training model

In [32]:
#Define function to create base model dynamically
#I used a dictionary and formatting to add hidden layers dynamically
def base_model(units=50, optimizer='Adam', hidden_layers=2, activation_td ='relu', dropout=0.1, recurrent_dropout=0.1):
    hidden_layers_stored = {}
    counter=1
    input = Input(shape=(x.shape[1],x.shape[-1]))
    mask = Masking(mask_value=0.)(input)
    for hl in range(hidden_layers):
        if counter==1:
            hidden_layers_stored['hl_{}'.format(counter)] = Bidirectional(LSTM(units=units, return_sequences=True, dropout=dropout, recurrent_dropout=recurrent_dropout))(mask)  
        else:
            hidden_layers_stored['hl_{}'.format(counter)] = Bidirectional(LSTM(units=units, return_sequences=True, dropout=dropout, recurrent_dropout=recurrent_dropout))(hidden_layers_stored['hl_{}'.format(counter-1)])
        counter+=1
    model_last_layer = TimeDistributed(Dense(50, activation=activation_td))(hidden_layers_stored['hl_{}'.format(counter-1)])  
    crf = CRF(25)  
    out = crf(model_last_layer)  
    model_final = Model(input, out)
    model_final.compile(optimizer=optimizer, loss=crf_loss, metrics=[crf_accuracy])
    return model_final

In [33]:
#GPU Options are added to prevent the program from taking up all the computer GPU's memory. 
graph_trainer = tf.Graph()
with graph_trainer.as_default():
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    config.log_device_placement = True
    session_trainer = tf.Session(config=config)
    with session_trainer.as_default():
        model_trainer = base_model(units=best_hyperparameter_info[1]['units_hyperparams'],optimizer=best_hyperparameter_info[1]['optimizer_hyperparams'],hidden_layers=best_hyperparameter_info[1]['hidden_layers_hyperparams'],dropout=best_hyperparameter_info[1]['dropout_hyperparams'],recurrent_dropout=best_hyperparameter_info[1]['recurrent_dropout_hyperparams'])
        model_trainer.fit(x_s, y_s, epochs=best_hyperparameter_info[1]['epochs_hyperparams'], batch_size=best_hyperparameter_info[1]['batch_size_hyperparams'], callbacks=[save_weights, compute_f1_of_epoch])    

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
Epoch 1/250


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


The f1 score for this epoch is: 0.38883495669311835
Epoch 2/250
The f1 score for this epoch is: 0.5003565195045758
Epoch 3/250
The f1 score for this epoch is: 0.5316941834005819
Epoch 4/250
The f1 score for this epoch is: 0.5646643096460804
Epoch 5/250
The f1 score for this epoch is: 0.6074049642951068
Epoch 6/250
The f1 score for this epoch is: 0.6495176725426405
Epoch 7/250
The f1 score for this epoch is: 0.6430846052820246
Epoch 8/250
The f1 score for this epoch is: 0.6792745990830817
Epoch 9/250
The f1 score for this epoch is: 0.6884647664746962
Epoch 10/250
The f1 score for this epoch is: 0.6855546130530401
Epoch 11/250
The f1 score for this epoch is: 0.7082861374760966
Epoch 12/250
The f1 score for this epoch is: 0.7090862996660485
Epoch 13/250
The f1 score for this epoch is: 0.7205119113518846
Epoch 14/250
The f1 score for this epoch is: 0.7132148400471454
Epoch 15/250
The f1 score for this epoch is: 0.7232008909175415
Epoch 16/250
The f1 score for this epoch is: 0.7377418133309

The f1 score for this epoch is: 0.8485796105516719
Epoch 53/250
The f1 score for this epoch is: 0.8457706947960176
Epoch 54/250
The f1 score for this epoch is: 0.8562256320444049
Epoch 55/250
The f1 score for this epoch is: 0.8425906418573356
Epoch 56/250
The f1 score for this epoch is: 0.8618381737856907
Epoch 57/250
The f1 score for this epoch is: 0.8721035382080382
Epoch 58/250
The f1 score for this epoch is: 0.8597358800533009
Epoch 59/250
The f1 score for this epoch is: 0.8716630775089689
Epoch 60/250
The f1 score for this epoch is: 0.8594166135277078
Epoch 61/250
The f1 score for this epoch is: 0.870519060612406
Epoch 62/250
The f1 score for this epoch is: 0.8511346574461749
Epoch 63/250
The f1 score for this epoch is: 0.8702707016784105
Epoch 64/250
The f1 score for this epoch is: 0.8825932419402984
Epoch 65/250
The f1 score for this epoch is: 0.8748115018106776
Epoch 66/250
The f1 score for this epoch is: 0.8765647654326248
Epoch 67/250
The f1 score for this epoch is: 0.8751514

The f1 score for this epoch is: 0.8986421937352457
Epoch 103/250
The f1 score for this epoch is: 0.8887404346832041
Epoch 104/250
The f1 score for this epoch is: 0.8984877223066637
Epoch 105/250
The f1 score for this epoch is: 0.9081775585716493
Epoch 106/250
The f1 score for this epoch is: 0.8995889892895933
Epoch 107/250
The f1 score for this epoch is: 0.8944326939254922
Epoch 108/250
The f1 score for this epoch is: 0.9007873202361735
Epoch 109/250
The f1 score for this epoch is: 0.8974732190250494
Epoch 110/250
The f1 score for this epoch is: 0.9000313034013413
Epoch 111/250
The f1 score for this epoch is: 0.9125724262158951
Epoch 112/250
The f1 score for this epoch is: 0.894272320938085
Epoch 113/250
The f1 score for this epoch is: 0.9032504249030238
Epoch 114/250
The f1 score for this epoch is: 0.9064434306881824
Epoch 115/250
The f1 score for this epoch is: 0.9016342530677325
Epoch 116/250
The f1 score for this epoch is: 0.9094996480798955
Epoch 117/250
The f1 score for this epoc

The f1 score for this epoch is: 0.9265141237601888
Epoch 153/250
The f1 score for this epoch is: 0.9126245391438612
Epoch 154/250
The f1 score for this epoch is: 0.9115171240970322
Epoch 155/250
The f1 score for this epoch is: 0.9229945240017788
Epoch 156/250
The f1 score for this epoch is: 0.9152924073139744
Epoch 157/250
The f1 score for this epoch is: 0.9153951126828651
Epoch 158/250
The f1 score for this epoch is: 0.9154069011741579
Epoch 159/250
The f1 score for this epoch is: 0.9184868206610464
Epoch 160/250
The f1 score for this epoch is: 0.9084972610715508
Epoch 161/250
The f1 score for this epoch is: 0.9182268712729829
Epoch 162/250
The f1 score for this epoch is: 0.9151545894280636
Epoch 163/250
The f1 score for this epoch is: 0.9212162958180312
Epoch 164/250
The f1 score for this epoch is: 0.9232082004666831
Epoch 165/250
The f1 score for this epoch is: 0.9244232979782988
Epoch 166/250
The f1 score for this epoch is: 0.9182333246162986
Epoch 167/250
The f1 score for this epo

The f1 score for this epoch is: 0.9277987000570963
Epoch 203/250
The f1 score for this epoch is: 0.9281632821520817
Epoch 204/250
The f1 score for this epoch is: 0.9243551204089598
Epoch 205/250
The f1 score for this epoch is: 0.922862513190919
Epoch 206/250
The f1 score for this epoch is: 0.9207328640127926
Epoch 207/250
The f1 score for this epoch is: 0.9249735065907746
Epoch 208/250
The f1 score for this epoch is: 0.9290755860360419
Epoch 209/250
The f1 score for this epoch is: 0.9206150769662776
Epoch 210/250
The f1 score for this epoch is: 0.917063255540347
Epoch 211/250
The f1 score for this epoch is: 0.9220712215426298
Epoch 212/250
The f1 score for this epoch is: 0.9237593520140875
Epoch 213/250
The f1 score for this epoch is: 0.9221198078309554
Epoch 214/250
The f1 score for this epoch is: 0.9286866518969891
Epoch 215/250
The f1 score for this epoch is: 0.928670854220797
Epoch 216/250
The f1 score for this epoch is: 0.9325323720236355
Epoch 217/250
The f1 score for this epoch 

Now that we have the validation f1 scores at the end of each epoch, we can see which model's hyperparameters are most optimal 
for our use case by picking out the one with the highest validation f1.

We then iterate through the folder of all saved model hyperparameters and only save the one that is the best

In [34]:
f1_hist = list(compute_f1_of_epoch.f1_of_epochs)

In [35]:
#The names of the first weight saved to the model training weights folder are not zero index based, it starts from 1
#example: weights.01.hdf5, weights.02.hdf5
best_epoch = f1_hist.index(max(f1_hist))+1

In [36]:
best_epoch_score = max(f1_hist)

In [37]:
best_epoch

233

In [38]:
best_epoch_score

0.9406319166803954

In [39]:
import os
for i in range(1,251):
    if i==best_epoch:
        continue
    else:
        epoch_number = str(i)
        if len(epoch_number) < 2:
            #pads 0s in front to match model weight numbers less than 10
            epoch_number = epoch_number.zfill(2) 

        os.remove(os.path.join(curr_dir,'model_training_weights','weights.{}.hdf5').format(epoch_number))

In [40]:
with open(f1_hist_filepath, "wb") as t:
    pickle.dump(f1_hist, t)

In [41]:
#rename the best weight to weights.best.hdf5
os.rename(os.path.join(curr_dir,'model_training_weights','weights.{}.hdf5').format(best_epoch),os.path.join(curr_dir,'model_training_weights','weights.{}.hdf5').format("best"))

## Bi-Directional LSTM (Long Short Term Memory) with CRF (Conditional Random Fields)

**Why?**
Suitable for learning sequence-related data, sequence is very important in sentences. For example:
<br>
Give example
<br>

We can guess an entity from the context given

<br>
<br>

### Recurrent Neural Networks
Feeds the information of previous time steps into the current time step to help with the prediction of the current time step

**Problem:** Cannot learn long term dependencies i.e. words very early in the sentence cannot be used to help with the prediction of a word that occurs much later in the sentence. This is because of the vanshing gradient problem which arises because of matrice multiplications in RNN back-propagation through time.

### Vanilla LSTM Concept

![](https://i.stack.imgur.com/swN2l.png)

<br>
<br>

* Hidden layer is made of hidden units 
  * Learning is done here
  * Each hidden unit learns different things about the sequence
  * Each hidden unit can be thought of as a chain of LSTM cells due to its recurrent nature

<br>
<br>

What an LSTM unit looks like:

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png)
<br>
<br>

Parts of an LSTM cell:
* Cell State (Black line at the top)
  * LSTM's memory running throughout its recursive operations
* Hidden State
  * Filtered version of the cell state, contain information related to the prediction of the currenr time step that is useful for the next time step
* Decision Gates (Yellow boxes)

<br>
<br>

### LSTM Decision Gates

There are 3 decision gates in a LSTM cell: *Forget, Input, Output*

#### Forget (Remember) Gate
* **Decides** what past information is irrelevant for learning the sequence for future timesteps
* **Decides** what information is still important for learning the sequence for future timesteps

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-f.png)

1. Takes in prev hidden state and current input as its inputs
2. LSTM weights are multiplied with their respective inputs
3. Bias is added to each of the results
5. Results are summed and put through a sigmoid function


The sigmoid function outputs the values between 0 and 1. Values closer to 0 are deemed more negligible vice-versa.

<br>
<br>

#### Input (Update) Gate
* **Decides** what information to add to the cell state that can be useful for the LSTM’s learnings for the future sequence 

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-i.png)

First equation decides what to update and how much to update the candidate values. The second equation generates all potential candidate values to be added. The output of these two equations are point-wise multiplied to output the scaled values to be updated.


<br>
<br>

#### Cell State Operation (Forget Gate & Input Gate)
* **Execution** of decisions made by *forget* and *input* gate

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png)

<br>
Execution of decision made by forget gate:

The forget gate’s output is pointwise multiplied by the cell state, this is where we execute the decision made by the forget gate and ‘forget’ insignificant information from the cell state. Cell state values multiplied by values closer to 0 are more likely to be ‘forgotten’ vice-versa

<br>
Execution of decision made by input gate:

The cell state is then pointwise added with the input gate’s output to execute the update decision made by the input gate

<br>
<br>

#### Output Gate
* **Decides** what information we are going to output from the cell state that can be useful for the next time step

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png)

<br>
<br>

#### Cell State Operation (Output Gate):
* **Execution** of decision made by *output* gate

The first equation decides what values to output from the cell state. The second equation pointwise multiplies the first equation (decision) by a tahn function applied to the cell state which results in the hidden state.


<br>
<br>
<br>

### Bidirectional

Same operations but runs the sequence backwards. 

Able to preserve information from past and future = context

_"You shall know a word by the company it keeps" - John Rupert Firth_

<br>
<br>
<br>

### CRF

![](https://d2ueix13hy5h3i.cloudfront.net/wp-content/uploads/2019/06/CodeCogsEqn7.png)

Calculating Probability of the most likely Y sequence of labels given the X sequence
* f represents the feature functions
* inner sum sums the feature functions of the words in the sentence
* outer sum sums the feature functions of the sentences
* exp is an expotential function and 1/Z(x) is a normalization which helps make it a probability

![](https://d2ueix13hy5h3i.cloudfront.net/wp-content/uploads/2019/06/CodeCogsEqn1-5.png)
![](https://d2ueix13hy5h3i.cloudfront.net/wp-content/uploads/2019/06/ss1-2.png)

The feature function takes in the previous label and the current label as well as the current input into consideration when making a prediction

<br>

### Back propagation through time (Training)

Process where the LSTM neural network updates all its parameters (weights and biases) to minimize the loss function. The loss function is the magnitude of errors that the LSTM neural network makes during prediction. Backpropagation through time can be thought of as the process of constantly finding the right direction in which to adjust the weights such that the local minimum of the loss function can be reached, which will make the LSTM neural network predict better.
<br>
The loss function we use is the CRF loss function which is a negative log-likelihood function.
* Likelihood: Measures how well the parameters have adjusted to classify our entity correctly using the probabilities produced by the model
* Log function: Regulate values as some of the likelihood values can be very small
* Negative: Minimise the function negative so that we can maximise our classification performance by minimizing the function as the optimizers are made to minimize loss functions

The closer our loss function to 0, the better


### Model architecture

```python
        input = Input(shape=(80,350)
        mask = Masking(mask_value=0.)(input)
        hidden_layer_1 = Bidirectional(LSTM(200, return_sequences=True, activation='tahn', recurrent_activation='hard_sigmoid))(mask)  
        model_last_layer = TimeDistributed(Dense(50, activation='relu'))(hidden_layer_1)  
        crf = CRF(11)  
        out = crf(model_last_layer)  
        model_final = Model(input, out)
        model_final.compile('RMSprop', loss=crf_loss, metrics=[crf_accuracy])
```

**Layer (Type)** | **Output Shape** | **Number of Params**
------------ | ------------- | ------------
input_1 (InputLayer) | (None, 80, 350) | 0
masking_1 (Masking)  | (None, 80, 350) | 0
bidirectional_1 (Bidirectional) | (None, 80, 400) | 601200
time_distributed_1 (TimeDistributed) | (None, 80, 50) | 15050
crf_1 (CRF) | (None, 80, 11) | 704

<br>
<br>


![](https://wikimedia.org/api/rest_v1/media/math/render/svg/2db2cba6a0d878e13932fa27ce6f3fb71ad99cf1)


LSTMs are a recursive series of **matrix computations** through the various gates:
* matrix multiplications
* matrix addition of biases
* matrix addition
* applying different functions
* pointwise addition & multiplication

The shape of the matrices are influenced by the number of hidden units and input vector size. The shape/length of the hidden state is the number of hidden units. 

More LSTM units = Larger Hidden State = Learn more structural + semantic information from the sequence

Each unit learns to remember different things about the sequence it sees

<br>
<br>

### Input Layer
Instantite a keras tensor (multi-dimensional array)

### Masking Layer

Indicate the padding notation and notify tensorflow to ignore it

### Bidirectional LSTM

Hidden states computed in the forward sequence and backward sequence are concatenated which can be seen by the output shape that has a length of 400 (hidden state size is 400) when there were only 200 units.

Learn from the sequence 

### Time Distributed Dense

Recommended

The time distributed layer will also be performing the following operation on every hidden state (300 units) for all 80 timesteps:

`output = activation(dot(input, weights) + bias)`

and will output a vector of length 50 for 80 timesteps (because it has 50 units).


### CRF Layer

Maps the input sequence to the output sequence using conditional random fields


### Tensorflow backend

Graph = A computational graph where a series of tensorflow operations are performed on the data 

Session = Executes the graph


<br>
<br>