# Python3 Part of the Data Analysis for the Journal Article Titled:
# <i>Natural language processing for cognitive behavioral therapy: extracting schemas from thought records</i>

>This script accompanies the journal article with the title stated above. The main aim of the research is to determine whether an algorithm can label utterances expressed in thought records with regard to the schema(s) they reflect. Thought record forms are a tool in cognitive therapy with which patients should gain insight into their maladaptive thought processes. According to the theory underlying cognitive therapy, it is these malaptive thought processes that result in the respective mental illness.  
This script complements an R/KnitR script that consists of the following sections:
    <ol> 
    <li>Preparing data for testing Hypothesis 1</li>
    <li>Testing Hypothesis 2</li>
    <li>Testing Hypothesis 3</li>
    <li>Testing Hypothesis 4</li>
    </ol>
Details concerning the hypotheses, the project background, the raw data, and the data collection process can all be found in the R/KnitR script.<br>
<br>
The modules below need to be installed before running the code:
    <ol>
    <li>gensim==3.8.3</li>
    <li>talos==0.6.3</li>
    <li>tensorflow==2.3.2</li>
    <li>statsmodels==0.10.2</li>
    <li>scipy==1.4.1</li>
    <li>scikit-learn==0.23.2</li>
    <li>numpy==1.16.3</li>
    <li>pandas==0.25.3</li>
    </ol>
<br>
The following inputs are required and can be found in the DataRepository/AnalysisArticle/Data directory:
    <ol>
    <li>glove.6B directory</li>
    <li>DatasetsForH1 directory</li>
    </ol>
<br>
Additionally, the following output is generated:
    <ol>
    <li>data_for_H2.csv file</li>
    <li>per_schema_models directory</li>
    </ol>
with the latter containing all trained per-schema RNN models in .h5 file format.  
<br>
The purpose of this script is to test Hypothesis 1, i.e. to see whether an algorithm can attach the correct schema label to thought record utterances more often than would be expected by chance. A thought record utterance could reflect none, any one, or multiple of 9 possible schemas. Additionally, labels are not binary (does or does not reflect schema) but ordinal (0 - has nothing to do with schema, 1 - has a little bit to do with the schema, 2 - has to do with the schema, 3 - fits perfectly with the schema). <br>
Utterances are in natural language format. It is therefore necessary to preprocess these pieces of text, which we do in R. We also split the entire raw dataset into training, validation and test sets. The test set is created by taking 15% of the raw data, the validation set is created by taking another 15% of the remaining data. <br>
Three algorithms are explored: k-nearest neighbors, support vector machines, and recurrent neural networks. We arrived at the former two, by following the decision tree presented by scikitLearn (https://scikit-learn.org/stable/tutorial/machine_learning_map/). The data are ordinal, labeled, and we have less than 100k samples. The recurrent neural networks are a logical choice for natural language data, since they allow modelling the temporal aspect that is inherent to sentences as sequences of words.<br>
The wall time of runtimes are provided in the first comment of cells of time intensive code. Additionally, the cell magic "%%time" in these cells ensures that runtimes are printed so that these can be compared to the reported runtimes to get appropriate estimates when running on a different machine.<br>
We import the following packages and functions:

In [1]:
#set seed
seed = 57839
import os
os.environ['PYTHONHASHSEED']=str(seed)
import sys

import random
random.seed(seed)

import numpy as np
np.random.seed(seed)

import csv
import pandas as pd
import scipy
import scipy.stats as stats
import functools
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument


import sklearn
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier, KNeighborsRegressor
from sklearn import metrics, preprocessing, svm
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm
import statsmodels.formula.api as smf

import tensorflow as tf
tf.random.set_seed(seed)

from tensorflow.python.keras.metrics import Metric
from tensorflow import keras
import talos
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.utils import np_utils
from keras.layers import Dense, Flatten, Embedding, SimpleRNN, LSTM, GRU, Bidirectional,Dropout

from keras import backend as K
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, positive=False):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  max_n_alpha

In [2]:
print(sys.version)

3.7.9 (default, Aug 31 2020, 17:10:11) [MSC v.1916 64 bit (AMD64)]


In [3]:
#list packages and their version numbers as used in this script (code is taken from 
#https://stackoverflow.com/questions/40428931/package-for-listing-version-of-packages-used-in-a-jupyter-notebook)
import pkg_resources
import types
def get_imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            # Split ensures you get root package, 
            # not just imported function
            name = val.__name__.split(".")[0]

        elif isinstance(val, type):
            name = val.__module__.split(".")[0]

        # Some packages are weird and have different
        # imported names vs. system/pip names. Unfortunately,
        # there is no systematic way to get pip names from
        # a package's imported name. You'll have to add
        # exceptions to this list manually!
        poorly_named_packages = {
            "sklearn": "scikit-learn"
        }
        if name in poorly_named_packages.keys():
            name = poorly_named_packages[name]

        yield name
imports = list(set(get_imports()))

# The only way I found to get the version of the root package
# from only the name of the package is to cross-check the names 
# of installed packages vs. imported packages
requirements = []
for m in pkg_resources.working_set:
    if m.project_name in imports and m.project_name!="pip":
        requirements.append((m.project_name, m.version))

for r in requirements:
    print("{}=={}".format(*r))

tensorflow==2.8.0
talos==1.0.2
statsmodels==0.13.1
scipy==1.4.1
scikit-learn==0.20.2
pandas==0.25.3
numpy==1.21.5
keras==2.8.0
gensim==3.8.3


> For working directory, just use the folder the notebook is in.

In [4]:
#os.chdir("/Users/fran/surfdrive/Documents/Projects/ThoughtRecordChatbot/Experiments/Exploratory_TRs/Documents/DataRepository/AnalysisArticle")

## Importing the datasets from csv
> The preprocessed utterances are split into three sets in the R script. They are saved in three separate csv files. Additionally, the manually assigned labels that correspond with the utterances are saved in three separate csv files.

In [4]:
# read in datasets (already pre-processed)
def readcsv(fname,istext):
    if istext:
        with open(fname,'rt') as f:
            reader=csv.reader(f)
            next(reader)
            data = []
            for row in reader:
                for item in row:
                    data.append(item)
            f.close()
    else:
        with open(fname,'r') as f:
            reader=csv.reader(f,delimiter=';')
            next(reader)
            data = list(reader)
            data = np.asarray(data, dtype='int')
            f.close()
    return data 

# read in training, validation, and test set utterances
train_text = readcsv('Data/DatasetsForH1_OutOfBox/H1_train_texts.csv',True)
val_text = readcsv('Data/DatasetsForH1_OutOfBox/H1_validate_texts.csv', True)
test_text = readcsv('Data/DatasetsForH1_OutOfBox/H1_test_texts.csv',True)

# read in training, validation, and test set labels
train_labels = readcsv('Data/DatasetsForH1_OutOfBox/H1_train_labels.csv',False)[:,0:9]
val_labels = readcsv('Data/DatasetsForH1_OutOfBox/H1_validate_labels.csv', False)[:,0:9]
test_labels = readcsv('Data/DatasetsForH1_OutOfBox/H1_test_labels.csv',False)[:,0:9]

In [5]:
print(train_text[0:5])

['lot people may think well lot people might not like me', 'might not working fast enough their standards', 'may not able graduate', 'would get bad performance review', 'friends will get annoyed by me']


In [6]:
print(train_labels[0:5,:])

[[2 0 0 0 0 0 0 0 3]
 [0 3 0 0 0 0 0 0 0]
 [0 3 0 0 0 0 0 0 0]
 [0 3 0 0 0 0 0 0 0]
 [2 0 0 0 0 0 0 0 3]]


> As can be seen, some utterances have multiple schemas assigned. However, overall, the label matrices are sparse matrices. The first column of the labels corresponds to the "Attachment" schema, the second to the "Competence" schema, the third to last to the "Other's views on self" schema.

In [7]:
#for later use
schemas = ["Attach","Comp","Global","Health","Control","MetaCog","Others","Hopeless","OthViews"]

## Embedding the utterances using GLoVE
> We have opted for representing the words in utterances as word vectors. We adopt the GLoVE word vector space that has been created with Wikipedia 2014. First, we tokenize the top 2000 words of the training set.  

In [8]:
# prepare tokenizer
max_words = 2000
t = Tokenizer(num_words = max_words)
t.fit_on_texts(train_text)
vocab_size = len(t.word_index) + 1
print(vocab_size)

2624


> The tokenizer takes the words and indexes these based on frequency. For the recurrent neural net, we need padded utterances sequences. Texts_to_sequences simply represents each utterance as a vector of tokens. Padding ensures that all vectors are of the same length, by appending 0s to the end of shorter vectors. We pad to a length of 25 words.

In [9]:
# integer encode all utterances
encoded_train = t.texts_to_sequences(train_text)
encoded_validate = t.texts_to_sequences(val_text)
encoded_test = t.texts_to_sequences(test_text)

# pad documents to a max length of 25 words
max_length = 25

padded_train = pad_sequences(encoded_train, maxlen=max_length, padding='post')
padded_validate = pad_sequences(encoded_validate, maxlen=max_length, padding='post')
padded_test = pad_sequences(encoded_test, maxlen=max_length, padding='post')

print(encoded_train[0:5])

[[147, 28, 48, 37, 101, 147, 28, 32, 1, 8, 5], [32, 1, 155, 658, 14, 125, 568], [48, 1, 19, 448], [2, 11, 53, 449, 659], [50, 6, 11, 373, 98, 5]]


In [10]:
print(padded_train[0:5])

[[147  28  48  37 101 147  28  32   1   8   5   0   0   0   0   0   0   0
    0   0   0   0   0   0   0]
 [ 32   1 155 658  14 125 568   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0]
 [ 48   1  19 448   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0]
 [  2  11  53 449 659   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0]
 [ 50   6  11 373  98   5   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0]]


> We can now load the GLoVE embeddings into memory.

In [11]:
%%time
# wall time to run: ~ 10sec
# load all embeddings into memory
embeddings_index = dict()
# Specify UTF-8 encoding
f = open('Data/glove.6B/glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.
Wall time: 11.4 s


> We can then create an embedding matrix by taking each word of the training set and finding the corresponding word vector in the GLoVE data. We only work with 100 dimensional representations.

In [12]:
vec_dims = 100
embedding_matrix = np.zeros((vocab_size, vec_dims))
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [13]:
# create tfidf weighted encoding matrix of utterances
train_sequences = t.texts_to_matrix(train_text,mode='tfidf')
val_sequences =  t.texts_to_matrix(val_text,mode='tfidf')
test_sequences = t.texts_to_matrix(test_text,mode='tfidf')
print(train_sequences[0:5])
print(train_sequences.shape)

[[0.         1.29214445 0.         ... 0.         0.         0.        ]
 [0.         1.29214445 0.         ... 0.         0.         0.        ]
 [0.         1.29214445 0.         ... 0.         0.         0.        ]
 [0.         0.         1.69021763 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
(4151, 2000)


In [14]:
# we want to normalize the word vectors
def normalize(v):
    norm = np.linalg.norm(v)
    if norm == 0: 
        return v
    return v / norm

In [15]:
# create utterance embeddings as tfidf weighted average of normalized word vectors
def seq2vec(datarow,embedmat):
    #initialize an empty utterance vector of the same length as word2vec vectors
    seqvec = np.zeros((100,))
    #counter for number of words in a specific utterance
    wordcount = 1
    #we iterate over the 2000 possible words in a given utterance
    wordind = 1
    while (wordind < len(datarow)):
        #the tf-idf weight is saved in the cells of datarow
        tfidfweight = datarow[wordind]
        if not tfidfweight is None:
            wordembed = tfidfweight * embedmat[wordind,]
            seqvec = seqvec + normalize(wordembed)
            wordcount = wordcount + 1
        wordind = wordind + 1
    return seqvec/wordcount

In [16]:
# go through the matrix and embed each utterances
def embed_utts(sequences,embedmat):
    vecseq = [seq2vec(seq,embedmat)for seq in sequences]
    return vecseq

> we now have everything needed to create the utterance embeddings

In [17]:
%%time
# wall time to run: ~ 1min 14s
# embedd all three datasets
train_embedutts = embed_utts(train_sequences,embedding_matrix)
val_embedutts = embed_utts(val_sequences,embedding_matrix)
test_embedutts = embed_utts(test_sequences,embedding_matrix)
print(train_embedutts[0])

[-3.44478099e-05  3.48539840e-04  3.35962753e-04 -3.70855457e-04
 -2.63147438e-04  1.07227074e-04 -1.66559109e-04  1.94234500e-05
  7.42420570e-05 -1.80615841e-04  1.80387578e-05  4.92242034e-05
  2.75006568e-04  2.34416192e-05  8.31148165e-05 -2.93833280e-04
 -7.15121389e-05  2.98592314e-04 -4.55134987e-04  4.72657153e-04
  2.57585086e-04  1.69741478e-04  7.75960265e-05 -2.15817394e-04
 -4.34789085e-05  7.24571212e-05 -1.54585404e-04 -4.98166781e-04
  1.93941088e-04 -1.74921206e-04  2.37557331e-05  4.85150809e-04
  3.08554881e-05 -4.62293641e-05  1.35110613e-04  2.80284189e-04
 -3.22980711e-05  3.12968134e-04  8.27704500e-05 -2.40951546e-04
 -3.13527886e-04 -1.35440392e-04 -2.05195768e-05 -4.81099111e-04
 -2.75375333e-04 -1.27601856e-04  2.50256011e-04 -2.50631136e-04
 -1.51297680e-04 -8.33555219e-04  3.54382525e-05 -8.74190709e-05
  1.05239327e-05  8.00132559e-04 -1.52039351e-04 -1.90058573e-03
  9.49153013e-05 -1.17238522e-05  1.18845110e-03  3.93093667e-04
 -1.88908628e-04  9.94003

## Model evaluation
> We use the Spearman correlation to evaluate the models and choose the best one, because it can be used for both the regression and the classification outcomes. This is not the case for a weighted Cohen's Kappa, for example, which only works for class labels.

In [18]:
#### Goodness of Fit
def gof_spear(X,Y):
    #spearman correlation of columns (schemas)
    gof_spear = np.zeros(X.shape[1])    
    for schema in range(9):
        rho,p = scipy.stats.spearmanr(X[:,schema],Y[:,schema])
        gof_spear[schema]=rho
    return gof_spear

## Bootstrapping for confidence intervals
> Since all models are expensive to run, we only do a small bootstrapping to obtain some insight into how confident we can be about the predictions.

In [19]:
# we adopt the algorithm from the following website:
# https://machinelearningmastery.com/calculate-bootstrap-confidence-intervals-machine-learning-results-python/
# def bootstrap
def bootstrap(iterations,sample_size, sample_embeds, sample_labels,classification,model):
    stats = np.zeros((iterations,9))
    for l in range(iterations):
        # prepare bootstrap sample
        bootstrap_sample_indx = random.sample(list(enumerate(sample_embeds)), sample_size)
        bootstrap_sample_utts = [sample_embeds[i] for (i,j) in bootstrap_sample_indx]
        bootstrap_sample_labels = [sample_labels[i] for (i,j) in bootstrap_sample_indx]
        # evaluate model
        if model=="knn":
            model_gof=my_kNN(np.array(bootstrap_sample_utts),np.array(bootstrap_sample_labels),classification)
        elif model=="svm":
            model_gof=my_svm(np.array(bootstrap_sample_utts),np.array(bootstrap_sample_labels),classification)
        elif model=="rnn":
            model_gof=my_rnn_fixed(np.array(bootstrap_sample_utts),np.array(bootstrap_sample_labels),classification)
        stats[l,:] = model_gof
    # confidence intervals
    cis = np.zeros((2,9))
    alpha = 0.95
    p = ((1.0-alpha)/2.0) * 100
    cis[0,:] = [max(0.0, np.percentile(stats[:,i], p)) for i in range(9)]
    p = (alpha+((1.0-alpha)/2.0)) * 100
    cis[1,:] = [min(1.0, np.percentile(stats[:,i], p)) for i in range(9)]
    return cis

# configure bootstrap
n_iterations = 100
n_size = int(len(val_text) * 0.75)

## k-nearest Neighbors Classification and Regression
> Since we have ordinal labels for our data, we train both classification and regression algorithms and see which one performs better. We also have multi-label data, and therefore write a custom kNN algorithm. We use the cosine distance to find the nearest neighbors.

In [20]:
# cosine distance
def cosine_dist(X,Y):
    return scipy.spatial.distance.cosine(X,Y)

In [21]:
#kNN algorithm
def knn_custom(train_X,test_X,train_y,test_y,k,dist,classification):
    #empty array to collect the results (should have shape of samples to classify)
    votes = np.zeros(test_y.shape)
    #fit the knn
    knn=NearestNeighbors(k, metric=dist)
    knn.fit(train_X)
    #collect neighbors
    i=0 # index to collect votes of the neighbors
    for sample in test_X:
        neighbors=knn.kneighbors([sample],k,return_distance=False)[0]
        if classification:
            output_y = np.zeros((k,test_y.shape[1]))
            j=0
            for neighbor in neighbors:
                output_y[j,:] = train_y[neighbor,:]
                j=j+1
            votes[i,:] = stats.mode(output_y,nan_policy='omit')[0]
        else:
            output_y = np.zeros(test_y.shape[1])
            for neighbor in neighbors:
                output_y += train_y[neighbor,:]
                votes[i,:] = np.divide(output_y,k)
        i=i+1
    return votes

> To evaluate choices for k, we use a performance metric that is a weighted mean of the spearman correlation for each choice of k. As weights we use the frequencies of schemas (# of utterances with labels > 0 for a given schema/total number of utterances) in the training set.

In [22]:
# weighting model output (spearman correlations) by schema frequencies in training set and returning mean over schemas
def performance(train_y,output):
    train_y = np.array(train_y)
    train_y[train_y>0]=1
    weighting = train_y.sum(axis=0)/train_y.shape[0]
    perf = output * weighting
    return np.nanmean(np.array(perf), axis=0)

In [23]:
# finding best k by testing some values for k
def find_k(train_X, test_X, train_y, test_y, dist, classification):
    perf = 0
    best_k = 0
    for k in [2,3,4,5,6,7,8,9,10,30,100]:
        knn_k = knn_custom(train_X, test_X, train_y, test_y,k,dist,classification)
        knn_gof_spear = gof_spear(knn_k,test_y)
        print('Results for choice of k is %s.' % k)
        print(pd.DataFrame(data=knn_gof_spear,index=schemas,columns=['gof']))
        if perf < performance(train_y,knn_gof_spear):
            perf = performance(train_y,knn_gof_spear)
            best_k = k
    return best_k

In [33]:
%%time
# wall time to run: ~ 15min
# find best k for classification
knn_class_k = find_k(train_embedutts,val_embedutts,train_labels,val_labels,cosine_dist,1)

Results for choice of k is 2.
               gof
Attach    0.535011
Comp      0.645117
Global    0.549926
Health    0.593794
Control   0.132613
MetaCog  -0.005695
Others   -0.006347
Hopeless  0.477770
OthViews  0.414780
Results for choice of k is 3.
               gof
Attach    0.540542
Comp      0.648290
Global    0.575273
Health    0.614754
Control   0.096685
MetaCog        NaN
Others   -0.008982
Hopeless  0.559148
OthViews  0.438930
Results for choice of k is 4.
               gof
Attach    0.608176
Comp      0.651848
Global    0.556168
Health    0.608395
Control   0.185125
MetaCog        NaN
Others         NaN
Hopeless  0.503857
OthViews  0.492713
Results for choice of k is 5.
               gof
Attach    0.585641
Comp      0.631078
Global    0.577591
Health    0.645793
Control   0.199890
MetaCog        NaN
Others         NaN
Hopeless  0.543466
OthViews  0.463903
Results for choice of k is 6.
               gof
Attach    0.602287
Comp      0.619769
Global    0.547025
Health    0.62

In [34]:
print('Best choice for classification k is: %s' % knn_class_k)

Best choice for classification k is: 4


In [35]:
%%time
# wall time to run: ~ 15min
# find best k for regression
knn_reg_k = find_k(train_embedutts,val_embedutts,train_labels,val_labels,cosine_dist,0)

Results for choice of k is 2.
               gof
Attach    0.576639
Comp      0.655949
Global    0.479923
Health    0.590129
Control   0.219795
MetaCog  -0.031264
Others    0.063275
Hopeless  0.527345
OthViews  0.465825
Results for choice of k is 3.
               gof
Attach    0.590981
Comp      0.646035
Global    0.479578
Health    0.559802
Control   0.215631
MetaCog   0.032108
Others    0.154095
Hopeless  0.498075
OthViews  0.471136
Results for choice of k is 4.
               gof
Attach    0.590042
Comp      0.648248
Global    0.481624
Health    0.547869
Control   0.242575
MetaCog   0.040388
Others    0.162686
Hopeless  0.479595
OthViews  0.468530
Results for choice of k is 5.
               gof
Attach    0.610950
Comp      0.665623
Global    0.480601
Health    0.535627
Control   0.250752
MetaCog   0.085472
Others    0.182273
Hopeless  0.488905
OthViews  0.459507
Results for choice of k is 6.
               gof
Attach    0.614144
Comp      0.662029
Global    0.477562
Health    0.49

In [36]:
print('Best choice for regression k is: %s' % knn_reg_k)

Best choice for regression k is: 5


> Since this is needed for the bootstrapping algorithm, we define a function that takes testset and labels and returns the goodness of fit. We print the results on the testset.

In [37]:
def my_kNN(test_X,test_y,classification):
    if classification:
        my_knn=knn_custom(train_embedutts,test_X,train_labels,test_y,4,cosine_dist,1)
    else:
        my_knn=knn_custom(train_embedutts,test_X,train_labels,test_y,5,cosine_dist,0)
    return gof_spear(my_knn,test_y)

In [38]:
%%time
#wall time to run: ~ 4min
output_kNN_class = my_kNN(test_embedutts,test_labels,1)
output_kNN_reg = my_kNN(test_embedutts,test_labels,0)

Wall time: 3min 54s


In [39]:
print('KNN Classification Prediction')
print(pd.DataFrame(data=output_kNN_class,index=schemas,columns=['estimate']))

KNN Classification Prediction
          estimate
Attach    0.550705
Comp      0.690230
Global    0.401123
Health    0.742217
Control   0.107526
MetaCog        NaN
Others    0.279105
Hopeless  0.484137
OthViews  0.454565


In [40]:
print('KNN Regression Prediction')
print(pd.DataFrame(data=output_kNN_reg,index=schemas,columns=['estimate']))

KNN Regression Prediction
          estimate
Attach    0.626743
Comp      0.663091
Global    0.411444
Health    0.534902
Control   0.231541
MetaCog   0.104785
Others    0.243713
Hopeless  0.513825
OthViews  0.458473


In [41]:
%%time
# wall time to run: ~ 3h 30min
# bootstrap confidence intervals for kNN regression and classification
bs_knn_reg = bootstrap(n_iterations,n_size,test_embedutts,test_labels,0,"knn")
bs_knn_class = bootstrap(n_iterations,n_size,test_embedutts,test_labels,1,"knn")

Wall time: 4h 8min 22s


In [42]:
print(f'KNN Classification 95% Confidence Intervals')
print(pd.DataFrame(data=np.transpose(bs_knn_class),index=schemas,columns=['low','high']))

KNN Classification 95% Confidence Intervals
               low      high
Attach    0.506600  0.595837
Comp      0.651139  0.725093
Global    0.338702  0.464383
Health    0.664488  0.817026
Control   0.018076  0.191187
MetaCog   0.000000  1.000000
Others    0.000000  1.000000
Hopeless  0.425464  0.543155
OthViews  0.388575  0.507912


In [43]:
print(f'KNN Regression 95% Confidence Intervals')
print(pd.DataFrame(data=np.transpose(bs_knn_reg),index=schemas,columns=['low','high']))

KNN Regression 95% Confidence Intervals
               low      high
Attach    0.598278  0.662205
Comp      0.635326  0.703022
Global    0.364667  0.457793
Health    0.460145  0.599812
Control   0.176072  0.275194
MetaCog   0.029082  0.179316
Others    0.152921  0.316757
Hopeless  0.471992  0.553974
OthViews  0.422648  0.498113


## Support vector machine
> The second algorithm we chose are support vector machines (SVMs). Again, we train both a support vector classification (SVC) and a support vectore regression (SVR). We only try all three types of standard kernels and do not do any additional parameter tuning. Just like the kNN, the support vector machine takes as input the utterances encoded as averages of word vectors. Support vector classification and regression do not allow for multilabel output. We therefore train disjoint models, one for each schema.<br>
For both types of SVM, we first transform the input texts as the algorithm expects normally distributed input centered around 0 and with a standard deviation of 1.

In [44]:
#SVM/SVR
def svm_scaler(train_X):
    #scale the data
    scaler_texts = StandardScaler()
    scaler_texts = scaler_texts.fit(train_X)
    return scaler_texts

scaler_texts = svm_scaler(train_embedutts)

>Since SVMs, unlike kNNs, can be trained and reused, we write a method that returns all 9 models and a separate one for the predictions.

In [45]:
def svm_custom(train_X,train_y,text_scaler,kern,classification):
    models=[]
    train_X = text_scaler.transform(train_X)
    #fit a new support vector regression for each schema
    for schema in range(9):
        if classification:
            model = svm.SVC(kernel=kern)
        else:
            model = svm.SVR(kernel=kern)
        model.fit(train_X, train_y[:,schema])
        models.append(model)
    return models

In [46]:
def svm_predict(svm_models,test_X,train_y,test_y,text_scaler):
    #empty array to collect the results (should have shape of samples to classify)
    votes = np.zeros(test_y.shape)
    for schema in range(9):
        svm_model=svm_models[schema]
        prediction = svm_model.predict(text_scaler.transform(test_X))
        votes[:,schema] = prediction
    out = votes
    gof = gof_spear(out,test_y)
    perf = performance(train_y,gof)
    return out,perf

In [47]:
%%time
# wall time to run: ~ 2min 20sec
# svr
svr_rbf_models =  svm_custom(train_embedutts,train_labels,scaler_texts,'rbf',0)
svr_rbf_out, svr_rbf_perf = svm_predict(svr_rbf_models,val_embedutts,train_labels,val_labels,scaler_texts)
svr_lin_models = svm_custom(train_embedutts,train_labels,scaler_texts,'linear',0)
svr_lin_out, svr_lin_perf = svm_predict(svr_lin_models,val_embedutts,train_labels,val_labels,scaler_texts)
svr_poly_models = svm_custom(train_embedutts,train_labels,scaler_texts,'poly',0)
svr_poly_out, svr_poly_perf = svm_predict(svr_poly_models,val_embedutts,train_labels,val_labels,scaler_texts)

Wall time: 2min 15s


In [48]:
print(pd.DataFrame(data=[svr_rbf_perf,svr_lin_perf,svr_poly_perf],index=['rbf','lin','poly'],columns=['svr']))

           svr
rbf   0.076675
lin   0.064361
poly  0.066954


In [49]:
%%time
# wall time to run: ~ 45sec
# svm
svm_rbf_models =  svm_custom(train_embedutts,train_labels,scaler_texts,'rbf',1)
svm_rbf_out, svm_rbf_perf = svm_predict(svm_rbf_models,val_embedutts,train_labels,val_labels,scaler_texts)
svm_lin_models = svm_custom(train_embedutts,train_labels,scaler_texts,'linear',1)
svm_lin_out, svm_lin_perf = svm_predict(svm_lin_models,val_embedutts,train_labels,val_labels,scaler_texts)
svm_poly_models = svm_custom(train_embedutts,train_labels,scaler_texts,'poly',1)
svm_poly_out, svm_poly_perf = svm_predict(svm_poly_models,val_embedutts,train_labels,val_labels,scaler_texts)

Wall time: 52.9 s


In [50]:
print(pd.DataFrame(data=[svm_rbf_perf,svm_lin_perf,svm_poly_perf],index=['rbf','lin','poly'],columns=['svm']))

           svm
rbf   0.107355
lin   0.072173
poly  0.045952


> In both algorithms, the radial basis function (rbf) kernel outperformed linear and polynomial kernels. We therefore opt for the rbf kernel when predicting the labels of the test dataset.

In [51]:
%%time
# wall time to run: 4sec
def my_svm(test_X,test_y,classification):
    if classification:
        my_svm_out, my_svm_perf=svm_predict(svm_rbf_models,test_X,train_labels,test_y,scaler_texts)
    else:
        my_svm_out, my_svm_perf=svm_predict(svr_rbf_models,test_X,train_labels,test_y,scaler_texts)
    return gof_spear(my_svm_out,test_y)

output_SVC = my_svm(test_embedutts,test_labels,1)
output_SVR = my_svm(test_embedutts,test_labels,0)

Wall time: 3.73 s


In [52]:
print('SVM Classification Prediction')
print(pd.DataFrame(data=output_SVC,index=schemas,columns=['estimate']))

SVM Classification Prediction
          estimate
Attach    0.647714
Comp      0.684661
Global    0.357601
Health    0.729181
Control        NaN
MetaCog        NaN
Others         NaN
Hopeless  0.489903
OthViews  0.476297


In [53]:
print('SVM Regression Prediction')
print(pd.DataFrame(data=output_SVR,index=schemas,columns=['estimate']))

SVM Regression Prediction
          estimate
Attach    0.675340
Comp      0.640866
Global    0.489372
Health    0.349064
Control   0.310007
MetaCog   0.114894
Others    0.185827
Hopeless  0.535979
OthViews  0.516635


In [54]:
%%time
# wall time to run: ~ 3min 15sec
# bootstrap confidence intervals for SVR and SVC
bs_svc = bootstrap(n_iterations,n_size,test_embedutts,test_labels,1,"svm")
bs_svr = bootstrap(n_iterations,n_size,test_embedutts,test_labels,0,"svm")

Wall time: 3min 40s


In [55]:
print(f'SVM Classification 95% Confidence Intervals')
print(pd.DataFrame(data=np.transpose(bs_svc),index=schemas,columns=['low','high']))

SVM Classification 95% Confidence Intervals
               low      high
Attach    0.611122  0.684222
Comp      0.648678  0.719014
Global    0.307078  0.412812
Health    0.630846  0.807764
Control   0.000000  1.000000
MetaCog   0.000000  1.000000
Others    0.000000  1.000000
Hopeless  0.428166  0.550694
OthViews  0.406193  0.533159


In [56]:
print(f'SVM Regression 95% Confidence Intervals')
print(pd.DataFrame(data=np.transpose(bs_svr),index=schemas,columns=['low','high']))

SVM Regression 95% Confidence Intervals
               low      high
Attach    0.642296  0.699495
Comp      0.609066  0.672340
Global    0.446127  0.531186
Health    0.294252  0.385987
Control   0.258147  0.351490
MetaCog   0.065958  0.144633
Others    0.134902  0.233350
Hopeless  0.503734  0.570941
OthViews  0.480737  0.549735


## Recurrent neural networks

> We train two types of recurrent neural networks: a multilabel RNN that predicts all 9 schemas simultaneously and a set of 9 single-label RNNs that predict the labels for each schema separately. Each RNN consists of 4 layers: an embedding layer, a bidirectional LSTM layer, a dropout layer, and an output layer.

### Training Multilabel RNN
> We used as inspiration for the architecture of all RNNs the paper: Kshirsagar, R., Morris, R., & Bowman, S. (2017). Detecting and explaining crisis. arXiv preprint arXiv:1705.09585. However, we used long short-term memory (LSTM) instead of a gated recurrent unit (GRU).

In [28]:
# define multilabel model
def multilabel_model(train_X, train_y, test_X, test_y,params):
    # build the model
    model = Sequential()
    e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_length, trainable=False)
    #embedding layer
    model.add(e)
    #LSTM layer
    model.add(Bidirectional(LSTM(params['lstm_units'])))
    #dropout layer
    model.add(Dropout(params['dropout']))
    #output layer
    model.add(Dense(9, activation='sigmoid'))
    # compile the model
    model.compile(optimizer=params['optimizer'], loss=params['losses'], metrics=['mean_absolute_error'])
    # summarize the model
    print(model.summary())
    # fit the model
    out = model.fit(train_X, train_y, 
                    validation_data=[test_X,test_y],
                    batch_size=params['batch_size'], 
                    epochs=params['epochs'], 
                    verbose=0)
    return out, model

In [29]:
def grid_search(train_X, test_X, train_y, test_y):
    #define hyperparameter grid
    p={'lstm_units':[50,100],
       'optimizer':['rmsprop','Adam'],
       'losses':['binary_crossentropy','categorical_crossentropy','mean_absolute_error'],
       'dropout':[0.1,0.5],
       'batch_size': [32,64],
       'epochs':[100]} 
    #scan the grid
    tal=talos.Scan(x=train_X,
                   y=train_y,
                   x_val=test_X,
                   y_val=test_y,
                   model=multilabel_model,
                   params=p,
                   experiment_name='multilabel_rnn',
                   print_params=True,
                   clear_session=True)
    return tal

In [30]:
# wall time to run grid search: ~ 2h 10min
#run the small grid search
%time tal = grid_search(padded_train, padded_validate, train_labels, val_labels)
#analyze the outcome
analyze_object=talos.Analyze(tal)
analysis_results = analyze_object.data
#let's have a look at the results of the grid search
print(analysis_results)

  0%|                                                   | 0/48 [00:00<?, ?it/s]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
____________________________________________________

  2%|▊                                       | 1/48 [02:30<1:58:14, 150.94s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_______________________________________________________

  4%|█▋                                      | 2/48 [04:59<1:54:50, 149.78s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
__________________________________________________

  6%|██▌                                     | 3/48 [08:21<2:10:03, 173.41s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_____________________________________________________

  8%|███▎                                    | 4/48 [11:43<2:15:36, 184.91s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_______________________________________________

 10%|████▏                                   | 5/48 [15:02<2:15:59, 189.76s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
__________________________________________________

 12%|█████                                   | 6/48 [18:05<2:11:10, 187.39s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_____________________________________________

 15%|█████▊                                  | 7/48 [22:57<2:31:27, 221.64s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
________________________________________________

 17%|██████▋                                 | 8/48 [28:17<2:48:45, 253.14s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
____________________________________________________

 19%|███████▌                                | 9/48 [31:49<2:36:09, 240.24s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_______________________________________________________

 21%|████████▏                              | 10/48 [35:03<2:23:09, 226.03s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
__________________________________________________

 23%|████████▉                              | 11/48 [40:35<2:39:15, 258.24s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_____________________________________________________

 25%|█████████▊                             | 12/48 [46:19<2:50:37, 284.37s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
____________________________________________________

 27%|██████████▌                            | 13/48 [50:03<2:35:09, 265.99s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_______________________________________________________

 29%|███████████▍                           | 14/48 [54:06<2:26:54, 259.24s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
__________________________________________________

 31%|███████████▌                         | 15/48 [1:00:11<2:40:07, 291.14s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_____________________________________________________

 33%|████████████▎                        | 16/48 [1:05:34<2:40:20, 300.63s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_______________________________________________

 35%|█████████████                        | 17/48 [1:09:08<2:21:54, 274.67s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
__________________________________________________

 38%|█████████████▉                       | 18/48 [1:13:33<2:15:46, 271.55s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_____________________________________________

 40%|██████████████▋                      | 19/48 [1:20:27<2:32:00, 314.52s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
________________________________________________

 42%|███████████████▍                     | 20/48 [1:26:21<2:32:16, 326.29s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
____________________________________________________

 44%|████████████████▏                    | 21/48 [1:30:26<2:15:50, 301.87s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_______________________________________________________

 46%|████████████████▉                    | 22/48 [1:34:48<2:05:36, 289.85s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
__________________________________________________

 48%|█████████████████▋                   | 23/48 [1:42:16<2:20:32, 337.32s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_____________________________________________________

 50%|██████████████████▌                  | 24/48 [1:49:43<2:28:06, 370.25s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
____________________________________________________

 52%|███████████████████▎                 | 25/48 [1:53:13<2:03:31, 322.24s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_______________________________________________________

 54%|████████████████████                 | 26/48 [1:56:05<1:41:35, 277.07s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
__________________________________________________

 56%|████████████████████▊                | 27/48 [2:00:12<1:33:48, 268.01s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_____________________________________________________

 58%|█████████████████████▌               | 28/48 [2:06:35<1:40:54, 302.74s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_______________________________________________

 60%|██████████████████████▎              | 29/48 [2:09:53<1:25:55, 271.32s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
__________________________________________________

 62%|███████████████████████▏             | 30/48 [2:12:44<1:12:19, 241.11s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_____________________________________________

 65%|███████████████████████▉             | 31/48 [2:18:22<1:16:31, 270.07s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
________________________________________________

 67%|████████████████████████▋            | 32/48 [2:24:42<1:20:49, 303.11s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
____________________________________________________

 69%|█████████████████████████▍           | 33/48 [2:28:07<1:08:25, 273.72s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_______________________________________________________

 71%|███████████████████████████▋           | 34/48 [2:31:04<57:04, 244.58s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
__________________________________________________

 73%|████████████████████████████▍          | 35/48 [2:35:49<55:38, 256.79s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_____________________________________________________

 75%|█████████████████████████████▎         | 36/48 [2:40:10<51:38, 258.24s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
____________________________________________________

 77%|██████████████████████████████         | 37/48 [2:43:40<44:39, 243.56s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_______________________________________________________

 79%|██████████████████████████████▉        | 38/48 [2:47:03<38:34, 231.46s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
__________________________________________________

 81%|███████████████████████████████▋       | 39/48 [2:53:09<40:46, 271.83s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 100, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_____________________________________________________

 83%|████████████████████████████████▌      | 40/48 [2:59:03<39:32, 296.52s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_______________________________________________

 85%|█████████████████████████████████▎     | 41/48 [3:02:09<30:43, 263.31s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
__________________________________________________

 88%|██████████████████████████████████▏    | 42/48 [3:05:18<24:06, 241.03s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_____________________________________________

 90%|██████████████████████████████████▉    | 43/48 [3:12:01<24:08, 289.66s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 100, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
________________________________________________

 92%|███████████████████████████████████▊   | 44/48 [3:18:50<21:42, 325.53s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
____________________________________________________

 94%|████████████████████████████████████▌  | 45/48 [3:21:59<14:13, 284.47s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 100)              60400     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 9)                 909       
                                                                 
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_______________________________________________________

 96%|█████████████████████████████████████▍ | 46/48 [3:25:19<08:38, 259.18s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
__________________________________________________

 98%|██████████████████████████████████████▏| 47/48 [3:33:11<05:22, 322.92s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 100, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           262400    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              160800    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 200)               0         
                                                                 
 dense (Dense)               (None, 9)                 1809      
                                                                 
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_____________________________________________________

100%|███████████████████████████████████████| 48/48 [3:41:47<00:00, 277.24s/it]

Wall time: 3h 41min 47s
              start              end    duration  round_epochs        loss  \
0   04/04/22-205642  04/04/22-205913  150.820034           100  -88.799911   
1   04/04/22-205913  04/04/22-210142  148.788683           100  -84.970695   
2   04/04/22-210142  04/04/22-210504  201.307548           100 -187.249344   
3   04/04/22-210504  04/04/22-210826  202.345328           100 -162.135391   
4   04/04/22-210826  04/04/22-211145  198.185141           100   46.436058   
5   04/04/22-211145  04/04/22-211447  182.586290           100   44.123749   
6   04/04/22-211448  04/04/22-211940  291.981771           100   84.674561   
7   04/04/22-211940  04/04/22-212500  320.381307           100   80.865166   
8   04/04/22-212500  04/04/22-212832  211.701952           100    0.317915   
9   04/04/22-212832  04/04/22-213146  194.007792           100    0.317915   
10  04/04/22-213146  04/04/22-213718  331.106126           100    0.317915   
11  04/04/22-213718  04/04/22-214302  34




In [31]:
#we choose the best model of the grid search on the basis of the MAE metric, lower values are better
mlm_model = tal.best_model(metric='mean_absolute_error', asc=True)
#to get an idea of how our best model performs, we check predictions on the validation set
prediction_mlm_val = mlm_model.predict(padded_validate)
output_mlm_val = gof_spear(prediction_mlm_val,val_labels)

In [32]:
print(pd.DataFrame(data=output_mlm_val,index=schemas,columns=['estimate']))

          estimate
Attach   -0.001573
Comp     -0.043008
Global    0.070634
Health   -0.056647
Control  -0.075264
MetaCog  -0.146389
Others   -0.042575
Hopeless -0.047001
OthViews  0.125456


### Inserted for CS 598 DLH

Restore the researchers' saved, best-performing multi-label RNN. Compare with our apparently awry model. What Spearman's correlation values does it give on the validation data?

In [59]:
#we restore the deployed Talos experiment
restore = talos.Restore('Data/mlm_rnn.zip')

<talos.commands.restore.Restore at 0x20549d65248>

In [60]:
# Get restored model
researcher_best_mlm_model = restore.model

In [61]:
# Now check the restored model's predictions on the validation set.
researcher_prediction_mlm_val = researcher_best_mlm_model.predict(padded_validate)
researcher_output_mlm_val = gof_spear(researcher_prediction_mlm_val,val_labels)

In [66]:
# Print results! Restored model works the same as before.
print(pd.DataFrame(data=researcher_output_mlm_val,index=schemas,columns=['estimate']))

          estimate
Attach    0.670015
Comp      0.659653
Global    0.512468
Health    0.385235
Control   0.270971
MetaCog   0.079663
Others    0.195465
Hopeless  0.461129
OthViews  0.510353


In [65]:
restore.results

Unnamed: 0,round_epochs,loss,mean_absolute_error,val_loss,val_mean_absolute_error,batch_size,dropout,epochs,losses,lstm_units,optimizer
0,100,-90.282791,0.344843,0.0,0.0,32,0.1,100,binary_crossentropy,50,rmsprop
1,100,-77.422638,0.36769,0.0,0.0,32,0.1,100,binary_crossentropy,50,Adam
2,100,-184.882538,0.339235,0.0,0.0,32,0.1,100,binary_crossentropy,100,rmsprop
3,100,-117.680328,0.383917,0.0,0.0,32,0.1,100,binary_crossentropy,100,Adam
4,100,1.3465,0.224412,0.0,0.0,32,0.1,100,categorical_crossentropy,50,rmsprop
5,100,1.368416,0.226051,0.0,0.0,32,0.1,100,categorical_crossentropy,50,Adam
6,100,1.275427,0.214414,0.0,0.0,32,0.1,100,categorical_crossentropy,100,rmsprop
7,100,1.26202,0.212742,0.0,0.0,32,0.1,100,categorical_crossentropy,100,Adam
8,100,0.317915,0.317915,0.0,0.0,32,0.1,100,mean_absolute_error,50,rmsprop
9,100,0.317915,0.317915,0.0,0.0,32,0.1,100,mean_absolute_error,50,Adam


### End of code inserted for CS 598 DLH

Where could the error be?
* Subtler code indentation errors?
* Different package versions?
* Model initializations?

In [52]:
#the predictions make sense considering what we got from KNN and SVM. We deploy the model.
talos.Deploy(tal,'mlm_rnn',metric='mean_absolute_error',asc=True)

Deploy package mlm_rnn have been saved.


<talos.commands.deploy.Deploy at 0x15173f518>

In [22]:
#we restore the deployed Talos experiment
restore = talos.Restore('Data/mlm_rnn.zip')
#to get the best performing parameters, we get the results of the Talos experiment
scan_results = restore.results

In [23]:
#select the row with the smallest mean absolute error
print(scan_results[scan_results.mean_absolute_error == scan_results.mean_absolute_error.min()]) 

   round_epochs     loss  mean_absolute_error  val_loss  \
7           100  1.26202             0.212742       0.0   

   val_mean_absolute_error  batch_size  dropout  epochs  \
7                      0.0          32      0.1     100   

                     losses  lstm_units optimizer  
7  categorical_crossentropy         100      Adam  


>We have learned that despite setting the random seed values for numpy and tensorflow, some variability remains with each training of the RNNs and our results will therefore not be 100% reproducible. To ensure that we cannot be accused of reporting just a "lucky shot", we have decided to follow the advice given in this blog post https://machinelearningmastery.com/reproducible-results-neural-networks-keras/ . We therefore train 30 multi-label neural nets with the best parameters identfied with the Talos scan. We report the mean Spearman correlations in the article. We do the same with the per-schema RNNs below.

In [24]:
def mlm_fixed(train_X, train_y, test_X, test_y):
    # build the model
    model = Sequential()
    e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_length, trainable=False)
    #embedding layer
    model.add(e)
    #LSTM layer
    model.add(Bidirectional(LSTM(100)))
    #dropout layer
    model.add(Dropout(0.1))
    #output layer
    model.add(Dense(9, activation='sigmoid'))
    # compile the model
    model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['mean_absolute_error'])
    # summarize the model
    print(model.summary())
    # fit the model
    out = model.fit(train_X, train_y, 
                    validation_data=[test_X,test_y],
                    batch_size=32, 
                    epochs=100, 
                    verbose=0)
    return out, model

In [25]:
%%time
# wall time to run: ~ 1h 54min
for i in range(30):
    #we train the model
    res, model = mlm_fixed(padded_train, train_labels, padded_validate, val_labels)
    #we save models to files to free up working memory
    model_name = 'Data/MLMs/mlm_' + str(i)
    model.save(model_name + '.h5')

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 25, 100)           262400    
_________________________________________________________________
bidirectional (Bidirectional (None, 200)               160800    
_________________________________________________________________
dropout (Dropout)            (None, 200)               0         
_________________________________________________________________
dense (Dense)                (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_10 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_10 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 25, 100)           262400    
_____________

Model: "sequential_20"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_20 (Embedding)     (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_20 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_20 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_20 (Dense)             (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_21"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_21 (Embedding)     (None, 25, 100)           262400    
_____________

CPU times: user 4h 22min 13s, sys: 28min 45s, total: 4h 50min 58s
Wall time: 1h 53min 47s


In [26]:
 #generate predictions with the per-schema models
def predict_schema_mlm(test_text, test_labels,fixed=None):
    if fixed is None:
        all_preds = np.zeros((test_labels.shape[0],test_labels.shape[1],30))
        all_gofs = np.zeros((30,9))
        for j in range(30):
            model_name = "Data/MLMs/mlm_" + str(j)
            model = keras.models.load_model(model_name + '.h5')
            preds = model.predict(test_text)
            gofs = gof_spear(preds,test_labels)
            all_preds[:,:,j] = preds
            all_gofs[j,:] = gofs
    else:
        model_name = "Data/MLMs/mlm_" + str(fixed)
        model = keras.models.load_model(model_name + '.h5')
        all_preds = model.predict(test_text)
        all_gofs = gof_spear(all_preds,test_labels)
    return all_gofs,all_preds

### Training Per-Schema RNNs
> We also train separate RNNs per schema. For this, we can use the output layer to compute a probability for each of the four possible labels. This way, the labels are treated as separate classes. We take over the parameter values from the multilabel model for the number of LSTM units, the dropout rate, the loss function, the evaluation metric, the batch size, and the number of epochs. To obtain the probability for each class, the units of the output layer have a softmax activation function. For the evaluation, the class with the highest probability is chosen per model. The resulting models are written to files and loaded again for prediction.

In [30]:
#define separate models
def perschema_models(train_X, train_y, test_X, test_y):
    model = Sequential()
    e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_length, trainable=False)
    model.add(e)
    model.add(Bidirectional(LSTM(100)))
    model.add(Dropout(0.1))
    model.add(Dense(4, activation='softmax'))
    # compile the model
    model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['mean_absolute_error'])
    # summarize the model
    print(model.summary())
    # fit the model
    model.fit(train_X, train_y,
              validation_data=[test_X,test_y],
              batch_size=32, 
              epochs=100, 
              verbose=0)
    out=model.predict(test_X)
    gof,p=scipy.stats.spearmanr(out,test_y,axis=None)
    return gof, model

In [33]:
%%time
# wall time to run: ~ 16h
#train models
for j in range(30):
    directory_name = "Data/PSMs/per_schema_models_" + str(j)
    if not os.path.exists(directory_name):
        os.makedirs(directory_name)
    for i in range(9):
        train_label_schema = np_utils.to_categorical(train_labels[:,i])
        val_label_schema = np_utils.to_categorical(val_labels[:,i])
        val_output_slm, model = perschema_models(padded_train,train_label_schema,padded_validate,val_label_schema)
        #we write trained models to files to free up working memory
        model_name = '/schema_model_' + schemas[i]
        save_model_under = directory_name + model_name
        model.save(save_model_under + '.h5')

Model: "sequential_47"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_47 (Embedding)     (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_47 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_47 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_47 (Dense)             (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_48"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_48 (Embedding)     (None, 25, 100)           262400    
_____________

Model: "sequential_57"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_57 (Embedding)     (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_57 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_57 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_57 (Dense)             (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_58"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_58 (Embedding)     (None, 25, 100)           262400    
_____________

Model: "sequential_67"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_67 (Embedding)     (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_67 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_67 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_67 (Dense)             (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_68"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_68 (Embedding)     (None, 25, 100)           262400    
_____________

Model: "sequential_77"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_77 (Embedding)     (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_77 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_77 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_77 (Dense)             (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_78"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_78 (Embedding)     (None, 25, 100)           262400    
_____________

Model: "sequential_87"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_87 (Embedding)     (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_87 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_87 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_87 (Dense)             (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_88"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_88 (Embedding)     (None, 25, 100)           262400    
_____________

Model: "sequential_97"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_97 (Embedding)     (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_97 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_97 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_97 (Dense)             (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_98"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_98 (Embedding)     (None, 25, 100)           262400    
_____________

Model: "sequential_107"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_107 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_107 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_107 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_107 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_108"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_108 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_117"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_117 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_117 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_117 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_117 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_118"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_118 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_127"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_127 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_127 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_127 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_127 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_128"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_128 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_137"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_137 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_137 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_137 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_137 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_138"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_138 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_147"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_147 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_147 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_147 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_147 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_148"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_148 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_157"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_157 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_157 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_157 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_157 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_158"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_158 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_167"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_167 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_167 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_167 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_167 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_168"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_168 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_177"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_177 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_177 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_177 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_177 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_178"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_178 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_187"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_187 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_187 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_187 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_187 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_188"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_188 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_197"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_197 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_197 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_197 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_197 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_198"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_198 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_207"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_207 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_207 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_207 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_207 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_208"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_208 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_217"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_217 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_217 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_217 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_217 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_218"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_218 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_227"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_227 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_227 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_227 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_227 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_228"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_228 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_237"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_237 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_237 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_237 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_237 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_238"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_238 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_247"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_247 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_247 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_247 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_247 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_248"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_248 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_257"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_257 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_257 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_257 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_257 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_258"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_258 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_267"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_267 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_267 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_267 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_267 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_268"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_268 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_277"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_277 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_277 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_277 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_277 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_278"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_278 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_287"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_287 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_287 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_287 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_287 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_288"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_288 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_297"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_297 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_297 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_297 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_297 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_298"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_298 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_307"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_307 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_307 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_307 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_307 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_308"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_308 (Embedding)    (None, 25, 100)           262400    
___________

In [64]:
#load single models
def load_single_models(directory):
    single_models = []
    for i in range(9):
        model_name ='/schema_model_' + schemas[i]
        get_from = directory + model_name
        model = keras.models.load_model(get_from + '.h5')
        single_models.append(model)
    return single_models

In [82]:
#generate predictions with the per-schema models
def predict_schema_psm(test_text, test_labels,fixed=None):
    if fixed is None:
        all_preds = np.zeros((test_labels.shape[0],test_labels.shape[1],30))
        all_gofs = np.zeros((30,9))
        for j in range(30):
            directory_name = "Data/PSMs/per_schema_models_" + str(j)
            preds = np.zeros(test_labels.shape)
            gofs=[]
            single_models = load_single_models(directory_name)
            for i in range(9):
                model = single_models[i]
                out = model.predict(test_text)
                out = out.argmax(axis=1)
                preds[:,i] = out
                gof,p=scipy.stats.spearmanr(out,test_labels[:,i])
                gofs.append(gof)
            all_preds[:,:,j] = preds
            all_gofs[j,:] = gofs
    else:
        directory_name= "Data/PSMs/per_schema_models_" + str(fixed)
        all_preds = np.zeros(test_labels.shape)
        all_gofs = []
        single_models = load_single_models(directory_name)
        for i in range(9):
            model = single_models[i]
            out = model.predict(test_text)
            out = out.argmax(axis=1)
            all_preds[:,i] = out
            gof,p=scipy.stats.spearmanr(out,test_labels[:,i])
            all_gofs.append(gof)
    return all_gofs,all_preds    

### Generate Testset Predictions with the RNN Models

In [74]:
def my_rnn(test_X,test_y,single):
    if single:
        gof,preds=predict_schema_psm(test_X,test_y)
    else:
        gof,preds=predict_schema_mlm(test_X,test_y)
    #make a sum of all classification values
    gof_sum = np.sum(gof,axis=1)
    #sort sums
    gof_sum_sorted = np.sort(gof_sum)
    #pick element that is closest but larger than median (we have even number of elements)
    get_med_element = gof_sum_sorted[15]
    #get index of median
    gof_sum_med_idx = np.where(gof_sum==get_med_element)[0]
    #choose this as the final model to use in H2 and to report in the paper
    gof_out = gof[gof_sum_med_idx]
    return np.transpose(gof_out),gof_sum_med_idx

In [68]:
%%time
# wall time to run: ~ 6min
# predicting testset with multilabel model
output_mlm,idx_mlm = my_rnn(padded_test,test_labels,0)
# predicting testset with perschema models
output_psm,idx_psm = my_rnn(padded_test,test_labels,1)

[[0.68103096 0.65396871 0.50859917 0.36096701 0.33152857 0.13731172
  0.16088792 0.51433435 0.51251623]
 [0.68247632 0.65967316 0.48886687 0.36404574 0.33820947 0.13267774
  0.18655239 0.4949372  0.49893988]
 [0.67015486 0.66712939 0.50171559 0.36997862 0.33548779 0.14987802
  0.18654622 0.51737017 0.5181974 ]
 [0.68326572 0.64860437 0.47934406 0.34777581 0.29407156 0.09002006
  0.17343252 0.52321797 0.51186589]
 [0.67466998 0.65535233 0.46393755 0.35647016 0.30749775 0.10822175
  0.17041992 0.50445964 0.49889157]
 [0.67673266 0.65519111 0.5034052  0.35881138 0.3179437  0.10324651
  0.13479307 0.52263527 0.51208628]
 [0.66847413 0.64405326 0.46220718 0.36462563 0.28227896 0.11016369
  0.16741885 0.52511721 0.4932703 ]
 [0.67363478 0.6526793  0.49777868 0.35069732 0.29950105 0.11598738
  0.16998781 0.49836745 0.49805541]
 [0.66628307 0.66315869 0.4882498  0.35970996 0.3196786  0.06315514
  0.15358537 0.51212825 0.47501241]
 [0.67199499 0.67745766 0.47293316 0.36514627 0.3523969  0.12411

In [69]:
print('RNN Multilabel Model Testset Output')
print(pd.DataFrame(data=output_mlm,index=schemas,columns=['estimate']))

RNN Multilabel Model Testset Output
          estimate
Attach    0.686899
Comp      0.662779
Global    0.487473
Health    0.351675
Control   0.314296
MetaCog   0.108215
Others    0.158427
Hopeless  0.533973
OthViews  0.504708


In [70]:
print('RNN Per-Schema Testset Output')
print(pd.DataFrame(data=output_psm,index=schemas,columns=['estimate']))

RNN Per-Schema Testset Output
          estimate
Attach    0.727138
Comp      0.755107
Global    0.577940
Health    0.752872
Control   0.278793
MetaCog  -0.012864
Others    0.223674
Hopeless  0.627562
OthViews  0.578668


In [78]:
def my_rnn_fixed(test_X,test_y,single):
    if single:
        gof,preds=predict_schema_psm(test_X,test_y,idx_psm[0])
    else:
        gof,preds=predict_schema_mlm(test_X,test_y,idx_mlm[0])
    return gof

In [83]:
%%time
# wall time to run: ~ 37min
#bootstrapping the 95% confidence intervals
bs_mlm = bootstrap(n_iterations,n_size,padded_test,test_labels,0,"rnn")
bs_psm = bootstrap(n_iterations,n_size,padded_test,test_labels,1,"rnn")

In [84]:
print(f'Multilabel RNN Classification 95% Confidence Intervals')
print(pd.DataFrame(data=np.transpose(bs_mlm),index=schemas,columns=['low','high']))
print(f'Per-Schema RNN Classification 95% Confidence Intervals')
print(pd.DataFrame(data=np.transpose(bs_psm),index=schemas,columns=['low','high']))

Multilabel RNN Classification 95% Confidence Intervals
               low      high
Attach    0.661568  0.719552
Comp      0.637833  0.691817
Global    0.448301  0.527082
Health    0.307359  0.387933
Control   0.268236  0.343677
MetaCog   0.058033  0.144963
Others    0.102297  0.203079
Hopeless  0.503509  0.563676
OthViews  0.473159  0.540421
Per-Schema RNN Classification 95% Confidence Intervals
               low      high
Attach    0.695272  0.759638
Comp      0.723876  0.785342
Global    0.541899  0.628235
Health    0.654251  0.819880
Control   0.208185  0.346432
MetaCog   0.000000 -0.009094
Others    0.066163  0.331280
Hopeless  0.560763  0.681406
OthViews  0.518882  0.632585


In [113]:
output_psm_flat = [item for sublist in output_psm for item in sublist]
output_mlm_flat = [item for sublist in output_mlm for item in sublist]

In [114]:
print(f'Estimates of all models')
outputs = np.concatenate((output_kNN_class,output_kNN_reg,output_SVC, output_SVR, output_psm_flat, output_mlm_flat))
outputs=np.reshape(outputs,(9,6),order='F')
print(pd.DataFrame(data=outputs,index=schemas,columns=['kNN_class','kNN_reg','SVC','SVR','PSM','MLM']))

Estimates of all models
          kNN_class   kNN_reg       SVC       SVR       PSM       MLM
Attach     0.550705  0.626743  0.647714  0.675340  0.727138  0.686899
Comp       0.690230  0.663091  0.684661  0.640866  0.755107  0.662779
Global     0.401123  0.411444  0.357601  0.489372  0.577940  0.487473
Health     0.742217  0.534902  0.729181  0.349064  0.752872  0.351675
Control    0.107526  0.231541       NaN  0.310007  0.278793  0.314296
MetaCog         NaN  0.104785       NaN  0.114894 -0.012864  0.108215
Others     0.279105  0.243713       NaN  0.185827  0.223674  0.158427
Hopeless   0.484137  0.513825  0.489903  0.535979  0.627562  0.533973
OthViews   0.454565  0.458473  0.476297  0.516635  0.578668  0.504708


In [115]:
print(f'Lower CIs of all models')
lower_cis = np.concatenate((bs_knn_class[0],bs_knn_reg[0],bs_svc[0], bs_svr[0], bs_psm[0], bs_mlm[0]))
lower_cis=np.reshape(lower_cis,(9,6),order='F')
print(pd.DataFrame(data=lower_cis,index=schemas,columns=['kNN_class','kNN_reg','SVC','SVR','PSM','MLM']))

Lower CIs of all models
          kNN_class   kNN_reg       SVC       SVR       PSM       MLM
Attach     0.506610  0.592528  0.606667  0.649722  0.695272  0.661568
Comp       0.643764  0.631965  0.649524  0.614080  0.723876  0.637833
Global     0.332734  0.363976  0.305471  0.451517  0.541899  0.448301
Health     0.654676  0.440834  0.650953  0.308533  0.654251  0.307359
Control    0.018689  0.171817  0.000000  0.259155  0.208185  0.268236
MetaCog    0.000000  0.005652  0.000000  0.055044  0.000000  0.058033
Others     0.000000  0.165796  0.000000  0.142794  0.066163  0.102297
Hopeless   0.437052  0.466884  0.431823  0.509305  0.560763  0.503509
OthViews   0.414239  0.419668  0.425397  0.475484  0.518882  0.473159


In [116]:
print(f'Upper CIs of all models')
upper_cis = np.concatenate((bs_knn_class[1],bs_knn_reg[1],bs_svc[1], bs_svr[1], bs_psm[1], bs_mlm[1]))
upper_cis=np.reshape(upper_cis,(9,6),order='F')
print(pd.DataFrame(data=upper_cis,index=schemas,columns=['kNN_class','kNN_reg','SVC','SVR','PSM','MLM']))

Upper CIs of all models
          kNN_class   kNN_reg       SVC       SVR       PSM       MLM
Attach     0.599736  0.653529  0.678210  0.697454  0.759638  0.719552
Comp       0.725916  0.694682  0.722693  0.674124  0.785342  0.691817
Global     0.464361  0.455419  0.404890  0.516982  0.628235  0.527082
Health     0.810789  0.601932  0.809307  0.395494  0.819880  0.387933
Control    0.177912  0.273330  1.000000  0.345596  0.346432  0.343677
MetaCog    1.000000  0.197388  1.000000  0.155583 -0.009094  0.144963
Others     1.000000  0.307799  1.000000  0.235886  0.331280  0.203079
Hopeless   0.549618  0.559551  0.529388  0.567836  0.681406  0.563676
OthViews   0.514116  0.499502  0.531955  0.550134  0.632585  0.540421


## Generate Dataset for Testing Hypothesis 2
Finally, we need to use the best-performing algorithm, the per-schema RNNs, to generate the predictions on the testset and write these to a file so that we can use them to test Hypothesis 2.

In [118]:
gofH2,predsH2=predict_schema_psm(padded_test,test_labels,idx_psm[0])

In [119]:
predsH2 = predsH2.astype(int)
print(predsH2[:,0:5])
diag_rho = [scipy.stats.spearmanr(predsH2[i,:], test_labels[i,0:9], nan_policy='omit')[0] for i in range(predsH2.shape[0])]


[[0 3 0 0 0]
 [0 3 0 0 0]
 [0 1 0 0 0]
 ...
 [0 0 0 0 0]
 [0 3 0 0 0]
 [0 0 0 0 0]]


In [120]:
df_predsH2 = pd.DataFrame(data=predsH2,columns=['AttachPred','CompPred',"GlobalPred","HealthPred","ControlPred","MetaCogPred","OthersPred","HopelessPred","OthViewsPred"])
df_predsH2["Corr"] = pd.DataFrame(diag_rho)

In [121]:
print(df_predsH2.head())

   AttachPred  CompPred  GlobalPred  HealthPred  ControlPred  MetaCogPred  \
0           0         3           0           0            0            0   
1           0         3           0           0            0            0   
2           0         1           0           0            0            0   
3           0         0           1           0            0            0   
4           0         0           0           0            0            0   

   OthersPred  HopelessPred  OthViewsPred  Corr  
0           0             1             0  0.75  
1           0             0             0  1.00  
2           0             0             0  1.00  
3           0             0             0  1.00  
4           0             0             0   NaN  


In [122]:
df_predsH2.to_csv("Data/PredictionsH2.csv", sep=';', header=True, index=False, mode='w')