# Python3 Part of the Data Analysis for the Journal Article Titled:
# <i>Natural language processing for cognitive behavioral therapy: extracting schemas from thought records</i>

>This script accompanies the journal article with the title stated above. The main aim of the research is to determine whether an algorithm can label utterances expressed in thought records with regard to the schema(s) they reflect. Thought record forms are a tool in cognitive therapy with which patients should gain insight into their maladaptive thought processes. According to the theory underlying cognitive therapy, it is these malaptive thought processes that result in the respective mental illness.  
This script complements an R/KnitR script that consists of the following sections:
    <ol> 
    <li>Preparing data for testing Hypothesis 1</li>
    <li>Testing Hypothesis 2</li>
    <li>Testing Hypothesis 3</li>
    <li>Testing Hypothesis 4</li>
    </ol>
Details concerning the hypotheses, the project background, the raw data, and the data collection process can all be found in the R/KnitR script.<br>
<br>
The modules below need to be installed before running the code:
    <ol>
    <li>gensim==3.8.3</li>
    <li>talos==0.6.3</li>
    <li>tensorflow==2.3.2</li>
    <li>statsmodels==0.10.2</li>
    <li>scipy==1.4.1</li>
    <li>scikit-learn==0.23.2</li>
    <li>numpy==1.16.3</li>
    <li>pandas==0.25.3</li>
    </ol>
<br>
The following inputs are required and can be found in the DataRepository/AnalysisArticle/Data directory:
    <ol>
    <li>glove.6B directory</li>
    <li>DatasetsForH1 directory</li>
    </ol>
<br>
Additionally, the following output is generated:
    <ol>
    <li>data_for_H2.csv file</li>
    <li>per_schema_models directory</li>
    </ol>
with the latter containing all trained per-schema RNN models in .h5 file format.  
<br>
The purpose of this script is to test Hypothesis 1, i.e. to see whether an algorithm can attach the correct schema label to thought record utterances more often than would be expected by chance. A thought record utterance could reflect none, any one, or multiple of 9 possible schemas. Additionally, labels are not binary (does or does not reflect schema) but ordinal (0 - has nothing to do with schema, 1 - has a little bit to do with the schema, 2 - has to do with the schema, 3 - fits perfectly with the schema). <br>
Utterances are in natural language format. It is therefore necessary to preprocess these pieces of text, which we do in R. We also split the entire raw dataset into training, validation and test sets. The test set is created by taking 15% of the raw data, the validation set is created by taking another 15% of the remaining data. <br>
Three algorithms are explored: k-nearest neighbors, support vector machines, and recurrent neural networks. We arrived at the former two, by following the decision tree presented by scikitLearn (https://scikit-learn.org/stable/tutorial/machine_learning_map/). The data are ordinal, labeled, and we have less than 100k samples. The recurrent neural networks are a logical choice for natural language data, since they allow modelling the temporal aspect that is inherent to sentences as sequences of words.<br>
The wall time of runtimes are provided in the first comment of cells of time intensive code. Additionally, the cell magic "%%time" in these cells ensures that runtimes are printed so that these can be compared to the reported runtimes to get appropriate estimates when running on a different machine.<br>
We import the following packages and functions:

In [15]:
#set seed
seed = 57839
import os
os.environ['PYTHONHASHSEED']=str(seed)
import sys

import random
random.seed(seed)

import numpy as np
np.random.seed(seed)

import csv
import pandas as pd
import scipy
import scipy.stats as stats
import functools
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument


import sklearn
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier, KNeighborsRegressor
from sklearn import metrics, preprocessing, svm
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm
import statsmodels.formula.api as smf

import tensorflow as tf
tf.random.set_random_seed(seed)

from tensorflow.python.keras.metrics import Metric
from tensorflow import keras
import talos
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.utils import np_utils
from keras.layers import Dense, Flatten, Embedding, SimpleRNN, LSTM, GRU, Bidirectional,Dropout

from keras import backend as K
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)

In [16]:
print(sys.version)

3.7.13 (default, Mar 28 2022, 07:24:34) 
[Clang 12.0.0 ]


In [17]:
#list packages and their version numbers as used in this script (code is taken from 
#https://stackoverflow.com/questions/40428931/package-for-listing-version-of-packages-used-in-a-jupyter-notebook)
import pkg_resources
import types
def get_imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            # Split ensures you get root package, 
            # not just imported function
            name = val.__name__.split(".")[0]

        elif isinstance(val, type):
            name = val.__module__.split(".")[0]

        # Some packages are weird and have different
        # imported names vs. system/pip names. Unfortunately,
        # there is no systematic way to get pip names from
        # a package's imported name. You'll have to add
        # exceptions to this list manually!
        poorly_named_packages = {
            "sklearn": "scikit-learn"
        }
        if name in poorly_named_packages.keys():
            name = poorly_named_packages[name]

        yield name
imports = list(set(get_imports()))

# The only way I found to get the version of the root package
# from only the name of the package is to cross-check the names 
# of installed packages vs. imported packages
requirements = []
for m in pkg_resources.working_set:
    if m.project_name in imports and m.project_name!="pip":
        requirements.append((m.project_name, m.version))

for r in requirements:
    print("{}=={}".format(*r))

gensim==3.8.3
numpy==1.21.6
tensorflow==1.14.0
scipy==1.4.1
pandas==0.25.3
scikit-learn==0.23.2
statsmodels==0.13.1
talos==0.6.3


> We also set the working directory:

In [18]:
os.chdir("/Users/sampathg/Documents/MyUIUC/CS598-DLHC/Project/Datasets/DataRepository/AnalysisArticle")

## Importing the datasets from csv
> The preprocessed utterances are split into three sets in the R script. They are saved in three separate csv files. Additionally, the manually assigned labels that correspond with the utterances are saved in three separate csv files.

In [19]:
# read in datasets (already pre-processed)
def readcsv(fname,istext):
    if istext:
        with open(fname,'rt') as f:
            reader=csv.reader(f)
            next(reader)
            data = []
            for row in reader:
                for item in row:
                    data.append(item)
            f.close()
    else:
        with open(fname,'r') as f:
            reader=csv.reader(f,delimiter=';')
            next(reader)
            data = list(reader)
            data = np.asarray(data, dtype='int')
            f.close()
    return data 

# read in training, validation, and test set utterances
train_text = readcsv('Data/DatasetsForH1/H1_train_texts.csv',True)
val_text = readcsv('Data/DatasetsForH1/H1_validate_texts.csv', True)
test_text = readcsv('Data/DatasetsForH1/H1_test_texts.csv',True)

# read in training, validation, and test set labels
train_labels = readcsv('Data/DatasetsForH1/H1_train_labels.csv',False)[:,0:9]
val_labels = readcsv('Data/DatasetsForH1/H1_validate_labels.csv', False)[:,0:9]
test_labels = readcsv('Data/DatasetsForH1/H1_test_labels.csv',False)[:,0:9]

In [20]:
print(train_text[0:5])

['lot people may think well lot people might not like me', 'might not working fast enough their standards', 'may not able graduate', 'would get bad performance review', 'friends will get annoyed by me']


In [21]:
print(train_labels[0:5,:])

[[2 0 0 0 0 0 0 0 3]
 [0 3 0 0 0 0 0 0 0]
 [0 3 0 0 0 0 0 0 0]
 [0 3 0 0 0 0 0 0 0]
 [2 0 0 0 0 0 0 0 3]]


> As can be seen, some utterances have multiple schemas assigned. However, overall, the label matrices are sparse matrices. The first column of the labels corresponds to the "Attachment" schema, the second to the "Competence" schema, the third to last to the "Other's views on self" schema.

In [22]:
#for later use
schemas = ["Attach","Comp","Global","Health","Control","MetaCog","Others","Hopeless","OthViews"]

## Embedding the utterances using GLoVE
> We have opted for representing the words in utterances as word vectors. We adopt the GLoVE word vector space that has been created with Wikipedia 2014. First, we tokenize the top 2000 words of the training set.  

In [23]:
# prepare tokenizer
max_words = 2000
t = Tokenizer(num_words = max_words)
t.fit_on_texts(train_text)
vocab_size = len(t.word_index) + 1
print(vocab_size)

2624


> The tokenizer takes the words and indexes these based on frequency. For the recurrent neural net, we need padded utterances sequences. Texts_to_sequences simply represents each utterance as a vector of tokens. Padding ensures that all vectors are of the same length, by appending 0s to the end of shorter vectors. We pad to a length of 25 words.

In [24]:
# integer encode all utterances
encoded_train = t.texts_to_sequences(train_text)
encoded_validate = t.texts_to_sequences(val_text)
encoded_test = t.texts_to_sequences(test_text)

# pad documents to a max length of 25 words
max_length = 25

padded_train = pad_sequences(encoded_train, maxlen=max_length, padding='post')
padded_validate = pad_sequences(encoded_validate, maxlen=max_length, padding='post')
padded_test = pad_sequences(encoded_test, maxlen=max_length, padding='post')

print(encoded_train[0:5])

[[147, 28, 48, 37, 101, 147, 28, 32, 1, 8, 5], [32, 1, 155, 658, 14, 125, 568], [48, 1, 19, 448], [2, 11, 53, 449, 659], [50, 6, 11, 373, 98, 5]]


In [25]:
print(padded_train[0:5])

[[147  28  48  37 101 147  28  32   1   8   5   0   0   0   0   0   0   0
    0   0   0   0   0   0   0]
 [ 32   1 155 658  14 125 568   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0]
 [ 48   1  19 448   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0]
 [  2  11  53 449 659   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0]
 [ 50   6  11 373  98   5   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0]]


> We can now load the GLoVE embeddings into memory.

In [26]:
%%time
# wall time to run: ~ 10sec
# load all embeddings into memory
embeddings_index = dict()
f = open('Data/glove.6B/glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.
CPU times: user 8.4 s, sys: 290 ms, total: 8.69 s
Wall time: 8.73 s


> We can then create an embedding matrix by taking each word of the training set and finding the corresponding word vector in the GLoVE data. We only work with 100 dimensional representations.

In [27]:
vec_dims = 100
embedding_matrix = np.zeros((vocab_size, vec_dims))
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [28]:
# create tfidf weighted encoding matrix of utterances
train_sequences = t.texts_to_matrix(train_text,mode='tfidf')
val_sequences =  t.texts_to_matrix(val_text,mode='tfidf')
test_sequences = t.texts_to_matrix(test_text,mode='tfidf')
print(train_sequences[0:5])
print(train_sequences.shape)

[[0.         1.29214445 0.         ... 0.         0.         0.        ]
 [0.         1.29214445 0.         ... 0.         0.         0.        ]
 [0.         1.29214445 0.         ... 0.         0.         0.        ]
 [0.         0.         1.69021763 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
(4151, 2000)


In [29]:
# we want to normalize the word vectors
def normalize(v):
    norm = np.linalg.norm(v)
    if norm == 0: 
       return v
    return v / norm

In [30]:
# create utterance embeddings as tfidf weighted average of normalized word vectors
def seq2vec(datarow,embedmat):
  #initialize an empty utterance vector of the same length as word2vec vectors
  seqvec = np.zeros((100,))
  #counter for number of words in a specific utterance
  wordcount = 1
  #we iterate over the 2000 possible words in a given utterance
  wordind = 1
  while (wordind < len(datarow)):
    #the tf-idf weight is saved in the cells of datarow
    tfidfweight = datarow[wordind]
    if not tfidfweight is None:
      wordembed = tfidfweight * embedmat[wordind,]
      seqvec = seqvec + normalize(wordembed)
      wordcount = wordcount + 1
    wordind = wordind + 1
  return seqvec/wordcount

In [31]:
# go through the matrix and embed each utterances
def embed_utts(sequences,embedmat):
  vecseq = [seq2vec(seq,embedmat)for seq in sequences]
  return vecseq

> we now have everything needed to create the utterance embeddings

In [32]:
%%time
# wall time to run: ~ 1min 14s
# embedd all three datasets
train_embedutts = embed_utts(train_sequences,embedding_matrix)
val_embedutts = embed_utts(val_sequences,embedding_matrix)
test_embedutts = embed_utts(test_sequences,embedding_matrix)
print(train_embedutts[0])

[-3.44478099e-05  3.48539840e-04  3.35962753e-04 -3.70855457e-04
 -2.63147438e-04  1.07227074e-04 -1.66559109e-04  1.94234500e-05
  7.42420570e-05 -1.80615841e-04  1.80387578e-05  4.92242034e-05
  2.75006568e-04  2.34416192e-05  8.31148165e-05 -2.93833280e-04
 -7.15121389e-05  2.98592314e-04 -4.55134987e-04  4.72657153e-04
  2.57585086e-04  1.69741478e-04  7.75960265e-05 -2.15817394e-04
 -4.34789085e-05  7.24571212e-05 -1.54585404e-04 -4.98166781e-04
  1.93941088e-04 -1.74921206e-04  2.37557331e-05  4.85150809e-04
  3.08554881e-05 -4.62293641e-05  1.35110613e-04  2.80284189e-04
 -3.22980711e-05  3.12968134e-04  8.27704500e-05 -2.40951546e-04
 -3.13527886e-04 -1.35440392e-04 -2.05195768e-05 -4.81099111e-04
 -2.75375333e-04 -1.27601856e-04  2.50256011e-04 -2.50631136e-04
 -1.51297680e-04 -8.33555219e-04  3.54382525e-05 -8.74190709e-05
  1.05239327e-05  8.00132559e-04 -1.52039351e-04 -1.90058573e-03
  9.49153013e-05 -1.17238522e-05  1.18845110e-03  3.93093667e-04
 -1.88908628e-04  9.94003

## Model evaluation
> We use the Spearman correlation to evaluate the models and choose the best one, because it can be used for both the regression and the classification outcomes. This is not the case for a weighted Cohen's Kappa, for example, which only works for class labels.

In [33]:
#### Goodness of Fit
def gof_spear(X,Y):
    #spearman correlation of columns (schemas)
    gof_spear = np.zeros(X.shape[1])    
    for schema in range(9):
        rho,p = scipy.stats.spearmanr(X[:,schema],Y[:,schema])
        gof_spear[schema]=rho
    return gof_spear

## Bootstrapping for confidence intervals
> Since all models are expensive to run, we only do a small bootstrapping to obtain some insight into how confident we can be about the predictions.

In [34]:
# we adopt the algorithm from the following website:
# https://machinelearningmastery.com/calculate-bootstrap-confidence-intervals-machine-learning-results-python/
# def bootstrap
def bootstrap(iterations,sample_size, sample_embeds, sample_labels,classification,model):
    stats = np.zeros((iterations,9))
    for l in range(iterations):
        # prepare bootstrap sample
        bootstrap_sample_indx = random.sample(list(enumerate(sample_embeds)), sample_size)
        bootstrap_sample_utts = [sample_embeds[i] for (i,j) in bootstrap_sample_indx]
        bootstrap_sample_labels = [sample_labels[i] for (i,j) in bootstrap_sample_indx]
        # evaluate model
        if model=="knn":
            model_gof=my_kNN(np.array(bootstrap_sample_utts),np.array(bootstrap_sample_labels),classification)
        elif model=="svm":
            model_gof=my_svm(np.array(bootstrap_sample_utts),np.array(bootstrap_sample_labels),classification)
        elif model=="rnn":
            model_gof=my_rnn_fixed(np.array(bootstrap_sample_utts),np.array(bootstrap_sample_labels),classification)
        stats[l,:] = model_gof
    # confidence intervals
    cis = np.zeros((2,9))
    alpha = 0.95
    p = ((1.0-alpha)/2.0) * 100
    cis[0,:] = [max(0.0, np.percentile(stats[:,i], p)) for i in range(9)]
    p = (alpha+((1.0-alpha)/2.0)) * 100
    cis[1,:] = [min(1.0, np.percentile(stats[:,i], p)) for i in range(9)]
    return cis

# configure bootstrap
n_iterations = 100
n_size = int(len(val_text) * 0.75)

## k-nearest Neighbors Classification and Regression
> Since we have ordinal labels for our data, we train both classification and regression algorithms and see which one performs better. We also have multi-label data, and therefore write a custom kNN algorithm. We use the cosine distance to find the nearest neighbors.

In [35]:
# cosine distance
def cosine_dist(X,Y):
    return scipy.spatial.distance.cosine(X,Y)

In [52]:
#kNN algorithm
def knn_custom(train_X,test_X,train_y,test_y,k,dist,classification):
    #empty array to collect the results (should have shape of samples to classify)
    votes = np.zeros(test_y.shape)
    #fit the knn
    knn=NearestNeighbors(k, metric=dist)
    knn.fit(train_X)
    #collect neighbors
    i=0 # index to collect votes of the neighbors
    for sample in test_X:
        neighbors=knn.kneighbors([sample],k,return_distance=False)[0]
        if classification:
            output_y = np.zeros((k,test_y.shape[1]))
            j=0
            for neighbor in neighbors:
                output_y[j,:] = train_y[neighbor,:]
                j=j+1
            votes[i,:] = stats.mode(output_y,nan_policy='omit')[0]
        else:
            output_y = np.zeros(test_y.shape[1])
            for neighbor in neighbors:
                output_y += train_y[neighbor,:]
                votes[i,:] = np.divide(output_y,k)
        i=i+1
    return votes

> To evaluate choices for k, we use a performance metric that is a weighted mean of the spearman correlation for each choice of k. As weights we use the frequencies of schemas (# of utterances with labels > 0 for a given schema/total number of utterances) in the training set.

In [56]:
# weighting model output (spearman correlations) by schema frequencies in training set and returning mean over schemas
def performance(train_y,output):
    train_y = np.array(train_y)
    train_y[train_y>0]=1
    weighting = train_y.sum(axis=0)/train_y.shape[0]
    perf = output * weighting
    return np.nanmean(np.array(perf), axis=0)

In [57]:
# # finding best k by testing some values for k
# def find_k(train_X, test_X, train_y, test_y, dist, classification):
#     perf = 0
#     best_k = 0
#     for k in [2,3,4,5,6,7,8,9,10,30,100]:
#         knn_k = knn_custom(train_X, test_X, train_y, test_y,k,dist,classification)
#         knn_gof_spear = gof_spear(knn_k,test_y)
#         print('Results for choice of k is %s.' % k)
#         print(pd.DataFrame(data=knn_gof_spear,index=schemas,columns=['gof']))
#         if perf < performance(train_y,knn_gof_spear):
#             perf = performance(train_y,knn_gof_spear)
#             best_k = k
#     return best_k

In [58]:
# %%time
# # wall time to run: ~ 15min
# # find best k for classification
# knn_class_k = find_k(train_embedutts,val_embedutts,train_labels,val_labels,cosine_dist,1)

In [59]:
# print('Best choice for classification k is: %s' % knn_class_k)

In [60]:
# %%time
# # wall time to run: ~ 15min
# # find best k for regression
# knn_reg_k = find_k(train_embedutts,val_embedutts,train_labels,val_labels,cosine_dist,0)

In [61]:
# print('Best choice for regression k is: %s' % knn_reg_k)

> Since this is needed for the bootstrapping algorithm, we define a function that takes testset and labels and returns the goodness of fit. We print the results on the testset.

In [62]:
def my_kNN(test_X,test_y,classification):
    if classification:
        my_knn=knn_custom(train_embedutts,test_X,train_labels,test_y,4,cosine_dist,1)
    else:
        my_knn=knn_custom(train_embedutts,test_X,train_labels,test_y,5,cosine_dist,0)
    return gof_spear(my_knn,test_y)

In [63]:
%%time
#wall time to run: ~ 4min
output_kNN_class = my_kNN(test_embedutts,test_labels,1)
output_kNN_reg = my_kNN(test_embedutts,test_labels,0)

CPU times: user 5min 24s, sys: 2.85 s, total: 5min 27s
Wall time: 5min 36s


In [64]:
print('KNN Classification Prediction')
print(pd.DataFrame(data=output_kNN_class,index=schemas,columns=['estimate']))

KNN Classification Prediction
          estimate
Attach    0.550705
Comp      0.690230
Global    0.401123
Health    0.742217
Control   0.107526
MetaCog        NaN
Others    0.279105
Hopeless  0.484137
OthViews  0.454565


In [65]:
print('KNN Regression Prediction')
print(pd.DataFrame(data=output_kNN_reg,index=schemas,columns=['estimate']))

KNN Regression Prediction
          estimate
Attach    0.626743
Comp      0.663091
Global    0.411444
Health    0.534902
Control   0.231541
MetaCog   0.104785
Others    0.243713
Hopeless  0.513825
OthViews  0.458473


In [66]:
# %%time
# # wall time to run: ~ 3h 30min
# # bootstrap confidence intervals for kNN regression and classification
# bs_knn_reg = bootstrap(n_iterations,n_size,test_embedutts,test_labels,0,"knn")
# bs_knn_class = bootstrap(n_iterations,n_size,test_embedutts,test_labels,1,"knn")

In [67]:
# print(f'KNN Classification 95% Confidence Intervals')
# print(pd.DataFrame(data=np.transpose(bs_knn_class),index=schemas,columns=['low','high']))

In [68]:
# print(f'KNN Regression 95% Confidence Intervals')
# print(pd.DataFrame(data=np.transpose(bs_knn_reg),index=schemas,columns=['low','high']))

## Support vector machine
> The second algorithm we chose are support vector machines (SVMs). Again, we train both a support vector classification (SVC) and a support vectore regression (SVR). We only try all three types of standard kernels and do not do any additional parameter tuning. Just like the kNN, the support vector machine takes as input the utterances encoded as averages of word vectors. Support vector classification and regression do not allow for multilabel output. We therefore train disjoint models, one for each schema.<br>
For both types of SVM, we first transform the input texts as the algorithm expects normally distributed input centered around 0 and with a standard deviation of 1.

In [69]:
#SVM/SVR
def svm_scaler(train_X):
        #scale the data
        scaler_texts = StandardScaler()
        scaler_texts = scaler_texts.fit(train_X)
        return scaler_texts

scaler_texts = svm_scaler(train_embedutts)

>Since SVMs, unlike kNNs, can be trained and reused, we write a method that returns all 9 models and a separate one for the predictions.

In [70]:
def svm_custom(train_X,train_y,text_scaler,kern,classification):
        models=[]
        train_X = text_scaler.transform(train_X)
        #fit a new support vector regression for each schema
        for schema in range(9):
            if classification:
                model = svm.SVC(kernel=kern)
            else:
                model = svm.SVR(kernel=kern)
            model.fit(train_X, train_y[:,schema])
            models.append(model)
        return models

In [71]:
def svm_predict(svm_models,test_X,train_y,test_y,text_scaler):
    #empty array to collect the results (should have shape of samples to classify)
    votes = np.zeros(test_y.shape)
    for schema in range(9):
        svm_model=svm_models[schema]
        prediction = svm_model.predict(text_scaler.transform(test_X))
        votes[:,schema] = prediction
    out = votes
    gof = gof_spear(out,test_y)
    perf = performance(train_y,gof)
    return out,perf

In [72]:
%%time
# wall time to run: ~ 2min 20sec
# svr
svr_rbf_models =  svm_custom(train_embedutts,train_labels,scaler_texts,'rbf',0)
svr_rbf_out, svr_rbf_perf = svm_predict(svr_rbf_models,val_embedutts,train_labels,val_labels,scaler_texts)
svr_lin_models = svm_custom(train_embedutts,train_labels,scaler_texts,'linear',0)
svr_lin_out, svr_lin_perf = svm_predict(svr_lin_models,val_embedutts,train_labels,val_labels,scaler_texts)
svr_poly_models = svm_custom(train_embedutts,train_labels,scaler_texts,'poly',0)
svr_poly_out, svr_poly_perf = svm_predict(svr_poly_models,val_embedutts,train_labels,val_labels,scaler_texts)

CPU times: user 2min 27s, sys: 1.42 s, total: 2min 29s
Wall time: 2min 37s


In [73]:
print(pd.DataFrame(data=[svr_rbf_perf,svr_lin_perf,svr_poly_perf],index=['rbf','lin','poly'],columns=['svr']))

           svr
rbf   0.076675
lin   0.064361
poly  0.066954


In [74]:
%%time
# wall time to run: ~ 45sec
# svm
svm_rbf_models =  svm_custom(train_embedutts,train_labels,scaler_texts,'rbf',1)
svm_rbf_out, svm_rbf_perf = svm_predict(svm_rbf_models,val_embedutts,train_labels,val_labels,scaler_texts)
svm_lin_models = svm_custom(train_embedutts,train_labels,scaler_texts,'linear',1)
svm_lin_out, svm_lin_perf = svm_predict(svm_lin_models,val_embedutts,train_labels,val_labels,scaler_texts)
svm_poly_models = svm_custom(train_embedutts,train_labels,scaler_texts,'poly',1)
svm_poly_out, svm_poly_perf = svm_predict(svm_poly_models,val_embedutts,train_labels,val_labels,scaler_texts)

CPU times: user 54.3 s, sys: 373 ms, total: 54.7 s
Wall time: 55.9 s


In [75]:
print(pd.DataFrame(data=[svm_rbf_perf,svm_lin_perf,svm_poly_perf],index=['rbf','lin','poly'],columns=['svm']))

           svm
rbf   0.107355
lin   0.072173
poly  0.045952


> In both algorithms, the radial basis function (rbf) kernel outperformed linear and polynomial kernels. We therefore opt for the rbf kernel when predicting the labels of the test dataset.

In [76]:
%%time
# wall time to run: 4sec
def my_svm(test_X,test_y,classification):
    if classification:
        my_svm_out, my_svm_perf=svm_predict(svm_rbf_models,test_X,train_labels,test_y,scaler_texts)
    else:
        my_svm_out, my_svm_perf=svm_predict(svr_rbf_models,test_X,train_labels,test_y,scaler_texts)
    return gof_spear(my_svm_out,test_y)

output_SVC = my_svm(test_embedutts,test_labels,1)
output_SVR = my_svm(test_embedutts,test_labels,0)

CPU times: user 3.19 s, sys: 30.1 ms, total: 3.22 s
Wall time: 3.31 s


In [77]:
print('SVM Classification Prediction')
print(pd.DataFrame(data=output_SVC,index=schemas,columns=['estimate']))

SVM Classification Prediction
          estimate
Attach    0.647714
Comp      0.684661
Global    0.357601
Health    0.729181
Control        NaN
MetaCog        NaN
Others         NaN
Hopeless  0.489903
OthViews  0.476297


In [78]:
print('SVM Regression Prediction')
print(pd.DataFrame(data=output_SVR,index=schemas,columns=['estimate']))

SVM Regression Prediction
          estimate
Attach    0.675340
Comp      0.640866
Global    0.489372
Health    0.349064
Control   0.310007
MetaCog   0.114894
Others    0.185827
Hopeless  0.535979
OthViews  0.516635


In [81]:
# %%time
# # wall time to run: ~ 3min 15sec
# # bootstrap confidence intervals for SVR and SVC
# bs_svc = bootstrap(n_iterations,n_size,test_embedutts,test_labels,1,"svm")
# bs_svr = bootstrap(n_iterations,n_size,test_embedutts,test_labels,0,"svm")

In [82]:
# print(f'SVM Classification 95% Confidence Intervals')
# print(pd.DataFrame(data=np.transpose(bs_svc),index=schemas,columns=['low','high']))

In [83]:
# print(f'SVM Regression 95% Confidence Intervals')
# print(pd.DataFrame(data=np.transpose(bs_svr),index=schemas,columns=['low','high']))

## Recurrent neural networks

> We train two types of recurrent neural networks: a multilabel RNN that predicts all 9 schemas simultaneously and a set of 9 single-label RNNs that predict the labels for each schema separately. Each RNN consists of 4 layers: an embedding layer, a bidirectional LSTM layer, a dropout layer, and an output layer.

### Training Multilabel RNN
> We used as inspiration for the architecture of all RNNs the paper: Kshirsagar, R., Morris, R., & Bowman, S. (2017). Detecting and explaining crisis. arXiv preprint arXiv:1705.09585. However, we used long short-term memory (LSTM) instead of a gated recurrent unit (GRU).

In [85]:
# define multilabel model
def multilabel_model(train_X, train_y, test_X, test_y,params):
    # build the model
    model = Sequential()
    e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_length, trainable=False)
    #embedding layer
    model.add(e)
    #LSTM layer
    model.add(Bidirectional(LSTM(params['lstm_units'])))
    #dropout layer
    model.add(Dropout(params['dropout']))
    #output layer
    model.add(Dense(9, activation='sigmoid'))
    # compile the model
    model.compile(optimizer=params['optimizer'], loss=params['losses'], metrics=['mean_absolute_error'])
    # summarize the model
    print(model.summary())
    # fit the model
    out = model.fit(train_X, train_y, 
                    validation_data=[test_X,test_y],
                    batch_size=params['batch_size'], 
                    epochs=params['epochs'], 
                    verbose=0)
    return out, model

In [86]:
def grid_search(train_X, test_X, train_y, test_y):
    #define hyperparameter grid
    p={'lstm_units':[50,100],
       'optimizer':['rmsprop','Adam'],
       'losses':['binary_crossentropy','categorical_crossentropy','mean_absolute_error'],
       'dropout':[0.1,0.5],
       'batch_size': [32,64],
       'epochs':[5]} # changed fromm 100 to 5
    #scan the grid
    tal=talos.Scan(x=train_X,
                   y=train_y,
                   x_val=test_X,
                   y_val=test_y,
                   model=multilabel_model,
                   params=p,
                   experiment_name='multilabel_rnn',
                   print_params=True,
                   clear_session=True)
    return tal

In [87]:
# wall time to run grid search: ~ 2h 10min
#run the small grid search
%time tal = grid_search(padded_train, padded_validate, train_labels, val_labels)
#analyze the outcome
analyze_object=talos.Analyze(tal)
analysis_results = analyze_object.data
#let's have a look at the results of the grid search
print(analysis_results)

  0%|          | 0/48 [00:00<?, ?it/s]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None



  2%|▏         | 1/48 [00:25<19:57, 25.48s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


  4%|▍         | 2/48 [00:48<18:19, 23.91s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


  6%|▋         | 3/48 [01:25<22:30, 30.01s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


  8%|▊         | 4/48 [02:03<24:24, 33.29s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 10%|█         | 5/48 [02:26<21:10, 29.54s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 12%|█▎        | 6/48 [02:51<19:34, 27.96s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 15%|█▍        | 7/48 [03:32<21:54, 32.06s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 17%|█▋        | 8/48 [04:12<23:06, 34.67s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 19%|█▉        | 9/48 [04:35<20:04, 30.89s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 21%|██        | 10/48 [04:58<18:04, 28.55s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 23%|██▎       | 11/48 [05:34<18:58, 30.76s/it]

{'batch_size': 32, 'dropout': 0.1, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 25%|██▌       | 12/48 [06:10<19:31, 32.55s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 27%|██▋       | 13/48 [06:32<17:09, 29.41s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 29%|██▉       | 14/48 [06:55<15:32, 27.44s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 31%|███▏      | 15/48 [07:30<16:21, 29.74s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 33%|███▎      | 16/48 [08:07<16:55, 31.74s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 35%|███▌      | 17/48 [08:29<14:55, 28.88s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 38%|███▊      | 18/48 [08:52<13:34, 27.14s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 40%|███▉      | 19/48 [09:28<14:21, 29.69s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 42%|████▏     | 20/48 [10:05<14:56, 32.02s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 44%|████▍     | 21/48 [10:28<13:06, 29.14s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 46%|████▌     | 22/48 [10:51<11:52, 27.40s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 48%|████▊     | 23/48 [11:27<12:29, 29.96s/it]

{'batch_size': 32, 'dropout': 0.5, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 50%|█████     | 24/48 [12:04<12:48, 32.00s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 52%|█████▏    | 25/48 [12:21<10:35, 27.62s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 54%|█████▍    | 26/48 [12:39<09:05, 24.81s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 56%|█████▋    | 27/48 [13:06<08:55, 25.48s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 58%|█████▊    | 28/48 [13:34<08:44, 26.23s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 60%|██████    | 29/48 [13:52<07:27, 23.55s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 62%|██████▎   | 30/48 [14:10<06:34, 21.90s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 65%|██████▍   | 31/48 [14:38<06:45, 23.83s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 67%|██████▋   | 32/48 [15:06<06:43, 25.22s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 69%|██████▉   | 33/48 [17:54<16:56, 67.79s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 71%|███████   | 34/48 [18:16<12:37, 54.11s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 73%|███████▎  | 35/48 [18:45<10:07, 46.76s/it]

{'batch_size': 64, 'dropout': 0.1, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 75%|███████▌  | 36/48 [19:13<08:12, 41.01s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 77%|███████▋  | 37/48 [19:30<06:12, 33.87s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 79%|███████▉  | 38/48 [19:48<04:50, 29.09s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 81%|████████▏ | 39/48 [20:15<04:15, 28.40s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 5, 'losses': 'binary_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 83%|████████▎ | 40/48 [20:43<03:45, 28.24s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 85%|████████▌ | 41/48 [21:00<02:54, 24.97s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 88%|████████▊ | 42/48 [21:18<02:17, 22.94s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 90%|████████▉ | 43/48 [21:45<02:00, 24.14s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 5, 'losses': 'categorical_crossentropy', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 92%|█████████▏| 44/48 [22:13<01:41, 25.32s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 94%|█████████▍| 45/48 [22:31<01:08, 22.94s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 50, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               60400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 909       
Total params: 323,709
Trainable params: 61,309
Non-trainable params: 262,400
_________________________________________________________________
None


 96%|█████████▌| 46/48 [22:49<00:42, 21.43s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'rmsprop'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


 98%|█████████▊| 47/48 [23:16<00:23, 23.09s/it]

{'batch_size': 64, 'dropout': 0.5, 'epochs': 5, 'losses': 'mean_absolute_error', 'lstm_units': 100, 'optimizer': 'Adam'}
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200)               160800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None


100%|██████████| 48/48 [23:44<00:00, 29.67s/it]

CPU times: user 19min 22s, sys: 1min 40s, total: 21min 2s
Wall time: 23min 44s
    round_epochs  val_loss  val_mean_absolute_error      loss  \
0              5 -1.778431                 0.392824 -1.887904   
1              5 -1.860020                 0.396768 -1.732630   
2              5 -4.462915                 0.387611 -4.043934   
3              5 -3.930356                 0.415972 -3.588713   
4              5  3.452294                 0.311869  3.422147   
5              5  3.342860                 0.311814  3.345878   
6              5  3.414769                 0.308845  3.277204   
7              5  3.351859                 0.312614  3.189061   
8              5  0.315901                 0.315901  0.317915   
9              5  0.315993                 0.315993  0.318041   
10             5  0.315901                 0.315901  0.317915   
11             5  0.315921                 0.315921  0.317942   
12             5 -1.121840                 0.407464 -1.335954   
13         




In [88]:
#we choose the best model of the grid search on the basis of the MAE metric, lower values are better
mlm_model = tal.best_model(metric='mean_absolute_error', asc=True)
#to get an idea of how our best model performs, we check predictions on the validation set
prediction_mlm_val = mlm_model.predict(padded_validate)
output_mlm_val = gof_spear(prediction_mlm_val,val_labels)

In [89]:
print(pd.DataFrame(data=output_mlm_val,index=schemas,columns=['estimate']))

          estimate
Attach    0.616179
Comp      0.606625
Global    0.471799
Health    0.382942
Control   0.293365
MetaCog   0.152214
Others    0.164798
Hopeless  0.436749
OthViews  0.496660


In [90]:
#the predictions make sense considering what we got from KNN and SVM. We deploy the model.
talos.Deploy(tal,'mlm_rnn',metric='mean_absolute_error',asc=True)

Deploy package mlm_rnn have been saved.


<talos.commands.deploy.Deploy at 0x7f862972eb10>

In [92]:
#we restore the deployed Talos experiment
restore = talos.Restore('mlm_rnn.zip')  # Changed from /Data/mlm_rnn.zip
#to get the best performing parameters, we get the results of the Talos experiment
scan_results = restore.results

In [93]:
#select the row with the smallest mean absolute error
print(scan_results[scan_results.mean_absolute_error == scan_results.mean_absolute_error.min()]) 

   round_epochs  val_loss  val_mean_absolute_error      loss  \
7             5  3.351859                 0.312614  3.189061   

   mean_absolute_error  batch_size  dropout  epochs                    losses  \
7             0.308441          32      0.1       5  categorical_crossentropy   

   lstm_units optimizer  
7         100      Adam  


>We have learned that despite setting the random seed values for numpy and tensorflow, some variability remains with each training of the RNNs and our results will therefore not be 100% reproducible. To ensure that we cannot be accused of reporting just a "lucky shot", we have decided to follow the advice given in this blog post https://machinelearningmastery.com/reproducible-results-neural-networks-keras/ . We therefore train 30 multi-label neural nets with the best parameters identfied with the Talos scan. We report the mean Spearman correlations in the article. We do the same with the per-schema RNNs below.

In [94]:
def mlm_fixed(train_X, train_y, test_X, test_y):
    # build the model
    model = Sequential()
    e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_length, trainable=False)
    #embedding layer
    model.add(e)
    #LSTM layer
    model.add(Bidirectional(LSTM(100)))
    #dropout layer
    model.add(Dropout(0.1))
    #output layer
    model.add(Dense(9, activation='sigmoid'))
    # compile the model
    model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['mean_absolute_error'])
    # summarize the model
    print(model.summary())
    # fit the model
    out = model.fit(train_X, train_y, 
                    validation_data=[test_X,test_y],
                    batch_size=32, 
                    epochs=100, 
                    verbose=0)
    return out, model

In [106]:
%%time
# wall time to run: ~ 1h 54min
for i in range(30):
    print("Aruna" + str(i))
    #we train the model
    res, model = mlm_fixed(padded_train, train_labels, padded_validate, val_labels)
    #we save models to files to free up working memory
    model_name = 'Data/MLMs/mlm_' + str(i)
    model.save(model_name + '.h5')

Aruna0
Model: "sequential_49"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_49 (Embedding)     (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_48 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_48 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_48 (Dense)             (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None
Aruna1
Model: "sequential_50"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_50 (Embedding)     (None, 25, 100)           262400    

Aruna10
Model: "sequential_59"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_59 (Embedding)     (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_58 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_58 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_58 (Dense)             (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None
Aruna11
Model: "sequential_60"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_60 (Embedding)     (None, 25, 100)           262400  

Aruna20
Model: "sequential_69"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_69 (Embedding)     (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_68 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_68 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_68 (Dense)             (None, 9)                 1809      
Total params: 425,009
Trainable params: 162,609
Non-trainable params: 262,400
_________________________________________________________________
None
Aruna21
Model: "sequential_70"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_70 (Embedding)     (None, 25, 100)           262400  

CPU times: user 7h 19min 49s, sys: 44min 14s, total: 8h 4min 3s
Wall time: 1d 3h 24min 57s


In [117]:
 #generate predictions with the per-schema models
def predict_schema_mlm(test_text, test_labels,fixed=None):
    if fixed is None:
        all_preds = np.zeros((test_labels.shape[0],test_labels.shape[1],1)) #Changed from 1, 30 to 1,1
        all_gofs = np.zeros((1,9)) #Changed from 30,9 to 1,9
        for j in range(1): #Changed from 30 to 1
            model_name = "Data/MLMs/mlm_" + str(j)
            model = keras.models.load_model(model_name + '.h5')
            preds = model.predict(test_text)
            gofs = gof_spear(preds,test_labels)
            all_preds[:,:,j] = preds
            all_gofs[j,:] = gofs
    else:
        model_name = "Data/MLMs/mlm_" + str(fixed)
        model = keras.models.load_model(model_name + '.h5')
        all_preds = model.predict(test_text)
        all_gofs = gof_spear(all_preds,test_labels)
    return all_gofs,all_preds

### Training Per-Schema RNNs
> We also train separate RNNs per schema. For this, we can use the output layer to compute a probability for each of the four possible labels. This way, the labels are treated as separate classes. We take over the parameter values from the multilabel model for the number of LSTM units, the dropout rate, the loss function, the evaluation metric, the batch size, and the number of epochs. To obtain the probability for each class, the units of the output layer have a softmax activation function. For the evaluation, the class with the highest probability is chosen per model. The resulting models are written to files and loaded again for prediction.

In [118]:
#define separate models
def perschema_models(train_X, train_y, test_X, test_y):
    model = Sequential()
    e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_length, trainable=False)
    model.add(e)
    model.add(Bidirectional(LSTM(100)))
    model.add(Dropout(0.1))
    model.add(Dense(4, activation='softmax'))
    # compile the model
    model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['mean_absolute_error'])
    # summarize the model
    print(model.summary())
    # fit the model
    model.fit(train_X, train_y,
              validation_data=[test_X,test_y],
              batch_size=32, 
              epochs=2, #Changed from 100 to 2
              verbose=0)
    out=model.predict(test_X)
    gof,p=scipy.stats.spearmanr(out,test_y,axis=None)
    return gof, model

In [111]:
%%time
# wall time to run: ~ 16h
#train models
for j in range(30):
    print("Aruna" + str(i))
    directory_name = "Data/PSMs/per_schema_models_" + str(j)
    if not os.path.exists(directory_name):
        os.makedirs(directory_name)
    for i in range(9):
        train_label_schema = np_utils.to_categorical(train_labels[:,i])
        val_label_schema = np_utils.to_categorical(val_labels[:,i])
        val_output_slm, model = perschema_models(padded_train,train_label_schema,padded_validate,val_label_schema)
        #we write trained models to files to free up working memory
        model_name = '/schema_model_' + schemas[i]
        save_model_under = directory_name + model_name
        model.save(save_model_under + '.h5')

Aruna0
Model: "sequential_80"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_80 (Embedding)     (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_79 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_79 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_79 (Dense)             (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_81"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_81 (Embedding)     (None, 25, 100)           262400    
______

Model: "sequential_90"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_90 (Embedding)     (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_89 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_89 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_89 (Dense)             (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_91"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_91 (Embedding)     (None, 25, 100)           262400    
_____________

Model: "sequential_100"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_100 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_99 (Bidirectio (None, 200)               160800    
_________________________________________________________________
dropout_99 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_99 (Dense)             (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_101"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_101 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_110"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_110 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_109 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_109 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_109 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_111"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_111 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_120"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_120 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_119 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_119 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_119 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_121"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_121 (Embedding)    (None, 25, 100)           262400    
___________

Model: "sequential_130"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_130 (Embedding)    (None, 25, 100)           262400    
_________________________________________________________________
bidirectional_129 (Bidirecti (None, 200)               160800    
_________________________________________________________________
dropout_129 (Dropout)        (None, 200)               0         
_________________________________________________________________
dense_129 (Dense)            (None, 4)                 804       
Total params: 424,004
Trainable params: 161,604
Non-trainable params: 262,400
_________________________________________________________________
None
Model: "sequential_131"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_131 (Embedding)    (None, 25, 100)           262400    
___________

KeyboardInterrupt: 

In [119]:
#load single models
def load_single_models(directory):
    single_models = []
    for i in range(9):
        model_name ='/schema_model_' + schemas[i]
        get_from = directory + model_name
        model = keras.models.load_model(get_from + '.h5')
        single_models.append(model)
    return single_models

In [130]:
#generate predictions with the per-schema models
def predict_schema_psm(test_text, test_labels,fixed=None):
    if fixed is None:
        all_preds = np.zeros((test_labels.shape[0],test_labels.shape[1],2)) # Changed from 1,30 to 1,2
        all_gofs = np.zeros((2,9)) # changed from 30,9 to 5,9
        for j in range(2): # Changed from 30 to 2
            directory_name = "Data/PSMs/per_schema_models_" + str(j)
            preds = np.zeros(test_labels.shape)
            gofs=[]
            single_models = load_single_models(directory_name)
            for i in range(9):
                model = single_models[i]
                out = model.predict(test_text)
                out = out.argmax(axis=1)
                preds[:,i] = out
                gof,p=scipy.stats.spearmanr(out,test_labels[:,i])
                gofs.append(gof)
            all_preds[:,:,j] = preds
            all_gofs[j,:] = gofs
    else:
        directory_name= "Data/PSMs/per_schema_models_" + str(fixed)
        all_preds = np.zeros(test_labels.shape)
        all_gofs = []
        single_models = load_single_models(directory_name)
        for i in range(9):
            model = single_models[i]
            out = model.predict(test_text)
            out = out.argmax(axis=1)
            all_preds[:,i] = out
            gof,p=scipy.stats.spearmanr(out,test_labels[:,i])
            all_gofs.append(gof)
    return all_gofs,all_preds    

### Generate Testset Predictions with the RNN Models

In [131]:
def my_rnn(test_X,test_y,single):
    if single:
        gof,preds=predict_schema_psm(test_X,test_y)
    else:
        gof,preds=predict_schema_mlm(test_X,test_y)
    #make a sum of all classification values
    gof_sum = np.sum(gof,axis=1)
    #sort sums
    gof_sum_sorted = np.sort(gof_sum)
    #pick element that is closest but larger than median (we have even number of elements)
    get_med_element = gof_sum_sorted[0] #Changed from 15 to 0
    #get index of median
    gof_sum_med_idx = np.where(gof_sum==get_med_element)[0]
    #choose this as the final model to use in H2 and to report in the paper
    gof_out = gof[gof_sum_med_idx]
    return np.transpose(gof_out),gof_sum_med_idx

In [132]:
%%time
# wall time to run: ~ 6min
# predicting testset with multilabel model
output_mlm,idx_mlm = my_rnn(padded_test,test_labels,0)
# predicting testset with perschema models
output_psm,idx_psm = my_rnn(padded_test,test_labels,1)

CPU times: user 32min 5s, sys: 1min 56s, total: 34min 2s
Wall time: 1h 37min 49s


In [127]:
print('RNN Multilabel Model Testset Output')
print(pd.DataFrame(data=output_mlm,index=schemas,columns=['estimate']))

RNN Multilabel Model Testset Output
          estimate
Attach    0.672925
Comp      0.652170
Global    0.474500
Health    0.360856
Control   0.305737
MetaCog   0.095281
Others    0.147344
Hopeless  0.521807
OthViews  0.496824


In [129]:
output_psm

array([], shape=(9, 0), dtype=float64)

In [128]:
print('RNN Per-Schema Testset Output')
print(pd.DataFrame(data=output_psm,index=schemas,columns=['estimate']))

RNN Per-Schema Testset Output


AssertionError: Number of manager items must equal union of block items
# manager items: 1, # tot_items: 0

In [None]:
def my_rnn_fixed(test_X,test_y,single):
    if single:
        gof,preds=predict_schema_psm(test_X,test_y,idx_psm[0])
    else:
        gof,preds=predict_schema_mlm(test_X,test_y,idx_mlm[0])
    return gof

In [None]:
%%time
# wall time to run: ~ 37min
#bootstrapping the 95% confidence intervals
bs_mlm = bootstrap(n_iterations,n_size,padded_test,test_labels,0,"rnn")
bs_psm = bootstrap(n_iterations,n_size,padded_test,test_labels,1,"rnn")

In [None]:
print(f'Multilabel RNN Classification 95% Confidence Intervals')
print(pd.DataFrame(data=np.transpose(bs_mlm),index=schemas,columns=['low','high']))
print(f'Per-Schema RNN Classification 95% Confidence Intervals')
print(pd.DataFrame(data=np.transpose(bs_psm),index=schemas,columns=['low','high']))

In [None]:
output_psm_flat = [item for sublist in output_psm for item in sublist]
output_mlm_flat = [item for sublist in output_mlm for item in sublist]

In [None]:
print(f'Estimates of all models')
outputs = np.concatenate((output_kNN_class,output_kNN_reg,output_SVC, output_SVR, output_psm_flat, output_mlm_flat))
outputs=np.reshape(outputs,(9,6),order='F')
print(pd.DataFrame(data=outputs,index=schemas,columns=['kNN_class','kNN_reg','SVC','SVR','PSM','MLM']))

In [None]:
print(f'Lower CIs of all models')
lower_cis = np.concatenate((bs_knn_class[0],bs_knn_reg[0],bs_svc[0], bs_svr[0], bs_psm[0], bs_mlm[0]))
lower_cis=np.reshape(lower_cis,(9,6),order='F')
print(pd.DataFrame(data=lower_cis,index=schemas,columns=['kNN_class','kNN_reg','SVC','SVR','PSM','MLM']))

In [None]:
print(f'Upper CIs of all models')
upper_cis = np.concatenate((bs_knn_class[1],bs_knn_reg[1],bs_svc[1], bs_svr[1], bs_psm[1], bs_mlm[1]))
upper_cis=np.reshape(upper_cis,(9,6),order='F')
print(pd.DataFrame(data=upper_cis,index=schemas,columns=['kNN_class','kNN_reg','SVC','SVR','PSM','MLM']))

## Generate Dataset for Testing Hypothesis 2
Finally, we need to use the best-performing algorithm, the per-schema RNNs, to generate the predictions on the testset and write these to a file so that we can use them to test Hypothesis 2.

In [None]:
gofH2,predsH2=predict_schema_psm(padded_test,test_labels,idx_psm[0])

In [None]:
predsH2 = predsH2.astype(int)
print(predsH2[:,0:5])
diag_rho = [scipy.stats.spearmanr(predsH2[i,:], test_labels[i,0:9], nan_policy='omit')[0] for i in range(predsH2.shape[0])]


In [None]:
df_predsH2 = pd.DataFrame(data=predsH2,columns=['AttachPred','CompPred',"GlobalPred","HealthPred","ControlPred","MetaCogPred","OthersPred","HopelessPred","OthViewsPred"])
df_predsH2["Corr"] = pd.DataFrame(diag_rho)

In [None]:
print(df_predsH2.head())

In [None]:
df_predsH2.to_csv("Data/PredictionsH2.csv", sep=';', header=True, index=False, mode='w')