# Reproducibility Project
CS598 Deep Learning for Healthcare - Spring 2023

Orignal Paper: [Disease Prediction and Early Intervention System Based on Symptom Similarity Analysis](https://ieeexplore.ieee.org/document/8924757)<br>
Reproduction Paper (DRAFT): [Reproducibility Project for CS598 DL4H in Spring 2023](https://drive.google.com/file/d/1OlaeqqWVPh9-TqqRmcs1-KRbOKmktVPi/view?usp=sharing)

*Note: Reproduction Paper must be accessed using UIUC email address (i.e, @illinois.edu).*

_Contributors:_
 * Michael Pettenato - mp34@illinois.edu
 * Adam Michalsky - adamwm3@illinois.edu

## Introduction

When physicians work with their patients, they often start by listening to their patient's symptom statements. The physcian will then map the patient's sympton statements to the symptoms that have been cataloged by the healthcare industry. This process or task of assess similarity between two sentences (e.g., A patient's symptom statement _and_ The cataloged symptom) is a task that can be easily mapped into computing.

Sentence similarity is a task that has had significant research done on it already. Deep learning models were developed to perform this task in the healthcare industry already but each model has its pros and cons. One of the most common traits of models that assess similarity is the training time required, which is what the original paper (cited above) aimed to solve using an approach comprised of leveraging a Stanford Parser, Word2Vec embedding, and a convolutional neural network (CNN) based model.  

The original paper did not have a documented repository, but did offer pseudo-code for certain aspects of the experiment. The following notebook is a reproduction of the original paper with some ablations that will be called out. 

In [None]:
import datetime
import gensim
import numpy as np
import os.path
import pandas as pd
import torch, stanza,spacy
from torch import nn
from torch.utils.data import Dataset, DataLoader
from enum import Enum
from gensim.models import Word2Vec
import threading
from sklearn.metrics import accuracy_score, f1_score
import matplotlib.pyplot as plt

## About the Data

The data that will be used to train and test is sourced from [Microsoft Research Paraphrase Corpus](https://www.microsoft.com/en-us/download/details.aspx?id=52398). 


### Helper Functions

Below we define helper functions for ingesting the data used in the experiments.
* `read_file` - This will be used to read in local files containing data.

In [None]:
def read_file(file_name):
    # Note: Unable to use pd.read_csv... the function complained about an issue with the formatting of the tsv file
    # train = pd.read_csv('data/msr_paraphrase_train.txt', sep='\t', encoding='latin1')
    # train

    # opting to read file in and split columns manually to create a pandas dataframe
    list = []
    with open(file_name, encoding="utf8") as f:
        lines = f.readlines()
        for i, line in enumerate(lines):
            fields = line.split('\t')
            list.append(fields)

    df = pd.DataFrame(list[1:], columns=['Quality', 'ID1', 'ID2', 'String1', 'String2'])
    return df

## Approach

The approach described in the original paper can be described in three seperate _stages_. 

1) **Data Pre-Processing** - Here we will read in the raw data, parse it using the described algorithm, and prepare the parsed data by performing embedding using `Word2Vec`.

2) **Neural Network** - Here we build the CNN-based model described in the original paper. This CNN will take the vectorized sentences pairs as inputs and return a similarity assesment [0,1].  No specifics of network architecture were offered in the original paper so the architecture is based on the our own knowledge.

3) **Training & Validation** - Here we train and test the model using the data and network established in the previous stages.
<br>
<br>


<div>
    <figure>
        <img src="./images/overview.png" width="500"/>
        <figcaption align='center'>Figure 1: Overview of the Approach</figcaption>
    </figure>
</div>

## Data Pre-processing

In data pre-processing, perform two critical functions:

   1. Read in and parse the files stored `.\data` directory.
   2. Embed the parsed sentences using `Word2Vec` embedding.

Data parsed is the stored in a custom dataset using PyTorch class `Dataset`. Our implementation is `MSPCDataset`. This will allow us to take advantage PyTorch's `DataLoader` functionality, which has data batching facilities.  The `MSPCDataset` produced an output format shown in the equation below.   

\begin{equation} 
Item=(Sent1, Sent2, SimFlag)
\label{eq:Item}
\end{equation}

$Sent1$ and $Sent2$ have a format like $Sen$, discussed in the *Embedding with `Word2Vec`* section below, and $SimFlag$ is a binary target value, `0` being not similar and `1` being similar.

### Parsing 
<br>

The original paper used the Stanford Parser to find the parts of speech for each word in a sentence. This parser has been deprecated in favor of the `stanza` parser, so we used this as a replacement for the Stanford parser.


_Ablation 1:_ After reviewing the output of the `stanza` parser, we found that we did not agree with the subject, predicate, and object (SPO) result 100% of the time. Therefore the decision was made to support alternative parsing approaches. In addition to `stanza`, we have introduced support for two more parsing methods:
* `spaCy` - Documentation and install instructions can be found at their [site](https://spacy.io/usage).
* `2 Char Stop Word Removal` - A method where stop words less than 3 characters in length are not considered while using `spaCy` parsing. 

To allow flexibility in coding we have defined enum values for each parse method:
 * `STANZA` - Stanford Parser(`stanza`)
 * `SPACY` - spaCy Parser
 * `RAW` - Raw sentences are embedded using `Word2Vec`.

In [None]:
class Parsing(Enum):
    STANZA=1
    SPACY=2
    RAW=3

##### Stanford Parser (`stanza`)

As mentioned above, the Stanford Parser was what was used by the original paper to parse its raw sentences so we were able to reproduce the custom logic used to parse the tree produced by the parser. For ease of use, we implemented a function designed to extract _subject(s)_, _predicate(s)_, and _object(s)_ from an input parse tree (Figure 2). Going forward this will be reffered to as a _SPO Kernel Function_.


<div>
    <figure>
        <img src="./images/parsetree.png" width="500"/>
    </figure>
</div>

<br> 

In in **Section III** of the original paper, the _SPO Kernel Function_ is discussed along with some pseudo-code.  We obsertved that the paper had discrepancies, where the pseudo-code did not match the textual description, found in the trunk construction algorithm.  We implemented our parsing algorithm according to the textual description, which was more detailed than the pseudo-code provided.

###### Helper Functions

Below we haved defined several functions that aid in finding the SPO provided a parsed sentence tree obtained from a `stanza` parser. 

* `find_branches()` - Extract phrases from a provided tree given a list of labels to include and exclude.
* `find_subject()` - Locate a _subject_ given a noun phrase.
    * _subject_ - Any `NN` child found in a noun phrase extracted from the parse tree.
* `find_predicate()` - Locate a _predicate_ given a verb phrase.
    * _predicate_ - Any `VB`child found in a verb phrase extracted from the parse tree.
* `find_object()` - Locate an _object_ given a verb phrase.
   * _object_ - Any `NN` child of a `NP`, `PP`, or `ADJP`object OR any child object that is not a `VP`found in the verb phrase extracted from the parse tree.

In [None]:
def find_branches(tree, label, not_in_label=None, ancestors=[]):
    branches = []
    # print("-------------")
    # print(ancestors)
    # print(f"{tree.label} == {label}")
    if tree.label == label and not_in_label not in ancestors:
        # print(f"adding {tree}")
        branches.append(tree)
    for child in tree.children:
        branches = branches + find_branches(child, label, not_in_label, ancestors + [tree.label])

    return branches

#
# # According to the paper the subject is the first NN child of NP
def find_subject(noun_phrase_for_subject):
    subject = []
    for child in noun_phrase_for_subject.children:
        if 'NN' in child.label:
            subject = subject + child.leaf_labels()

    #print(f"subject = {subject}")
    #if len(subject) > 0:
    #    return ' '.join(subject)
    return subject

    return None

def find_predicate(verb_phrase_for_predicate):
    predicate = []
    for child in verb_phrase_for_predicate.children:
        if child.label.startswith('VB'):
            predicate = predicate + child.leaf_labels()

    if len(predicate) > 0:
        return ' '.join(predicate)

    return None

def find_object(verb_phase_for_object, parent='VP'):
    objects = []
    for child in verb_phase_for_object.children:
        if child.label == 'VP':
            continue
        if 'NN' in child.label and parent in ['NP', 'PP', 'ADJP']:
            #objects = objects + child.leaf_labels()
            new_objects = child.leaf_labels()
            for new_object in new_objects:
                if new_object not in objects:
                    objects.append(new_object)
        else:
            new_objects = find_object(child, child.label)
            #if new_objects not in objects and new_objects is not None:
            for new_object in new_objects:
                if new_object not in objects:
                    objects.append(new_object)
                #objects = objects + new_objects

    return objects
    # if len(objects) > 0:
    #     #return ' '.join(objects)
    #     return objects
    # else:
    #     return None

###### SPO Kernel Function

In [None]:
def find_spo(tree):
    noun_phrases_for_subject = find_branches(tree, label='NP', not_in_label='VP', ancestors=[])
    subject_list = []
    for noun_phrase_for_subject in noun_phrases_for_subject:
        subject = find_subject(noun_phrase_for_subject)
        #if subject is not None:
        #   subject_list.append(subject)
        subject_list = subject_list + subject

    verb_phrases = find_branches(tree, label='VP')
    predicate_list = []
    object_list = []
    for verb_phrase in verb_phrases:
        predicate = find_predicate(verb_phrase)
        if predicate is not None:
            predicate_list.append(predicate)
        object = find_object(verb_phrase)
        object_list = object_list + object
        #if object is not None:
        #    object_list.append(object)

    # dedupe list
    subject_list = list(dict.fromkeys(subject_list))
    predicate_list = list(dict.fromkeys(predicate_list))
    object_list = list(dict.fromkeys(object_list))

    return subject_list, predicate_list, object_list

#### SpaCy Parser

Similar to the Stanford Parser, a _SPO Function_ is implemented to traverse a sentence _tree_ or `doc` to locate the SPO.

###### SPO Kernel Function

In [None]:
def find_spacy_spo(doc):
    # Extract the subject, predicate, and object
    subject = []
    predicate = []
    obj = []

    for token in doc:
        #print(f"{token.dep_} : {token.text}")
        if "subj" in token.dep_:
            subtree = list(token.subtree)
            start = subtree[0].i
            end = subtree[-1].i + 1
            subject .append(doc[start:end])
        elif "obj" in token.dep_ or "pcomp" in token.dep_:
            obj.append(token.text)
        elif "ROOT" in token.dep_ or "pred" in token.dep_:
            predicate.append(token.text)

    # Print the results
    # print("Subject: ", subject)
    # print("Predicate: ", predicate)
    # print("Object: ", obj)
    return subject, predicate, object

#### Raw

No specialized _SPO Kernel Function_ is required.

#### Concurreny Support

After implentation of the _SPO Functions_, the original paper began embedding a vector using `Word2Vec`. Before we started the embedding process, we decided to further improve the speed of parsing by implementing concurrency. 

_Ablation 2:_ We found the parts-of-speech parsing to be slow.  In order to speed it up we implemented a multi-threaded parsing class called `SentenceProcessingThread`.    

This class instantiates a parser based on the enumerated `Parsing` types and performs the specified parsing method against the input sentences.

In [None]:
class SentenceProcessingThread(threading.Thread):
    def __init__(self, sentences, output_list, begin, end, parsing_enum=Parsing.STANZA):
        super(SentenceProcessingThread, self).__init__()
        self.sentences = sentences
        self.parsing_enum = parsing_enum

        if parsing_enum == Parsing.STANZA:
            self.nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency', download_method=None, use_gpu=True)
        else:
            self.nlp = spacy.load('en_core_web_sm')
        self.output_list = output_list
        self.begin = begin
        self.end = end

    def trunk_construction(self, str, parent_label = None):
        doc = self.nlp(str)
        tree = doc.sentences[0].constituency

        #words = construct_sentence(tree, parent_label)
        #return ' '.join(words)

        if self.parsing_enum == Parsing.SPACY:
            subjects, predicates, objects = find_spacy_spo(tree)
        else:
            subjects, predicates, objects = find_spo(tree)

        return f"{' '.join(subjects)},{' '.join(predicates)},{' '.join(objects)}"

    def run(self):
        print(f"going to process {self.begin} to {self.end}")
        for i, sentence in enumerate(self.sentences):
            new_sentence = self.trunk_construction(sentence)
            self.output_list[self.begin + i] = new_sentence

Below we implement functions to facilitate parsing using `SentenceProcessingThread`. To further improve performance we added support to detect and using GPUs if GPUs are available on the host machine. If GPUs are unavailable, the parsing process will simply utilitze CPUs with the appropriate number of threads.

* `process_sentences_concurrently` - Processes the provided `sentences` across `p` threads and stores the processed sentence in the `output`.

* `preprocess_corpus` - Reads sentences in an `input_file` containing sentences to parse using the provided parsing method (`parsing_enum`) and saves the result in the `output_file`.

In [None]:
def process_sentences_concurrently(sentences, output, p=2):
    total = len(sentences)
    interval = int(total / p)
    threads = []
    for i in range(p):
        s = i*interval
        if i == p-1:
            e = total
        else:
            e = (i+1) * interval
        sentences_slice = sentences[s:e]
        sentence_thread = SentenceProcessingThread(sentences_slice, output, s, e)
        sentence_thread.start()
        threads.append(sentence_thread)

    for thread in threads:
        thread.join()

def preprocess_corpus(input_file='data/msr_paraphrase_train.txt', output_file='data/msr_paraphrase_train_stanza.txt', N=None, parsing_enum=Parsing.STANZA):
    print(output_file)
    if os.path.exists(output_file):
        print(f"{output_file} already exists")
        return

    starttime = datetime.datetime.now()
    df = read_file(input_file)

    if N is None:
        N = len(df.String1)

    output1 = [None] * N
    output2 = [None] * N

    # we can process with more threads if we only have CPU
    p = 8

    if torch.cuda.is_available():
        # if cuda is available we don't need that many threads
        # and if the number of threads is set too large using cuda
        # we can get out of memory exceptions
        p = 2
        torch.cuda.empty_cache()

    process_sentences_concurrently(df.String1[:N], output1, p)

    # try and be careful with gpu memory
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

    process_sentences_concurrently(df.String2[:N], output2, p)

    endtime = datetime.datetime.now()

    print(f"time to process {N*2} sentences is {endtime - starttime}")

    stanza_df = df[:N]

    processed_string1 = pd.Series(output1)
    # processed_string1.apply(gensim.utils.simple_preprocess)
    processed_string2 = pd.Series(output2)
    #processed_string2.apply(gensim.utils.simple_preprocess)

    stanza_df.String1 = processed_string1
    stanza_df.String2 = processed_string2

    # write the file out.  This can help in the future
    print(f"about to write out {output_file}")
    stanza_df.to_csv(output_file, sep="\t")

#### Embedding with `Word2Vec`

Each word in the sentence needs to be replaced by an embedding vector. Consistent with the information described in the original paper, we used Word2Vec, with an embedding size of 50 create sentence vectors.  The following equation describes a word $W$ in vector form:

<br>
\begin{equation} 
W=(\mathbf {w}^1, \mathbf {w}^2,...,\mathbf {w}^n)
\label{eq:Word}
\end{equation}
<br>
where $w$ represents an embedding vector and the superscript represents the sequence of words in a sentence.
<br><br>
The next equation shows how the sentence is further organized
<br>
\begin{equation} 
Sen=(\mathbf {S}^{\mathrm {T}}, \mathbf {P}^{\mathrm {T}}, \mathbf {O}^{\mathrm {T}})
\label{eq:Sent}
\end{equation}
<br>
where $S^T$, $P^T$, $O^T$ represent the subject, predicate, and object words, all with a similar structure to $W$.


The original paper did not provide details on how their `Word2Vec` model was trained. This is the motivation for our next ablation.

_Ablation 3_: Due to the uncertainty of how the original paper's `Word2Vec` model was trained, we opted to try two different models:
1) A Word2Vec model that is trained on MSRP corpus. 
2) A pre-trained Word2Vec model.

##### Pre-trained Model Details

For our comparison, we chose to leverage one trained on Google News vectors from `gensim`. For further documentation please see their [website](https://radimrehurek.com/gensim/models/word2vec.html).

**WARNING:** The following code block is to download the pre-trained model mentioned above and it takes several minutes to complete!

Please be advised that some default jupyter settings will need to be modified to ensure a successful download. We modfied the `c.NotebookApp.iopub_data_rate_limit` setting to 100000000.

Instructions on how to create a configuration file and to modify the variable can be found [here](https://stackoverflow.com/questions/43490495/how-to-set-notebookapp-iopub-data-rate-limit-and-others-notebookapp-settings-in).

In [None]:
import gensim.downloader as api
pretrained_word2vec_path = api.load("word2vec-google-news-300", return_path=True)
print(pretrained_word2vec_path)

##### Fake Word Embedding

When using a pretrained word2vec model there may be some words that don't appear in the vocabulary for the pretrained model.  In our experiment this occured with people's names, which appeared in sentences from news articles.  There are a few ways to handle this, for instance you could ignore the word that does not appear in the vocabulary, or you could train word2vec with that additional information in the corpus. Due to the fact that the words that were missing were mostly firstnames or surnames, we decided to generate a fake word embedding.  The requirement of the word embedding was to be real numbers between -0.5 and 0.5 with a euclidean norm of the array equal to 1.


In [None]:
fake_embeddings = {}
def get_fake_embedding(word, size):

    if word in fake_embeddings:
        return fake_embeddings[word]

    embedding = np.random.rand(300) - 0.5

    # Calculate the Euclidean norm of the array
    norm = np.linalg.norm(embedding)

    embedding /= norm

    return embedding

Below we implementation the following functions:

* `generate_word2vec_model(corpus)` - Create a `Word2Vec` model given a `corpus`.
* `sentence_embeddings(w2v_model, sentence, size)` - Embed a `sentence` of length `size` using a provided `Word2Vec` model, `w2vmodel`.
* `corpus_embeddings(model, corpus, max_sentence_len)` - Embed a `corpus` with a maximum sentence length of `max_sentence_len` using a provided model, `model`. 
* `init_word2vec(train_input_file, test_input_file, parsing_enum)` - Provided `train_input_file`, `test_input_file`, and a parsing method identified by `parsing_enum`, this function will return a tuple containing a `Word2Vec` model trained on the MSRP corpus dataset along with the maximum sentence length, `max_sentence_length` for all the sentences in a corpus. 

In [None]:
# Function is broken out for testing purposes
def generate_word2vec_model(corpus):
    # Creating the Word2Vec model
    model = Word2Vec(sentences=corpus, min_count=1, window=2, vector_size=50)

    return model


In [None]:
# Function is broken out for testing purposes
def sentence_embeddings(w2v_model, sentence, size):
    np_embedding = np.zeros(size)
    for i, word in enumerate(sentence):
        #print(word)
        if hasattr(w2v_model, "wv"):
            np_embedding[i] = w2v_model.wv.get_vector(word)
        else:
            try:
                np_embedding[i] = w2v_model.get_vector(word)
            except KeyError:
                print(f"Word {word} was not found in the vocabulary")
                np_embedding[i] = get_fake_embedding(word, size)
                print("Generated a fake embedding ", np_embedding[i])

    return np_embedding
    # list = []
    # for word in sentence:
    #     list.append(w2v_model.wv.get_vector(word))
    #
    # word_matrix = np.row_stack(list)
    # #return np.mean(word_matrix, axis=0)
    # return word_matrix

In [None]:
def corpus_embeddings(model, corpus, max_sentence_len, embedding_size=50):
    corpus_size = len(corpus)
    embeddings_list = []
    embedding_matrix = np.zeros((corpus_size, max_sentence_len, embedding_size))
    for i, sentence in enumerate(corpus):
        embeddings = sentence_embeddings(model, sentence, size=(max_sentence_len, embedding_size))
        embedding_matrix[i] = embeddings
        embeddings_list.append(embeddings)

    return embedding_matrix

In [None]:
def init_word2vec(train_input_file, test_input_file, parsing_enum=Parsing.STANZA):

    if parsing_enum == Parsing.RAW:
        train_output_file = train_input_file
        test_output_file = test_input_file
        train_df = read_file(train_output_file)
        test_df = read_file(test_output_file)

    elif parsing_enum == Parsing.STANZA:
        file_parts = os.path.splitext(train_input_file)
        train_output_file = f"{file_parts[0]}_stanza{file_parts[1]}"
        print("About to preprocess spacy data")
        preprocess_corpus(input_file=train_input_file, output_file=train_output_file, parsing_enum=parsing_enum)
        print("Done preprocessing spacy data")

        file_parts = os.path.splitext(test_input_file)
        test_output_file = f"{file_parts[0]}_stanza{file_parts[1]}"
        print("About to preprocess data")
        preprocess_corpus(input_file=test_input_file, output_file=test_output_file, parsing_enum=parsing_enum)
        print("Done preprocessing data")
        train_df = pd.read_csv(train_output_file, sep="\t")
        test_df = pd.read_csv(test_output_file, sep="\t")
    else:
        file_parts = os.path.splitext(train_input_file)
        train_output_file = f"{file_parts[0]}_spacy{file_parts[1]}"
        print("About to preprocess spacy data")
        preprocess_corpus(input_file=train_input_file, output_file=train_output_file, parsing_enum=parsing_enum)
        print("Done preprocessing spacy data")

        file_parts = os.path.splitext(test_input_file)
        test_output_file = f"{file_parts[0]}_spacy{file_parts[1]}"
        print("About to preprocess data")
        preprocess_corpus(input_file=test_input_file, output_file=test_output_file, parsing_enum=parsing_enum)
        print("Done preprocessing data")
        train_df = pd.read_csv(train_output_file, sep="\t")
        test_df = pd.read_csv(test_output_file, sep="\t")

    # train_df = read_file(train)
    # test_df = read_file(test)

    train_sentences1 = train_df.String1.apply(gensim.utils.simple_preprocess)
    train_sentences2 = train_df.String2.apply(gensim.utils.simple_preprocess)
    test_sentences1 = test_df.String1.apply(gensim.utils.simple_preprocess)
    test_sentences2 = test_df.String2.apply(gensim.utils.simple_preprocess)

    corpus = pd.concat([train_sentences1, train_sentences2, test_sentences1, test_sentences2], ignore_index=True)
    max_sentence_len = corpus.apply(len).max()

    word2vec = generate_word2vec_model(corpus)


    return word2vec, max_sentence_len


#### Custom Dataset: `MSPCDataset`

Our custom dataset is implmented as follows:

In [None]:
# Dataset for the MSPC dataset
class MSPCDataset(Dataset):
    """
    Arguments:
        tsv_file (string): path to the tsv file with sentences to compare and associate quality score
        num_records (int): number of records to load.  Defaults to None which is all
    """
    def __init__(self, tsv_file, w2v_model, max_sentence_length, num_records=None, parsing_enum=Parsing.STANZA, embedding_size=50):

        self.max_sentence_len = max_sentence_length
        self.w2v_model = w2v_model

        if parsing_enum == Parsing.STANZA:
            file_parts = os.path.splitext(tsv_file)
            output_file = f"{file_parts[0]}_stanza{file_parts[1]}"
            print("About to preprocess stanza data")
            preprocess_corpus(input_file=tsv_file, output_file=output_file, parsing_enum=parsing_enum)
            print("Done preprocessing stanza data")
            #df = read_file('data/msr_paraphrase_train.txt')
            df = pd.read_csv(output_file, sep="\t")
        elif parsing_enum == Parsing.SPACY:
            file_parts = os.path.splitext(tsv_file)
            output_file = f"{file_parts[0]}_spacy{file_parts[1]}"
            print("About to preprocess spacy data")
            preprocess_corpus(input_file=tsv_file, output_file=output_file, parsing_enum=parsing_enum)
            print("Done preprocessing spacy data")
            #df = read_file('data/msr_paraphrase_train.txt')
            df = pd.read_csv(output_file, sep="\t")
        else:
            df = read_file(tsv_file)

        if num_records is not None:
            processed_string1 = df[:num_records].String1
            processed_string2 = df[:num_records].String2
            self.quality = df[:num_records].Quality
        else:
            processed_string1 = df.String1
            processed_string2 = df.String2
            self.quality = df.Quality

        if parsing_enum == Parsing.RAW:
            processed_string1 = processed_string1.apply(gensim.parsing.preprocessing.remove_stopwords)
            processed_string2 = processed_string2.apply(gensim.parsing.preprocessing.remove_stopwords)
            processed_string1 = processed_string1.apply(lambda x: gensim.parsing.preprocessing.strip_short(x, minsize=3))
            processed_string2 = processed_string2.apply(lambda x: gensim.parsing.preprocessing.strip_short(x, minsize=3))

        processed_string1 = processed_string1.apply(gensim.utils.simple_preprocess)
        processed_string2 = processed_string2.apply(gensim.utils.simple_preprocess)


        print(processed_string1)


        #corpus = pd.concat([processed_string1, processed_string2], ignore_index=True)
        #self.max_sentence_len = corpus.apply(len).max()
        #w2v_model = generate_word2vec_model(corpus)

        sentence_embeddings1 = corpus_embeddings(self.w2v_model, processed_string1, max_sentence_len=self.max_sentence_len, embedding_size=embedding_size)
        sentence_embeddings2 = corpus_embeddings(self.w2v_model, processed_string2, max_sentence_len=self.max_sentence_len, embedding_size= embedding_size)

        #self.w2v_model = w2v_model
        self.sentences_embeddings1 = sentence_embeddings1
        self.sentences_embeddings2 = sentence_embeddings2

        # print (f"Processing 200 sentences with gensim.utils.simple_preprocess took {end_time - start_time}")
        print(f"Number of sentences processed in the String1 column: {len(processed_string1)}")
        print(f"Number of sentences processed in the String2 column: {len(processed_string2)}")
        #print(self.sentences_embeddings1)

    def __len__(self):
        return len(self.sentences_embeddings1)

    def __getitem__(self, i):
        #return torch.FloatTensor(np.stack((self.sentences_embeddings1[i], self.sentences_embeddings2[i]))), self.quality[i]
        return torch.FloatTensor(self.sentences_embeddings1[i]), torch.FloatTensor(self.sentences_embeddings2[i]), self.quality[i]

    def get_max_sentence_length(self):
        return self.max_sentence_len

#### Pre-processing Validations

<br>
Below we are simply validating that the output is what expected when using either parsing method.
<br>

##### Stanford Parser (`stanza`) Validations


_MSRP Corpus Trained Word2Vec:_

In [None]:
def test_dataset():
    word2vec, max_sentence_length = init_word2vec('data/msr_paraphrase_train.txt', 'data/msr_paraphrase_test.txt', parsing_enum=Parsing.STANZA)
    dataset = MSPCDataset('data/msr_paraphrase_train.txt', word2vec, max_sentence_length, 10)
    assert len(dataset) == 10

    x1, y1, quality = dataset[0]
    assert x1.shape[0] == max_sentence_length
    assert x1.shape[1] == 50

test_dataset()

_Pre-trained Word2Vec (Google News Vectors):_

In [None]:
def test_dataset_with_pretrained_word2vec():
    word2vec, max_sentence_length = init_word2vec('data/msr_paraphrase_train.txt', 'data/msr_paraphrase_test.txt', parsing_enum=Parsing.STANZA)
    print("Initiating a pretrained model for word2vec... this may take a while")
    #word2vec_pretrained = gensim.models.keyedvectors.load_word2vec_format('data/w2v/Google-news-vectors.bin.gz', binary=True)
    word2vec_pretrained = gensim.models.keyedvectors.load_word2vec_format(pretrained_word2vec_path, binary=True)
    print("Done initiating the pretrained model for word2vec")
    dataset = MSPCDataset('data/msr_paraphrase_train.txt', word2vec_pretrained, max_sentence_length, 10, embedding_size=300)
    assert len(dataset) == 10
    #
    x1, y1, quality = dataset[0]
    assert x1.shape[0] == max_sentence_length
    assert x1.shape[1] == 300

test_dataset_with_pretrained_word2vec()

##### SpaCy Parser Validations

In [None]:
def test_dataset_spacy_sentences():
    word2vec_spacy, max_sentence_length_spacy = init_word2vec('data/msr_paraphrase_train.txt', 'data/msr_paraphrase_test.txt', parsing_enum=Parsing.SPACY)
    dataset = MSPCDataset('data/msr_paraphrase_train.txt', word2vec_spacy, max_sentence_length_spacy, 10, parsing_enum=Parsing.SPACY)
    assert len(dataset) == 10

    x1, y1, quality = dataset[0]
    assert x1.shape[0] == max_sentence_length_spacy
    assert x1.shape[1] == 50

test_dataset_spacy_sentences()

##### Raw Parser Validations

In [None]:
def test_dataset_raw_sentences():
    word2vec_raw, max_sentence_length_raw = init_word2vec('data/msr_paraphrase_train.txt', 'data/msr_paraphrase_test.txt', Parsing.RAW)
    dataset = MSPCDataset('data/msr_paraphrase_train.txt', word2vec_raw, max_sentence_length_raw, 10, parsing_enum=Parsing.RAW)
    assert len(dataset) == 10

    x1, y1, quality = dataset[0]
    assert x1.shape[0] == max_sentence_length_raw
    assert x1.shape[1] == 50

test_dataset_raw_sentences()

### Dataloaders

Create training and test dataloaders for sentences parsed with...
<br>
1. The stanza parts-of-speech parser and a Word2Vec model trained on the MSRP corpus

In [None]:
word2vec, max_sentence_length = init_word2vec('data/msr_paraphrase_train.txt', 'data/msr_paraphrase_test.txt')

train_dataset = MSPCDataset('data/msr_paraphrase_train.txt', word2vec, max_sentence_length)
test_dataset = MSPCDataset('data/msr_paraphrase_test.txt', word2vec, max_sentence_length)

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False)

2. The stanza parts-of-speech parser and a Word2Vec model loaded with the "Google-news-vectors" pretrained model.

In [None]:
print("Initiating a pretrained model for word2vec... this may take a while")
word2vec_pretrained = gensim.models.keyedvectors.load_word2vec_format(pretrained_word2vec_path, binary=True)
print("Done initiating the pretrained model for word2vec")

pretrained_train_dataset = MSPCDataset('data/msr_paraphrase_train.txt', word2vec_pretrained, max_sentence_length, embedding_size=300)
pretrained_test_dataset = MSPCDataset('data/msr_paraphrase_test.txt', word2vec_pretrained, max_sentence_length, embedding_size=300)


pretrained_train_dataloader = DataLoader(pretrained_train_dataset, batch_size=64, shuffle=False)
pretrained_test_dataloader = DataLoader(pretrained_test_dataset, batch_size=64, shuffle=False)

3. The SpaCy parts-of-speech parser and a Word2Vec model trained on the MSRP corpus. 

In [None]:
word2vec_spacy_sentences, max_sentence_length_spacy_sentences = init_word2vec('data/msr_paraphrase_train.txt', 'data/msr_paraphrase_test.txt')
train_dataset_spacy_sentences = MSPCDataset('data/msr_paraphrase_train.txt', word2vec_spacy_sentences, max_sentence_length_spacy_sentences)
test_dataset_spacy_sentences = MSPCDataset('data/msr_paraphrase_test.txt', word2vec_spacy_sentences, max_sentence_length_spacy_sentences)

train_dataloader_spacy_sentences = DataLoader(train_dataset_spacy_sentences, batch_size=64, shuffle=False)
test_dataloader_spacy_sentences = DataLoader(test_dataset_spacy_sentences, batch_size=64, shuffle=False)

4. Stop words and words with 2 or less characters removed  and a Word2Vec model trained on the MSRP corpus.

In [None]:
word2vec_raw_sentences, max_sentence_length_raw_sentences = init_word2vec('data/msr_paraphrase_train.txt', 'data/msr_paraphrase_test.txt')
train_dataset_raw_sentences = MSPCDataset('data/msr_paraphrase_train.txt', word2vec_raw_sentences, max_sentence_length_raw_sentences)
test_dataset_raw_sentences = MSPCDataset('data/msr_paraphrase_test.txt', word2vec_raw_sentences, max_sentence_length_raw_sentences)

train_dataloader_raw_sentences = DataLoader(train_dataset_raw_sentences, batch_size=64, shuffle=False)
test_dataloader_raw_sentences = DataLoader(test_dataset_raw_sentences, batch_size=64, shuffle=False)

## Neural Network

The original paper noted that a CNN was used but only provided details on the pooling method, _Dynamic K-Max Pooling_, and the sentence similarity calculations. The following table is the final architecture used for the neural network, `SentenceSimilarityCNN2`.

### Dynamic K-Max Pooling

The original paper used a dynamic k-max pooling method in their model. The _k_ value is determine by equation.

\begin{equation*} k=\max \left({k_{top},\left \lceil{ \frac {L-l}{L} \left |{ s }\right | }\right \rceil }\right)\end{equation*}

While the pooling method was discussed, the implementation details were not provided. After some research, we were inspired by how it was implemented by [Kalchbrenner et. al 2014](https://arxiv.org/pdf/1404.2188.pdf).

We implmented two custom layers for our network to support _dynamic k-max pooling_:

* `DynamicKMaxPoolId` - K-Max pooling function where $k$ is defined as a function of the current depth in the network. This will be used twice in the neural network.
* `KMaxPool1d` - Will be used to perform the final pooling function before passing throught the fully-connected layer.

In [None]:
class DynamicKMaxPoolId(nn.Module):
    def __init__(self, k, l, L):
        super(DynamicKMaxPoolId, self).__init__()
        self.k = k
        self.l = l
        self.L = L

    def forward(self, x, sentence_length):
        ktop = max(self.k, int((self.L - self.l)/self.L * sentence_length))
        #print(f"ktop: {ktop}")
        k_max_values, k_max_indices = torch.topk(x, ktop, dim=2)
        return k_max_values


In [None]:
class KMaxPool1d(nn.Module):
    def __init__(self, k):
        super(KMaxPool1d, self).__init__()
        self.k = k

    def forward(self, x):
        # input shape (batch_size, num_channels, sequence_length)
        # output shape (batch_size num_channels, k)
        k_max_values, k_max_indices  = torch.topk(x, self.k, dim=2)
        return k_max_values

### Network Architecture: `SentenceSimilarityCNN2`


Layers | Configuration | Activation Function 
--- | --- | --- | 
Conv1d | input channel `embedding_dim`, output channel `num_filters` | ReLU* 
DynamicKMaxPool | (k=3, l=1, L=3) | -
Conv1d | input channel `num_filters`, output channel `num_filters * 2` | ReLU*
KMaxPool1d | (k=3) | - 
Linear | input = `k * num_filters *2` output = `hidden_dim` | Sentence Similarity


<br>

_* ReLU was chosen as the activation function for its ability to introduce sparsity into the network_

#### Sentence Similarity


The output of the network is a `Tensor` containing a similarity score defined by the equations below. As you can see, Manhattan distance is used to calulated similarity between the sentence pairs.



 \begin{align*} Man(\vec V_{x}, \vec V_{y})=&\left |{ x_{1}-y_{1} }\right |\! +\! \left |{ x_{2}-y_{2} }\right | \!+ \!\ldots \!+ \!\left |{ x_{n}-y_{n} }\right |
 \\ score=&e^{-Man(\vec V_{x}, \vec V_{y})},\quad score\in [{0,1}] \end{align*}

In [None]:
class DynamicKMaxPoolId(nn.Module):
    def __init__(self, k, l, L):
        super(DynamicKMaxPoolId, self).__init__()
        self.k = k
        self.l = l
        self.L = L

    def forward(self, x, sentence_length):
        ktop = max(self.k, int((self.L - self.l)/self.L * sentence_length))
        #print(f"ktop: {ktop}")
        k_max_values, k_max_indices = torch.topk(x, ktop, dim=2)
        return k_max_values


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SentenceSimilarityCNN2(nn.Module):
    def __init__(self, embedding_dim, num_filters, filter_size, hidden_dim):
        super(SentenceSimilarityCNN2, self).__init__()

        self.conv1 = nn.Conv1d(in_channels=embedding_dim, out_channels=num_filters, kernel_size=filter_size, padding=1)
        self.conv2 = nn.Conv1d(in_channels= num_filters, out_channels=num_filters * 2, kernel_size=filter_size, padding=1)
        #self.conv3 = nn.Conv1d(in_channels= num_filters * 2, out_channels=num_filters * 3, kernel_size=filter_size, padding=1)

        self.pool1 = DynamicKMaxPoolId(k=3, l=1, L=3)

        #self.pool2 = DynamicKMaxPoolId(k=3, l=2, L=3)

        self.k = 3
        self.kmaxPool1d = KMaxPool1d(k=self.k)

        #self.fc1 = nn.Linear(self.k * num_filters , hidden_dim)
        self.fc1 = nn.Linear(self.k * num_filters * 2, hidden_dim)

    def forward(self, input1_embedded, input2_embedded):

        # input: input1_embedded is sentence1 and input2_embedding is sentence2
        # input shape: (batch_size, max_sentence_length, embedding_size)
        # output shap: (50) dimension vector that represents the sentence
        # output: similarity score

        # Find the sentence lengths
        sent_length1 = (torch.nonzero(input1_embedded).max(dim=0).values[1] + 1).item()
        sent_length2 = (torch.nonzero(input2_embedded).max(dim=0).values[1] + 1).item()
        #print(f"sent length1: {sent_length1} AND sent length2: {sent_length2}")

        # Convolution
        #print(input1_embedded.shape)
        # input shape (batch_size, max_sentence_length, embedding_size)
        # permuted shape (batch_size, embedding_size, max_sentence_length)
        # output shape (batch_size, out_channels, (max_sentence_length - kernel_size + 2*padding)/stride + 1
        # i.e. if max_sentence_length=19 and kernel_size=3 and padding=1 and stride=1 and out_channels=64
        #      output_shape (19-3+2*1)/1 + 1 = 19 => (64, 64, 19)
        input1_embedded = self.conv1(input1_embedded.permute(0, 2, 1))
        input2_embedded = self.conv1(input2_embedded.permute(0, 2, 1))
        #print(f"output of conv1: {input1_embedded.shape}")

        input1_embedded = torch.relu(input1_embedded)
        input2_embedded = torch.relu(input2_embedded)

        # pool1 is dynamic k-max pooling and is a function of the following formula
        # max(k-top, (L-l)/l * |s|) where L is total number of convolutional layers, l is the current convolution layer and |s| is sentence length k-top is the
        # k-top important features and serves as a lower bound... we will always try and find at least k-top features
        input1_embedded = self.pool1(input1_embedded, sent_length1)
        input2_embedded = self.pool1(input2_embedded, sent_length2)
        #print(f"output of pool1: {input1_embedded.shape}")

        input1_embedded = self.conv2(input1_embedded)
        input2_embedded = self.conv2(input2_embedded)
        #print(f"output of conv2: {input1_embedded.shape}")

        #input1_embedded = torch.relu(input1_embedded)
        #input2_embedded = torch.relu(input2_embedded)

        #input1_embedded = self.pool2(input1_embedded, sent_length1)
        #input2_embedded = self.pool2(input2_embedded, sent_length2)
        #print(f"output of pool2: {input1_embedded.shape}")

        #input1_embedded = self.conv3(input1_embedded)
        #input2_embedded = self.conv3(input2_embedded)
        #print(f"output of conv3: {input1_embedded.shape}")

        input1_embedded = torch.relu(input1_embedded)
        input2_embedded = torch.relu(input2_embedded)

        input1_embedded = self.kmaxPool1d(input1_embedded)
        input2_embedded = self.kmaxPool1d(input2_embedded)
        #print(f"output of k-max pool: {input1_embedded.shape}")

        input1_embedded = input1_embedded.view(input1_embedded.shape[0], input1_embedded.shape[1] * input1_embedded.shape[2])
        input2_embedded = input2_embedded.view(input2_embedded.shape[0], input2_embedded.shape[1] * input2_embedded.shape[2])

        #kmax_input1_embedded = self.kmaxPool1d(input1_embedded)
        #input1_embedded = F.max_pool1d(input1_embedded, input1_embedded.shape[2]).squeeze(2)
        #nput2_embedded = F.max_pool1d(input2_embedded, input2_embedded.shape[2]).squeeze(2)
        #print(f"output of max pool: {input1_embedded.shape}")
        #print(f"output of kmax pool: {kmax_input1_embedded.shape}")

        input1_embedded = self.fc1(input1_embedded)
        input2_embedded = self.fc1(input2_embedded)
        #print(input1_embedded.shape)

        man_dist = torch.sum(torch.abs(input1_embedded - input2_embedded), axis=1)
        # sentence1_mean = torch.mean(x1, axis=1)
        # sentence2_mean = torch.mean(x2, axis=1)
        # man_dist = torch.sum(torch.abs(sentence1_mean - sentence2_mean), axis=1)
        # print(man_dist.shape)

        return torch.exp(-man_dist)


        #input2_embedded = self.conv1(input2_embedded.permute(0, 2, 1))


        # Max pooling
        # input1_pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in input1_conv]
        # input2_pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in input2_conv]
        #
        # # Concatenate and flatten
        # input1_concat = torch.cat(input1_pooled, dim=1)
        # input2_concat = torch.cat(input2_pooled, dim=1)
        #
        # # Concatenate the two sentence representations
        # sentence_similarity = torch.cat([input1_concat, input2_concat], dim=1)
        #
        # # Dense layers
        # #sentence_similarity = self.dropout(F.relu(self.fc1(sentence_similarity)))
        # #sentence_similarity = self.fc2(sentence_similarity)
        #
        # return torch.sigmoid(sentence_similarity)

## Training & Results

Below we define a custom `train` function that facilitates the training process as well as a custom evaluation function `eval_model` to allow flexibility when evaluating the output of our model.


### Training Details

#### Hyperparameters

The hyperparameters are as follows:

* `embedding_dim` = `50`
* `num_filters` = `100`
* `filter_size` = `3`
* `hidden_dim` = `300`
* `dropout` = `0.5`
* `n_epochs` = `80`

#### Loss Function

The Mean Square Error loss is used to calculate the loss between the expected value of the similarity flag and the predicated value.  The loss will be used to update the weights during the back-propagation phase.

#### Optimizer

Stochastic Gradient descent is an iterative optimization algorithm used to find optimal results by taking small steps in the direction of the gradient.  The size of the step is defined by the learning rate.  The learning rate used is defined above in _Hyperparameters_, which was based on the configuration information described in the original paper.

In [None]:
vocab_size = len(word2vec.wv)
embedding_dim = 50
num_filters = 100
filter_size = 3
hidden_dim = 300
dropout = 0.5
n_epochs=80

def train(train_loader, n_epochs=n_epochs, embedding_dim=embedding_dim, num_filters=num_filters, hidden_dim=hidden_dim, lr=1e-1):
    model = SentenceSimilarityCNN2(embedding_dim, num_filters, filter_size, hidden_dim)
    criterion = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    model.train()

    for epoch in range(n_epochs):
        curr_epoch_loss = []
        for x1, x2, y in train_loader:
            #print(x1.shape)
            y_hat = model(x1, x2)
            loss = criterion(y_hat, y.float())

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            curr_epoch_loss.append(loss.cpu().data.numpy())

        print(f"Epoch {epoch}: curr_epoch_loss={np.mean(curr_epoch_loss)}")

    return model


def eval_model(model, test_dataloader, diff_score=0.435):
    model.eval()
    Y_pred = []
    Y = []
    for x1, x2, y in test_dataloader:
        y_hat = model(x1, x2)
        #print(y_hat)
        y_pred = torch.zeros(y_hat.shape)
        y_pred = (y_hat > diff_score).int()

        Y_pred = np.concatenate((Y_pred, y_pred), axis=0)
        Y = np.concatenate((Y, y), axis=0)

        #print(y_pred)
        #print(y)
    return Y_pred, Y



### Stanza Parser Training & Results - _MSRP Trained Model_
#### Training

In [None]:
# Train stanza sentences
start_time = datetime.datetime.now()
model = train(train_dataloader)
end_time = datetime.datetime.now()
print("Number of epochs: ", n_epochs)
print ("Training dataset size", len(train_dataset))
print("Training time: ", (end_time - start_time))




#### Results

In [None]:
y_pred, y = eval_model(model, test_dataloader)

print("size of pos test corpus = ", len(test_dataset))
print("accuracy for pos test corpus = ", accuracy_score(y, y_pred))
print("f1 score test corpus = ", f1_score(y, y_pred))


### Stanza Parser Training & Results - _Pre-trained Model_
#### Training

In [None]:
# 35 epochs yields - accuracy .7026 with 0.45 - 300 - 400 - 800
n_epochs_pretrained=50
start_time = datetime.datetime.now()
pretrained_model = train(pretrained_train_dataloader, n_epochs=n_epochs_pretrained, embedding_dim=300, num_filters=400, hidden_dim=800)
end_time = datetime.datetime.now()
print("Number of epochs: ", n_epochs_pretrained)
print ("Training dataset size", len(pretrained_train_dataset))
print("Training time: ", (end_time - start_time))

#### Results

In [None]:
y_pred_pretrained, y = eval_model(pretrained_model, pretrained_test_dataloader)
print("size of pos test corpus = ", len(pretrained_test_dataloader))
print("accuracy for pos test corpus = ", accuracy_score(y, y_pred_pretrained))
print("f1 score test corpus = ", f1_score(y, y_pred_pretrained))


### SpaCy Parser Training & Results

#### Training

In [None]:
# Train spacy sentences
start_time = datetime.datetime.now()
model_spacy_sentences = train(train_dataloader_spacy_sentences)
end_time = datetime.datetime.now()
print("Number of epochs: ", n_epochs)
print ("Training dataset size", len(train_dataset))
print("Training time: ", (end_time - start_time))


#### Results

In [None]:
y_pred_spacy, y_spacy = eval_model(model_spacy_sentences, test_dataloader_spacy_sentences)

print("size of spacy test corpus = ", len(test_dataset_spacy_sentences))
print("accuracy for spacy test corpus = ", accuracy_score(y_spacy, y_pred_spacy))
print("f1 score test corpus = ", f1_score(y, y_pred_spacy))


### Raw Parser Training & Results

#### Training

In [None]:
# Train the raw sentences
start_time = datetime.datetime.now()
model_raw_sentences = train(train_dataloader_raw_sentences)
end_time = datetime.datetime.now()
print("Number of epochs: ", n_epochs)
print ("Training dataset size", len(train_dataset))
print("Training time: ", (end_time - start_time))

#### Results

In [None]:
y_pred_raw, y_raw = eval_model(model_raw_sentences, test_dataloader_raw_sentences, diff_score=0.4)

print("size of raw test corpus = ", len(test_dataset_raw_sentences))
print("accuracy for raw test corpus = ", accuracy_score(y_raw, y_pred_raw))
print("f1 score test corpus = ", f1_score(y, y_pred_raw))

### Result Summary

In [None]:
#Doesn't work yet

acc_df = pd.DataFrame([['STANZA-MSRP', accuracy_score(y, y_pred)], \
                   ['STANZA-PRETRAIN', accuracy_score(y, y_pred_pretrained)], \
                   ['SPACY', accuracy_score(y_spacy, y_pred_spacy)], \
                   ['RAW', accuracy_score(y_raw, y_pred_raw)]], columns=['parser', 'val'])

acc_df.plot(kind='bar')

f1_df = pd.DataFrame([['STANZA-MSRP', f1_score(y, y_pred)], \
                     ['STANZA-PRETRAIN',  f1_score(y, y_pred_pretrained)], \
                     ['SPACY',  f1_score(y, y_pred_spacy)], \
                     ['RAW',  f1_score(y, y_pred_raw)]], columns=['parser', 'val'])

f1_df.plot(kind='bar')

plt.show()
