# Methodology
## Training
For every training pair, selected whether to add noise and change label to 0 (ie not paraphrase) with rpobability 0.5. If adding noise, selected whether to pick random sentence/document to replace one from the pair or intro duce noise into one of the documents with probability of 0.5 for each. If adding noise, use a defined parament, learning noise or LN, which defaults to 10%, and select LN% of of one of the sentence's token to be replaced with another token selected uniformly at random from the embedding vocabulary.

In [2]:
import random
import itertools
import re
import math
import time
import pandas as pd
import numpy as np
import csv
import importlib

In [3]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
# from sklearn.linear_model import LogisticRegression
# from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from textdistance import jaccard

In [4]:
from spacy.lang.en import English

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchtext import vocab

In [6]:
random.seed(12144)
torch.manual_seed(31190)

<torch._C.Generator at 0x1285f43f0>

In [8]:
import pipeline

In [44]:
importlib.reload(pipeline)

<module 'pipeline' from '/Users/matthewmauer/NLP/hw/4/pipeline.py'>

# Loading Data

#### TRAINING

In [9]:
raw_training = []
with open('data/train.tsv', 'r') as f:
    for line in f:
        a, b = line.strip("\n").split("\t")
        raw_training.append((a, b))
    

#### DEV

In [10]:
raw_dev = []
with open('data/dev+devtest/dev.tsv', 'r') as f:
    for line in f:
        t1, t2, y = line.strip('\n').split('\t')
        raw_dev.append((t1, t2, int(y)))

#### DEVTEST

In [11]:
raw_devtest = []
with open('data/dev+devtest/devtest.tsv', 'r') as f:
    for line in f:
        t1, t2, y = line.strip('\n').split('\t')
        raw_devtest.append((t1, t2, int(y)))

### "Easy" Kaggle Test Data

In [12]:
raw_kaggle_test = {}
with open('data/test_no_labels.tsv', 'r') as f:
    for line in f:
        i, a, b = line.strip("\n").split("\t")
        raw_kaggle_test[i] = (a, b)

### "Hard" DEV

In [13]:
raw_dev_hard = []
with open('data/heldout-hard/dev.hard.tsv', 'r') as f:
    for line in f:
        t1, t2, y = line.strip('\n').split('\t')
        raw_dev_hard.append((t1, t2, int(y)))

In [14]:
raw_devtest_hard = []
with open('data/heldout-hard/devtest.hard.tsv', 'r') as f:
    for line in f:
        t1, t2, y = line.strip('\n').split('\t')
        raw_devtest_hard.append((t1, t2, int(y)))

### "Hard" Kaggle Test Data

In [15]:
raw_kaggle_test_hard = {}
with open('data/test_no_labels-hard.tsv', 'r') as f:
    for line in f:
        i, a, b = line.strip("\n").split("\t")
        raw_kaggle_test_hard[i] = (a, b)

## Parser and Word Embeddings
* Using the basic English parser from spaCy for tokenization.
* Using the 300-dimension, 6B Wikipedia word embeddings [Glove](https://nlp.stanford.edu/projects/glove/)

In [16]:
parser = English()
embeddings300 = vocab.GloVe(name='6B', dim=300)

# Baseline LSTM Model
* Siames LSTM
* The two vectors from the LSTM are joined by an absolute difference elementwise
* 300 dimension embeddings
* randomly initialized and trained h0, c0 (ie initial hidden and cell vectors)
* no manual features
* learning noise: 0.1
* No additional between joining the LSTM outputs and the final classifier layer.

In [19]:
net = pipeline.AbsDiffSiamese()
absdiffsiamese = pipeline.ParaphraseClassifier(
    net=net,
    embeddings=embeddings300,
    parser=parser
)
absdiffsiamese.train(raw_training[:50000], raw_dev, raw_devtest)

After 50000 training samples, accuracy is 0.7396 on the DEV data.
After 50000 training samples, accuracy is 0.7292 on the DEVTEST data.

------------------------------------------------------------------

580.8 seconds has been spent training.
The best score on DEVTEST data was 0.7292 after 50000 training samples.


In [20]:
absdiffsiamese.train(raw_training[50000:200000], raw_dev, raw_devtest)

After 100000 training samples, accuracy is 0.7323 on the DEV data.


KeyboardInterrupt: 

Early stopping hould be performed at 200k samples for the above model. The model converges to a local minimum by then.

In [None]:
# absdiffsiamese.train(raw_training[200000:], raw_dev, raw_devtest)

#### Accuracy on DEVTEST after early stopping

In [31]:
absdiffsiamese.test(raw_devtest)

0.7478260869565218

#### Accuracy on "Easy" Kaggle set
__0.865__

In [70]:
# with open("results/easy/absdiff.csv", 'w', newline='') as f:
#     writer = csv.writer(f, delimiter=',')
#     writer.writerow(["ID", "Category"])
#     for i in raw_kaggle_test:
#         prediction = absdiffsiamese.predict(raw_kaggle_test[i])
#         writer.writerow([i, prediction])

pipeline.write_results("results/easy/absdiff.csv", absdiffsiamese, raw_kaggle_test)

#### Accuracy on Hard DEV

In [41]:
absdiffsiamese.test(raw_dev_hard)

0.432

#### Accuracy on "Hard" Kaggle set
__0.443__

In [44]:
# with open("results/hard/absdiff.csv", 'w', newline='') as f:
#     writer = csv.writer(f, delimiter=',')
#     writer.writerow(["ID", "Category"])
#     for i in raw_kaggle_test_hard:
#         prediction = absdiffsiamese.predict(raw_kaggle_test_hard[i])
#         writer.writerow([i, prediction])

pipeline.write_results("results/hard/absdiff.csv", absdiffsiamese, raw_kaggle_test_hard)

# Improvements for the Hard data
* Use of elementwise product instead of absolute difference for the LSTM join.
* Lowering of the learning noise to 5% to detect smaller differences in the sentence pair.
* Introduction of POS in a separate Siamese LSTM.
    * Several negative samples in the hard data have the subject and object flipped...
* Manual features.
* Use only the hard dev and devtest data for training.
    * With smaller embedding dimensions to avoid overfit.

In [50]:
len(raw_devtest_hard)

1000

__CHANGES__
* Learning noise: 0.05
* Product Siamese join

In [18]:
net = pipeline.ProductSiamese()
product_siamese = pipeline.ParaphraseClassifier(
    net=net,
    embeddings=embeddings300,
    parser=parser,
    learning_noise=0.05
)
product_siamese.train(raw_training[:100000], raw_dev, raw_dev_hard)

After 50000 training samples, accuracy is 0.6834 on the DEV data.
After 50000 training samples, accuracy is 0.463 on the DEVTEST data.
After 100000 training samples, accuracy is 0.7017 on the DEV data.
After 100000 training samples, accuracy is 0.463 on the DEVTEST data.

------------------------------------------------------------------

1207.7 seconds has been spent training.
The best score on DEVTEST data was 0.463 after 50000 training samples.


In [73]:
product_siamese.train(raw_training[100000:200000], raw_dev, raw_dev_hard)

After 150000 training samples, accuracy is 0.7078 on the DEV data.
After 200000 training samples, accuracy is 0.6883 on the DEV data.

------------------------------------------------------------------

2414.2 seconds has been spent training.
The best score on DEVTEST data was 0.484 after 100000 training samples.


Early stopping hould be performed at 100k samples for the above model.

In [53]:
product_siamese.test(raw_devtest)

0.6993788819875777

In [56]:
product_siamese.test(raw_devtest_hard)

0.471

__RESULTS__  
Early stopping should be performed after 100k samples. There seemed to be modest impprovement at that stage, but the gains rapidly decline after 100k samples.

__CHANGES__
The model from above with the four manual features:
* TfIdf cosine similarity.
* 5-character-gram Jaccard score.
* The absolute difference in count of negation terms. (Scaled down by a factor of 10.)
* The number of numerical discrepencies. (Scaled down by a factor of 10.)

In [23]:
# init the engineer
tfidf_vec = TfidfVectorizer()
tfidf_vec.fit(itertools.chain(*random.choices(raw_training, k=100000)))

engineer = pipeline.FeatureEngineer(
    tfidf_vectorizer=tfidf_vec
)

In [69]:
# init the NN
net_manual = net = pipeline.ProductSiamese(n_man_features=4)

# init the full model
# use weight decay to prevent overfitting to the manual features
product_siamese_manual = pipeline.ParaphraseClassifier(
    net=net_manual,
    embeddings=embeddings300,
    parser=parser,
    learning_noise=0.05,
    feature_engineer=engineer,
    weight_decay=1e-4
)

# train the model on the first 100k samples using DEV and the hard DEV for early stopping
product_siamese_manual.train(raw_training[:100000], raw_dev, raw_dev_hard)

After 50000 training samples, accuracy is 0.6932 on the DEV data.
After 50000 training samples, accuracy is 0.433 on the DEVTEST data.
After 100000 training samples, accuracy is 0.665 on the DEV data.

------------------------------------------------------------------

1421.5 seconds has been spent training.
The best score on DEVTEST data was 0.433 after 50000 training samples.


### Train on Hard Dev and Devtest data

In [65]:
product_siamese_manual_hard = pipeline.SupervisedParaphraseClassifier(
    net=net_manual,
    embeddings=embeddings300,
    parser=parser,
    learning_noise=0.05,
    feature_engineer=engineer,
    weight_decay=1e-4
)

# train the model on the first 100k samples using DEV and the hard DEV for early stopping
product_siamese_manual_hard.train(raw_dev_hard, raw_devtest_hard, epochs=10)

After 1000 training samples, accuracy is 0.592 on the DEV data.
After 2000 training samples, accuracy is 0.596 on the DEV data.
After 3000 training samples, accuracy is 0.609 on the DEV data.
After 4000 training samples, accuracy is 0.606 on the DEV data.
After 5000 training samples, accuracy is 0.59 on the DEV data.
After 6000 training samples, accuracy is 0.589 on the DEV data.
After 7000 training samples, accuracy is 0.592 on the DEV data.
After 8000 training samples, accuracy is 0.583 on the DEV data.
After 9000 training samples, accuracy is 0.581 on the DEV data.
After 10000 training samples, accuracy is 0.574 on the DEV data.

------------------------------------------------------------------

284.4 seconds has been spent training.
The best score on DEV data was 0.609 after 3000 training samples over 2 epochs.


__ADJUSTMENT__
To handle overfitting and collapsing gradients, we train with cross validation. Training 2 epochs on one dataset while testing on the other, and then swap the training and testing sets.

__WITHOUT Manual Features__

In [71]:
net = pipeline.ProductSiamese()

product_siamese_hard = pipeline.SupervisedParaphraseClassifier(
    net=net,
    embeddings=embeddings300,
    parser=parser
)

for i in range(6):
    if i%2==0:
        product_siamese_hard.train(raw_dev_hard, raw_devtest_hard, epochs=1)
    else:
        product_siamese_hard.train(raw_devtest_hard, raw_dev_hard, epochs=1)

After 1000 training samples, accuracy is 0.546 on the DEV data.

------------------------------------------------------------------

24.2 seconds has been spent training.
The best score on DEV data was 0.546 after 1000 training samples over 2 epochs.
After 2000 training samples, accuracy is 0.567 on the DEV data.

------------------------------------------------------------------

51.4 seconds has been spent training.
The best score on DEV data was 0.567 after 2000 training samples over 2 epochs.
After 3000 training samples, accuracy is 0.546 on the DEV data.

------------------------------------------------------------------

75.1 seconds has been spent training.
The best score on DEV data was 0.567 after 2000 training samples over 2 epochs.
After 4000 training samples, accuracy is 0.567 on the DEV data.

------------------------------------------------------------------

98.0 seconds has been spent training.
The best score on DEV data was 0.567 after 2000 training samples over 2 epoc

__RESULTS__  
The simple model didn't appear to learn anything.

__WITH Manual Features__

In [66]:
product_siamese_manual_hard = pipeline.SupervisedParaphraseClassifier(
    net=net_manual,
    embeddings=embeddings300,
    parser=parser,
    learning_noise=0.05,
    feature_engineer=engineer,
    weight_decay=1e-4
)

for i in range(6):
    if i%2==0:
        product_siamese_manual_hard.train(raw_dev_hard, raw_devtest_hard, epochs=1)
    else:
        product_siamese_manual_hard.train(raw_devtest_hard, raw_dev_hard, epochs=1)

After 1000 training samples, accuracy is 0.57 on the DEV data.

------------------------------------------------------------------

30.0 seconds has been spent training.
The best score on DEV data was 0.57 after 1000 training samples over 2 epochs.
After 2000 training samples, accuracy is 0.7 on the DEV data.

------------------------------------------------------------------

61.1 seconds has been spent training.
The best score on DEV data was 0.7 after 2000 training samples over 2 epochs.
After 3000 training samples, accuracy is 0.589 on the DEV data.

------------------------------------------------------------------

93.4 seconds has been spent training.
The best score on DEV data was 0.7 after 2000 training samples over 2 epochs.
After 4000 training samples, accuracy is 0.714 on the DEV data.

------------------------------------------------------------------

121.7 seconds has been spent training.
The best score on DEV data was 0.714 after 4000 training samples over 2 epochs.
Aft

In [67]:
product_siamese_manual_hard.test(raw_dev)

0.4682151589242054

In [68]:
# with open("results/hard/product_siamese_manual_hard.csv", 'w', newline='') as f:
#     writer = csv.writer(f, delimiter=',')
#     writer.writerow(["ID", "Category"])
#     for i in raw_kaggle_test_hard:
#         prediction = product_siamese_manual_hard.predict(raw_kaggle_test_hard[i])
#         writer.writerow([i, prediction])

pipeline.write_results("results/hard/product_siamese_manual_hard.csv", product_siamese_manual_hard, raw_kaggle_test_hard)

__RESULTS__  
It appears that the bulk of the predictive power of the model trained on the hard dev and devtest data comes from the manual features.

# Naive Bayes Model
Attempt a Naive Bayes model (Gaussian dist) with the 4 manual features using the entire hard DEV and DEVTEST data.

In [24]:
engineered_training_hard = []

for obs in raw_dev_hard + raw_devtest_hard:
    doc1, doc2, y = obs
    x = engineer.construct_features((doc1, doc2))
    engineered_training_hard.append(x + [y])

In [25]:
engineered_training_hard_matrix = np.array(engineered_training_hard)

In [26]:
nb_hard = GaussianNB()
nb_hard.fit(X = engineered_training_hard_matrix[:,:-1], y = engineered_training_hard_matrix[:,-1])

GaussianNB()

In [27]:
nb_hard.score(X = engineered_training_hard_matrix[:,:-1], y = engineered_training_hard_matrix[:,-1])

0.4445

In [42]:
random.choice(engineered_training_hard)

[0.9898589106597186, 0.7727272727272727, 0.0, 0.0, 1]

__RESULTS__  
The model learned little to nothing. The primary feature, TF-IDF cosine similarity, has very little variance in the hard dataset. Most of the difference between positive and negative samples is the result of reordering nouns and verbs.

# Stacked Model
* For the easy data.
* Train a simple Siamese LSTM (product join) with learning noise=0.05.
* Use the outputs (without the softmax) of the Siamese LSTM as two inputs alongside the four manual features for a Naive Bayes model (Gaussian distribution).

In [76]:
importlib.reload(pipeline)

<module 'pipeline' from '/Users/matthewmauer/NLP/hw/4/pipeline.py'>

In [77]:
stacked_model = pipeline.StackedClassifier(
    siamese_classifier = absdiffsiamese,
    feature_engineer = engineer,
    super_model = GaussianNB()
)

stacked_model.train(raw_dev)

After training on 818 training samples and 4.644663095474243 seconds of training, the stacked model has an accuracy of 0.8031784841075794 on the training data.


In [78]:
stacked_model.processed_training_data[0]

[0.4434166441744399, 0.5, 0.0, 0.0, -0.7320107221603394, 0.6121735572814941, 1]

In [79]:
pipeline.write_results("results/easy/stacked.csv", stacked_model, raw_kaggle_test)

__RESULTS__  
__0.899__ accuracy on the easy Kaggle data.