# Kaggle Competition - NLP - "Contradictory, My Dear Watson"

## Team: jnees
### Notebook 1: Data exploration and Cleaning
#### [GitHub Repo](https://github.com/jnees/data-science-projects/tree/master/NLP_Kaggle_Contradictory_My_Dear_Watson)

#### [Competition Overview](https://www.kaggle.com/c/contradictory-my-dear-watson/overview)

## Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import spacy
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
from scipy import spatial
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline

In [2]:
# Options
pd.set_option('max_colwidth', 200)

## Data Import

In [3]:
train = pd.read_csv("../Data/train.csv")
test = pd.read_csv("../Data/test.csv")

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12120 entries, 0 to 12119
Data columns (total 6 columns):
id            12120 non-null object
premise       12120 non-null object
hypothesis    12120 non-null object
lang_abv      12120 non-null object
language      12120 non-null object
label         12120 non-null int64
dtypes: int64(1), object(5)
memory usage: 568.2+ KB


In [5]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5195 entries, 0 to 5194
Data columns (total 5 columns):
id            5195 non-null object
premise       5195 non-null object
hypothesis    5195 non-null object
lang_abv      5195 non-null object
language      5195 non-null object
dtypes: object(5)
memory usage: 203.1+ KB


In [6]:
print(train.head())

           id  \
0  5130fd2cb5   
1  5b72532a0b   
2  3931fbe82a   
3  5622f0c60b   
4  86aaa48b45   

                                                                                                                                                                                  premise  \
0                                                                                                                    and these comments were considered in formulating the interim rules.   
1                                                                                                       These are issues that we wrestle with in practice groups of law firms, she said.    
2                                                                                            Des petites choses comme celles-là font une différence énorme dans ce que j'essaye de faire.   
3                                                                                            you know they can't really defend themselves lik

## Data Overview

The training data is comprised of sentences in 15 languages. English is the primary language in the set with about 57% share. The test data has a similar language distribution.

In [7]:
round(train["language"].value_counts(normalize=True)*100,2)

English       56.68
Chinese        3.39
Arabic         3.31
French         3.22
Swahili        3.18
Urdu           3.14
Vietnamese     3.13
Russian        3.10
Hindi          3.09
Greek          3.07
Thai           3.06
Spanish        3.02
German         2.90
Turkish        2.90
Bulgarian      2.82
Name: language, dtype: float64

In [8]:
round(test["language"].value_counts(normalize=True)*100,2)

English       56.69
Spanish        3.37
Swahili        3.31
Russian        3.31
Greek          3.23
Urdu           3.23
Turkish        3.21
Thai           3.16
Arabic         3.06
French         3.02
German         2.93
Chinese        2.91
Bulgarian      2.89
Hindi          2.89
Vietnamese     2.79
Name: language, dtype: float64

## NLP feature engineering

#### 1. Similarity between vectorized premise and hypothesis. (Cosine distance between vectors)

In [9]:
nlp = spacy.load("en_core_web_lg")

In [10]:
train_sample = train[train["language"] == "English"].head(2000)
train_sample.shape

(2000, 6)

In [11]:
## Function for measuring vector similarity - cosine distance between vectors.
cosine_similarity = lambda vec1, vec2: 1 - spatial.distance.cosine(vec1, vec2)

In [12]:
def calc_similarity(row):
    token1 = nlp(row.premise)
    token2 = nlp(row.hypothesis)
    return token1.similarity(token2)

In [13]:
train_sample["similarity"] = train_sample.apply(calc_similarity, 1)

In [14]:
print(train_sample.iloc[10])

id                                                                                                                                                                                                         ad5a79456e
premise       Increased saving by current generations would expand the nation's capital stock, allowing future generations to better afford the nation's retirement costs while also enjoying higher standards of ...
hypothesis    Current generations' increased saving would expand the nation's capital stock, allowing future generations to more easily afford the nation's retirement costs while also enjoying higher standards ...
lang_abv                                                                                                                                                                                                           en
language                                                                                                                                        

#### 2. L2 Vector Norms

In [15]:
def calc_l2_premise(row):
    return nlp(row.premise).vector_norm

def calc_l2_hypothesis(row):
    return nlp(row.hypothesis).vector_norm

train_sample["L2_premise"] = train_sample.apply(calc_l2_premise, 1)
train_sample["L2_hypothesis"] = train_sample.apply(calc_l2_hypothesis, 1)

#### 3. CountVector

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
import scipy as sp

# X_train_counts = vect.fit_transform(train_sample["premise"])
# X_train_counts = X_train_counts + train_sample[['similarity', 'L2_premise']]

X_train = sp.sparse.hstack((vect.fit_transform(train_sample.premise), train_sample[['similarity', 'L2_premise', 'L2_hypothesis']].values),format='csr')
X_train = sp.sparse.hstack((vect.fit_transform(train_sample.hypothesis), X_train),format='csr')
X_train.shape


(2000, 11491)

## Train and test

In [46]:
y_train = train_sample["label"]
nn = MLPClassifier(hidden_layer_sizes=(8,), activation="relu", random_state=1)
nn.fit(X_train, y_train)
predictions = nn.predict(X_train)
print(accuracy_score(y_train, predictions))

1.0


In [43]:
from sklearn.tree import DecisionTreeClassifier
predictions = DecisionTreeClassifier(max_depth=12).fit(X_train, y_train).predict(X_train)
print(accuracy_score(y_train, predictions))

0.6345
