# Kaggle Competition - NLP - "Contradictory, My Dear Watson" - Refactor Notebook

## Team: jnees

#### [GitHub Repo](https://github.com/jnees/data-science-projects/tree/master/NLP_Kaggle_Contradictory_My_Dear_Watson)

#### [Competition Overview](https://www.kaggle.com/c/contradictory-my-dear-watson/overview)

## Libraries and Options

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import spacy
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import CountVectorizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import scipy as sp
from scipy import spatial
%matplotlib inline



In [2]:
# Options
pd.set_option('max_colwidth', 200)

In [3]:
# Spacy Language Libraries
nlp = spacy.load("en_core_web_lg")

## Data Import

In [4]:
train = pd.read_csv("../Data/train.csv")
test = pd.read_csv("../Data/test.csv")

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12120 entries, 0 to 12119
Data columns (total 6 columns):
id            12120 non-null object
premise       12120 non-null object
hypothesis    12120 non-null object
lang_abv      12120 non-null object
language      12120 non-null object
label         12120 non-null int64
dtypes: int64(1), object(5)
memory usage: 568.2+ KB


In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5195 entries, 0 to 5194
Data columns (total 5 columns):
id            5195 non-null object
premise       5195 non-null object
hypothesis    5195 non-null object
lang_abv      5195 non-null object
language      5195 non-null object
dtypes: object(5)
memory usage: 203.0+ KB


In [7]:
print(train.head())

           id  \
0  5130fd2cb5   
1  5b72532a0b   
2  3931fbe82a   
3  5622f0c60b   
4  86aaa48b45   

                                                                                                                                                                                  premise  \
0                                                                                                                    and these comments were considered in formulating the interim rules.   
1                                                                                                       These are issues that we wrestle with in practice groups of law firms, she said.    
2                                                                                            Des petites choses comme celles-là font une différence énorme dans ce que j'essaye de faire.   
3                                                                                            you know they can't really defend themselves lik

## Data Overview

The training data is comprised of sentences in 15 languages. English is the primary language in the set with about 57% share. The test data has a similar language distribution.

In [8]:
round(train["language"].value_counts(normalize=True)*100,2)

English       56.68
Chinese        3.39
Arabic         3.31
French         3.22
Swahili        3.18
Urdu           3.14
Vietnamese     3.13
Russian        3.10
Hindi          3.09
Greek          3.07
Thai           3.06
Spanish        3.02
Turkish        2.90
German         2.90
Bulgarian      2.82
Name: language, dtype: float64

In [9]:
round(test["language"].value_counts(normalize=True)*100,2)

English       56.69
Spanish        3.37
Swahili        3.31
Russian        3.31
Greek          3.23
Urdu           3.23
Turkish        3.21
Thai           3.16
Arabic         3.06
French         3.02
German         2.93
Chinese        2.91
Hindi          2.89
Bulgarian      2.89
Vietnamese     2.79
Name: language, dtype: float64

#### Restrict Languages

In [10]:
train_sample = train[train["language"] == "English"].copy()
test_sample = test[test["language"] == "English"].copy()

print(train_sample.shape, test_sample.shape)

(6870, 6) (2945, 5)


## Feature engineering

In [11]:
def process_features(raw_df, labels=True):
    df = raw_df.copy()
    
    ### ---------------------------- ###
    ###       Similarity Index       ###
    ### ---------------------------- ###
    
    # Function for measuring vector similarity - cosine distance between vectors.
    cosine_similarity = lambda vec1, vec2: 1 - spatial.distance.cosine(vec1, vec2)
    
    def calc_similarity(row):
        token1 = nlp(row.premise)
        token2 = nlp(row.hypothesis)
        return token1.similarity(token2)
    
    df["similarity"] = df.apply(calc_similarity, 1)
    
    
    ### ---------------------------- ###
    ###       L2 Vector Norms        ###
    ### ---------------------------- ###
    
    def calc_l2_premise(row):
        return nlp(row.premise).vector_norm

    def calc_l2_hypothesis(row):
        return nlp(row.hypothesis).vector_norm

    df["L2_premise"] = df.apply(calc_l2_premise, 1)
    df["L2_hypothesis"] = df.apply(calc_l2_hypothesis, 1)
    
    
    ### ---------------------------- ###
    ###      Sentiment Intensity     ###
    ### ---------------------------- ###
    
    sid = SentimentIntensityAnalyzer()
    
    df["sent_score_p"] = df["premise"].apply(lambda p: sid.polarity_scores(p)["compound"])
    df["sent_score_h"] = df["premise"].apply(lambda p: sid.polarity_scores(p)["compound"])

    df["sent_score_intensity_diff"] = np.abs(df["sent_score_p"]) - np.abs(df["sent_score_h"])

    ### ------------------------------ ###
    ###        Count Vectorizer        ###
    ### ------------------------------ ###
    
    vect = CountVectorizer()

    features = [
        "similarity",
        'L2_premise', 
        'L2_hypothesis', 
        'sent_score_p',
        'sent_score_h',
        'sent_score_intensity_diff',
    ]

    X = sp.sparse.hstack((vect.fit_transform(df.premise), df[features].values),format='csr')
    X = sp.sparse.hstack((vect.fit_transform(df.hypothesis), X),format='csr')
    
    if labels:
        y = df["label"]
        return X, y
    else:
        return X

In [12]:
X_train, y_train = process_features(train_sample)
X_test = process_features(test_sample, labels=False)

  


## Train and test

#### Neural Net

In [13]:
nn = MLPClassifier(hidden_layer_sizes=(32,), activation="tanh", max_iter=300, random_state=1)
nn.fit(X_train, y_train)
predictions = nn.predict(X_train)
print(accuracy_score(y_train, predictions))

0.9962154294032023


#### Decision Tree

In [14]:
from sklearn.tree import DecisionTreeClassifier
predictions = DecisionTreeClassifier(min_samples_leaf=5).fit(X_train, y_train).predict(X_train)
print(accuracy_score(y_train, predictions))

0.7513828238719068
