# Kaggle Competition - NLP - "Contradictory, My Dear Watson" - Refactor Notebook

## Team: jnees

#### [GitHub Repo](https://github.com/jnees/data-science-projects/tree/master/NLP_Kaggle_Contradictory_My_Dear_Watson)

#### [Competition Overview](https://www.kaggle.com/c/contradictory-my-dear-watson/overview)

## Libraries and Options

In [1]:
import pandas as pd
import numpy as np
import spacy
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import CountVectorizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import scipy as sp
from scipy import spatial




In [2]:
# Options
pd.set_option('max_colwidth', 200)

In [3]:
# Spacy Language Libraries
nlp = spacy.load("en_core_web_lg")

## Data Import

In [4]:
train = pd.read_csv("../Data/train_translated.csv")
test = pd.read_csv("../Data/test_translated.csv")

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12120 entries, 0 to 12119
Data columns (total 8 columns):
id               12120 non-null object
premise          12120 non-null object
hypothesis       12120 non-null object
lang_abv         12120 non-null object
language         12120 non-null object
label            12120 non-null int64
hypothesis_en    12120 non-null object
premise_en       12120 non-null object
dtypes: int64(1), object(7)
memory usage: 757.6+ KB


In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5195 entries, 0 to 5194
Data columns (total 7 columns):
id               5195 non-null object
premise          5195 non-null object
hypothesis       5195 non-null object
lang_abv         5195 non-null object
language         5195 non-null object
hypothesis_en    5195 non-null object
premise_en       5195 non-null object
dtypes: object(7)
memory usage: 284.2+ KB


In [7]:
print(train.head())

           id  \
0  5130fd2cb5   
1  5b72532a0b   
2  3931fbe82a   
3  5622f0c60b   
4  86aaa48b45   

                                                                                                                                                                                  premise  \
0                                                                                                                    and these comments were considered in formulating the interim rules.   
1                                                                                                       These are issues that we wrestle with in practice groups of law firms, she said.    
2                                                                                            Des petites choses comme celles-là font une différence énorme dans ce que j'essaye de faire.   
3                                                                                            you know they can't really defend themselves lik

## Data Overview

The training data is comprised of sentences in 15 languages. English is the primary language in the set with about 57% share. The test data has a similar language distribution.

In [8]:
round(train["language"].value_counts(normalize=True)*100,2)

English       56.68
Chinese        3.39
Arabic         3.31
French         3.22
Swahili        3.18
Urdu           3.14
Vietnamese     3.13
Russian        3.10
Hindi          3.09
Greek          3.07
Thai           3.06
Spanish        3.02
German         2.90
Turkish        2.90
Bulgarian      2.82
Name: language, dtype: float64

In [9]:
round(test["language"].value_counts(normalize=True)*100,2)

English       56.69
Spanish        3.37
Russian        3.31
Swahili        3.31
Urdu           3.23
Greek          3.23
Turkish        3.21
Thai           3.16
Arabic         3.06
French         3.02
German         2.93
Chinese        2.91
Bulgarian      2.89
Hindi          2.89
Vietnamese     2.79
Name: language, dtype: float64

#### Restrict Languages - option for testing

In [10]:
# train_sample = train[train["language"] == "English"].copy()
# test_sample = test[test["language"] == "English"].copy()

# print(train_sample.shape, test_sample.shape)

## Feature engineering

In [11]:

def process_features(raw_df,):
    ### Adds feature set to dataframe without vectorizing.
    
    df = raw_df.copy()
    
    ### ---------------------------- ###
    ###       Similarity Index       ###
    ### ---------------------------- ###
    
    # Function for measuring vector similarity - cosine distance between vectors.
    cosine_similarity = lambda vec1, vec2: 1 - spatial.distance.cosine(vec1, vec2)
    
    def calc_similarity(row):
        token1 = nlp(row.premise_en)
        token2 = nlp(row.hypothesis_en)
        return token1.similarity(token2)
    
    df["similarity"] = df.apply(calc_similarity, 1)
    
    
    ### ---------------------------- ###
    ###       L2 Vector Norms        ###
    ### ---------------------------- ###
    
    def calc_l2_premise(row):
        return nlp(row.premise_en).vector_norm

    def calc_l2_hypothesis(row):
        return nlp(row.hypothesis_en).vector_norm

    df["L2_premise"] = df.apply(calc_l2_premise, 1)
    df["L2_hypothesis"] = df.apply(calc_l2_hypothesis, 1)
    
    
    ### ---------------------------- ###
    ###      Sentiment Intensity     ###
    ### ---------------------------- ###
    
    sid = SentimentIntensityAnalyzer()
    
    df["sent_score_p"] = df["premise_en"].apply(lambda p: sid.polarity_scores(p)["compound"])
    df["sent_score_h"] = df["premise_en"].apply(lambda p: sid.polarity_scores(p)["compound"])

    df["sent_score_intensity_diff"] = np.abs(df["sent_score_p"]) - np.abs(df["sent_score_h"])
    
    return df


In [12]:
## Add features
train_processed = process_features(train)
test_processed = process_features(test)

  app.launch_new_instance()


In [13]:
# ## Split for cross validation

# train_split_pct = 0.7
# split_index = int(train_processed.shape[0] * train_split_pct)

# train_processed_A = train_processed.iloc[:split_index]
# train_processed_B = train_processed.iloc[split_index:]

# print(train_processed_A.shape, train_processed_B.shape)

## Vectorize

In [14]:
def vect_test_train(train, test, test_labels=True):
    ### Uses same vectorizer on test and training set so that they are ready for model with same dimensions.
    
    ### ------------------------------ ###
    ###        Count Vectorizer        ###
    ### ------------------------------ ###
    
    test_df = test.copy()
    train_df = train.copy()
    
    test_df = test_df.reset_index()
    train_df = train_df.reset_index()
    
    vect_h = CountVectorizer()
    vect_p = CountVectorizer()

    features = [
        "similarity",
        'L2_premise', 
        'L2_hypothesis', 
        'sent_score_p',
        'sent_score_h',
        'sent_score_intensity_diff',
        ]

    ## Train Vectorize and Concat with features
    X_train_premise = vect_p.fit_transform(train_df.premise_en)
    X_train_hyp = vect_h.fit_transform(train_df.hypothesis_en)
    X_train_features = train_df[features]
    
    X_train_premise = pd.DataFrame(X_train_premise.todense())
    X_train_hyp = pd.DataFrame(X_train_hyp.todense())
    X_train = pd.concat([X_train_premise, X_train_hyp, X_train_features], axis=1)
    X_train = X_train.reset_index()
    
    ## Test Vectorize and Concat with features
    X_test_premise = vect_p.transform(test_df.premise_en)
    X_test_hyp = vect_h.transform(test_df.hypothesis_en)
    X_test_features = test_df[features]

    
    X_test_premise = pd.DataFrame(X_test_premise.todense())
    X_test_hyp = pd.DataFrame(X_test_hyp.todense())
    X_test = pd.concat([X_test_premise, X_test_hyp, X_test_features], axis=1)
    X_test = X_test.reset_index()
    
    ## Labels and output 
    y_train = train_df["label"]
    
    if test_labels:
        y_test = test_df["label"]
        return X_train, y_train, X_test, y_test
    else:
        return X_train, y_train, X_test, None

In [17]:
X_train, y_train, X_test, y_test = vect_test_train(train=train_processed, test=test_processed, test_labels=False) 


In [18]:
test_cols = X_test.columns
train_cols = X_train.columns

## Train and test

Note that feature function outputs a sparse matrix, so it needs to be returned to a dense matrix to avoid dimension mismatch error when using trained instance on test data.

#### Random Forest

In [20]:
from sklearn.ensemble import RandomForestClassifier
cf = RandomForestClassifier(n_estimators=500, min_samples_leaf=20, random_state=1)
cf.fit(X_train, y_train)
predictions = cf.predict(X_train)
print("Train accuracy: ", accuracy_score(y_train, predictions))

test_predictions = cf.predict(X_test)

# print("Test accuracy: ", accuracy_score(y_test, test_predictions))

Train accuracy:  0.5558580858085809


In [27]:
test_predictions = pd.Series(test_predictions)

In [29]:
submission = pd.DataFrame([train['id'], test_predictions])