## Predict Duplicate using basic ML + NLP techniques

I am trying to predict the duplicate sentences using vector similarity calculations and NLP technique in this module and its other forked versions.

BOW + Cosine/Euclidean/Manhattan/Jaccard/Minskowiski

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

WordEmbeddings
test.csv
train.csv



## Reading train data, Cleaning

*Reading Training Data ,
Removing duplicates , 
Removing NULL values*

Using pandas read_csv command to read data from train files.  And doing some basic cleanup on the data by removing any duplicates or null values that may be present in the data. 

Note: Reducing the size of the data set so that the Kernel memory does not run out

In [8]:
from sklearn.model_selection import train_test_split

def read_data():
    df = pd.read_csv("./input/train.csv")
#     df = pd.read_csv("../input/train.csv", nrows=20000)
    print ("Shape of base training File = ", df.shape)
    # Remove missing values and duplicates from training data
    df.drop_duplicates(inplace=True)
    df.dropna(inplace=True)
    print("Shape of base training data after cleaning = ", df.shape)
    return df

df_train = read_data()
# df_train, df_test = train_test_split(df, test_size = 0.02)
print (df_train.head(2))
# print (df_test.shape)

Shape of base training File =  (404290, 6)
Shape of base training data after cleaning =  (404288, 6)
   id  qid1  qid2                                          question1  \
0   0     1     2  What is the step by step guide to invest in sh...   
1   1     3     4  What is the story of Kohinoor (Koh-i-Noor) Dia...   

                                           question2  is_duplicate  
0  What is the step by step guide to invest in sh...             0  
1  What would happen if the Indian government sto...             0  


As we can see from the above output the file contains six columns

- **qid1** & **qid2**  - which contains the unique id assigned to the question
- **question1** & **question2** - which contains the actual questions
- **is_duplicate** - which contains information if the question1 and 2 are duplicate or not

## EDA ##

Some EDA on the data to get a look and feel about the data. Here we are trying to see the distribution of output data. Duplicate questions available etc.

In [9]:
from collections import Counter
import matplotlib.pyplot as plt
import operator

def eda(df):
    print ("Duplicate Count = %s , Non Duplicate Count = %s" 
           %(df.is_duplicate.value_counts()[1],df.is_duplicate.value_counts()[0]))
    
    question_ids_combined = df.qid1.tolist() + df.qid2.tolist()
    
    print ("Unique Questions = %s" %(len(np.unique(question_ids_combined))))
    
    question_ids_counter = Counter(question_ids_combined)
    sorted_question_ids_counter = sorted(question_ids_counter.items(), key=operator.itemgetter(1))
    question_appearing_more_than_once = [i for i in question_ids_counter.values() if i > 1]
    print ("Count of Quesitons appearing more than once = %s" %(len(question_appearing_more_than_once)))
    
    
eda(df_train)

Duplicate Count = 149263 , Non Duplicate Count = 255025
Unique Questions = 537931
Count of Quesitons appearing more than once = 111778


## Train Dictionary ##

First we will tokenize the sentences to extract words from the question. Lets also apply porter stemmer to break down words into their basic form. This should help us increase the accuracy of the system.

Then we use gensims to train a dictionary of words available in the corpus. We are training the dictionary based on the Bag Of Words concept. Gensims dictionary will assign a id to each word which we can use later to convert documents into vectors. 

Also, filter extremes to remove words appearing less than 5times in the corpus or in more than 80% of the questions.

In [4]:
import re
import gensim
from gensim import corpora
from nltk.corpus import stopwords
from nltk.stem.porter import *

words = re.compile(r"\w+",re.I)
stopword = stopwords.words('english')
stemmer = PorterStemmer()

def tokenize_questions(df):
    question_1_tokenized = []
    question_2_tokenized = []

    for q in df.question1.tolist():
        question_1_tokenized.append([stemmer.stem(i.lower()) for i in words.findall(q) if i not in stopword])

    for q in df.question2.tolist():
        question_2_tokenized.append([stemmer.stem(i.lower()) for i in words.findall(q) if i not in stopword])

    df["Question_1_tok"] = question_1_tokenized
    df["Question_2_tok"] = question_2_tokenized
    
    return df

def train_dictionary(df):
    
    questions_tokenized = df.Question_1_tok.tolist() + df.Question_2_tok.tolist()
    
    dictionary = corpora.Dictionary(questions_tokenized)
    dictionary.filter_extremes(no_below=5, no_above=0.8)
    dictionary.compactify()
    
    return dictionary
    
df_train = tokenize_questions(df_train)
dictionary = train_dictionary(df_train)
print ("No. of words in the dictionary = %s" %len(dictionary.token2id))

df_test = tokenize_questions(df_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


No of words in the dictionary = 7781


As we can see that the number of unique words in the dictionary after filtering are 4831. 
This would be the size of each of the vector in the question set.

## Create Vector##

Here we are using the simple method of Bag Of Words Technique to convert sentences into vectors. There are two vector matrices thus created where each of the matrix is a sparse matrix to save  memory in the system.

In [5]:
def get_vectors(df, dictionary):
    
    question1_vec = [dictionary.doc2bow(text) for text in df.Question_1_tok.tolist()]
    question2_vec = [dictionary.doc2bow(text) for text in df.Question_2_tok.tolist()]
    
    question1_csc = gensim.matutils.corpus2csc(question1_vec, num_terms=len(dictionary.token2id))
    question2_csc = gensim.matutils.corpus2csc(question2_vec, num_terms=len(dictionary.token2id))
    
    return question1_csc.transpose(),question2_csc.transpose()


q1_csc, q2_csc = get_vectors(df_train, dictionary)

print (q1_csc.shape)
print (q2_csc.shape)

(49000, 7781)
(49000, 7781)


As we can see each of the matrix is of size 
404288 X 30114   = > (size of the training data) X (no of words in the dictionary)

## Define Similarity Calculation Fucntions##

Here we have defined various Distance calculation functions for 

 - Cosine Distance
 - Euclidean Distance
 - Manhattan Distance
 - Jaccard Distance
 - Minkowski Distance

As Eucledian, Manhattan and Minkowski Distance may go beyond 1 we must scale them down between0 - 1 , for that we are using MinMaxScaler and training them on training data.

In [6]:
from sklearn.metrics.pairwise import cosine_similarity as cs
from sklearn.metrics.pairwise import manhattan_distances as md
from sklearn.metrics.pairwise import euclidean_distances as ed
from sklearn.metrics import jaccard_similarity_score as jsc
from sklearn.neighbors import DistanceMetric
from sklearn.preprocessing import MinMaxScaler

minkowski_dis = DistanceMetric.get_metric('minkowski')
mms_scale_man = MinMaxScaler()
mms_scale_euc = MinMaxScaler()
mms_scale_mink = MinMaxScaler()

def get_similarity_values(q1_csc, q2_csc):
    cosine_sim = []
    manhattan_dis = []
    eucledian_dis = []
    jaccard_dis = []
    minkowsk_dis = []
    
    for i,j in zip(q1_csc, q2_csc):
        sim = cs(i,j)
        cosine_sim.append(sim[0][0])
        sim = md(i,j)
        manhattan_dis.append(sim[0][0])
        sim = ed(i,j)
        eucledian_dis.append(sim[0][0])
        i_ = i.toarray()
        j_ = j.toarray()
        try:
            sim = jsc(i_,j_)
            jaccard_dis.append(sim)
        except:
            jaccard_dis.append(0)
            
        sim = minkowski_dis.pairwise(i_,j_)
        minkowsk_dis.append(sim[0][0])
    
    return cosine_sim, manhattan_dis, eucledian_dis, jaccard_dis, minkowsk_dis    


# cosine_sim = get_cosine_similarity(q1_csc, q2_csc)
cosine_sim, manhattan_dis, eucledian_dis, jaccard_dis, minkowsk_dis = get_similarity_values(q1_csc, q2_csc)
print ("cosine_sim sample= \n", cosine_sim[0:2])
print ("manhattan_dis sample = \n", manhattan_dis[0:2])
print ("eucledian_dis sample = \n", eucledian_dis[0:2])
print ("jaccard_dis sample = \n", jaccard_dis[0:2])
print ("minkowsk_dis sample = \n", minkowsk_dis[0:2])

eucledian_dis_array = np.array(eucledian_dis).reshape(-1,1)
manhattan_dis_array = np.array(manhattan_dis).reshape(-1,1)
minkowsk_dis_array = np.array(minkowsk_dis).reshape(-1,1)
    
manhattan_dis_array = mms_scale_man.fit_transform(manhattan_dis_array)
eucledian_dis_array = mms_scale_euc.fit_transform(eucledian_dis_array)
minkowsk_dis_array = mms_scale_mink.fit_transform(minkowsk_dis_array)

eucledian_dis = eucledian_dis_array.flatten()
manhattan_dis = manhattan_dis_array.flatten()
minkowsk_dis = minkowsk_dis_array.flatten()

cosine_sim sample= 
 [0.59999999999999998, 0.48997894350611149]
manhattan_dis sample = 
 [4.0, 21.0]
eucledian_dis sample = 
 [2.0, 4.7958315233127191]
jaccard_dis sample = 
 [0.42857142857142855, 0]
minkowsk_dis sample = 
 [2.0, 4.7958315233127191]


## Calculate Log Loss##

Here we will use log loss formula to set a base criteria as to what accuracy our algorithm is able to achieve in terms of log loss which is the competition calucation score.

We will also use Eucledian, Manhattan , Minkowski and Jaccard to calculate the similarity and then have a look at the log loss from each one of them. These are the five most widely used similarity classes used in Data Science so Lets use each one of them to see which performs best.

In [7]:
from sklearn.metrics import log_loss

def calculate_logloss(y_true, y_pred):
    loss_cal = log_loss(y_true, y_pred)
    return loss_cal

q1_csc_test, q2_csc_test = get_vectors(df_test, dictionary)
y_pred_cos, y_pred_man, y_pred_euc, y_pred_jac, y_pred_mink = get_similarity_values(q1_csc_test, q2_csc_test)
y_true = df_test.is_duplicate.tolist()

y_pred_man_array = mms_scale_man.transform(np.array(y_pred_man).reshape(-1,1))
y_pred_man = y_pred_man_array.flatten()

y_pred_euc_array = mms_scale_euc.transform(np.array(y_pred_euc).reshape(-1,1))
y_pred_euc = y_pred_euc_array.flatten()

y_pred_mink_array = mms_scale_mink.transform(np.array(y_pred_mink).reshape(-1,1))
y_pred_mink = y_pred_mink_array.flatten()

logloss = calculate_logloss(y_true, y_pred_cos)
print ("The calculated log loss value on the test set for cosine sim is = %f" %logloss)

logloss = calculate_logloss(y_true, y_pred_man)
print ("The calculated log loss value on the test set for manhattan sim is = %f" %logloss)

logloss = calculate_logloss(y_true, y_pred_euc)
print ("The calculated log loss value on the test set for euclidean sim is = %f" %logloss)

logloss = calculate_logloss(y_true, y_pred_jac)
print ("The calculated log loss value on the test set for jaccard sim is = %f" %logloss)

logloss = calculate_logloss(y_true, y_pred_mink)
print ("The calculated log loss value on the test set for minkowski sim is = %f" %logloss)

The calculated log loss value on the test set for cosine sim is = 1.422442
The calculated log loss value on the test set for manhattan sim is = 3.507409
The calculated log loss value on the test set for euclidean sim is = 3.142903
The calculated log loss value on the test set for jaccard sim is = 3.406602
The calculated log loss value on the test set for minkowski sim is = 3.142903


Although this test is run on a small set it indicates that cosine similarity is working as the best parameter for finding duplicate among sentences.

## Adding Machine Learning Models to improve logloss accuracy ##

Now in order to improve on the accuracy let us feed the results from these similarity coefficients to a Random Forest Regressor and Support Vector Regressor and check if we can improve on the log loss values.

Not concentrating on the hyper parameters of RF and SVM we are just allowing the algorithms to run as it is.

In [8]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

X_train = pd.DataFrame({"cos" : cosine_sim, "man" : manhattan_dis, "euc" : eucledian_dis, "jac" : jaccard_dis, "min" : minkowsk_dis})
y_train = df_train.is_duplicate

X_test = pd.DataFrame({"cos" : y_pred_cos, "man" : y_pred_man, "euc" : y_pred_euc, "jac" : y_pred_jac, "min" : y_pred_mink})
y_test = y_true

rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)

svr = SVR()
svr.fit(X_train,y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

Now that we have trained the model . Lets predict duplicate from models and calcualte logloss from them to check if their is any improvement in the logloss values.

In [9]:
y_rfr_predicted = rfr.predict(X_test)
y_svr_predicted = svr.predict(X_test)

logloss_rfr = calculate_logloss(y_test, y_rfr_predicted)
logloss_svr = calculate_logloss(y_test, y_svr_predicted)

print ("The calculated log loss value on the test set using RFR is = %f" %logloss_rfr)
print ("The calculated log loss value on the test set using SVR is = %f" %logloss_svr)

The calculated log loss value on the test set using RFR is = 0.733248
The calculated log loss value on the test set using SVR is = 0.660548


As we can see from the above results that we are able to bring down the logloss values to nearly **half** of what was predicted earlier using base similarity techniques. 

##Adding other features like word count ##