# Project 1: Quora Question Pairs

## Description:

This notebook uses NLP to generate predictions for the Quora Question Pairs dataset from https://www.kaggle.com/c/quora-question-pairs/data

In [1]:
from pathlib import Path
import random

import spacy
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix


## Function definitions, Training Set Import, Preprocessing

Define helper functions to calculate cosine similarity

In [2]:


def parse(nlp, ser):
    docs = []
    
    for doc in nlp.pipe(ser.astype('unicode').values, n_threads=10):
        docs.append(doc)
    
    return docs


def get_similarity(docs1, docs2):
    similarity = []
    
    for idx in range(len(docs1)):
        similarity.append(docs1[idx].similarity(docs2[idx]))            
        
    return similarity



Load in train.csv. For faster computation, only load 2.5% of the full sample, or about 10,000 rows

In [3]:
csv = Path.cwd().joinpath('train.csv')
p = 0.025
df = pd.read_csv(csv,
                 index_col='id',
                 skiprows=lambda i: i>0 and random.random() > p)
df.head()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
25,51,52,What are some tips on making it through the jo...,What are some tips on making it through the jo...,0
42,85,86,"Can I make 50,000 a month by day trading?","Can I make 30,000 a month by day trading?",0
111,223,224,Is USA the most powerful country of the world?,Why is the USA the most powerful country of th...,0
231,463,464,Is drinking 4 liters of water each day unhealthy?,How many liters of water should I drink if I r...,0
287,574,575,If there will be a war between India and Pakis...,Who will win if a war starts between India and...,1


Calculate cosine similarity between question 1 and question 2, then concatenate the questions for TFIDF generation

In [4]:
nlp = spacy.load('en_core_web_lg')

q1_parsed = parse(nlp, df['question1'])
q2_parsed = parse(nlp, df['question2'])

df['similarity'] = get_similarity(q1_parsed, q2_parsed)
df['q_concat'] = df['question1'].map(str) + ' ' + df['question2']

df.head()


Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate,similarity,q_concat
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
25,51,52,What are some tips on making it through the jo...,What are some tips on making it through the jo...,0,0.989188,What are some tips on making it through the jo...
42,85,86,"Can I make 50,000 a month by day trading?","Can I make 30,000 a month by day trading?",0,0.998203,"Can I make 50,000 a month by day trading? Can ..."
111,223,224,Is USA the most powerful country of the world?,Why is the USA the most powerful country of th...,0,0.994251,Is USA the most powerful country of the world?...
231,463,464,Is drinking 4 liters of water each day unhealthy?,How many liters of water should I drink if I r...,0,0.866125,Is drinking 4 liters of water each day unhealt...
287,574,575,If there will be a war between India and Pakis...,Who will win if a war starts between India and...,1,0.984085,If there will be a war between India and Pakis...


Train-test split

In [5]:
x = df.drop(['question1', 
             'question2', 
             'qid1', 
             'qid2', 
             'is_duplicate'], axis=1)
y = df['is_duplicate']

x_train, x_test, y_train, y_test = train_test_split(
        x, y, stratify=y, random_state=42
    )

x_train.head()

Unnamed: 0_level_0,similarity,q_concat
id,Unnamed: 1_level_1,Unnamed: 2_level_1
300308,0.930127,"How do I use ""would"", ""could"", ""should"", ""woul..."
377555,0.969686,What are the differences between permittivity ...
75465,0.987111,What would you change about Quora and why? Wha...
247058,0.952029,What are the best books available for data str...
269218,0.868851,What is to be written in physical education pr...


## TF-IDF Vectorizer

Generate TF-IDF's for the train and test sets

In [6]:
vectorizer = TfidfVectorizer()
train_tfidf = vectorizer.fit_transform(
        x_train['q_concat'].values.astype('U')
    )
test_tfidf = vectorizer.transform(
        x_test['q_concat'].values.astype('U')
    )

x_train_bow = pd.merge(
        x_train.drop('q_concat', axis=1), 
        pd.DataFrame(train_tfidf.todense(), index=x_train.index), 
        on=x_train.index
    ).set_index('key_0')
x_test_bow = pd.merge(
        x_test.drop('q_concat', axis=1), 
        pd.DataFrame(test_tfidf.todense(), index=x_test.index), 
        on=x_test.index
    ).set_index('key_0')

x_train_bow.head()

Unnamed: 0_level_0,similarity,0,1,2,3,4,5,6,7,8,...,13059,13060,13061,13062,13063,13064,13065,13066,13067,13068
key_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
300308,0.930127,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
377555,0.969686,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75465,0.987111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
247058,0.952029,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
269218,0.868851,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Model 1: Multinomial Naive Bayes

Using cosine similarities and TF-IDF's as features, Multinomial Naive Bayes achieves quite high accuracy, but a notable bias towards duplicate predictions as illustrated by the Confusion Matrix.

In [7]:
mnb = MultinomialNB()
mnb.fit(x_train_bow, y_train)
preds = mnb.predict(x_test_bow)
print(accuracy_score(y_test, preds))
print(confusion_matrix(y_test, preds))

0.6896149358226371
[[1562   49]
 [ 749  211]]


## Feature transformation: Singular Value Decomposition

Using sklearn's TruncatedSVD class, reduce the TF-IDF's into a lower feature space of 100 components

In [8]:
svd = TruncatedSVD(n_components=100, random_state=42)
train_tfidf_lsa = svd.fit_transform(train_tfidf)
test_tfidf_lsa = svd.transform(test_tfidf)

x_train_lsa = pd.merge(
        x_train.drop('q_concat', axis=1), 
        pd.DataFrame(train_tfidf_lsa, index=x_train.index), 
        on=x_train.index
    ).set_index('key_0')
x_test_lsa = pd.merge(
        x_test.drop('q_concat', axis=1), 
        pd.DataFrame(test_tfidf_lsa, index=x_test.index), 
        on=x_test.index
    ).set_index('key_0')

x_train_lsa.head()

Unnamed: 0_level_0,similarity,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
key_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
300308,0.930127,0.187575,0.045253,-0.089859,0.063262,-0.088426,-0.001472,-0.016887,0.033564,0.049702,...,0.061069,-0.041443,-0.009761,0.077965,-0.044491,0.00053,-0.018479,0.084164,0.013864,0.006785
377555,0.969686,0.159295,-0.12814,-0.062896,0.137959,0.007423,-0.138207,0.031591,0.156295,0.070564,...,-0.008332,-0.003728,-0.002333,0.002204,3e-05,0.005239,0.001501,-0.005327,0.000386,-0.012356
75465,0.987111,0.209598,0.076719,-0.294874,-0.154684,0.049238,0.066283,0.186079,0.051552,0.078945,...,0.014643,-0.023534,-0.002114,0.037925,-0.046528,0.131369,-0.077564,0.058452,0.01486,-0.013146
247058,0.952029,0.209052,-0.117752,0.076499,-0.107127,0.065773,-0.048192,-0.129439,0.0506,0.009955,...,0.041605,0.030666,0.030029,0.054183,-0.023131,0.022277,0.046116,-0.046678,-0.007315,0.050097
269218,0.868851,0.142602,0.038235,0.058988,0.057774,-0.019644,0.017936,-0.0182,-0.025484,-0.010537,...,-0.003668,-0.018677,-0.005009,-0.022032,-0.020165,-0.005459,0.005218,-0.007658,-0.006429,0.029034


## Model 2: Support Vector Machine

Using cosine similarity and the decomposed TF-IDF's as features, the linear Support Vector Machine Classifier demonstrates greatly improved performance over Multinomial Naive Bayes, with much less bias toward duplicate predictions

In [9]:
svc = SVC(kernel='linear', random_state=42).fit(x_train_lsa, y_train)
preds = svc.predict(x_test_lsa)
print(accuracy_score(y_test, preds))
print(confusion_matrix(y_test, preds))

0.7288992609879424
[[1388  223]
 [ 474  486]]
