# Quora Question Pair Similarity
### Kaggle Competition link: https://www.kaggle.com/c/quora-question-pairs

<p>We have built features to train the model on. Here we will load data with all our 627 features. We will first build a random or simple (Naive Bayes) base model and then will try out different machine learning algorithms and compare against our base model. After that, we will choose the best one and tune it to generalize it on future data.</p> 
<p> The metrics we will evaluate the models on are:<br>
* log-loss <br>
* Binary Confusion Matrix <br> 
</p>

Our strategy is:
1. Load the data
2. Split data into train test (70:30)
3. <b>Build random model:</b> A model that randomly assigns probabilities.
4. <b>Build Logistic Regression:</b> A statistical model that uses a logistic function to model the probability of a binary response based on one or more predictor variables.
5. <b>Build Naive Bayes:</b> A probabilistic algorithm based on Bayes' theorem that assumes the independence of the features in the input data
6. <b>Build Support Vector Machines:</b> Works by finding the best hyperplane that separates different classes of data points
7. <b>Build Gradient Boosting:</b> A powerful ensemble method that combines multiple weak models to create a strong classifier



In [44]:
# Imports

# General
from datetime import datetime 

# Data 
import pandas as pd
import numpy as np 
import sqlite3
from sqlalchemy import create_engine
from collections import Counter

# ML
from sklearn.model_selection import train_test_split


# Metrics
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix


#### 1. Load data from SQLite

In [11]:
start = datetime.now()
try:
    conn = sqlite3.connect("train.db")
    data = pd.read_sql_query("SELECT * FROM train_data", conn)
    conn.commit()
    conn.close()
    print("Data loaded!\nTime taken: {0}".format(datetime.now()-start))
except Exception as e:
    print(e)

Data loaded!
Time taken: 0:04:43.249457


In [12]:
print("Shape of data: {0}".format(data.shape))

Shape of data: (404290, 634)


In [17]:
# Remove unnecessary columns
data = data.iloc[:,6:]
print("Shape of data after removing unnecessary columns: {0}".format(data.shape))

Shape of data after removing unnecessary columns: (404290, 628)


In [20]:
data.describe()

Unnamed: 0,is_duplicate,q1_frequency,q2_frequency,q1_length,q2_length,q1_tokens_count,q2_tokens_count,q1_words_count,q2_words_count,q1_nonstopwords_count,...,q2_feat_291,q2_feat_292,q2_feat_293,q2_feat_294,q2_feat_295,q2_feat_296,q2_feat_297,q2_feat_298,q2_feat_299,q2_feat_300
count,404290.0,404290.0,404290.0,404289.0,404288.0,404290.0,404290.0,404290.0,404290.0,404290.0,...,404290.0,404290.0,404290.0,404290.0,404290.0,404290.0,404290.0,404290.0,404290.0,404290.0
mean,0.369198,2.827609,3.046961,59.536856,60.108663,12.438626,12.697482,10.944592,11.18512,5.646781,...,46.763162,-28.009493,12.527336,-11.368422,-43.681345,-3.424754,26.23938,-20.802047,-66.431884,36.273753
std,0.482588,4.487418,6.026871,29.940546,33.86369,6.085369,7.08056,5.431949,6.311076,3.074383,...,61.070267,50.926218,68.96311,60.353412,56.719721,59.543628,57.218495,68.938349,66.916404,58.818664
min,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,-1181.601844,-869.310289,-1082.525471,-1091.648775,-904.862332,-768.730118,-536.614247,-940.066401,-2196.106435,-985.436538
25%,0.0,1.0,1.0,39.0,39.0,9.0,8.0,7.0,7.0,4.0,...,10.372415,-54.9931,-21.331203,-42.520671,-69.958454,-36.663614,-6.019517,-58.038648,-96.510577,0.655566
50%,0.0,1.0,1.0,52.0,51.0,11.0,11.0,10.0,10.0,5.0,...,41.029458,-26.450932,14.150357,-10.002014,-35.709216,-5.305916,22.305288,-18.329733,-57.013161,28.147001
75%,1.0,3.0,2.0,72.0,72.0,15.0,15.0,13.0,13.0,7.0,...,77.47386,-0.62113,50.02573,21.583519,-8.438641,27.29722,54.594615,18.195365,-24.500331,62.941765
max,1.0,50.0,120.0,623.0,1169.0,144.0,272.0,125.0,237.0,57.0,...,1487.891279,1156.781213,834.215001,852.355989,569.228259,907.38918,880.135961,868.905853,511.18026,1080.074251


#### 2. Split data into train test (70:30)

In [22]:
# Split data into X & y first
X = data.drop('is_duplicate', axis=1)
y = data['is_duplicate']

print("Shape of X: {0}".format(X.shape))
print("Shape of y: {0}".format(y.shape))

Shape of X: (404290, 627)
Shape of y: (404290,)


In [23]:
# Split into train & test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify=y)
print("Shape of X_train: {0}".format(X_train.shape))
print("Shape of X_test: {0}".format(X_test.shape))
print("Shape of y_train: {0}".format(y_train.shape))
print("Shape of y_test: {0}".format(y_test.shape))

Shape of X_train: (283003, 627)
Shape of X_test: (121287, 627)
Shape of y_train: (283003,)
Shape of y_test: (121287,)


In [34]:
print("Distribution of target variable in train")
train_counter = Counter(y_train)
train_len = len(y_train)
print("Class 0: {0} % \nClass 1: {1} %".format((train_counter[0]/train_len)*100, (train_counter[1]/train_len)*100))


print("\nDistribution of target variable in test")
test_counter = Counter(y_test)
test_len = len(y_test)
print("Class 0: {0} % \nClass 1: {1} %".format((test_counter[0]/test_len)*100, (test_counter[1]/test_len)*100))


Distribution of target variable in train
Class 0: 63.08025003268517 % 
Class 1: 36.919749967314836 %

Distribution of target variable in test
Class 0: 63.08013224830361 % 
Class 1: 36.91986775169639 %


#### 2. Build random model
Here we will randomly assign a class based on random probability to each test data point and measure its log loss.<br>
A strategy we will follow for this is:
1. Generatea list of 2 random numbers for each test row
2. Divide each random number by its sum so we get their sum as 1
3. Take the index of maximum of the 2 numbers in the list
4. This index will be the class of given test row 

In [53]:
y_pred_prob = np.zeros((test_len,2))
for i in range(test_len):
    random_probs = np.random.rand(1,2)
    y_pred_prob[i] = ((random_probs/sum(sum(random_probs)))[0])

print("Test log-loss of random model: {0}".format(log_loss(y_test, y_pred_prob, eps=1e-15)))

y_pred = np.argmax(y_pred_prob, axis=1)

print("\nTest accuracy score of random model: {0}".format(accuracy_score(y_test, y_pred)))

print("\nTest confusion matrix of random model: \n{0}".format(confusion_matrix(y_test, y_pred)))

    

Test log-loss of random model: 0.8883442618472261

Test accuracy score of random model: 0.4978522018023366

Test confusion matrix of random model: 
[[38017 38491]
 [22413 22366]]


We wil ltake this as benchmark to compare our future models