# Quora Question Pair Similarity
### Kaggle Competition link: https://www.kaggle.com/c/quora-question-pairs

<p>We have built features to train the model on. Here we will load data with all our 627 features. We will first build a random or simple (Naive Bayes) base model and then will try out different machine learning algorithms and compare against our base model. After that, we will choose the best one and tune it to generalize it on future data.</p> 
<p> The metrics we will evaluate the models on are:<br>
* log-loss <br>
* Binary Confusion Matrix <br> 
</p>

Our strategy is:
1. Load the data
2. Split data into train test (70:30)
3. Normalize data
4. <b>Build random model:</b> A model that randomly assigns probabilities.
5. Apply models with default parameters:<br>
   i. <b>Build Logistic Regression:</b> A statistical model that uses a logistic function to model the probability of a binary response based on one or more predictor variables.<br>
   ii. <b>Build Naive Bayes:</b> A probabilistic algorithm based on Bayes' theorem that assumes the independence of the features in the input data<br>
   iii. <b>Build Support Vector Machines:</b> Works by finding the best hyperplane that separates different classes of data points<br>
   iv. <b>Build Gradient Boosting:</b> A powerful ensemble method that combines multiple weak models to create a strong classifier<br>



In [2]:
# Imports

# General
from datetime import datetime 
import pickle

# Data 
import pandas as pd
import numpy as np 
import sqlite3
from sqlalchemy import create_engine
from collections import Counter
from sklearn.model_selection import train_test_split

# Vectorization
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler 
from sklearn.impute import SimpleImputer

# Modelling
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier

# Metrics
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix


#### 1. Load data from SQLite

In [3]:
start = datetime.now()
try:
    conn = sqlite3.connect("train.db")
    data = pd.read_sql_query("SELECT * FROM train_data ORDER BY RANDOM() LIMIT 100000", conn)
    conn.commit()
    conn.close()
    print("Data loaded!\nTime taken: {0}".format(datetime.now()-start))
except Exception as e:
    print(e)

Data loaded!
Time taken: 0:02:57.968335


In [4]:
print("Shape of data: {0}".format(data.shape))

Shape of data: (100000, 634)


In [5]:
# Remove unnecessary columns
data = data.iloc[:,6:]
print("Shape of data after removing unnecessary columns: {0}".format(data.shape))

Shape of data after removing unnecessary columns: (100000, 628)


In [6]:
data.describe()

Unnamed: 0,is_duplicate,q1_frequency,q2_frequency,q1_length,q2_length,q1_tokens_count,q2_tokens_count,q1_words_count,q2_words_count,q1_nonstopwords_count,...,q2_feat_291,q2_feat_292,q2_feat_293,q2_feat_294,q2_feat_295,q2_feat_296,q2_feat_297,q2_feat_298,q2_feat_299,q2_feat_300
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,...,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,0.3691,2.82357,3.04351,59.45484,60.11191,12.42497,12.69636,10.9323,11.18396,5.64148,...,46.569629,-27.794024,12.598192,-11.504296,-43.654854,-3.496213,26.121807,-20.646194,-66.488375,36.155795
std,0.482563,4.468338,6.080107,29.794226,33.917819,6.056198,7.102448,5.406961,6.331815,3.059856,...,60.911657,51.105032,68.906294,60.590214,56.855171,59.871539,57.103065,69.285594,67.43977,58.390357
min,0.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,0.0,...,-597.424782,-711.233365,-721.021765,-998.05776,-688.422742,-768.730118,-369.922146,-878.340893,-2196.106435,-364.145416
25%,0.0,1.0,1.0,39.0,39.0,9.0,8.0,7.0,7.0,4.0,...,10.06801,-54.789893,-21.192799,-42.669529,-69.730615,-36.950131,-6.18424,-58.039417,-96.491496,0.591541
50%,0.0,1.0,1.0,52.0,51.0,11.0,11.0,10.0,10.0,5.0,...,40.863403,-26.268981,14.107473,-10.018565,-35.657638,-5.231376,22.242627,-18.28561,-56.951148,28.049228
75%,1.0,3.0,2.0,72.0,71.0,15.0,15.0,13.0,13.0,7.0,...,77.466833,-0.47532,49.983556,21.570394,-8.413399,27.29722,54.648308,18.541631,-24.297612,62.815105
max,1.0,50.0,120.0,370.0,1151.0,100.0,272.0,71.0,237.0,53.0,...,914.91641,1156.781213,834.215001,852.355989,569.228259,740.797131,880.135961,868.905853,511.18026,1080.074251


#### 2. Split data into train test (70:30)

In [7]:
# Split data into X & y first
X = data.drop('is_duplicate', axis=1)
y = data['is_duplicate']

print("Shape of X: {0}".format(X.shape))
print("Shape of y: {0}".format(y.shape))

Shape of X: (100000, 627)
Shape of y: (100000,)


In [8]:
# Split into train & test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify=y)
print("Shape of X_train: {0}".format(X_train.shape))
print("Shape of X_test: {0}".format(X_test.shape))
print("Shape of y_train: {0}".format(y_train.shape))
print("Shape of y_test: {0}".format(y_test.shape))

Shape of X_train: (70000, 627)
Shape of X_test: (30000, 627)
Shape of y_train: (70000,)
Shape of y_test: (30000,)


In [9]:
print("Distribution of target variable in train")
train_counter = Counter(y_train)
train_len = len(y_train)
print("Class 0: {0} % \nClass 1: {1} %".format((train_counter[0]/train_len)*100, (train_counter[1]/train_len)*100))


print("\nDistribution of target variable in test")
test_counter = Counter(y_test)
test_len = len(y_test)
print("Class 0: {0} % \nClass 1: {1} %".format((test_counter[0]/test_len)*100, (test_counter[1]/test_len)*100))


Distribution of target variable in train
Class 0: 63.09 % 
Class 1: 36.91 %

Distribution of target variable in test
Class 0: 63.09 % 
Class 1: 36.91 %


In [10]:
# Replace NaN with 0
X_train = X_train.fillna(0)
X_test = X_test.fillna(0)


#### 3. Normalize data
Before we proceed to build the models, lets normalize all the features first

In [11]:
numerical_features = list(X_train.columns)

In [12]:
numerical_pipeline = Pipeline(steps=[("normalizer", MinMaxScaler()), (("imputer", SimpleImputer(strategy="most_frequent")))])

vectorizer = ColumnTransformer([("num_pipeline", numerical_pipeline, numerical_features)])

start = datetime.now()
print("Vectorizing X_train")
X_train = vectorizer.fit_transform(X_train)
print("Normalization of X_train is completed.\n\nTime taken: {0}".format(datetime.now()-start))

start = datetime.now()
print("\nNormalizing X_test")
X_test = vectorizer.transform(X_test)
print("Normalization of X_test is completed.\n\nTime taken: {0}".format(datetime.now()-start))


Vectorizing X_train
Normalization of X_train is completed.

Time taken: 0:00:04.362978

Normalizing X_test
Normalization of X_test is completed.

Time taken: 0:00:00.401804


In [13]:
# Lets save our vectorizer to .pkl file
vectorizer_file = "../models/vectorizer.pkl"
with open(vectorizer_file, 'wb') as f:
    pickle.dump(vectorizer, f)
print('Dumped the vectorizer in {} file'.format(vectorizer_file))

Dumped the vectorizer in ../models/vectorizer.pkl file


#### 4. Build random model
Here we will randomly assign a class based on random probability to each test data point and measure its log loss.<br>
A strategy we will follow for this is:
1. Generatea list of 2 random numbers for each test row
2. Divide each random number by its sum so we get their sum as 1
3. Take the index of maximum of the 2 numbers in the list
4. This index will be the class of given test row 

In [14]:
y_pred_prob = np.zeros((test_len,2))
for i in range(test_len):
    random_probs = np.random.rand(1,2)
    y_pred_prob[i] = ((random_probs/sum(sum(random_probs)))[0])

print("Test log-loss of random model: {0}".format(log_loss(y_test, y_pred_prob, eps=1e-15)))

y_pred = np.argmax(y_pred_prob, axis=1)

print("\nTest accuracy score of random model: {0}".format(accuracy_score(y_test, y_pred)))

print("\nTest confusion matrix of random model: \n{0}".format(confusion_matrix(y_test, y_pred)))

print("\nTest confusion matrix of random model (%): \n{0}".format(np.round(confusion_matrix(y_test, y_pred)/len(y_test)*100,2)))

    

Test log-loss of random model: 0.89532403134099

Test accuracy score of random model: 0.4961

Test confusion matrix of random model: 
[[9373 9554]
 [5563 5510]]

Test confusion matrix of random model (%): 
[[31.24 31.85]
 [18.54 18.37]]


We wil ltake this as benchmark to compare our future models

#### 5. Apply ML models

In [15]:
result = []
for classifier in [LogisticRegression(solver='lbfgs', max_iter=3000), BernoulliNB(), SVC(), GradientBoostingClassifier() ]:
    
    # Training
    start = datetime.now()
    clf_str = str(classifier).split("(")[0]
    print("{0} started.".format(clf_str))
    classifier.fit(X_train, y_train)
    print("{0} training completed. Time taken: {1}\n".format(clf_str, datetime.now()-start))
    
    # Prediction
    y_pred = classifier.predict(X_test)
    
    # Evaluation
    lg_loss = log_loss(y_test,y_pred)
    acc = accuracy_score(y_test,y_pred)
    cm = np.round((confusion_matrix(y_test,y_pred)/len(X_test)*100),2)
    
    # Add to result
    temp = list()
    temp.append(clf_str)
    temp.append("Default")
    temp.append(lg_loss)
    temp.append(acc)
    temp.append(cm)
    temp.append(datetime.now()-start)
    result.append(temp)



LogisticRegression started.
LogisticRegression training completed. Time taken: 0:00:47.715623

BernoulliNB started.
BernoulliNB training completed. Time taken: 0:00:00.554632

SVC started.
SVC training completed. Time taken: 0:22:43.016640

GradientBoostingClassifier started.
GradientBoostingClassifier training completed. Time taken: 0:51:01.605331



In [17]:
pd.DataFrame(result, columns=['Algorithm', 'Hyperparameters', 'Log-loss', 'Accuracy', 'Confusion Matrix (TP,FP,FN,TN)', 'Time taken'])

Unnamed: 0,Algorithm,Hyperparameters,Log-loss,Accuracy,"Confusion Matrix (TP,FP,FN,TN)",Time taken
0,LogisticRegression,Default,8.010101,0.777767,"[[53.57, 9.52], [12.7, 24.21]]",0 days 00:00:47.783151
1,BernoulliNB,Default,12.445874,0.6547,"[[38.9, 24.19], [10.34, 26.57]]",0 days 00:00:00.805230
2,SVC,Default,7.282019,0.797967,"[[53.89, 9.2], [11.0, 25.91]]",0 days 00:34:24.827754
3,GradientBoostingClassifier,Default,6.320855,0.824633,"[[56.07, 7.02], [10.51, 26.4]]",0 days 00:51:01.899364
