## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [None]:
import pandas as pd
from google.colab import drive
import numpy as np

In [None]:
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/My Drive/projects/Quora/train.csv')

#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

In [None]:
from sklearn.model_selection import train_test_split

### Exploration

In [None]:
df[['question1','question2','is_duplicate']].head(3)

Unnamed: 0,question1,question2,is_duplicate
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0


In [None]:
df[['question1','question2','is_duplicate']].tail(3)

Unnamed: 0,question1,question2,is_duplicate
404287,What is one coin?,What's this coin?,0
404288,What is the approx annual cost of living while...,I am having little hairfall problem but I want...,0
404289,What is like to have sex with cousin?,What is it like to have sex with your cousin?,0


Some Samples seem incorrectly labelled. A fix to this would take a lot of time but if we had more time for the project would definetly be worth it. It doesn't seem like there's too many incorrectly labelled so for our little project it shouldn't cause too much confusion.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404290 entries, 0 to 404289
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404290 non-null  int64 
 1   qid1          404290 non-null  int64 
 2   qid2          404290 non-null  int64 
 3   question1     404289 non-null  object
 4   question2     404288 non-null  object
 5   is_duplicate  404290 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 18.5+ MB


In [None]:
print('is_duplicate == 0 count:', df[df.is_duplicate == 0].count()[0], '\nis_duplicate == 1 count:', df[df.is_duplicate == 1].count()[0])
print('percent of samples where is_duplicate == 1:', df[df.is_duplicate == 1].count()[0]/df.count()[0])

is_duplicate == 0 count: 255027 
is_duplicate == 1 count: 149263
percent of samples where is_duplicate == 1: 0.369197853026293


In [None]:
df[df.question2.isnull()]

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
105780,105780,174363,174364,How can I develop android app?,,0
201841,201841,303951,174364,How can I create an Android app?,,0


### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [None]:
df = df.dropna()

In [None]:
from drive.MyDrive.modules import nlp_cleaning as cleaner

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
df['question1Clean'] = cleaner.preprocess(df.question1)
df['question2Clean'] = cleaner.preprocess(df.question2)

# preprocess removes punctuation, stopwords, and converts to lowercase

In [None]:

df['question1Lem'] = df.question1Clean.apply(cleaner.lematize_words)
df['question2Lem'] = df.question2Clean.apply(cleaner.lematize_words)

# The dataset it small so we can use a lemmatizer over a stemmer.

In [None]:
df[df.question1Lem == df.question2Lem].is_duplicate.value_counts()

1    16053
0     4297
Name: is_duplicate, dtype: int64

This will lead to confusion. We may need a more considerate cleaning process but first I'll expirement with this data.

### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [None]:
Y = df.is_duplicate

In [None]:
#tmp = df.apply(lambda x: ' '.join([x['question1'], x['question2'], x['question1Lem'], x['question2Lem']]), axis=1)
tmp = df.apply(lambda x: ' '.join([x['question1Lem'], x['question2Lem']]), axis=1)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(tmp.tolist())
X.shape

(404287, 1000)

### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

# XGBoost First

In [None]:
from xgboost import XGBClassifier

In [None]:
xgb = XGBClassifier()
X_train, X_test, y_train, y_test = train_test_split(X, Y)

In [None]:
xgb.fit(X_train,y_train)

In [None]:
vectorizer.get_feature_names_out()[50:100]

array(['anxiety', 'anyone', 'anything', 'app', 'apple', 'application',
       'apply', 'apps', 'area', 'army', 'around', 'art', 'asian', 'ask',
       'asked', 'atheist', 'attack', 'australia', 'available', 'average',
       'avoid', 'away', 'baby', 'back', 'bad', 'balance', 'ball', 'ban',
       'bang', 'bangalore', 'bank', 'banning', 'based', 'basic',
       'battery', 'battle', 'beautiful', 'become', 'beginner', 'behind',
       'believe', 'belly', 'benefit', 'best', 'better', 'big', 'biggest',
       'bike', 'bill', 'birthday'], dtype=object)

In [None]:
pred = xgb.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, f1_score, roc_auc_score
def print_scores(y_test,pred):
  cm = confusion_matrix(y_test, pred)
  roc = roc_auc_score(y_test,pred)
  acc = accuracy_score(y_test,pred)
  pre = precision_score(y_test,pred)
  f1 = f1_score(y_test,pred)
  print('~~~~{Confusion}~~~~\n',cm)
  print('\n~~~~{roc score}~~~~\n',roc)
  print('\n~~~~{accuracy}~~~~~\n',acc)
  print('\n~~~~{precision}~~~~\n',pre)
  print('\n~~~~{f1  score}~~~~\n',f1)
print_scores(y_test,pred)

~~~~{Confusion}~~~~
 [[60057  3655]
 [24371 12989]]

~~~~{roc score}~~~~
 0.6451518886649451

~~~~{accuracy}~~~~~
 0.7227125217666613

~~~~{precision}~~~~
 0.7804013458303293

~~~~{f1  score}~~~~
 0.4810384415969187


# Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_train,y_train)
pred = mnb.predict(X_test)
print_scores(y_test,pred)

~~~~{Confusion}~~~~
 [[58008  5704]
 [24572 12788]]

~~~~{roc score}~~~~
 0.6263816725586336

~~~~{accuracy}~~~~~
 0.7004511635269907

~~~~{precision}~~~~
 0.6915422885572139

~~~~{f1  score}~~~~
 0.4579245147890854


In [None]:
# We'll do a quick grid search 
from sklearn.model_selection import GridSearchCV
param_grid = {
    'alpha': [0.01,0.005,0.2,0.05,0.1,0.001],
    'fit_prior': [True, False],
    'class_prior': [None, [0.1, 0.2], [0.8, 0.2]]
}

mnb = MultinomialNB()
gs = GridSearchCV(mnb, param_grid, cv=5, n_jobs=-1)
gs.fit(X, Y)
print('params', gs.best_params_)
print('score', gs.best_score_)

params {'alpha': 0.2, 'class_prior': None, 'fit_prior': True}
score 0.7012518322095492


# LSTM

In [None]:
X.shape

(404287, 1000)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM,

model = Sequential()
model.add(LSTM(10, dropout=0.2, recurrent_dropout=0.2,input_shape = (X_train.shape[1],1),return_sequences=True))
model.add(LSTM(20, dropout=0.2, recurrent_dropout=0.2,return_sequences=True))
model.add(LSTM(30, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))

model.compile(loss='SparseCategoricalCrossentropy', optimizer='adam', metrics=['accuracy'])



In [None]:
from sklearn.decomposition import TruncatedSVD

In [None]:
svd = TruncatedSVD(250)
# 250 components is a massive increase in evr from 100 while not using too much more ram

In [None]:
Xpca = svd.fit_transform(X)

In [None]:
svd.explained_variance_ratio_.sum()

0.5527787820539418

In [None]:
X_train, X_test, y_train, y_test = train_test_split(Xpca, Y)

In [None]:
model.fit(X_train, y_train, batch_size=512, epochs=5, validation_data=(X_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fa663256880>

In [None]:
pred = model.predict(X_test)
print_scores(y_test,pred)

In [None]:
print_scores(y_test,np.argmax(pred,axis=1))

~~~~{Confusion}~~~~
 [[63938     0]
 [37134     0]]

~~~~{roc score}~~~~
 0.5

~~~~{accuracy}~~~~~
 0.6325985436124743

~~~~{precision}~~~~
 0.0

~~~~{f1  score}~~~~
 0.0


  _warn_prf(average, modifier, msg_start, len(result))
