## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [320]:
import pandas as pd

In [321]:
df = pd.read_csv("train.csv")

In [322]:
df.head(3)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0


#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

### **Exploration and Data Analysis**

In [323]:
# import lets plot
import numpy as np
from lets_plot import *
LetsPlot.setup_html()

In [324]:
# How big is this dataset?
df.shape

(404290, 6)

In [325]:
# What portion of our questions are actually duplicate?
df['is_duplicate'].value_counts()

0    255027
1    149263
Name: is_duplicate, dtype: int64

In [326]:
# plot class distribution for each class using ggplot2
ggplot(df, aes(x='is_duplicate', fill = 'is_duplicate')) + geom_bar() + ggtitle(" ") + labs(x="Class", y="Count") +\
scale_fill_discrete(guide='none') + \
theme_classic() + \
flavor_high_contrast_dark() 


In [327]:
# Are we missing any data?
print('Number of nulls in label: {}'.format(df['is_duplicate'].isnull().sum()))
print('Number of nulls in text: {}'.format(df['question1'].isnull().sum()))
print('Number of nulls in text: {}'.format(df['question2'].isnull().sum()))

Number of nulls in label: 0
Number of nulls in text: 1
Number of nulls in text: 2


In [328]:
# How many unique questions are there?
print('Total number of question pairs for training: {}'.format(len(df)))

Total number of question pairs for training: 404290


In [330]:
# count how many times a question appears in the df
qids = pd.Series(df[df['qid1'].notnull()]['qid1'].tolist() + df[df['qid2'].notnull()]['qid2'].tolist())

In [331]:
# plot distribution of of qids using ggplot2
ggplot(qids, aes(x=qids)) + geom_density(binwidth=10, method='histodot') + ggtitle(" ") + labs(x="Question ID", y="Count")

In [332]:
# concat question1 and question2 into a single string
questions = df['question1'].astype(str) + " " + df['question2'].astype(str).tolist()

In [333]:
questions = questions.apply(len)

In [334]:
# Get min, max, mean, and standard deviation of a list of numbers
def min_max_mean_std(numbers):
    mean = sum(numbers) / len(numbers)
    std = (sum([(x - mean) ** 2 for x in numbers]) / len(numbers)) ** 0.5
    return mean, std

In [336]:
# call function on questions list
mean, std = min_max_mean_std(questions)
print(mean, std)

120.6450963417349 55.00734920633935


In [337]:
minimum(questions), max(questions), mean, std

(6, 1319, 120.6450963417349, 55.00734920633935)

In [338]:
# plot distribution of questions using ggplot2
ggplot(questions, aes(x=questions)) + \
    geom_histogram(bins = 100, fill = 'white') +\
          ggtitle(" ") + labs(x="Question Length", y="Count") + \
          theme_classic() + \
            flavor_high_contrast_dark() 
            

In [339]:
# word count for each question
questions = df['question1'].astype(str) + " " + df['question2'].astype(str).tolist()

# split questions into words

In [340]:
questions = questions.apply(lambda x: x.lower().split())

In [341]:
questions = questions.apply(len)

In [342]:
mean, std = min_max_mean_std(questions)

In [343]:
min(questions), max(questions), mean, std

(2, 270, 22.124200450171905, 10.074891656619457)

In [344]:
# plot distribution of questions using ggplot2
ggplot(questions, aes(x=questions)) + \
    geom_histogram(bins = 100, fill = 'white') +\
          ggtitle(" ") + labs(x="Question Word Length", y="Count") + \
          theme_classic() + \
            flavor_high_contrast_dark()

From here, I will drop those questions that have 99 words

In [345]:
#Remove id and qid1 and qid2
df.drop(['id', 'qid1', 'qid2'], axis=1, inplace=True)

In [347]:
df['word_count'] = questions

In [348]:
df.head()

Unnamed: 0,question1,question2,is_duplicate,word_count
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,26
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,21
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,24
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,20
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,20


In [None]:
# drop 

In [None]:
# fill the missing values with empty strings
df = df.fillna(' ')
df.head()

### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [None]:
import re #regular expression
import spacy
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

In [None]:
# Remove punctuation
import string
string.punctuation

def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text


In [None]:
df['q1_clean'] = df['question1'].apply(lambda x: remove_punct(x))
df['q2_clean'] = df['question2'].apply(lambda x: remove_punct(x))

In [None]:
df.drop(['question1', 'question2'], axis=1, inplace=True)

In [None]:
# Remove Alpha Numeric
def remove_alphanumeric(text):
    text = ''.join([i for i in text if not i.isdigit()])
    return text

In [None]:
# apply function to data frame
df['q1_clean'] = df['q1_clean'].apply(lambda x: remove_alphanumeric(x))
df['q2_clean'] = df['q2_clean'].apply(lambda x: remove_alphanumeric(x))

In [None]:
# Remove Stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

def remove_stopwords(text):
    text = [word for word in text.split() if word.lower() not in (stop)]
    return " ".join(text)


In [None]:
# remove stopwords from questions
df['q1_clean'] = df['q1_clean'].apply(lambda x: remove_stopwords(x))
df['q2_clean'] = df['q2_clean'].apply(lambda x: remove_stopwords(x))

df.head(3)

In [None]:
# remove unwanted spaces and convert to lower case
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize, pos_tag

# define class to remove unwanted spaces and convert to lower case
class CleanText(object):
    def __init__(self, text):
        self.text = text
        
    def clean(self):
        # remove unwanted spaces and convert to lower case
        self.text = self.text.lower()
        self.text = self.text.strip()
        return self.text


In [None]:
# call class to remove unwanted spaces and convert to lower case for q1_clean and q2_clean
df['q1_clean'] = df['q1_clean'].apply(lambda x: CleanText(x).clean())
df['q2_clean'] = df['q2_clean'].apply(lambda x: CleanText(x).clean())

df.head(3)

In [None]:
from copy import deepcopy


In [None]:
df_A = deepcopy(df)

In [None]:
df_A.head(3)

In [None]:
# combine q1_clean and q2_clean into a single column called "combined" using string concatenation
df_A['combined'] = df_A['q1_clean'].astype(str) +' ' + df_A['q2_clean'].astype(str)



In [None]:
df_A.head(3)

In [None]:
# drop q1_clean and q2_clean

df_A.drop(['q1_clean', 'q2_clean'], axis=1, inplace=True)
df_A.head()

In [None]:
df_A.to_csv('df_A.csv', index=False)

In [None]:
df

### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [None]:
from collections import Counter
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [None]:
target = df['is_duplicate']

In [None]:
# create a dataframe with target
X = pd.DataFrame(target)

In [None]:
# calculate the length of questions and apply to X df
X['q1_len'] = df['q1_clean'].apply(lambda x: len(x))
X['q2_len'] = df['q2_clean'].apply(lambda x: len(x))

In [None]:
# calculate weight of each word in corpus
def get_weight(count, eps=10000, min_count=2):
    return 0 if count < min_count else 1 / (count + eps)

# join all questions together
pairs_qs = df['q1_clean'].str.split().astype(str) + df['q2_clean'].str.split().astype(str) 
words = (" ".join(pairs_qs)).lower().split()
counts = Counter(words)
weights = {word: get_weight(count) for word, count in counts.items()}


In [None]:
X['word_count'] = pairs_qs.apply(lambda x: len(str(x).split()))

In [None]:
# import spacy
nlp = spacy.load('en_core_web_sm')


In [None]:
# find the number of unique words in each question
def unique_words(text):
    doc = nlp(text)
    unique_words = set([token.text for token in doc if token.is_stop != True and token.is_punct != True])
    return len(unique_words)

# find common words in each question
def common_words(text):
    doc = nlp(text)
    common_words = set([token.text for token in doc if token.is_stop == True and token.is_punct == True])
    return len(common_words)

In [None]:
X.head(3)

In [None]:
stopwords = nltk.corpus.stopwords.words('english')

In [None]:
%%time
nlp = spacy.load('en_core_web_sm')
stops = set(nltk.corpus.stopwords.words("english"))

def word_shares(row):
    q1_list = str(row['q1_clean']).lower().split()
    q1 = set(q1_list)
    q1words = q1.difference(stops)
    if len(q1words) == 0:
        return '0:0:0:0:0:0:0:0:0'

    q2_list = str(row['q2_clean']).lower().split()
    q2 = set(q2_list)
    q2words = q2.difference(stops)
    if len(q2words) == 0:
        return '0:0:0:0:0:0:0:0:0'

    words_hamming = sum(1 for i in zip(q1_list, q2_list) if i[0]==i[1])/max(len(q1_list), len(q2_list))

    q1stops = q1.intersection(stops)
    q2stops = q2.intersection(stops)

    q1_2gram = set([i for i in zip(q1_list, q1_list[1:])])
    q2_2gram = set([i for i in zip(q2_list, q2_list[1:])])

    shared_2gram = q1_2gram.intersection(q2_2gram)

    shared_words = q1words.intersection(q2words)
    shared_weights = [weights.get(w, 0) for w in shared_words]
    q1_weights = [weights.get(w, 0) for w in q1words]
    q2_weights = [weights.get(w, 0) for w in q2words]
    total_weights = q1_weights + q1_weights

    R1 = np.sum(shared_weights) / np.sum(total_weights) #tfidf share
    R2 = len(shared_words) / (len(q1words) + len(q2words) - len(shared_words)) #count share
    R31 = len(q1stops) / len(q1words) #stops in q1
    R32 = len(q2stops) / len(q2words) #stops in q2
    Rcosine_denominator = (np.sqrt(np.dot(q1_weights,q1_weights))*np.sqrt(np.dot(q2_weights,q2_weights)))
    Rcosine = np.dot(shared_weights, shared_weights)/Rcosine_denominator
    if len(q1_2gram) + len(q2_2gram) == 0:
        R2gram = 0
    else:
        R2gram = len(shared_2gram) / (len(q1_2gram) + len(q2_2gram))
    
    fuzzy_match = fuzz.token_sort_ratio(q1_list, q2_list)
    
    return '{}:{}:{}:{}:{}:{}:{}:{}:{}'.format(R1, R2, len(shared_words), R31, R32, R2gram, 
                                                  Rcosine, words_hamming, fuzzy_match)

X['word_shares'] = df.apply(word_shares, axis=1)


In [None]:
X.head(3)

In [None]:
X['word_match']       = X['word_shares'].apply(lambda x: float(x.split(':')[0]))
X['tfidf_word_match'] = X['word_shares'].apply(lambda x: float(x.split(':')[1]))
X['shared_count']     = X['word_shares'].apply(lambda x: float(x.split(':')[2]))

X['stops1_ratio']     = X['word_shares'].apply(lambda x: float(x.split(':')[3]))
X['stops2_ratio']     = X['word_shares'].apply(lambda x: float(x.split(':')[4]))
X['shared_2gram']     = X['word_shares'].apply(lambda x: float(x.split(':')[5]))
X['cosine']           = X['word_shares'].apply(lambda x: float(x.split(':')[6]))
X['words_hamming']    = X['word_shares'].apply(lambda x: float(x.split(':')[7]))
X['fuzzy_match']    = X['word_shares'].apply(lambda x: float(x.split(':')[8]))


X['len_word_q1'] = df['q1_clean'].apply(lambda x: len(str(x).split()))
X['len_word_q2'] = df['q2_clean'].apply(lambda x: len(str(x).split()))
X['diff_len_word'] = X['len_word_q1'] - X['len_word_q2']

In [None]:
X.head(3)

In [None]:
# drop word_shares column
X.drop(['word_shares'], axis=1, inplace=True)

In [None]:
# get unique values in cosine column
X['cosine'].unique()

In [None]:
X.drop(['cosine'], axis=1, inplace=True)


In [None]:
X['word_match'].unique()

In [None]:
X.drop(['word_match'], axis=1, inplace=True)

In [None]:
X.head(3)

In [None]:
# export to csv
X.to_csv('clean_df.csv')

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('clean_df.csv', index_col=0)

In [None]:
# take the first 10000 rows from df sample data using iloc
X = df.iloc[:100000]




### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc

# import standard scaler
from sklearn.preprocessing import StandardScaler

# import stratified k fold
from sklearn.model_selection import StratifiedKFold

# import logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

# import pipeline
from sklearn.pipeline import Pipeline


In [None]:
y = X['is_duplicate']
X = X.drop(['is_duplicate', 'stops1_ratio', 'stops2_ratio', 'word_count'], axis=1)

In [None]:
# split the data into train and test using 80:20 ratio using iloc
y_train = y.iloc[:int(y.shape[0]*0.8)]
y_test = y.iloc[int(y.shape[0]*0.8):]
X_train = X.iloc[:int(X.shape[0]*0.8)]
X_test = X.iloc[int(X.shape[0]*0.8):]


In [None]:
y_train.shape, y_test.shape, X_train.shape, X_test.shape

1. Test different models

In [None]:
scoring = 'accuracy'
kfold = StratifiedKFold(n_splits=10)

In [None]:
# build pipeline without scaling
models = []
models.append(('LR', LogisticRegression()))
models.append(('RF', RandomForestClassifier()))


results = []
names = []
#evaluate each model in turn with cross validation
for name, model in models:
    model.fit(X_train, y_train)
    results.append(accuracy_score(y_test, model.predict(X_test)))
    names.append(name)
    msg = f'{name}: {results}'
    print(msg)


In [None]:
# import xgboost
import xgboost as xgb

In [None]:
# xgb classifier with learning rate 0.01
Xgb = xgb.XGBClassifier()
Xgb.fit(X_train, y_train)
y_pred = Xgb.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

In [None]:
# grid search for best hyperparameters for XGBoost
def grid_search(X_train, y_train, X_test, y_test):
    """
    Grid search for best hyperparameters for XGBoost
    """
    # import XGBoost
    from xgboost import XGBClassifier
    # import GridSearchCV
    from sklearn.model_selection import GridSearchCV
    # import RandomizedSearchCV
    from sklearn.model_selection import RandomizedSearchCV

    # define XGBoost parameters
    xgb_params = {
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 4, 5],
        'learning_rate': [0.01, 0.05, 0.1],
        'min_child_weight': [1, 3, 5],
        'gamma': [0.5, 1, 1.5, 2],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'reg_alpha': [1e-5, 1e-4, 1e-3],
        'reg_lambda': [1e-5, 1e-4, 1e-3],
        'scale_pos_weight': [1, 3, 5],
        'objective': ['binary:logistic', 'multi:softmax'],
        'nthread': [4],
        'seed': [42]
    }
    # define XGBoost grid search
    xgb_grid = GridSearchCV(XGBClassifier(), xgb_params, cv=2, n_jobs=-1, verbose=1)
    # fit XGBoost grid search
    xgb_grid.fit(X_train, y_train)
    # print best XGBoost hyperparameters
    print('Best XGBoost hyperparameters: ', xgb_grid.best_params_)
    # print best XGBoost score
    print('Best XGBoost score: ', xgb_grid.best_score_)
    # predict on test set
    y_pred = xgb_grid.predict(X_test)
    # print classification report
    print(classification_report(y_test, y_pred))
    # print confusion matrix

In [None]:
# grid_search(X_train, y_train, X_test, y_test)

In [None]:
X_train.shape[1]

In [None]:
# LSTM model
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Dropout
import numpy as np
import pandas as pd
from keras import layers


model = Sequential()
model.add(Dense(500, input_shape=(X_train.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(30, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

In [None]:
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=120,batch_size=34, verbose=1,)

In [None]:
# clear session to avoid clutter from old models / layers.
from keras import backend as K 
K.clear_session()

In [None]:
# Neural Net Using RAW Data

import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('df_A.csv')

In [None]:
# take sample of first 10000 rows from df sample data using iloc
sample = df.iloc[:10000]

In [None]:
sample.head()

In [None]:
X = sample['combined']
y = sample['is_duplicate']

In [None]:
y_train = y.iloc[:int(y.shape[0]*0.8)]
y_test = y.iloc[int(y.shape[0]*0.8):]
X_train = X.iloc[:int(X.shape[0]*0.8)]
X_test = X.iloc[int(X.shape[0]*0.8):]

In [None]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=3000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

In [None]:
print(X_train[5])

In [None]:
from keras.models import Sequential
from keras import layers

embedding_dim = 50
maxlen = 100

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

In [None]:
X_train[1]

In [None]:
history = model.fit(X_train, y_train,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))