## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("train.csv")

#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

In [4]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [5]:
df.shape

(404290, 6)

In [6]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, random_state=1)

In [7]:
print(df_train.shape)
print(df_test.shape)

(303217, 6)
(101073, 6)


### Exploration

In [8]:
df_train.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
394617,394617,527528,527529,How can I become a CEO if my law school grades...,What's a good online discussion board where I ...,0
146735,146735,60120,231728,Can a Singapore citizen obtain another citizen...,"As an American, could I cross the Canadian bor...",0
231076,231076,340783,167045,How can you get rid of pimples in your earlobe?,How do you get rid of a pimple in your ear?,1
66117,66117,103992,26685,How will releasing new 500 and 2000 rupee note...,If PM Modi wants to curb black money? Why was ...,1
114046,114046,8431,89956,What are some possible solutions if I forgot m...,How do I get my iCloud password?,1


In [9]:
# What portion of our text messages are actually duplicates?
df_train['is_duplicate'].value_counts()

0    191290
1    111927
Name: is_duplicate, dtype: int64

In [10]:
# Are we missing any data?
df_train.isnull().sum()

id              0
qid1            0
qid2            0
question1       0
question2       1
is_duplicate    0
dtype: int64

In [11]:
# Are there any duplicate rows?
df_train[df_train.duplicated()].sum()

id              0.0
qid1            0.0
qid2            0.0
question1       0.0
question2       0.0
is_duplicate    0.0
dtype: float64

In [12]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 303217 entries, 394617 to 128037
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            303217 non-null  int64 
 1   qid1          303217 non-null  int64 
 2   qid2          303217 non-null  int64 
 3   question1     303217 non-null  object
 4   question2     303216 non-null  object
 5   is_duplicate  303217 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 16.2+ MB


### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [13]:
df_train = df_train.dropna(axis=0)

In [14]:
df_train = df_train.loc[:,'question1':'is_duplicate'].reset_index(drop=True)
df_train.head()

Unnamed: 0,question1,question2,is_duplicate
0,How can I become a CEO if my law school grades...,What's a good online discussion board where I ...,0
1,Can a Singapore citizen obtain another citizen...,"As an American, could I cross the Canadian bor...",0
2,How can you get rid of pimples in your earlobe?,How do you get rid of a pimple in your ear?,1
3,How will releasing new 500 and 2000 rupee note...,If PM Modi wants to curb black money? Why was ...,1
4,What are some possible solutions if I forgot m...,How do I get my iCloud password?,1


In [15]:
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
from nltk.stem import PorterStemmer
from sklearn.preprocessing import FunctionTransformer

def clean_all(text):

    # remove punctuation
    text = "".join([char for char in text if char not in string.punctuation])

    # make lowercase
    text = text.lower()

    # remove stopwords  
    eng_stopwords = stopwords.words('English')
    text = [word for word in text.split() if word not in eng_stopwords]

    # lemmatize
    lemmatizer = WordNetLemmatizer()
    text = ' '.join([lemmatizer.lemmatize(word) for word in text])

    # stem
    ps = PorterStemmer()
    text = ''.join([ps.stem(word) for word in text])

    return text

# Create a Transformer from the function so that we can use it in a Pipeline
cleaner = FunctionTransformer(clean_all)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
df_train['question1_cleaned'] = df_train['question1'].apply(lambda x: clean_all(x))
df_train.head(10)

Unnamed: 0,question1,question2,is_duplicate,question1_cleaned
0,How can I become a CEO if my law school grades...,What's a good online discussion board where I ...,0,become ceo law school grade competitive
1,Can a Singapore citizen obtain another citizen...,"As an American, could I cross the Canadian bor...",0,singapore citizen obtain another citizenship b...
2,How can you get rid of pimples in your earlobe?,How do you get rid of a pimple in your ear?,1,get rid pimple earlobe
3,How will releasing new 500 and 2000 rupee note...,If PM Modi wants to curb black money? Why was ...,1,releasing new 500 2000 rupee note help eradica...
4,What are some possible solutions if I forgot m...,How do I get my iCloud password?,1,possible solution forgot icloud password
5,Are there any languages that use the same word...,Is it widespread for languages to use the same...,0,language use word iron steel
6,"My crush didn't accept my friend request, but ...",If my crush has not accepted my friend request...,0,crush didnt accept friend request accepted req...
7,Is Sun in bal awastha i.e at 0 degree in 9th h...,Is Sun in bal awastha i.e at 0 degree in 9th h...,1,sun bal awastha ie 0 degree 9th house sign leo...
8,Where can I buy Nestle Wonder Balls?,How good is Nestle Pure Life Water for you?,0,buy nestle wonder ball
9,How do I backup my pictures and music from my ...,How do I transfer music from iTunes to iPhone?,0,backup picture music iphone itunes


In [17]:
df_train['question2_cleaned'] = df_train['question2'].apply(lambda x: clean_all(x))
df_train.head(10)

Unnamed: 0,question1,question2,is_duplicate,question1_cleaned,question2_cleaned
0,How can I become a CEO if my law school grades...,What's a good online discussion board where I ...,0,become ceo law school grade competitive,whats good online discussion board air daily f...
1,Can a Singapore citizen obtain another citizen...,"As an American, could I cross the Canadian bor...",0,singapore citizen obtain another citizenship b...,american could cross canadian border child chi...
2,How can you get rid of pimples in your earlobe?,How do you get rid of a pimple in your ear?,1,get rid pimple earlobe,get rid pimple ear
3,How will releasing new 500 and 2000 rupee note...,If PM Modi wants to curb black money? Why was ...,1,releasing new 500 2000 rupee note help eradica...,pm modi want curb black money new 2000 rupee n...
4,What are some possible solutions if I forgot m...,How do I get my iCloud password?,1,possible solution forgot icloud password,get icloud password
5,Are there any languages that use the same word...,Is it widespread for languages to use the same...,0,language use word iron steel,widespread language use word one
6,"My crush didn't accept my friend request, but ...",If my crush has not accepted my friend request...,0,crush didnt accept friend request accepted req...,crush accepted friend request facebook mean do...
7,Is Sun in bal awastha i.e at 0 degree in 9th h...,Is Sun in bal awastha i.e at 0 degree in 9th h...,1,sun bal awastha ie 0 degree 9th house sign leo...,sun bal awastha ie 0 degree 9th house sign leo...
8,Where can I buy Nestle Wonder Balls?,How good is Nestle Pure Life Water for you?,0,buy nestle wonder ball,good nestle pure life water
9,How do I backup my pictures and music from my ...,How do I transfer music from iTunes to iPhone?,0,backup picture music iphone itunes,transfer music itunes iphone


In [18]:
df_train = df_train.loc[:,'is_duplicate':'question2_cleaned']

In [19]:
df_train.head()

Unnamed: 0,is_duplicate,question1_cleaned,question2_cleaned
0,0,become ceo law school grade competitive,whats good online discussion board air daily f...
1,0,singapore citizen obtain another citizenship b...,american could cross canadian border child chi...
2,1,get rid pimple earlobe,get rid pimple ear
3,1,releasing new 500 2000 rupee note help eradica...,pm modi want curb black money new 2000 rupee n...
4,1,possible solution forgot icloud password,get icloud password


### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [20]:
# For tokenization
import nltk

# For converting words into frequency counts
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [65]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

Document1 = df_train['question1_cleaned'][3]
Document2 = df_train['question2_cleaned'][3]

corpus = [Document1, Document2]

X_train_counts = count_vect.fit_transform(corpus)

pd.DataFrame(X_train_counts.toarray(),columns=count_vect.get_feature_names_out(),index=['Document 1','Document 2'])

Unnamed: 0,2000,500,black,curb,eradicating,help,introduced,modi,money,new,note,pm,releasing,rupee,want
Document 1,1,1,1,0,1,1,0,0,1,1,1,0,1,1,0
Document 2,1,0,1,1,0,0,1,1,1,1,1,1,0,1,1


In [66]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

trsfm=vectorizer.fit_transform(corpus)
pd.DataFrame(trsfm.toarray(),columns=vectorizer.get_feature_names_out(),index=['Document 1','Document 2'])


Unnamed: 0,2000,500,black,curb,eradicating,help,introduced,modi,money,new,note,pm,releasing,rupee,want
Document 1,0.268208,0.376957,0.268208,0.0,0.376957,0.376957,0.0,0.0,0.268208,0.268208,0.268208,0.0,0.376957,0.268208,0.0
Document 2,0.250969,0.0,0.250969,0.352728,0.0,0.0,0.352728,0.352728,0.250969,0.250969,0.250969,0.352728,0.0,0.250969,0.352728


In [67]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(trsfm[0:1], trsfm)

array([[1.        , 0.40387178]])

In [None]:
def get_cosine_similarity():

for i in range(len(df_train)):

    Document1 = df_train['question1_cleaned'][i]
    Document2 = df_train['question2_cleaned'][i]

    corpus = [Document1, Document2]

    vectorizer = TfidfVectorizer()
    trsfm=vectorizer.fit_transform(corpus)
    cosine_similarity(trsfm[0:1], trsfm)

In [74]:
df_train = df_train.dropna(axis=0)

In [27]:
vectorizer = TfidfVectorizer()
Document1 = df_train['question1_cleaned']
Document2 = df_train['question2_cleaned']

corpus = pd.concat([Document1, Document2])

corpus
vectorizer.fit(corpus)

vec_question1_train = vectorizer.transform(Document1)
vec_question2_train = vectorizer.transform(Document2)

cos_sim_lst = []

from sklearn.metrics.pairwise import cosine_similarity

for i in range(len(df_train)):
       
        cos_sim_lst.append(cosine_similarity(vec_question1_train[i], vec_question2_train[i]))

In [28]:
cos_sim_lst[0][0]

[array([[0.]]),
 array([[0.20088057]]),
 array([[0.58481618]]),
 array([[0.41558813]]),
 array([[0.6282172]]),
 array([[0.40196859]]),
 array([[0.62258093]]),
 array([[1.]]),
 array([[0.42405255]]),
 array([[0.6456331]]),
 array([[0.24750132]]),
 array([[1.]]),
 array([[0.]]),
 array([[0.84381446]]),
 array([[1.]]),
 array([[0.27653011]]),
 array([[0.34277343]]),
 array([[0.61627327]]),
 array([[0.37297201]]),
 array([[0.45138402]]),
 array([[0.68209891]]),
 array([[0.08216513]]),
 array([[0.78678501]]),
 array([[0.44084841]]),
 array([[0.6847579]]),
 array([[0.42465657]]),
 array([[0.76723405]]),
 array([[0.87630306]]),
 array([[0.38468354]]),
 array([[0.84941567]]),
 array([[0.78682158]]),
 array([[0.64608008]]),
 array([[0.91824926]]),
 array([[0.1771505]]),
 array([[0.26551076]]),
 array([[0.59338168]]),
 array([[0.55881366]]),
 array([[0.4896249]]),
 array([[0.54760265]]),
 array([[0.65929564]]),
 array([[0.6057367]]),
 array([[0.21964078]]),
 array([[0.79160913]]),
 array([[0.526

### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

In [20]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [21]:
# # drop rows with missing values
# df = df.dropna(axis=0)

# # drop duplicate rows
# df = df.drop_duplicates()

In [22]:
# from sklearn.naive_bayes import BernoulliNB # Bernoulli because we have binary features
# from sklearn.pipeline import Pipeline
# from sklearn.model_selection import train_test_split

# preprocessing_pipeline = Pipeline(steps=[
#     ('cleaning',cleaner),
#     ('preprocessing',CountVectorizer())
# ])

# preprocessing = ColumnTransformer(transformers=[
#     ('preprocessing_1', preprocessing_pipeline,'question1'),
#     ('preprocessing_2', preprocessing_pipeline,'question2')
# ])

# pipeline = Pipeline([
#     # ('cleaning', cleaning),
#     ('preprocessing', preprocessing), 
#     ('model', BernoulliNB())
# ])

# X_train, X_test, y_train, y_test = train_test_split(df_train[['question1','question2']], df_train['is_duplicate'].astype('int'), test_size=0.20, random_state=1)

# pipeline.fit(X_train, y_train)
# train_accuracy = pipeline.score(X_train, y_train)
# test_accuracy = pipeline.score(X_test, y_test)

# print(f'Train accuracy:\t{train_accuracy}')
# print(f'Test accuracy:\t{test_accuracy}')

# Count Vectorizer & BernoulliNB

In [23]:
from sklearn.naive_bayes import BernoulliNB # Bernoulli because we have binary features
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

preprocessing = ColumnTransformer(transformers=
[
    ('preprocessing_1', CountVectorizer() ,'question1_cleaned'),
    ('preprocessing_2', CountVectorizer() ,'question2_cleaned')])

pipeline = Pipeline([
    ('preprocessing', preprocessing), 
    ('model', BernoulliNB())
    ])

X_train, X_test, y_train, y_test = train_test_split(df_train[['question1_cleaned','question2_cleaned']], df_train['is_duplicate'].astype('int'), test_size=0.20, random_state=1)

pipeline.fit(X_train, y_train)
train_accuracy = pipeline.score(X_train, y_train)
test_accuracy = pipeline.score(X_test, y_test)

print(f'Train accuracy:\t{train_accuracy}')
print(f'Test accuracy:\t{test_accuracy}')


Train accuracy:	0.7592220041884471
Test accuracy:	0.724853241870589


# TfidfVectorizer & BernoulliNB

In [31]:
from sklearn.naive_bayes import BernoulliNB # Bernoulli because we have binary features
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

preprocessing = ColumnTransformer(transformers=
[
    ('preprocessing_1', TfidfVectorizer() ,'question1_cleaned'),
    ('preprocessing_2', TfidfVectorizer() ,'question2_cleaned')])

pipeline = Pipeline([
    ('preprocessing', preprocessing), 
    ('model', BernoulliNB())
    ])

X_train, X_test, y_train, y_test = train_test_split(df_train[['question1_cleaned','question2_cleaned']], df_train['is_duplicate'].astype('int'), test_size=0.20, random_state=1)

pipeline.fit(X_train, y_train)
train_accuracy = pipeline.score(X_train, y_train)
test_accuracy = pipeline.score(X_test, y_test)

print(f'Train accuracy:\t{train_accuracy}')
print(f'Test accuracy:\t{test_accuracy}')


Train accuracy:	0.7592220041884471
Test accuracy:	0.724853241870589


# Word2Vec & BernoulliNB

In [45]:
def tokenize(text):
    tokens = text.split()
    return tokens

In [46]:
df_train['question1_cleaned_tokenized'] = df_train['question1_cleaned'].apply(lambda x: tokenize(x))
# df_train['question2_cleaned_tokenized'] = df_train['question2_cleaned'].apply(lambda x: tokenize(x))
df_train.head()

Unnamed: 0,is_duplicate,question1_cleaned,question2_cleaned,question1_cleaned_tokenized,question2_cleaned_tokenized
0,0,become ceo law school grade competitive,whats good online discussion board air daily f...,"[become, ceo, law, school, grade, competitive]","[whats, good, online, discussion, board, air, ..."
1,0,singapore citizen obtain another citizenship b...,american could cross canadian border child chi...,"[singapore, citizen, obtain, another, citizens...","[american, could, cross, canadian, border, chi..."
2,1,get rid pimple earlobe,get rid pimple ear,"[get, rid, pimple, earlobe]","[get, rid, pimple, ear]"
3,1,releasing new 500 2000 rupee note help eradica...,pm modi want curb black money new 2000 rupee n...,"[releasing, new, 500, 2000, rupee, note, help,...","[pm, modi, want, curb, black, money, new, 2000..."
4,1,possible solution forgot icloud password,get icloud password,"[possible, solution, forgot, icloud, password]","[get, icloud, password]"


In [47]:
import gensim

Model_CBoW = gensim.models.Word2Vec(df_train['question1_cleaned_tokenized'], vector_size = 100, window = 5, min_count = 1)
Model_CBoW.train

<bound method Word2Vec.train of <gensim.models.word2vec.Word2Vec object at 0x0000023BDECB6B20>>

In [None]:
Model_CBoW = gensim.models.Word2Vec(df_train['question2_cleaned_tokenized'], vector_size = 100, window = 5, min_count = 1)
Model_CBoW.train

In [48]:
Model_CBoW.wv['school']

array([ 0.30626673, -1.4813572 ,  1.4438479 ,  0.1176673 , -0.05845909,
       -0.6039322 ,  0.14252529, -0.0250668 ,  0.63708794, -0.06565531,
       -0.63601655, -1.3520248 ,  1.3923845 ,  0.86414075,  0.12141857,
        1.2088464 , -0.29210582, -0.94989145,  0.08012938, -0.38129961,
        1.785485  , -1.2206194 ,  0.29603723, -1.674993  , -0.9170058 ,
        1.0199337 , -0.9172853 , -1.8531643 , -1.6108813 , -0.5632631 ,
        1.6912004 ,  0.8579326 , -2.1741757 ,  1.098585  ,  0.88569295,
        2.89512   ,  0.2630835 , -0.8194446 ,  0.7330894 , -1.5767498 ,
        0.66460997, -0.43954933, -3.4969757 , -1.5781193 ,  2.1128495 ,
       -0.5824082 , -3.119341  ,  0.9600089 ,  0.38305953,  0.4144633 ,
        0.7754835 , -1.2191876 ,  0.04605655,  0.48488665, -1.2559518 ,
        0.02976516,  0.7597034 ,  1.8602806 ,  0.4396544 ,  1.4920886 ,
       -0.6650546 ,  2.2185013 ,  0.99010766, -0.5134579 , -2.3844233 ,
       -0.6341457 ,  1.2214446 ,  0.09481429, -0.24935493,  0.17

In [49]:
Model_CBoW.wv.most_similar('school')

[('college', 0.7359702587127686),
 ('schooler', 0.7320328950881958),
 ('harvard', 0.7086325883865356),
 ('undergrad', 0.705434262752533),
 ('juilliard', 0.7047107219696045),
 ('grade', 0.7000455856323242),
 ('stanford', 0.68809974193573),
 ('mit', 0.6841893792152405),
 ('graduate', 0.6746595501899719),
 ('literay', 0.6720913648605347)]

In [43]:
# import gensim

# Model_CBoW = gensim.models.Word2Vec(df_train[['question1_cleaned_tokenized','question2_cleaned_tokenized']], vector_size = 100, window = 5, min_count = 1)
# Model_CBoW.train

<bound method Word2Vec.train of <gensim.models.word2vec.Word2Vec object at 0x0000023B88750940>>

In [33]:
from sklearn.naive_bayes import BernoulliNB # Bernoulli because we have binary features
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split


# preprocessing = ColumnTransformer(transformers=
# [
#     ('preprocessing_1', gensim.models.Word2Vec() ,'question1_cleaned'),
#     ('preprocessing_2', gensim.models.Word2Vec() ,'question2_cleaned')])

# pipeline = Pipeline([
#     ('preprocessing', preprocessing), 
#     ('model', BernoulliNB())
#     ])

X_train, X_test, y_train, y_test = train_test_split(df_train[['question1_cleaned','question2_cleaned']], df_train['is_duplicate'].astype('int'), test_size=0.20, random_state=1)

model = BernoulliNB()

model.fit(X_train, y_train)
train_accuracy = pipeline.score(X_train, y_train)
test_accuracy = pipeline.score(X_test, y_test)

print(f'Train accuracy:\t{train_accuracy}')
print(f'Test accuracy:\t{test_accuracy}')


TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. 'Word2Vec<vocab=0, vector_size=100, alpha=0.025>' (type <class 'gensim.models.word2vec.Word2Vec'>) doesn't.