## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and make writers to answer multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

We are going to build the similar model during this project!

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv("train.csv", encoding='UTF')

In [459]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


### Exploration

In [5]:
data = df[['question1', 'question2', 'is_duplicate']]
data = data.dropna()

In [52]:
data.head(3)

Unnamed: 0,question1,question2,is_duplicate
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0


In [20]:
# train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data[['question1','question2']], data.is_duplicate, 
                                                    test_size=0.20, random_state=42)

### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [17]:
from gensim.utils import simple_preprocess
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

def punc_remove(docs):
    review_no_puncs = []
    for review in docs:
        review_no_punc = ''.join([char for char in review if char not in string.punctuation])
        review_no_puncs.append(review_no_punc)
    return review_no_puncs

def stop_word_remove(docs):
    review_no_stops = []
    for review in docs:
        tokens = review.split()
        review_no_stop = ' '.join([word for word in tokens if word not in stop_words])
        review_no_stops.append(review_no_stop)
    return review_no_stops

def lemmitization(docs):
    review_lemms = []
    for review in docs:
        tokens = review.split()
        review_lemm = ' '.join([lemmatizer.lemmatize(word) for word in tokens])
        review_lemms.append(review_lemm)
    return review_lemms

def stop_word_remove(docs):
    review_no_stops = []
    for review in docs:
        tokens = review.split()
        review_no_stop = ' '.join([word for word in tokens if word not in stop_words])
        review_no_stops.append(review_no_stop)
    return review_no_stops

def tokenize(texts):
    tokenized = []
    for doc in texts:
        tokenized.append(simple_preprocess(doc, min_len=2))
    return tokenized

def to_word_string(tokens):
    texts = []
    for doc in tokens:
        texts.append(' '.join([word for word in doc]))
    return texts

def preprocess(texts):        
    texts = punc_remove(texts)
    texts = stop_word_remove(texts)
    texts = stop_word_remove(texts)
    texts = lemmitization(texts)
    texts = tokenize(texts)
    return texts

def make_word_list(*args):
    i = len(args)
    word_list = set()
    for i in range(i):
        for e in args[i]:
            word_list.update(e)
    return list(word_list)

In [21]:
qs_tokens = preprocess(pd.concat([X_train['question1'], X_train['question2']]))

In [25]:
qs = to_word_string(qs_tokens)

In [28]:
unique_word_list = make_word_list(qs_tokens)

In [155]:
q1_train_tokens = preprocess(X_train['question1'])
q2_train_tokens = preprocess(X_train['question2'])
q1_test_tokens = preprocess(X_test['question1'])
q2_test_tokens = preprocess(X_test['question2'])

In [156]:
q1_train_str = to_word_string(q1_train_tokens)
q2_train_str = to_word_string(q2_train_tokens)
q1_test_str = to_word_string(q1_test_tokens)
q2_test_str = to_word_string(q2_test_tokens)

### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [205]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
def cosine_sim(text1, text2):
    tfidf = vect.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0,1]

In [206]:
train_texts1 = X_train.question1.values.tolist()
train_texts2 = X_train.question2.values.tolist()

In [207]:
X_train_targets = []
for text1, text2 in zip(train_texts1, train_texts2):
    X_train_targets.append(cosine_sim(text1, text2))

In [209]:
X_train = X_train.reset_index(drop=True)

In [210]:
X_train = pd.concat([X_train, pd.Series(X_train_targets, name='cos_sim')], axis=1)

In [75]:
test_texts1 = X_test.question1.values.tolist()
test_texts2 = X_test.question2.values.tolist()

In [172]:
X_test_targets = []
for text1, text2 in zip(test_texts1, test_texts2):
    X_test_targets.append(cosine_sim(text1, text2))

In [152]:
X_test = X_test.reset_index(drop=True)

In [154]:
X_test = pd.concat([X_test, pd.Series(X_test_targets, name='cos_sim')], axis=1)

In [125]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [126]:
vectorizer = CountVectorizer(min_df=0, lowercase=False, max_features=1500)
vectorizer.fit(qs)

CountVectorizer(lowercase=False, max_features=1500, min_df=0)

In [127]:
x_train_q1 = vectorizer.transform(X_train.question1)
x_train_q2 = vectorizer.transform(X_train.question2)
x_test_q1 = vectorizer.transform(X_test.question1)
x_test_q2 = vectorizer.transform(X_test.question2)

In [224]:
from scipy.sparse import hstack
from scipy import sparse
X_train_tfidf = hstack((x_train_q1,x_train_q2, 
                        sparse.csr_matrix(np.array(X_train_targets).reshape(-1, 1))))
X_test_tfidf = hstack((x_test_q1,x_test_q2, 
                       sparse.csr_matrix(np.array(X_test_targets).reshape(-1, 1))))

In [285]:
from gensim.models import Word2Vec
model = Word2Vec(sentences=qs, vector_size=100, window=5, min_count=1, workers=4)

In [286]:
word_vectors = model.wv

In [288]:
word_vectors

<gensim.models.keyedvectors.KeyedVectors at 0x2cb2e11c9d0>

### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

In [282]:

from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
from scipy import sparse

def main_process(X):
    X.dropna(inplace=True)
    
    tokens = preprocess(pd.concat([X['question1'], X['question2']]))
    qs = to_word_string(tokens)
    
    vectorizer = CountVectorizer(min_df=0, lowercase=False, max_features=3000)
    vectorizer.fit(qs)
    
    x_q1 = vectorizer.transform(X.question1)
    x_q2 = vectorizer.transform(X.question2)
    
    
    train_texts1 = X.question1.values.tolist()
    train_texts2 = X.question2.values.tolist()
    
    X_targets = []
    for text1, text2 in zip(train_texts1, train_texts2):
        X_targets.append(cosine_sim(text1, text2))
    
    X_tfidf = hstack((x_q1, x_q2, 
                      sparse.csc_matrix(np.array(X_targets).reshape(-1, 1))))
    return X_tfidf
    

In [283]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

prep = FunctionTransformer(main_process)
model = LogisticRegression(max_iter=1700)

pipe = Pipeline([
    ('pre', prep),
    ('model', model)
])

In [284]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('pre',
                 FunctionTransformer(func=<function main_process at 0x000002CC72C89D30>)),
                ('model', LogisticRegression(max_iter=1700))])

In [289]:
y_pred = pipe.predict(X_test)

In [281]:
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('F1 Score: ', f1_score(y_test, y_pred))
print('Confusion Matrix : ') 
print(confusion_matrix(y_test, y_pred))

Accuracy:  0.5952534072076975
Recall:  0.6328439259855189
Precision:  0.464393771677367
F1 Score:  0.5356884443498616
Confusion Matrix : 
[[29252 21774]
 [10953 18879]]


In [264]:
model = LogisticRegression(max_iter=1700)
model.fit(X_train_tfidf, y_train)

LogisticRegression(max_iter=1700)

In [263]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix

y_pred = model.predict(X_test_tfidf)

print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('F1 Score: ', f1_score(y_test, y_pred))
print('Confusion Matrix : ') 
print(confusion_matrix(y_test, y_pred))

Accuracy:  0.4529174602389374
Recall:  0.999597747385358
Precision:  0.40273350980498085
F1 Score:  0.5741460972698217
Confusion Matrix : 
[[ 6802 44224]
 [   12 29820]]


In [294]:
from xgboost import XGBClassifier

xgb = XGBClassifier(objective='reg:squarederror', n_estimators=500, use_label_encoder=False, 
                    max_depth=3)
xgb.fit(X_train_tfidf, y_train)
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix

y_pred = xgb.predict(X_test_tfidf)

print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('F1 Score: ', f1_score(y_test, y_pred))
print('Confusion Matrix : ') 
print(confusion_matrix(y_test, y_pred))

Accuracy:  0.47407801330727944
Recall:  0.9924242424242424
Precision:  0.4117377094777832
F1 Score:  0.5820104779971889
Confusion Matrix : 
[[ 8727 42299]
 [  226 29606]]


In [297]:
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
cross_val_score(xgb, X_train_tfidf, y_train)

array([0.75674798, 0.75493924, 0.75928331, 0.75360975, 0.75594033])

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)

X_train = tokenizer.texts_to_sequences(sentences_train)
# X_test = tokenizer.texts_to_sequences(sentences_test)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

maxlen = 100

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
# X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers

vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 50


model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

In [None]:
history = model.fit(X_train, y_train,
                    epochs=3,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)