## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("train.csv")

In [None]:
df.head(3)

#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

### **Exploration**

In [None]:
# How big is this dataset?
df.shape

In [None]:
# What portion of our questions are actually duplicate?
df['is_duplicate'].value_counts()

In [None]:
# Are we missing any data?
print('Number of nulls in label: {}'.format(df['is_duplicate'].isnull().sum()))
print('Number of nulls in text: {}'.format(df['question1'].isnull().sum()))
print('Number of nulls in text: {}'.format(df['question2'].isnull().sum()))

In [None]:
# How many unique questions are there?
print('Total number of question pairs for training: {}'.format(len(df)))

In [None]:
#Remove id and qid1 and qid2
df.drop(['id', 'qid1', 'qid2'], axis=1, inplace=True)

In [None]:
# fill the missing values with empty strings
df = df.fillna(' ')
df.head()

### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [None]:
import re #regular expression
import spacy
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

In [None]:
# Remove punctuation
import string
string.punctuation

def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text


In [None]:
df['q1_clean'] = df['question1'].apply(lambda x: remove_punct(x))
df['q2_clean'] = df['question2'].apply(lambda x: remove_punct(x))

In [None]:
df.drop(['question1', 'question2'], axis=1, inplace=True)

In [None]:
# Remove Alpha Numeric
def remove_alphanumeric(text):
    text = ''.join([i for i in text if not i.isdigit()])
    return text

In [None]:
# apply function to data frame
df['q1_clean'] = df['q1_clean'].apply(lambda x: remove_alphanumeric(x))
df['q2_clean'] = df['q2_clean'].apply(lambda x: remove_alphanumeric(x))

In [None]:
# Remove Stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

def remove_stopwords(text):
    text = [word for word in text.split() if word.lower() not in (stop)]
    return " ".join(text)


In [None]:
# remove stopwords from questions
df['q1_clean'] = df['q1_clean'].apply(lambda x: remove_stopwords(x))
df['q2_clean'] = df['q2_clean'].apply(lambda x: remove_stopwords(x))

df.head(3)

In [None]:
# remove unwanted spaces and convert to lower case
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize, pos_tag

# define class to remove unwanted spaces and convert to lower case
class CleanText(object):
    def __init__(self, text):
        self.text = text
        
    def clean(self):
        # remove unwanted spaces and convert to lower case
        self.text = self.text.lower()
        self.text = self.text.strip()
        return self.text



In [None]:
# call class to remove unwanted spaces and convert to lower case for q1_clean and q2_clean
df['q1_clean'] = df['q1_clean'].apply(lambda x: CleanText(x).clean())
df['q2_clean'] = df['q2_clean'].apply(lambda x: CleanText(x).clean())

df.head(3)

In [None]:
df.shape

### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [None]:
from collections import Counter
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [None]:
target = df['is_duplicate']

In [None]:
# create a dataframe with target
X = pd.DataFrame(target)

In [None]:
# calculate the length of questions and apply to X df
X['q1_len'] = df['q1_clean'].apply(lambda x: len(x))
X['q2_len'] = df['q2_clean'].apply(lambda x: len(x))

In [None]:
# calculate weight of each word in corpus
def get_weight(count, eps=10000, min_count=2):
    return 0 if count < min_count else 1 / (count + eps)

# join all questions together
pairs_qs = df['q1_clean'].str.split().astype(str) + df['q2_clean'].str.split().astype(str) 
words = (" ".join(pairs_qs)).lower().split()
counts = Counter(words)
weights = {word: get_weight(count) for word, count in counts.items()}


In [None]:
X['word_count'] = pairs_qs.apply(lambda x: len(str(x).split()))

In [None]:
# import spacy
nlp = spacy.load('en_core_web_sm')


In [None]:
# find the number of unique words in each question
def unique_words(text):
    doc = nlp(text)
    unique_words = set([token.text for token in doc if token.is_stop != True and token.is_punct != True])
    return len(unique_words)

# find common words in each question
def common_words(text):
    doc = nlp(text)
    common_words = set([token.text for token in doc if token.is_stop == True and token.is_punct == True])
    return len(common_words)

In [None]:
X.head(3)

In [None]:
stopwords = nltk.corpus.stopwords.words('english')

In [None]:
%%time
nlp = spacy.load('en_core_web_sm')
stops = set(nltk.corpus.stopwords.words("english"))

def word_shares(row):
    q1_list = str(row['q1_clean']).lower().split()
    q1 = set(q1_list)
    q1words = q1.difference(stops)
    if len(q1words) == 0:
        return '0:0:0:0:0:0:0:0:0'

    q2_list = str(row['q2_clean']).lower().split()
    q2 = set(q2_list)
    q2words = q2.difference(stops)
    if len(q2words) == 0:
        return '0:0:0:0:0:0:0:0:0'

    words_hamming = sum(1 for i in zip(q1_list, q2_list) if i[0]==i[1])/max(len(q1_list), len(q2_list))

    q1stops = q1.intersection(stops)
    q2stops = q2.intersection(stops)

    q1_2gram = set([i for i in zip(q1_list, q1_list[1:])])
    q2_2gram = set([i for i in zip(q2_list, q2_list[1:])])

    shared_2gram = q1_2gram.intersection(q2_2gram)

    shared_words = q1words.intersection(q2words)
    shared_weights = [weights.get(w, 0) for w in shared_words]
    q1_weights = [weights.get(w, 0) for w in q1words]
    q2_weights = [weights.get(w, 0) for w in q2words]
    total_weights = q1_weights + q1_weights

    R1 = np.sum(shared_weights) / np.sum(total_weights) #tfidf share
    R2 = len(shared_words) / (len(q1words) + len(q2words) - len(shared_words)) #count share
    R31 = len(q1stops) / len(q1words) #stops in q1
    R32 = len(q2stops) / len(q2words) #stops in q2
    Rcosine_denominator = (np.sqrt(np.dot(q1_weights,q1_weights))*np.sqrt(np.dot(q2_weights,q2_weights)))
    Rcosine = np.dot(shared_weights, shared_weights)/Rcosine_denominator
    if len(q1_2gram) + len(q2_2gram) == 0:
        R2gram = 0
    else:
        R2gram = len(shared_2gram) / (len(q1_2gram) + len(q2_2gram))
    
    fuzzy_match = fuzz.token_sort_ratio(q1_list, q2_list)
    
    return '{}:{}:{}:{}:{}:{}:{}:{}:{}'.format(R1, R2, len(shared_words), R31, R32, R2gram, 
                                                  Rcosine, words_hamming, fuzzy_match)

X['word_shares'] = df.apply(word_shares, axis=1)


In [None]:
X.head(3)

In [None]:
X['word_match']       = X['word_shares'].apply(lambda x: float(x.split(':')[0]))
X['tfidf_word_match'] = X['word_shares'].apply(lambda x: float(x.split(':')[1]))
X['shared_count']     = X['word_shares'].apply(lambda x: float(x.split(':')[2]))

X['stops1_ratio']     = X['word_shares'].apply(lambda x: float(x.split(':')[3]))
X['stops2_ratio']     = X['word_shares'].apply(lambda x: float(x.split(':')[4]))
X['shared_2gram']     = X['word_shares'].apply(lambda x: float(x.split(':')[5]))
X['cosine']           = X['word_shares'].apply(lambda x: float(x.split(':')[6]))
X['words_hamming']    = X['word_shares'].apply(lambda x: float(x.split(':')[7]))
X['fuzzy_match']    = X['word_shares'].apply(lambda x: float(x.split(':')[8]))


X['len_word_q1'] = df['q1_clean'].apply(lambda x: len(str(x).split()))
X['len_word_q2'] = df['q2_clean'].apply(lambda x: len(str(x).split()))
X['diff_len_word'] = X['len_word_q1'] - X['len_word_q2']

In [None]:
X.head(3)

In [None]:
# drop word_shares column
X.drop(['word_shares'], axis=1, inplace=True)

In [None]:
# get unique values in cosine column
X['cosine'].unique()

In [None]:
X.drop(['cosine'], axis=1, inplace=True)


In [None]:
X['word_match'].unique()

In [None]:
X.drop(['word_match'], axis=1, inplace=True)

In [None]:
X.head(3)

In [None]:
# export to csv
X.to_csv('clean_df.csv')

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('clean_df.csv', index_col=0)

In [3]:
# take the first 10000 rows from df sample data using iloc
X = df.iloc[:10000]




### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc

# import standard scaler
from sklearn.preprocessing import StandardScaler

# import stratified k fold
from sklearn.model_selection import StratifiedKFold

# import logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

# import pipeline
from sklearn.pipeline import Pipeline


In [5]:
y = X['is_duplicate']
X = X.drop(['is_duplicate', 'stops1_ratio', 'stops2_ratio', 'word_count'], axis=1)

In [6]:
# split the data into train and test using 80:20 ratio using iloc
y_train = y.iloc[:int(y.shape[0]*0.8)]
y_test = y.iloc[int(y.shape[0]*0.8):]
X_train = X.iloc[:int(X.shape[0]*0.8)]
X_test = X.iloc[int(X.shape[0]*0.8):]


In [7]:
y_train.shape, y_test.shape, X_train.shape, X_test.shape

((8000,), (2000,), (8000, 10), (2000, 10))

1. Test different models

In [8]:
scoring = 'accuracy'
kfold = StratifiedKFold(n_splits=10)

In [9]:
# build pipeline without scaling
models = []
models.append(('LR', LogisticRegression()))
models.append(('RF', RandomForestClassifier()))


results = []
names = []
#evaluate each model in turn with cross validation
for name, model in models:
    model.fit(X_train, y_train)
    results.append(accuracy_score(y_test, model.predict(X_test)))
    names.append(name)
    msg = f'{name}: {results}'
    print(msg)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LR: [0.683]
RF: [0.683, 0.69]


In [10]:
# import xgboost
import xgboost as xgb

In [11]:
# xgb classifier with learning rate 0.01
Xgb = xgb.XGBClassifier()
Xgb.fit(X_train, y_train)
y_pred = Xgb.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.75      0.76      1268
           1       0.59      0.60      0.59       732

    accuracy                           0.70      2000
   macro avg       0.68      0.68      0.68      2000
weighted avg       0.70      0.70      0.70      2000

[[956 312]
 [290 442]]
0.699


In [12]:
# grid search for XGBoost
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [2, 4, 6, 8, 10, 20, 30, 50, 100],
                    'min_child_weight': [1, 3, 5, 7, 9, 11, 13, 15, 17, 20],
                    'gamma': [i / 100. for i in range(0, 51)],
                    'subsample': [i / 100. for i in range(6, 101)],
                    'colsample_bytree': [i / 100. for i in range(6, 101)],
                    'learning_rate': [i / 100. for i in range(1, 101)]
}
Xgb = xgb.XGBClassifier()
grid_search = GridSearchCV(Xgb, param_grid, scoring='accuracy', n_jobs=-1, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
print(grid_search.best_score_)


: 

: 