## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [1]:
import pandas as pd

In [63]:
df = pd.read_csv("train.csv")

In [64]:
df.head(3)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0


#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

### **Exploration and Data Analysis**

In [4]:
# import lets plot
import numpy as np
from lets_plot import *
LetsPlot.setup_html()

In [5]:
# How big is this dataset?
df.shape

(404290, 6)

In [6]:
# What portion of our questions are actually duplicate?
df['is_duplicate'].value_counts()

0    255027
1    149263
Name: is_duplicate, dtype: int64

In [7]:
# plot class distribution for each class using ggplot2
ggplot(df, aes(x='is_duplicate', fill = 'is_duplicate')) + geom_bar() + ggtitle(" ") + labs(x="Class", y="Count") +\
scale_fill_discrete(guide='none') + \
theme_classic() + \
flavor_high_contrast_dark() 


In [8]:
# Are we missing any data?
print('Number of nulls in label: {}'.format(df['is_duplicate'].isnull().sum()))
print('Number of nulls in text: {}'.format(df['question1'].isnull().sum()))
print('Number of nulls in text: {}'.format(df['question2'].isnull().sum()))

Number of nulls in label: 0
Number of nulls in text: 1
Number of nulls in text: 2


In [9]:
# How many unique questions are there?
print('Total number of question pairs for training: {}'.format(len(df)))

Total number of question pairs for training: 404290


In [10]:
# count how many times a question appears in the df
qids = pd.Series(df[df['qid1'].notnull()]['qid1'].tolist() + df[df['qid2'].notnull()]['qid2'].tolist())

In [11]:
# plot distribution of of qids using ggplot2
ggplot(qids, aes(x=qids)) + geom_density(binwidth=10, method='histodot') + ggtitle(" ") + labs(x="Question ID", y="Count")

In [12]:
# concat question1 and question2 into a single string
questions = df['question1'].astype(str) + " " + df['question2'].astype(str).tolist()

In [13]:
questions = questions.apply(len)

In [14]:
# Get min, max, mean, and standard deviation of a list of numbers
def min_max_mean_std(numbers):
    mean = sum(numbers) / len(numbers)
    std = (sum([(x - mean) ** 2 for x in numbers]) / len(numbers)) ** 0.5
    return mean, std

In [15]:
# call function on questions list
mean, std = min_max_mean_std(questions)
print(mean, std)

120.6450963417349 55.00734920633935


In [16]:
min(questions), max(questions), mean, std

(6, 1319, 120.6450963417349, 55.00734920633935)

In [17]:
# plot distribution of questions using ggplot2
ggplot(questions, aes(x=questions)) + \
    geom_histogram(bins = 100, fill = 'white') +\
          ggtitle(" ") + labs(x="Question Length", y="Count") + \
          theme_classic() + \
            flavor_high_contrast_dark() 
            

In [18]:
# add a new column to the dataframe with the length of each question
df['characters'] = questions

In [19]:
# word count for each question
questions = df['question1'].astype(str) + " " + df['question2'].astype(str).tolist()

# split questions into words

In [20]:
questions = questions.apply(lambda x: x.lower().split())

In [21]:
questions = questions.apply(len)

In [22]:
mean, std = min_max_mean_std(questions)

In [23]:
min(questions), max(questions), mean, std

(2, 270, 22.124200450171905, 10.074891656619457)

In [24]:
# plot distribution of questions using ggplot2
ggplot(questions, aes(x=questions)) + \
    geom_histogram(bins = 100, fill = 'white') +\
          ggtitle(" ") + labs(x="Question Word Length", y="Count") + \
          theme_classic() + \
            flavor_high_contrast_dark()

From here, I will drop those questions that have 99 words and less than 15 characters.

In [25]:
df['words'] = questions

In [26]:
#Remove id and qid1 and qid2
df.drop(['id', 'qid1', 'qid2'], axis=1, inplace=True)

In [27]:
df.head()

Unnamed: 0,question1,question2,is_duplicate,characters,words
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,124,26
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,140,21
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,133,24
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,116,20
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,116,20


In [28]:
# convert is_duplicate as yes or no
df['is_duplicate'] = df['is_duplicate'].apply(lambda x: 'yes' if x == 1 else 'no')

In [29]:
x = df['words']
ggplot(df, aes(x='words', fill='is_duplicate')) + ggsize(700, 400) + \
geom_density(color='dark_green', alpha=.7) + scale_fill_discrete(guide='none') + \
theme_classic() + \
flavor_high_contrast_dark()

In [30]:
# from here i will drop rows where words are less than 10 and greater than 67 words. 

df = df[df['words'] > 5]
df = df[df['words'] < 67]

df.shape

(402691, 5)

In [31]:
# plot class distribution for each class using ggplot2
x = df['words']
ggplot(df, aes(x='words', fill='is_duplicate')) + ggsize(700, 400) + \
geom_density(color='dark_green', alpha=.7) + scale_fill_discrete(guide='none') + \
theme_classic() + \
flavor_high_contrast_dark()

In [32]:
df['is_duplicate'] = df['is_duplicate'].apply(lambda x: 1 if x == 'yes' else 0)

In [33]:
# fill the missing values with empty strings
df = df.fillna(' ')
df.head()

Unnamed: 0,question1,question2,is_duplicate,characters,words
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,124,26
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,140,21
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,133,24
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,116,20
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,116,20


In [34]:
# drop characters and words columns
df.drop(['characters', 'words'], axis=1)

Unnamed: 0,question1,question2,is_duplicate
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
...,...,...,...
404285,How many keywords are there in the Racket prog...,How many keywords are there in PERL Programmin...,0
404286,Do you believe there is life after death?,Is it true that there is life after death?,1
404287,What is one coin?,What's this coin?,0
404288,What is the approx annual cost of living while...,I am having little hairfall problem but I want...,0


### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [35]:
import re #regular expression
import spacy
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

In [36]:
# Remove punctuation
import string
string.punctuation

def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text


In [37]:
df['q1_clean'] = df['question1'].apply(lambda x: remove_punct(x))
df['q2_clean'] = df['question2'].apply(lambda x: remove_punct(x))

In [38]:
df.drop(['question1', 'question2'], axis=1, inplace=True)

In [39]:
# Remove Alpha Numeric
def remove_alphanumeric(text):
    text = ''.join([i for i in text if not i.isdigit()])
    return text

In [40]:
# apply function to data frame
df['q1_clean'] = df['q1_clean'].apply(lambda x: remove_alphanumeric(x))
df['q2_clean'] = df['q2_clean'].apply(lambda x: remove_alphanumeric(x))

In [41]:
# Remove Stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

def remove_stopwords(text):
    text = [word for word in text.split() if word.lower() not in (stop)]
    return " ".join(text)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/patrickokwir/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [42]:
# remove stopwords from questions
df['q1_clean'] = df['q1_clean'].apply(lambda x: remove_stopwords(x))
df['q2_clean'] = df['q2_clean'].apply(lambda x: remove_stopwords(x))

df.head(3)

Unnamed: 0,is_duplicate,characters,words,q1_clean,q2_clean
0,0,124,26,step step guide invest share market india,step step guide invest share market
1,0,140,21,story Kohinoor KohiNoor Diamond,would happen Indian government stole Kohinoor ...
2,0,133,24,increase speed internet connection using VPN,Internet speed increased hacking DNS


In [43]:
# remove unwanted spaces and convert to lower case
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize, pos_tag

# define class to remove unwanted spaces and convert to lower case
class CleanText(object):
    def __init__(self, text):
        self.text = text
        
    def clean(self):
        # remove unwanted spaces and convert to lower case
        self.text = self.text.lower()
        self.text = self.text.strip()
        return self.text


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/patrickokwir/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [44]:
# call class to remove unwanted spaces and convert to lower case for q1_clean and q2_clean
df['q1_clean'] = df['q1_clean'].apply(lambda x: CleanText(x).clean())
df['q2_clean'] = df['q2_clean'].apply(lambda x: CleanText(x).clean())

df.head(3)

Unnamed: 0,is_duplicate,characters,words,q1_clean,q2_clean
0,0,124,26,step step guide invest share market india,step step guide invest share market
1,0,140,21,story kohinoor kohinoor diamond,would happen indian government stole kohinoor ...
2,0,133,24,increase speed internet connection using vpn,internet speed increased hacking dns


In [45]:
 from copy import deepcopy


In [46]:
df_A = deepcopy(df)

In [47]:
df_A.head(3)

Unnamed: 0,is_duplicate,characters,words,q1_clean,q2_clean
0,0,124,26,step step guide invest share market india,step step guide invest share market
1,0,140,21,story kohinoor kohinoor diamond,would happen indian government stole kohinoor ...
2,0,133,24,increase speed internet connection using vpn,internet speed increased hacking dns


In [48]:
# combine q1_clean and q2_clean into a single column called "combined" using string concatenation
df_A['combined'] = df_A['q1_clean'].astype(str) +' ' + df_A['q2_clean'].astype(str)



In [49]:
df_A.head(3)

Unnamed: 0,is_duplicate,characters,words,q1_clean,q2_clean,combined
0,0,124,26,step step guide invest share market india,step step guide invest share market,step step guide invest share market india step...
1,0,140,21,story kohinoor kohinoor diamond,would happen indian government stole kohinoor ...,story kohinoor kohinoor diamond would happen i...
2,0,133,24,increase speed internet connection using vpn,internet speed increased hacking dns,increase speed internet connection using vpn i...


In [50]:
# drop q1_clean and q2_clean

df_A.drop(['q1_clean', 'q2_clean'], axis=1, inplace=True)
df_A.head()

Unnamed: 0,is_duplicate,characters,words,combined
0,0,124,26,step step guide invest share market india step...
1,0,140,21,story kohinoor kohinoor diamond would happen i...
2,0,133,24,increase speed internet connection using vpn i...
3,0,116,20,mentally lonely solve find remainder mathmath ...
4,0,116,20,one dissolve water quikly sugar salt methane c...


In [51]:
df_A.to_csv('df_A.csv', index=False)

In [52]:
df

Unnamed: 0,is_duplicate,characters,words,q1_clean,q2_clean
0,0,124,26,step step guide invest share market india,step step guide invest share market
1,0,140,21,story kohinoor kohinoor diamond,would happen indian government stole kohinoor ...
2,0,133,24,increase speed internet connection using vpn,internet speed increased hacking dns
3,0,116,20,mentally lonely solve,find remainder mathmath divided
4,0,116,20,one dissolve water quikly sugar salt methane c...,fish would survive salt water
...,...,...,...,...,...
404285,0,165,27,many keywords racket programming language late...,many keywords perl programming language latest...
404286,1,84,17,believe life death,true life death
404287,0,35,7,one coin,whats coin
404288,0,222,42,approx annual cost living studying uic chicago...,little hairfall problem want use hair styling ...


### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [53]:
from collections import Counter
from fuzzywuzzy import fuzz
from fuzzywuzzy import process



In [54]:
target = df['is_duplicate']

In [55]:
# create a dataframe with target
X = pd.DataFrame(target)

In [56]:
# calculate the length of questions and apply to X df
X['q1_len'] = df['q1_clean'].apply(lambda x: len(x))
X['q2_len'] = df['q2_clean'].apply(lambda x: len(x))

In [57]:
# calculate weight of each word in corpus
def get_weight(count, eps=10000, min_count=2):
    return 0 if count < min_count else 1 / (count + eps)

# join all questions together
pairs_qs = df['q1_clean'].str.split().astype(str) + df['q2_clean'].str.split().astype(str) 
words = (" ".join(pairs_qs)).lower().split()
counts = Counter(words)
weights = {word: get_weight(count) for word, count in counts.items()}


In [58]:
X['word_count'] = pairs_qs.apply(lambda x: len(str(x).split()))

In [59]:
# import spacy
nlp = spacy.load('en_core_web_sm')


In [60]:
# find the number of unique words in each question
def unique_words(text):
    doc = nlp(text)
    unique_words = set([token.text for token in doc if token.is_stop != True and token.is_punct != True])
    return len(unique_words)

# find common words in each question
def common_words(text):
    doc = nlp(text)
    common_words = set([token.text for token in doc if token.is_stop == True and token.is_punct == True])
    return len(common_words)

In [61]:
X.head(3)

Unnamed: 0,is_duplicate,q1_len,q2_len,word_count
0,0,41,35,12
1,0,31,67,12
2,0,44,36,10


In [62]:
stopwords = nltk.corpus.stopwords.words('english')

In [63]:
%%time
nlp = spacy.load('en_core_web_sm')
stops = set(nltk.corpus.stopwords.words("english"))

def word_shares(row):
    q1_list = str(row['q1_clean']).lower().split()
    q1 = set(q1_list)
    q1words = q1.difference(stops)
    if len(q1words) == 0:
        return '0:0:0:0:0:0:0:0:0'

    q2_list = str(row['q2_clean']).lower().split()
    q2 = set(q2_list)
    q2words = q2.difference(stops)
    if len(q2words) == 0:
        return '0:0:0:0:0:0:0:0:0'

    words_hamming = sum(1 for i in zip(q1_list, q2_list) if i[0]==i[1])/max(len(q1_list), len(q2_list))

    q1stops = q1.intersection(stops)
    q2stops = q2.intersection(stops)

    q1_2gram = set([i for i in zip(q1_list, q1_list[1:])])
    q2_2gram = set([i for i in zip(q2_list, q2_list[1:])])

    shared_2gram = q1_2gram.intersection(q2_2gram)

    shared_words = q1words.intersection(q2words)
    shared_weights = [weights.get(w, 0) for w in shared_words]
    q1_weights = [weights.get(w, 0) for w in q1words]
    q2_weights = [weights.get(w, 0) for w in q2words]
    total_weights = q1_weights + q1_weights

    R1 = np.sum(shared_weights) / np.sum(total_weights) #tfidf share
    R2 = len(shared_words) / (len(q1words) + len(q2words) - len(shared_words)) #count share
    R31 = len(q1stops) / len(q1words) #stops in q1
    R32 = len(q2stops) / len(q2words) #stops in q2
    Rcosine_denominator = (np.sqrt(np.dot(q1_weights,q1_weights))*np.sqrt(np.dot(q2_weights,q2_weights)))
    Rcosine = np.dot(shared_weights, shared_weights)/Rcosine_denominator
    if len(q1_2gram) + len(q2_2gram) == 0:
        R2gram = 0
    else:
        R2gram = len(shared_2gram) / (len(q1_2gram) + len(q2_2gram))
    
    fuzzy_match = fuzz.token_sort_ratio(q1_list, q2_list)
    
    return '{}:{}:{}:{}:{}:{}:{}:{}:{}'.format(R1, R2, len(shared_words), R31, R32, R2gram, 
                                                  Rcosine, words_hamming, fuzzy_match)

X['word_shares'] = df.apply(word_shares, axis=1)




CPU times: user 33.8 s, sys: 657 ms, total: 34.5 s
Wall time: 33.9 s


In [64]:
X.head(3)

Unnamed: 0,is_duplicate,q1_len,q2_len,word_count,word_shares
0,0,41,35,12,nan:0.8333333333333334:5:0.0:0.0:0.45454545454...
1,0,31,67,12,nan:0.2222222222222222:2:0.0:0.0:0.18181818181...
2,0,44,36,10,nan:0.2222222222222222:2:0.0:0.0:0.0:nan:0.166...


In [65]:
X['word_match']       = X['word_shares'].apply(lambda x: float(x.split(':')[0]))
X['tfidf_word_match'] = X['word_shares'].apply(lambda x: float(x.split(':')[1]))
X['shared_count']     = X['word_shares'].apply(lambda x: float(x.split(':')[2]))

X['stops1_ratio']     = X['word_shares'].apply(lambda x: float(x.split(':')[3]))
X['stops2_ratio']     = X['word_shares'].apply(lambda x: float(x.split(':')[4]))
X['shared_2gram']     = X['word_shares'].apply(lambda x: float(x.split(':')[5]))
X['cosine']           = X['word_shares'].apply(lambda x: float(x.split(':')[6]))
X['words_hamming']    = X['word_shares'].apply(lambda x: float(x.split(':')[7]))
X['fuzzy_match']    = X['word_shares'].apply(lambda x: float(x.split(':')[8]))


X['len_word_q1'] = df['q1_clean'].apply(lambda x: len(str(x).split()))
X['len_word_q2'] = df['q2_clean'].apply(lambda x: len(str(x).split()))
X['diff_len_word'] = X['len_word_q1'] - X['len_word_q2']

In [66]:
X.head(3)

Unnamed: 0,is_duplicate,q1_len,q2_len,word_count,word_shares,word_match,tfidf_word_match,shared_count,stops1_ratio,stops2_ratio,shared_2gram,cosine,words_hamming,fuzzy_match,len_word_q1,len_word_q2,diff_len_word
0,0,41,35,12,nan:0.8333333333333334:5:0.0:0.0:0.45454545454...,,0.833333,5.0,0.0,0.0,0.454545,,0.857143,92.0,7,6,1
1,0,31,67,12,nan:0.2222222222222222:2:0.0:0.0:0.18181818181...,,0.222222,2.0,0.0,0.0,0.181818,,0.0,59.0,4,9,-5
2,0,44,36,10,nan:0.2222222222222222:2:0.0:0.0:0.0:nan:0.166...,,0.222222,2.0,0.0,0.0,0.0,,0.166667,65.0,6,5,1


In [67]:
# drop word_shares column
X.drop(['word_shares'], axis=1, inplace=True)

In [68]:
# get unique values in cosine column
X['cosine'].unique()

array([nan,  0.])

In [69]:
X.drop(['cosine'], axis=1, inplace=True)


In [70]:
X['word_match'].unique()

array([nan,  0.])

In [71]:
X.drop(['word_match'], axis=1, inplace=True)

In [72]:
X.head(3)

Unnamed: 0,is_duplicate,q1_len,q2_len,word_count,tfidf_word_match,shared_count,stops1_ratio,stops2_ratio,shared_2gram,words_hamming,fuzzy_match,len_word_q1,len_word_q2,diff_len_word
0,0,41,35,12,0.833333,5.0,0.0,0.0,0.454545,0.857143,92.0,7,6,1
1,0,31,67,12,0.222222,2.0,0.0,0.0,0.181818,0.0,59.0,4,9,-5
2,0,44,36,10,0.222222,2.0,0.0,0.0,0.0,0.166667,65.0,6,5,1


In [60]:
# export to csv
X.to_csv('clean_df.csv')

In [26]:
import pandas as pd

In [61]:
df = pd.read_csv('clean_df.csv', index_col=0)

In [62]:
# take the first 10000 rows from df sample data using iloc
sample = df.iloc[:100000]
sample.head(3)

Unnamed: 0,q1_len,q2_len,word_count,tfidf_word_match,shared_count,shared_2gram,words_hamming,fuzzy_match,len_word_q1,len_word_q2,diff_len_word
0,41,35,12,0.833333,5.0,0.454545,0.857143,92.0,7,6,1
1,31,67,12,0.222222,2.0,0.181818,0.0,59.0,4,9,-5
2,44,36,10,0.222222,2.0,0.0,0.166667,65.0,6,5,1


### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc

# import standard scaler
from sklearn.preprocessing import StandardScaler

# import stratified k fold
from sklearn.model_selection import StratifiedKFold

# import logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

# import pipeline
from sklearn.pipeline import Pipeline

# import smote from imblearn
from imblearn.over_sampling import SMOTE


In [47]:
y = sample['is_duplicate']
X = sample.drop(['is_duplicate', 'stops1_ratio', 'stops2_ratio'], axis=1)

In [59]:
sm = SMOTE(random_state=42)

In [50]:
X, y = sm.fit_resample(X, y)

In [51]:
# split the data into train and test using 80:20 ratio using iloc
y_train = y.iloc[:int(y.shape[0]*0.8)]
y_test = y.iloc[int(y.shape[0]*0.8):]
X_train = X.iloc[:int(X.shape[0]*0.8)]
X_test = X.iloc[int(X.shape[0]*0.8):]


In [52]:
y_train.shape, y_test.shape, X_train.shape, X_test.shape

((100180,), (25046,), (100180, 11), (25046, 11))

1. Test different models

In [53]:
scoring = 'accuracy'
kfold = StratifiedKFold(n_splits=10)

In [54]:
# build pipeline without scaling
models = []
models.append(('LR', LogisticRegression()))
models.append(('RF', RandomForestClassifier()))


results = []
names = []
#evaluate each model in turn with cross validation
for name, model in models:
    model.fit(X_train, y_train)
    results.append(accuracy_score(y_test, model.predict(X_test)))
    names.append(name)
    msg = f'{name}: {results}'
    print(msg)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LR: [0.5303042401980356]
RF: [0.5303042401980356, 0.8544677792861135]


In [12]:
# import xgboost
import xgboost as xgb

In [55]:
# xgb classifier with learning rate 0.01
Xgb = xgb.XGBClassifier()
Xgb.fit(X_train, y_train)
y_pred = Xgb.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.81      0.89     25046

    accuracy                           0.81     25046
   macro avg       0.50      0.40      0.45     25046
weighted avg       1.00      0.81      0.89     25046

[[    0     0]
 [ 4834 20212]]
0.8069951289627086


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [85]:
# grid search for best hyperparameters for XGBoost
def grid_search(X_train, y_train, X_test, y_test):
    """
    Grid search for best hyperparameters for XGBoost
    """
    # import XGBoost
    from xgboost import XGBClassifier
    # import GridSearchCV
    from sklearn.model_selection import GridSearchCV
    # import RandomizedSearchCV
    from sklearn.model_selection import RandomizedSearchCV

    # define XGBoost parameters
    xgb_params = {
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 4, 5],
        'learning_rate': [0.01, 0.05, 0.1],
        'min_child_weight': [1, 3, 5],
        'gamma': [0.5, 1, 1.5, 2],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'reg_alpha': [1e-5, 1e-4, 1e-3],
        'reg_lambda': [1e-5, 1e-4, 1e-3],
        'scale_pos_weight': [1, 3, 5],
        'objective': ['binary:logistic', 'multi:softmax'],
        'nthread': [4],
        'seed': [42]
    }
    # define XGBoost grid search
    xgb_grid = GridSearchCV(XGBClassifier(), xgb_params, cv=2, n_jobs=-1, verbose=1)
    # fit XGBoost grid search
    xgb_grid.fit(X_train, y_train)
    # print best XGBoost hyperparameters
    print('Best XGBoost hyperparameters: ', xgb_grid.best_params_)
    # print best XGBoost score
    print('Best XGBoost score: ', xgb_grid.best_score_)
    # predict on test set
    y_pred = xgb_grid.predict(X_test)
    # print classification report
    print(classification_report(y_test, y_pred))
    # print confusion matrix

In [86]:
# grid_search(X_train, y_train, X_test, y_test)

In [14]:
X_train.shape[1]

11

In [56]:
# LSTM model
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Dropout
import numpy as np
import pandas as pd
from keras import layers


model = Sequential()
model.add(Dense(110, input_shape=(X_train.shape[1],), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(500, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(100, activation='LeakyReLU'))
model.add(Dropout(0.2))
model.add(Dense(30, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(2, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 110)               1320      
                                                                 
 dropout_4 (Dropout)         (None, 110)               0         
                                                                 
 dense_7 (Dense)             (None, 500)               55500     
                                                                 
 dropout_5 (Dropout)         (None, 500)               0         
                                                                 
 dense_8 (Dense)             (None, 100)               50100     
                                                                 
 dropout_6 (Dropout)         (None, 100)               0         
                                                                 
 dense_9 (Dense)             (None, 30)               

In [57]:
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=120,batch_size=300, verbose=1,)

Epoch 1/120
Epoch 2/120
Epoch 3/120
Epoch 4/120
Epoch 5/120
Epoch 6/120
Epoch 7/120
Epoch 8/120
Epoch 9/120

KeyboardInterrupt: 

In [23]:
# clear session to avoid clutter from old models / layers.
from keras import backend as K 
K.clear_session()

In [None]:
# Neural Net Using RAW Data

import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('df_A.csv')

In [None]:
# take sample of first 10000 rows from df sample data using iloc
sample = df.iloc[:10000]

In [None]:
sample.head()

In [None]:
X = sample['combined']
y = sample['is_duplicate']

In [None]:
y_train = y.iloc[:int(y.shape[0]*0.8)]
y_test = y.iloc[int(y.shape[0]*0.8):]
X_train = X.iloc[:int(X.shape[0]*0.8)]
X_test = X.iloc[int(X.shape[0]*0.8):]

In [None]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=3000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

In [None]:
print(X_train[5])

In [None]:
from keras.models import Sequential
from keras import layers

embedding_dim = 50
maxlen = 100

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

In [None]:
X_train[1]

In [None]:
history = model.fit(X_train, y_train,
                    epochs=50,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))