<h1 id = "a-predictive-natural-language-processing-model"; style="color:#207d06; font-size:220%; text-align:left; border-bottom: 3px solid #207d06;">A PREDICTIVE NATURAL LANGUAGE PROCESSING MODEL </h1>

<h2 style="color:#8fca6b; border-bottom: 1px solid #207d06;">Lighthouse Labs, Mini-Project V: Identifying Duplicate Questions</h2>
<p id = "by-jamie-dormaar"; style="font-family:JetBrains Mono; letter-spacing: 1px; color:#8fca6b; font-size:110%; text-align:left;">By Jamie Dormaar, January 31, 2023.</p>  

<!--  
<h1 id = "a-predictive-natural-language-processing-model"; style="font-family:JetBrains Mono; letter-spacing: 1px; color:#207d06; font-size:120%; text-align:left; border-bottom: 3px solid #207d06;">A PREDICTIVE NATURAL LANGUAGE PROCESSING MODEL </h1>

<h2 style="font-family:JetBrains Mono; letter-spacing: 1px; color:#8fca6b; text-align:left; border-bottom: 1px solid #207d06;">Lighthouse Labs, Mini-Project V: Identifying Duplicate Questions</h2>
<p id = "by-jamie-dormaar"; style="font-family:JetBrains Mono; letter-spacing: 1px; color:#8fca6b; font-size:110%; text-align:left;">By Jamie Dormaar, January 31, 2023.</p>  
-->


[](#a-predictive-natural-language-processing-model)

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 

<h2 id = "table-of-contents";/h2>  

<h2 id = "table-of-contents"; style="color:#207d06; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">TABLE OF CONTENTS</h2>  



- [DOWNLOAD DATA](#download-data)
- [EXPLORATION](#exploration)
- [DATA CLEANING](#data-cleaning)
  - [CUSTOM MODULE](#custom-module)
- [PREPROCESS DATA](#preprocess-data)
- [FEATURE ENGINEERING](#feature-engineering)
  - [NLP REPRESENTATIONS](#nlp-representations)
    - [BOW](#bow)
    - [TF-IDF](#tf-idf)
- [MODELING](#modeling)
  - [SPLITTING DATA](#splitting-data)
  - [MODEL TRAINING](#model-training)
  - [MODEL EVALUATION](#model-evaluation)

<!--
By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 
Doc Headers:
<h2 id= "download-data"; style="color:#207d06; text-align:left; padding: 0px; border-bottom: 3px solid #207d06;">DOWNLOAD DATA</h2>
<h2 id= "exploration"; style="color:#207d06; text-align:left; padding: 0px; border-bottom: 3px solid #207d06;">EXPLORATION</h2>
<h2 id= "data-cleaning"; style="color:#207d06; text-align:left; padding: 0px; border-bottom: 3px solid #207d06;">DATA CLEANING</h2>
<h3 id= "custom-module"; style="color:#8fca6b; text-align:left; padding: 0px; border-bottom: 2px solid #8fca6b;">CUSTOM MODULE</h3>
<h2 id= "preprocess-data"; style="color:#207d06; text-align:left; padding: 0px; border-bottom: 3px solid #207d06;">PREPROCESS DATA</h2>
<h2 id= "feature-engineering"; style="color:#207d06; text-align:left; padding: 0px; border-bottom: 3px solid #207d06;">FEATURE ENGINEERING</h2>
<h3 id= "nlp-representations"; style="color:#8fca6b; text-align:left; padding: 0px; border-bottom: 2px solid #8fca6b;">NLP REPRESENTATIONS</h3>
<h4 id= "bow"; style="color:#c8d43e; text-align:left; padding: 0px; border-bottom: 1px solid #c8d43e;">BOW</h4>
<h4 id= "tf-idf"; style="color:#c8d43e; text-align:left; padding: 0px; border-bottom: 1px solid #c8d43e;">TF-IDF</h4>
<h2 id= "modeling"; style="color:#207d06; text-align:left; padding: 0px; border-bottom: 3px solid #207d06;">MODELING</h2>
<h3 id= "splitting-data"; style="color:#8fca6b; text-align:left; padding: 0px; border-bottom: 2px solid #8fca6b;">SPLITTING DATA</h3>
<h3 id= "model-training"; style="color:#8fca6b; text-align:left; padding: 0px; border-bottom: 2px solid #8fca6b;">MODEL TRAINING</h3>
<h3 id= "model-evaluation"; style="color:#8fca6b; text-align:left; padding: 0px; border-bottom: 2px solid #8fca6b;">MODEL EVALUATION</h3>
<h3 id= "tune-hyperparameters"; style="color:#8fca6b; text-align:left; padding: 0px; border-bottom: 2px solid #8fca6b;">TUNE HYPERPARAMETERS</h3>
-->

##### <p style="font-family:JetBrains Mono; text-align:right; padding-right: 10%;">[Top Of Page](#lighthouse-labs-mini-project-v-identifying-duplicate-questions) / [TOC](#table-of-contents)</p>

<h2 id = "download-data"; style=" color:#207d06; text-align:left; padding:0px; border-bottom: 3px solid #207d06;">DOWNLOAD DATA</h2>

In [175]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_theme(style='darkgrid', context='talk')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')

In [176]:
data = pd.read_csv("../data/source materials/train.csv")
df = data.copy()

>Note: _There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data._

##### SAVE: raw data

In [177]:
data.to_csv('../data/df_raw.csv', index=False)
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [178]:
df['is_duplicate'].value_counts()

0    255027
1    149263
Name: is_duplicate, dtype: int64

##### <p style="font-family:JetBrains Mono; text-align:right; padding-right: 10%;">[Top Of Page](#lighthouse-labs-mini-project-v-identifying-duplicate-questions) / [TOC](#table-of-contents)</p>

<h2 id = "exploration"; style="color:#207d06; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">EXPLORATION</h2>

In [179]:
df.shape

(404290, 6)

In [180]:
q_dups = df[df['is_duplicate'] > 0].copy()
q_difs = df[df['is_duplicate']== 0].copy()

q_dups.reset_index(drop=True, inplace=True)
q_difs.reset_index(drop=True, inplace=True)

display(q_dups[['question1', 'question2']].head())
display(q_difs[['question1', 'question2']].head())

Unnamed: 0,question1,question2
0,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan..."
1,How can I be a good geologist?,What should I do to be a great geologist?
2,How do I read and find my YouTube comments?,How can I see all my Youtube comments?
3,What can make Physics easy to learn?,How can you make physics easy to learn?
4,What was your first sexual experience like?,What was your first sexual experience?


Unnamed: 0,question1,question2
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?


In [181]:

tag = ''   # For save id if needed:
# tag = '_q_difs'
# tag = '_q_dups'

# df = q_dups.copy()
# df = q_difs.copy()

df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [182]:
# df[df['qid1']==15]
n_q1 = df['question1'].nunique()
n_q2 = df['question2'].nunique()

print(f'question1 unique entries: {n_q1}\n        Percent of total: {round(n_q1/df.shape[0], 2)}\n')
print(f'question2 unique entries: {n_q2}\n        Percent of total: {round(n_q2/df.shape[0], 2)}\n')

question1 unique entries: 290456
        Percent of total: 0.72

question2 unique entries: 299174
        Percent of total: 0.74



##### <p style="font-family:JetBrains Mono; text-align:right; padding-right: 10%;">[Top Of Page](#lighthouse-labs-mini-project-v-identifying-duplicate-questions) / [TOC](#table-of-contents)</p>

<h2 id = "data-cleaning" style="color:#207d06; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">DATA CLEANING</h2>

### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [183]:
import numpy as np
import pandas as pd
import re # regex
import nltk

##### <p style="font-family:JetBrains Mono; text-align:right; padding-right: 10%;">[Top Of Page](#lighthouse-labs-mini-project-v-identifying-duplicate-questions) / [TOC](#table-of-contents)</p>

<h3 
  id = "custom-module" 
  style="
    color: #8fca6b; 
    text-align: left;
    padding: 0px; 
    border-bottom: 2px solid #8fca6b;
    "
  >
  CUSTOM MODULE
</h3>

The following functions are contained within a custom module created for modular reuse.

>```py
>import numpy as np
>import pandas as pd
>import re # regex
>import string
>import nltk
>from nltk.tokenize import word_tokenize
>from nltk.corpus import stopwords
>from nltk.stem.porter import PorterStemmer
>
>def tokenize(text):
>    tokens = word_tokenize(text)
>    return tokens
>
>def alphabetic_filter(stripped):
>    words = [word for word in stripped if word.isalpha()]
>    return words
>
>def stemming(words):
>    ps = PorterStemmer()
>    stemmed = [ps.stem(word) for word in words]
>    return stemmed
>
>def remove_stopwords(words):
>    stop_words = set(stopwords.words('english'))
>    clean_text = [w for w in words if not w in stop_words]
>    return clean_text
>
>def preprocess(text, output_list=True):
>    text = tokenize(text)
>    text = alphabetic_filter(text)
>    text = stemming(text)
>    text = remove_stopwords(text)
>    if output_list==False:
>        text = " ".join([word for word in text])
>    return text
>
>def text_cleaning_regex(features):
>    """function takes in an iterable of string data types, and using
>    regex module successively:
>        - Removes all the special characters.
>        - Removes all single characters.
>        - Removes single characters from the start.
>        - Substitutes multiple spaces with single space.
>        - Converts all text to lowercase.
>
>    Args:
>        features (iterable): an array or list. Example: df['col']
>    """
>    processed_features = []
>    for sentence in range(0, len(features)):
>        # Remove all the special characters
>        alt_text = re.sub(r'\W', ' ', str(features[sentence]))
>
>        # remove all single characters
>        alt_text= re.sub(r'\s+[a-zA-Z]\s+', ' ', alt_text)
>
>        # Remove single characters from the start
>        alt_text = re.sub(r'\^[a-zA-Z]\s+', ' ', alt_text)
>
>        # Substituting multiple spaces with single space
>        alt_text = re.sub(r'\s+', ' ', alt_text, flags=re.I)
>
>        # Removing prefixed 'b'
>        alt_text = re.sub(r'^b\s+', '', alt_text)
>
>        # Converting to Lowercase
>        alt_text = alt_text.lower()
>        processed_features.append(alt_text)
>
>    print(f'''
>    Removed special characters:
>    Removed single characters
>    Standardization to lowercase\n''')
>    return processed_features
>```

Module is accessed just like any usual setup importing the necessary packages.  For example:
>```py
>import nlp_data_prep as nlp
>nlp.text_cleaning_regex(...)
>```

[W9D2m4](../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W9/W9D2m4_kaggle_sentiment-analysis-airline-tweets.ipynb)


In [184]:
import nlp_data_prep as nlp

In [185]:
q1 = df['question1']
q2 = df['question2']
display(df.head())

# CLEAN: Question1 Column
df['question1'] = nlp.text_cleaning_regex(q1)
display(df.head())

# CLEAN: Question2 Column
df['question2'] = nlp.text_cleaning_regex(q2)
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0



    Removed special characters:
    Removed single characters:
    Standardization to lowercase:



Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,what is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,what is the story of kohinoor koh noor diamond,What would happen if the Indian government sto...,0
2,2,5,6,how can increase the speed of my internet conn...,How can Internet speed be increased by hacking...,0
3,3,7,8,why am mentally very lonely how can solve it,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,which one dissolve in water quikly sugar salt ...,Which fish would survive in salt water?,0



    Removed special characters:
    Removed single characters:
    Standardization to lowercase:



Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,what is the step by step guide to invest in sh...,what is the step by step guide to invest in sh...,0
1,1,3,4,what is the story of kohinoor koh noor diamond,what would happen if the indian government sto...,0
2,2,5,6,how can increase the speed of my internet conn...,how can internet speed be increased by hacking...,0
3,3,7,8,why am mentally very lonely how can solve it,find the remainder when math 23 24 math is div...,0
4,4,9,10,which one dissolve in water quikly sugar salt ...,which fish would survive in salt water,0


##### SAVE: clean data

In [186]:
# SAVE: DataFrame stages:
df_clean = df.copy()
df_clean.to_csv(f'../data/df_clean{tag}.csv', index=False)


##### <p style="font-family:JetBrains Mono; text-align:right; padding-right: 10%;">[Top Of Page](#lighthouse-labs-mini-project-v-identifying-duplicate-questions) / [TOC](#table-of-contents)</p>

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">PREPROCESS DATA</p>

In [187]:
df['question1'] = df['question1'].apply(lambda x: nlp.preprocess(x, output_list=False))
nlp.preprocess_report('question1')
display(df.head())

df['question2'] = df['question2'].apply(lambda x: nlp.preprocess(x, output_list=False))
nlp.preprocess_report('question2')
display(df.head())



question1:
    Tokenized text data:
    Stemmed root words:
    Removed non-alphabetic:
    Removed English stopwords:
    


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,step step guid invest share market india,what is the step by step guide to invest in sh...,0
1,1,3,4,stori kohinoor koh noor diamond,what would happen if the indian government sto...,0
2,2,5,6,increas speed internet connect use vpn,how can internet speed be increased by hacking...,0
3,3,7,8,whi mental veri lone solv,find the remainder when math 23 24 math is div...,0
4,4,9,10,one dissolv water quikli sugar salt methan car...,which fish would survive in salt water,0



question2:
    Tokenized text data:
    Stemmed root words:
    Removed non-alphabetic:
    Removed English stopwords:
    


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,step step guid invest share market india,step step guid invest share market,0
1,1,3,4,stori kohinoor koh noor diamond,would happen indian govern stole kohinoor koh ...,0
2,2,5,6,increas speed internet connect use vpn,internet speed increas hack dn,0
3,3,7,8,whi mental veri lone solv,find remaind math math divid,0
4,4,9,10,one dissolv water quikli sugar salt methan car...,fish would surviv salt water,0


In [188]:
print(f'current df shape:\n{df.shape}')
df.dropna(inplace=True)
print(f'In case cleaning created nulls, the new shape:\n{df.shape}')
df.drop_duplicates(inplace=True)
print(f'shape after dropping duplicate values:\n{df.shape}')

current df shape:
(404290, 6)
In case cleaning created nulls, the new shape:
(404290, 6)
shape after dropping duplicate values:
(404290, 6)


##### SAVE: preprocessed data

In [189]:
df_preprocessed = df.copy()
df.to_csv(f'../data/df_preprocessed{tag}.csv', index=False)

In [190]:
# Add two new columns with the text features in list format:
df['q1_list'] = df['question1'].apply(lambda x: nlp.preprocess(x, output_list=True))
display(df.head())

df['q2_list'] = df['question2'].apply(lambda x: nlp.preprocess(x, output_list=True))
display(df.head())

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,q1_list
0,0,1,2,step step guid invest share market india,step step guid invest share market,0,"[step, step, guid, invest, share, market, india]"
1,1,3,4,stori kohinoor koh noor diamond,would happen indian govern stole kohinoor koh ...,0,"[stori, kohinoor, koh, noor, diamond]"
2,2,5,6,increas speed internet connect use vpn,internet speed increas hack dn,0,"[increa, speed, internet, connect, use, vpn]"
3,3,7,8,whi mental veri lone solv,find remaind math math divid,0,"[whi, mental, veri, lone, solv]"
4,4,9,10,one dissolv water quikli sugar salt methan car...,fish would surviv salt water,0,"[one, dissolv, water, quikli, sugar, salt, met..."


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,q1_list,q2_list
0,0,1,2,step step guid invest share market india,step step guid invest share market,0,"[step, step, guid, invest, share, market, india]","[step, step, guid, invest, share, market]"
1,1,3,4,stori kohinoor koh noor diamond,would happen indian govern stole kohinoor koh ...,0,"[stori, kohinoor, koh, noor, diamond]","[would, happen, indian, govern, stole, kohinoo..."
2,2,5,6,increas speed internet connect use vpn,internet speed increas hack dn,0,"[increa, speed, internet, connect, use, vpn]","[internet, speed, increa, hack, dn]"
3,3,7,8,whi mental veri lone solv,find remaind math math divid,0,"[whi, mental, veri, lone, solv]","[find, remaind, math, math, divid]"
4,4,9,10,one dissolv water quikli sugar salt methan car...,fish would surviv salt water,0,"[one, dissolv, water, quikli, sugar, salt, met...","[fish, would, surviv, salt, water]"


##### <p style="font-family:JetBrains Mono; text-align:right; padding-right: 10%;">[Top Of Page](#lighthouse-labs-mini-project-v-identifying-duplicate-questions) / [TOC](#table-of-contents)</p>

## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">FEATURE ENGINEERING</p>

### <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#8fca6b; font-size:100%; text-align:left;padding: 0px; border-bottom: 2px solid #8fca6b;">NLP REPRESENTATIONS</p>

<!--
Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

##### <p style="font-family:JetBrains Mono; text-align:right; padding-right: 10%;">[Top Of Page](#lighthouse-labs-mini-project-v-identifying-duplicate-questions) / [TOC](#table-of-contents)</p>
-->

In [191]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,q1_list,q2_list
0,0,1,2,step step guid invest share market india,step step guid invest share market,0,"[step, step, guid, invest, share, market, india]","[step, step, guid, invest, share, market]"
1,1,3,4,stori kohinoor koh noor diamond,would happen indian govern stole kohinoor koh ...,0,"[stori, kohinoor, koh, noor, diamond]","[would, happen, indian, govern, stole, kohinoo..."
2,2,5,6,increas speed internet connect use vpn,internet speed increas hack dn,0,"[increa, speed, internet, connect, use, vpn]","[internet, speed, increa, hack, dn]"
3,3,7,8,whi mental veri lone solv,find remaind math math divid,0,"[whi, mental, veri, lone, solv]","[find, remaind, math, math, divid]"
4,4,9,10,one dissolv water quikli sugar salt methan car...,fish would surviv salt water,0,"[one, dissolv, water, quikli, sugar, salt, met...","[fish, would, surviv, salt, water]"


In [192]:
# Convert to list
df_clean_question1 = df_clean.question1.values.tolist()
df_clean_question2 = df_clean.question2.values.tolist()
question1 = df['question1'].values.tolist()
question2 = df['question2'].values.tolist()
q1_list = df['q1_list'].values.tolist()
q2_list = df['q2_list'].values.tolist()
question1[:5]
q1_list[:5]

[['step', 'step', 'guid', 'invest', 'share', 'market', 'india'],
 ['stori', 'kohinoor', 'koh', 'noor', 'diamond'],
 ['increa', 'speed', 'internet', 'connect', 'use', 'vpn'],
 ['whi', 'mental', 'veri', 'lone', 'solv'],
 ['one',
  'dissolv',
  'water',
  'quikli',
  'sugar',
  'salt',
  'methan',
  'carbon',
  'di',
  'oxid']]

In [193]:
from nltk.stem import WordNetLemmatizer

def lemmatize_verbs(text):
    """Lemmatize verbs in list of tokenized text"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in text:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

In [194]:
lemma_df_clean_question1 = lemmatize_verbs(df_clean_question1)
lemma_df_clean_question2 = lemmatize_verbs(df_clean_question2)
lemma_question1 = lemmatize_verbs(question1)
lemma_question2 = lemmatize_verbs(question2)

In [195]:
lemma_df_clean_question1[:5]

['what is the step by step guide to invest in share market in india ',
 'what is the story of kohinoor koh noor diamond ',
 'how can increase the speed of my internet connection while using vpn ',
 'why am mentally very lonely how can solve it ',
 'which one dissolve in water quikly sugar salt methane and carbon di oxide ']

In [196]:
lemma_question1[:5]

['step step guid invest share market india',
 'stori kohinoor koh noor diamond',
 'increas speed internet connect use vpn',
 'whi mental veri lone solv',
 'one dissolv water quikli sugar salt methan carbon di oxid']

##### <p style="font-family:JetBrains Mono; text-align:right; padding-right: 10%;">[Top Of Page](#lighthouse-labs-mini-project-v-identifying-duplicate-questions) / [TOC](#table-of-contents)</p>


#### <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#c8d43e; font-size:100%; text-align:left;padding: 0px; border-bottom: 1px solid #c8d43e;">BAG OF WORDS (BOW)</p>

In [197]:
from sklearn.feature_extraction.text import CountVectorizer # BOW

* instantiate [CountVectorizer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [198]:
cv = CountVectorizer()

* use fit_transform method of CountVectorizer to 'docs' and store the result in 'word_count_vector'

In [199]:
word_count_vector = cv.fit_transform(q1_list[0])
word_count_vector

<7x6 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

In [200]:
print(type(word_count_vector))
# scipy.sparse._csr.csr_matrix
word_count_vector.toarray()

<class 'scipy.sparse._csr.csr_matrix'>


array([[0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0]])

In [201]:
feature_names = cv.get_feature_names_out()

In [202]:
X = pd.DataFrame(word_count_vector.toarray(), columns=feature_names)
display(X)
X.sum().to_dict()

Unnamed: 0,guid,india,invest,market,share,step
0,0,0,0,0,0,1
1,0,0,0,0,0,1
2,1,0,0,0,0,0
3,0,0,1,0,0,0
4,0,0,0,0,1,0
5,0,0,0,1,0,0
6,0,1,0,0,0,0


{'guid': 1, 'india': 1, 'invest': 1, 'market': 1, 'share': 1, 'step': 2}

In [203]:
def bow_vector_matrix(text, cv):
    bow_matrix = cv.fit_transform(text)
    X = pd.DataFrame(bow_matrix.toarray(), columns = cv.get_feature_names_out())
    # display(X)
    return X.sum().to_dict()

<h4 id= "TF-IDF"; style="color:#c8d43e;text-align:left;padding: 0px; border-bottom: 1px solid #c8d43e;">TF-IDF</h4>


##### <p style="font-family:JetBrains Mono; text-align:right; padding-right: 10%;">[Top Of Page](#lighthouse-labs-mini-project-v-identifying-duplicate-questions) / [TOC](#table-of-contents)</p>

In [204]:
from sklearn.feature_extraction.text import TfidfVectorizer

# tfidf = TfidfTransformer(use_idf = True, smooth_idf = True)
vect = TfidfVectorizer()

In [205]:
def tfidf_vector_matrix(text, vect):
    vect = TfidfVectorizer()
    tfidf_matrix = vect.fit_transform(text)
    X = pd.DataFrame(tfidf_matrix.toarray(), columns = vect.get_feature_names_out())
    # display(X)
    return X.sum().to_dict()

In [207]:
# tf_idf_data = tfidf.fit_transform(word_count_vector).toarray()
# tf_idf_data = vect.fit_transform(word_count_vector).toarray()
# tf_idf_data

##### <p style="font-family:JetBrains Mono; text-align:right; padding-right: 10%;">[Top Of Page](#lighthouse-labs-mini-project-v-identifying-duplicate-questions) / [TOC](#table-of-contents)</p>


<h2 id = "modeling"; style="color:#207d06; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">MODELING</h2>

### <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#8fca6b; font-size:100%; text-align:left;padding: 0px; border-bottom: 2px solid #8fca6b;">SPLITTING DATA</p>


In [208]:
df.reset_index(drop=True, inplace=True)
df.dropna(inplace=True)
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,q1_list,q2_list
0,0,1,2,step step guid invest share market india,step step guid invest share market,0,"[step, step, guid, invest, share, market, india]","[step, step, guid, invest, share, market]"
1,1,3,4,stori kohinoor koh noor diamond,would happen indian govern stole kohinoor koh ...,0,"[stori, kohinoor, koh, noor, diamond]","[would, happen, indian, govern, stole, kohinoo..."
2,2,5,6,increas speed internet connect use vpn,internet speed increas hack dn,0,"[increa, speed, internet, connect, use, vpn]","[internet, speed, increa, hack, dn]"
3,3,7,8,whi mental veri lone solv,find remaind math math divid,0,"[whi, mental, veri, lone, solv]","[find, remaind, math, math, divid]"
4,4,9,10,one dissolv water quikli sugar salt methan car...,fish would surviv salt water,0,"[one, dissolv, water, quikli, sugar, salt, met...","[fish, would, surviv, salt, water]"


In [209]:
df['is_duplicate'].value_counts()

0    255027
1    149263
Name: is_duplicate, dtype: int64

In [212]:
from sklearn.model_selection import train_test_split

clean_text = df['question1'] + ' ' + df['question2']
labels = df['is_duplicate']

X_train, X_test, y_train, y_test = train_test_split(
                                          clean_text
                                        , labels
                                        , test_size = 0.2
                                        , random_state = 17
                                        , shuffle=True
                                    )
print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_test shape: {y_test.shape}')

X_train shape: (323432,)
X_test shape: (80858,)
y_train shape: (323432,)
y_test shape: (80858,)


In [222]:
y_train.value_counts()

0    203778
1    119654
Name: is_duplicate, dtype: int64

##### <p style="font-family:JetBrains Mono; text-align:right; padding-right: 10%;">[Top Of Page](#lighthouse-labs-mini-project-v-identifying-duplicate-questions) / [TOC](#table-of-contents)</p>


<h3 id= "model-training" style="color:#8fca6b; text-align:left;padding: 0px; border-bottom: 2px solid #8fca6b;">MODEL TRAINING</h3>

In [226]:
# Import train_test_split from sklearn and split the data.
from sklearn.ensemble import RandomForestClassifier as rfc

rf_clf = rfc(
      n_estimators = 10
    , random_state = 0
)

# Define the machine learning pipeline
rf_clf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', rf_clf)
])
rf_clf_pipe

In [227]:
# Fit the machine learning model to the training data
rf_clf_pipe.fit(X_train, y_train)

In [230]:
# Define the machine learning pipeline
svc_clf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LinearSVC())
])
svc_clf_pipe

In [231]:
# Fit the machine learning model to the training data
svc_clf_pipe.fit(X_train, y_train)

### <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#8fca6b; font-size:100%; text-align:left;padding: 0px; border-bottom: 2px solid #8fca6b;">MODEL EVALUATION</p>

In [241]:
# Evaluate the machine learning model on the test data
rf_pred = rf_clf_pipe.predict(X_test)
rf_accuracy = np.mean(rf_pred == y_test)

print(f'rf_clf_pipe Accuracy: {rf_accuracy}')

rf_clf_pipe Accuracy: 0.7981770511266665


In [243]:
# Evaluate the machine learning model on the test data
svc_pred = svc_clf_pipe.predict(X_test)
svc_accuracy = np.mean(svc_pred == y_test)

print(f'svc_clf_pipe Accuracy: {svc_accuracy}')

svc_clf_pipe Accuracy: 0.7530980236958619


In [244]:
from sklearn.metrics import (
      classification_report
    , confusion_matrix
    , accuracy_score
)

Random Forest predictions:

In [249]:
print(f' random_forest_confusion_matrix:\n------------------------------------------------------')
print(f' {confusion_matrix(y_test, rf_pred)}\n\n')
print(f' random_forest_classification_report:\n------------------------------------------------------')
print(f' {classification_report(y_test, rf_pred)}\n\n')
print(f' random_forest_accuracy_score:\n------------------------------------------------------')
print(f' {accuracy_score(y_test, rf_pred)}\n\n')

 random_forest_confusion_matrix:
------------------------------------------------------
 [[45555  5694]
 [10625 18984]]


 random_forest_classification_report:
------------------------------------------------------
               precision    recall  f1-score   support

           0       0.81      0.89      0.85     51249
           1       0.77      0.64      0.70     29609

    accuracy                           0.80     80858
   macro avg       0.79      0.77      0.77     80858
weighted avg       0.80      0.80      0.79     80858



 random_forest_accuracy_score:
------------------------------------------------------
 0.7981770511266665




Support Vector Classifier predictions:

In [251]:
print(f' Support_Vector_confusion_matrix:\n------------------------------------------------------')
print(f' {confusion_matrix(y_test, svc_pred)}\n\n')
print(f' Support_Vector_classification_report:\n------------------------------------------------------')
print(f' {classification_report(y_test, svc_pred)}\n\n')
print(f' Support_Vector_accuracy_score:\n------------------------------------------------------')
print(f' {accuracy_score(y_test, svc_pred)}\n\n')

 Support_Vector_confusion_matrix:
------------------------------------------------------
 [[43249  8000]
 [11964 17645]]


 Support_Vector_classification_report:
------------------------------------------------------
               precision    recall  f1-score   support

           0       0.78      0.84      0.81     51249
           1       0.69      0.60      0.64     29609

    accuracy                           0.75     80858
   macro avg       0.74      0.72      0.73     80858
weighted avg       0.75      0.75      0.75     80858



 Support_Vector_accuracy_score:
------------------------------------------------------
 0.7530980236958619




## <p style="font-family:JetBrains Mono; font-weight:normal; letter-spacing: 1px; color:#207d06; font-size:100%; text-align:left;padding: 0px; border-bottom: 3px solid #207d06;">REFERENCES</p>


1. Simplifying sentiment analysis python. By Sayak Paul, datacamp.com Tutorials.  Page titled "Python Sentiment Analysis Tutorial". May 2021. Sourced Feb 2023. [Link](https://www.datacamp.com/tutorial/simplifying-sentiment-analysis-python): https://www.datacamp.com/tutorial/simplifying-sentiment-analysis-python.

Compass LHL DS Bootcamp references - _links are valid only with my personal file directory_  
1. [W9D1m4]('../../../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W9/W9D1m4_data_prep_NLP.ipynb)    
1. [W9D1m5]('../../../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W9/W9D1m5_NLP_data_preparation_exercise.ipynb)
1. [W9D1m8L]('../../../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W9/W9D1L_NLP-Pre-Processing_Vectorizing.ipynb')  
1. [W9D1m8]('../../../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W9/W9D1m8_representations_bag_of_words.ipynb)  
1. [W9D1m10]('../../../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W9/W9D1m10_word2vec_from_google.ipynb)  
1. [W9D1m12]('../../../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W9/W9D1m12_NLP_representations_exercise.ipynb')  
1. [W9D2m2](../../../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W9/W9D2m2_sentiment_analysis.ipynb)  
1. [W9D2m4](../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W9/W9D2m4_kaggle_sentiment-analysis-airline-tweets.ipynb)  
1. [W9D2m7](../../../Documents/LHL%20DS%20Bootcamp/Course%20work/W9/W9D2m7_topic_modeling.ipynb)  