# NLP with Question Pairs (v2)

ML Sample of Natural Language Processing.

- For environment test and confirmation.

## Dataset

Quora Question Pairs
> Can you identify question pairs that have the same intent?

https://www.kaggle.com/competitions/quora-question-pairs/overview

In [1]:
import pandas as pd

import contractions 
from nltk.corpus import stopwords
import nltk

from sklearn.feature_extraction.text import TfidfVectorizer


In [2]:
pd.set_option("display.max_colwidth", 120)

In [3]:
nltk.download('stopwords')

STOPWORDS = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
# Load Train Dataset
df_train = pd.read_csv(
    './raw_data/train.csv',
    na_filter=False
)

display(df_train.head(10))

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in share market in india?,What is the step by step guide to invest in share market?,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Diamond?,What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?,0
2,2,5,6,How can I increase the speed of my internet connection while using a VPN?,How can Internet speed be increased by hacking through DNS?,0
3,3,7,8,Why am I mentally very lonely? How can I solve it?,"Find the remainder when [math]23^{24}[/math] is divided by 24,23?",0
4,4,9,10,"Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?",Which fish would survive in salt water?,0
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?,"I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?",1
6,6,13,14,Should I buy tiago?,What keeps childern active and far from phone and video games?,0
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
8,8,17,18,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,9,19,20,Motorola (company): Can I hack my Charter Motorolla DCX3400?,How do I hack Motorola DCX3400 for free internet?,0


In [5]:
print(df_train.shape)

(404290, 6)


## Methods preparation

In [6]:
def clean_text(text: str) -> str:
    """Clean text."""
    text = _clean_text_expand_contractions(text)
    text = _clean_text_lowercase_conversion(text)
    text = _clean_text_stopwords_removing(text)
    return text


def _clean_text_expand_contractions(text: str) -> str:
    """Clean text with expansion contractions."""
    return contractions.fix(text)


def _clean_text_lowercase_conversion(text: str) -> str:
    """Clean text with lower case conversion."""
    return text.lower()


def _clean_text_stopwords_removing(text: str) -> str:
    """Clean text with removing stopwords."""
    words = text.split()
    words = [
        word for word in words if word not in STOPWORDS
    ]
    return ' '.join(words)

## Data Preprocessing

In [7]:
# Clean text
df_train['question1'] = df_train['question1'].apply(clean_text)
df_train['question2'] = df_train['question2'].apply(clean_text)

display(df_train.head(10))

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,step step guide invest share market india?,step step guide invest share market?,0
1,1,3,4,story kohinoor (koh-i-noor) diamond?,would happen indian government stole kohinoor (koh-i-noor) diamond back?,0
2,2,5,6,increase speed internet connection using vpn?,internet speed increased hacking dns?,0
3,3,7,8,mentally lonely? solve it?,"find remainder [math]23^{24}[/math] divided 24,23?",0
4,4,9,10,"one dissolve water quikly sugar, salt, methane carbon di oxide?",fish would survive salt water?,0
5,5,11,12,astrology: capricorn sun cap moon cap rising...what say me?,"triple capricorn (sun, moon ascendant capricorn) say me?",1
6,6,13,14,buy tiago?,keeps childern active far phone video games?,0
7,7,15,16,good geologist?,great geologist?,1
8,8,17,18,use シ instead し?,"use ""&"" instead ""and""?",0
9,9,19,20,motorola (company): hack charter motorolla dcx3400?,hack motorola dcx3400 free internet?,0


In [8]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(
    max_df=0.5,
    min_df=2,
    max_features=20_000
)

question1_tfidf = vectorizer.fit_transform(df_train['question1'])
question2_tfidf = vectorizer.transform(df_train['question2'])

In [9]:
X = question1_tfidf - question2_tfidf
y = df_train['is_duplicate']