# NLP with Question Pairs (v2)

ML Sample of Natural Language Processing.

- For environment test and confirmation.

## Dataset

Quora Question Pairs
> Can you identify question pairs that have the same intent?

https://www.kaggle.com/competitions/quora-question-pairs/overview

In [1]:
import pandas as pd
import numpy as np

import os
from joblib import dump, load

import contractions 
from nltk.corpus import stopwords
import nltk

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.base import BaseEstimator
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score

In [2]:
pd.set_option("display.max_colwidth", 120)

In [3]:
nltk.download('stopwords')

STOPWORDS = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
# Load Train Dataset
df_train = pd.read_csv(
    './raw_data/train.csv',
    na_filter=False
)

display(df_train.head(10))

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in share market in india?,What is the step by step guide to invest in share market?,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Diamond?,What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?,0
2,2,5,6,How can I increase the speed of my internet connection while using a VPN?,How can Internet speed be increased by hacking through DNS?,0
3,3,7,8,Why am I mentally very lonely? How can I solve it?,"Find the remainder when [math]23^{24}[/math] is divided by 24,23?",0
4,4,9,10,"Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?",Which fish would survive in salt water?,0
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?,"I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?",1
6,6,13,14,Should I buy tiago?,What keeps childern active and far from phone and video games?,0
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
8,8,17,18,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,9,19,20,Motorola (company): Can I hack my Charter Motorolla DCX3400?,How do I hack Motorola DCX3400 for free internet?,0


In [5]:
print(df_train.shape)

(404290, 6)


## Methods preparation

In [6]:
def clean_text(text: str) -> str:
    """Clean text."""
    text = _clean_text_expand_contractions(text)
    text = _clean_text_lowercase_conversion(text)
    text = _clean_text_stopwords_removing(text)
    return text


def _clean_text_expand_contractions(text: str) -> str:
    """Clean text with expansion contractions."""
    return contractions.fix(text)


def _clean_text_lowercase_conversion(text: str) -> str:
    """Clean text with lower case conversion."""
    return text.lower()


def _clean_text_stopwords_removing(text: str) -> str:
    """Clean text with removing stopwords."""
    words = text.split()
    words = [
        word for word in words if word not in STOPWORDS
    ]
    return ' '.join(words)


def load_or_dump_built_model(
    model: BaseEstimator,
    model_name: str,
    X_train_data: np.ndarray,
    y_train_data: np.ndarray,
) -> BaseEstimator:
    """
    Load a previously saved model, or train and save a new model if not already present.

    Args:
        model (BaseEstimator): An uninitialized scikit-learn model.
        model_name (str): The name of the model, used for file naming.
        X_train_data (np.ndarray): Training data features.
        y_train_data (np.ndarray): Training data labels.

    Returns:
        BaseEstimator: Trained model instance.
    """
    model_file_path = f'./models/{model_name}.joblib'

    # If the model file exists, load it
    if os.path.exists(model_file_path):
        model = load(model_file_path)
    else:
        model.fit(X_train_data, y_train_data)
        # Save the model after training
        dump(model, model_file_path)

    return model


def evaluate_trained_model(
    model: BaseEstimator,
    X_test_data: list,
    y_test_data: list
) -> None:
    """Evaluate a trained Machine Learning model using various metrics

    This function provides:
    - Accuracy Score: Measures how accurately the class labels are predicted.
    - Precision Score: Evaluates how many of the items predicted as positive are actually positive.
    - Confusion Matrix: Provides a matrix representing TP, FP, FN, TN for each class.
    - Classification Report: Generates a detailed report including Precision, Recall, F1-score, and Support for each class.

    Args:
        model: Trained machine learning model.
        X_test_data, y_test_data: Test data and labels.
    """
    y_pred = model.predict(X_test_data)

    print(f"Evaluation: {model.__class__.__name__}\n")  
    print("Accuracy:", accuracy_score(y_test_data, y_pred))
    print("Precision:", precision_score(y_test_data, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test_data, y_pred))
    print("Classification Report:\n", classification_report(y_test_data, y_pred))

## Data Preprocessing

In [7]:
# Clean text
df_train['question1'] = df_train['question1'].apply(clean_text)
df_train['question2'] = df_train['question2'].apply(clean_text)

display(df_train.head(10))

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,step step guide invest share market india?,step step guide invest share market?,0
1,1,3,4,story kohinoor (koh-i-noor) diamond?,would happen indian government stole kohinoor (koh-i-noor) diamond back?,0
2,2,5,6,increase speed internet connection using vpn?,internet speed increased hacking dns?,0
3,3,7,8,mentally lonely? solve it?,"find remainder [math]23^{24}[/math] divided 24,23?",0
4,4,9,10,"one dissolve water quikly sugar, salt, methane carbon di oxide?",fish would survive salt water?,0
5,5,11,12,astrology: capricorn sun cap moon cap rising...what say me?,"triple capricorn (sun, moon ascendant capricorn) say me?",1
6,6,13,14,buy tiago?,keeps childern active far phone video games?,0
7,7,15,16,good geologist?,great geologist?,1
8,8,17,18,use シ instead し?,"use ""&"" instead ""and""?",0
9,9,19,20,motorola (company): hack charter motorolla dcx3400?,hack motorola dcx3400 free internet?,0


In [8]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(
    max_df=0.5,
    min_df=2,
    max_features=20_000
)

question1_tfidf = vectorizer.fit_transform(df_train['question1'])
question2_tfidf = vectorizer.transform(df_train['question2'])

In [9]:
X = question1_tfidf - question2_tfidf
y = df_train['is_duplicate']

## Model Building

In [10]:
# Model Building: split data
X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

In [11]:
# Model Building and Evaluation: Logistic Regression model
model_lr = load_or_dump_built_model(
    LogisticRegression(max_iter=1000),
    'v2_lr',
    X_train,
    y_train
)
evaluate_trained_model(
    model_lr,
    X_val,
    y_val
)

Evaluation: LogisticRegression

Accuracy: 0.6117885676123574
Precision: 0.40779113137173645
Confusion Matrix:
 [[46516  4287]
 [27103  2952]]
Classification Report:
               precision    recall  f1-score   support

           0       0.63      0.92      0.75     50803
           1       0.41      0.10      0.16     30055

    accuracy                           0.61     80858
   macro avg       0.52      0.51      0.45     80858
weighted avg       0.55      0.61      0.53     80858



In [12]:
# Model Building: Random Forest
model_rf = load_or_dump_built_model(
    RandomForestClassifier(),
    'v2_rf',
    X_train,
    y_train
)
evaluate_trained_model(
    model_rf,
    X_val,
    y_val
)

Evaluation: RandomForestClassifier

Accuracy: 0.7909545128496871
Precision: 0.739283894913034
Confusion Matrix:
 [[43638  7165]
 [ 9738 20317]]
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.86      0.84     50803
           1       0.74      0.68      0.71     30055

    accuracy                           0.79     80858
   macro avg       0.78      0.77      0.77     80858
weighted avg       0.79      0.79      0.79     80858



In [13]:
# Model Building: K_Neighbors
model_knb = load_or_dump_built_model(
    KNeighborsClassifier(),
    'v2_knb',
    X_train,
    y_train
)
evaluate_trained_model(
    model_knb,
    X_val,
    y_val
)

Evaluation: KNeighborsClassifier

Accuracy: 0.5911474436666749
Precision: 0.4670252469813392
Confusion Matrix:
 [[26526 24277]
 [ 8782 21273]]
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.52      0.62     50803
           1       0.47      0.71      0.56     30055

    accuracy                           0.59     80858
   macro avg       0.61      0.61      0.59     80858
weighted avg       0.65      0.59      0.60     80858



### Final Verdict

- __RandomForestClassifier__ shows the most balanced performance, with good results for Accuracy, Precision, and Recall.
- Therefore, among the three models, __RandomForestClassifier__ is the most appropriate model.

## Prediction Test dataset

In [14]:
# Load Test Dataset
df_test = pd.read_csv(
    './raw_data/test.csv',
    na_filter=False
)

display(df_test.head())

Unnamed: 0,test_id,question1,question2
0,0,How does the Surface Pro himself 4 compare with iPad Pro?,Why did Microsoft choose core m3 and not core i3 home Surface Pro 4?
1,1,Should I have a hair transplant at age 24? How much would it cost?,How much cost does hair transplant require?
2,2,What but is the best way to send money from China to the US?,What you send money to China?
3,3,Which food not emulsifiers?,What foods fibre?
4,4,"How ""aberystwyth"" start reading?",How their can I start reading?


In [15]:
# Preprocessing: Clean text
df_test['question1'] = df_test['question1'].apply(clean_text)
df_test['question2'] = df_test['question2'].apply(clean_text)

display(df_test.head())

Unnamed: 0,test_id,question1,question2
0,0,surface pro 4 compare ipad pro?,microsoft choose core m3 core i3 home surface pro 4?
1,1,hair transplant age 24? much would cost?,much cost hair transplant require?
2,2,best way send money china us?,send money china?
3,3,food emulsifiers?,foods fibre?
4,4,"""aberystwyth"" start reading?",start reading?


In [16]:
# TF-IDF Vectorization
question1_tfidf_test = vectorizer.transform(df_test['question1'])
question2_tfidf_test = vectorizer.transform(df_test['question2'])

In [17]:
X_test = question1_tfidf_test - question2_tfidf_test

In [18]:
# Test: Prediction
y_test_pred = model_rf.predict(X_test)

In [19]:
df_final = pd.DataFrame({
    'test_id': df_test['test_id'],
    'is_duplicate': y_test_pred
})

display(df_final)

Unnamed: 0,test_id,is_duplicate
0,0,0
1,1,0
2,2,1
3,3,1
4,4,1
...,...,...
2345791,2345791,0
2345792,2345792,0
2345793,2345793,0
2345794,2345794,0


In [20]:
print(df_final[df_final['is_duplicate'] == 1].shape[0])

384230


In [21]:
df_final.to_csv(
    'v2_submission.csv',
    index=False
)