# Predicting Duplicate Question Pairs with Transformers

## Inspiration

This notebook is inspired from the ['Getting Started with NLP for absolute beginners'](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners) notebook by Jeremy Howard. It focuses on comparing two short phrases and assigning a similarity score. 


    A score closer to 0 means the words/phrases have very different meanings
    A score closer to 1 means the phrases have similar meanings.


Key highlights from the above Notebook:

1. Creating a Combined Input Feature

    To compare two text columns in the context of a third, the notebook demonstrates how to combine them into a single input string.
    For example, if our dataset has three columns — col1, col2, and col3 — and we want to compare col1 and col2 based on the context in col3, we can format the input like this:

    > df['input'] = 'Text1: ' + df.col1 + ' TEXT2: ' + df.col2 + ' CONTEXT: ' + df.col3

    This combined string is then tokenized for model input.

2. Splitting the Data: Train, Validation, and Test Sets

    * The training set is used to teach the model.
    * The validation set helps evaluate and fine-tune the model during training.
    * The test set checks how the model performs on completely new data.

   Evaluating on test set is like a final performance evaluation.

3. Using Pre-trained models

    Using pre-trained models (from hugging face) are especially helpful for tasks like comparing pairs of questions. 
    Because these models have been trained on large volumes of text, they already understand the language structure and can detect subtle differences or similarities in meaning.

## Introduction

I have applied some concepts which I learned from the above notebook to the [Question Pairs Dataset](https://www.kaggle.com/datasets/quora/question-pairs-dataset). This dataset contains six columns, as shown below:

* **id** - Id for each question pair
* **qid1** - Id for question 1 in pair
* **qid2** - Id for question 2 in pair
* **question1** - Full Text for question 1
* **question2** - Full Text for question 2
* **is_duplicate** - 1 if duplicate, else 0

The target column 'is_duplicate' contains a binary value to indicate if the pair of questions ('question1' and 'question2') represents a duplicate pair or not.

## Importing Basic Libraries

We'll start with importing basic libraries and displaying path to the csv file.

* `NumPy` - A Python library for working with arrays and numbers, helps to do math fast with large sets of data like matrices and tables
* `Pandas` - A Python library for handling data in tables, makes it easy to read, write, clean and analyze data

In [None]:
import numpy as np
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

> As seen in the output above, the input directory contains a sub-directory `question-pairs-dataset`, which contain the dataset file `questions.csv`

In [None]:
from pathlib import Path
path = Path('../input/question-pairs-dataset')

In [None]:
!ls {path}

We'll import the CSV file into a `Pandas DataFrame` and, for faster processing, limit it to the first **10,000** rows.

In [None]:
import pandas as pd
df = pd.read_csv(path/'questions.csv', nrows=10000)

## Data Preprocessing

We'll removes rows from the DataFrame where `question1` or `question2` has missing values.

In [None]:
df = df.dropna(subset=["question1", "question2"])

Displaying last **three** rows of the DataFrame for quick analysis

In [None]:
df.tail(3)

Displaying `summary statistics` (like **count**, **unique values**, **top value**, and **frequency**) for all object (text/string) in the DataFrame

In [None]:
df.describe(include='object')

The output above shows that `question1` contains **9,813** unique values, with the most frequent one appearing **4** times. Similarly, `question2` has **9,790** unique values, and its most common value also appears **4** times.

As discussed above, we'll prepare the `input` column by concatenating the two string columns `question1` and `question2` like below-

In [None]:
df['input'] = 'QUES1: ' + df.question1 + ' QUES2: ' + df.question2

This is what our new dataframe looke like with an additional `input` column

In [None]:
df.tail(3)

## Tokenization

Converting the `pandas dataframe` into a `hugging face Dataset` to enable efficient data handling for further tasks

In [None]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)

In [None]:
ds

Picking a **pre-trained model**, chosen for its efficiency in natural language understanding tasks while being lightweight for faster training, which will be **fine-tuned** later to `predict duplicate pairs` within the dataset

In [None]:
# Loading DeBERTa-v3-small model (by Microsoft) from Hugging Face

pt_model = 'microsoft/deberta-v3-small'

**Loading the tokenizer** associated with the pre-trained model, which will `convert text into tokens` suitable for model input

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tok = AutoTokenizer.from_pretrained(pt_model)

With below code, we can retrieve he vocabulary index (ID) of an example token `_is` from tokenizer's vocabulary

In [None]:
tok.vocab['▁is']

`Splitting` an example input sentence into tokens that the model can understand

In [None]:
tok.tokenize('Hello, this is a python program')

Defining a function `tok_inp` that takes a dictionary `x` and returns the tokenized verison for the value associated with the `input` key using the tokenizer `tok`

In [None]:
def tok_inp(x): return tok(x["input"])

As seen in the output below, there are `no null values`, as they were already handled earlier

In [None]:
print(df.isnull().sum())

In [None]:
null_rows = df[df.isnull().any(axis=1)]
print(null_rows)

This line applies the `tok_inp` function to the entire dataset `ds` in batches, creating a new dataset `tok_ds` with tokenized inputs.

In [None]:
tok_ds = ds.map(tok_inp, batched=True)

Applying the tokenization function to the dataset in **batches** adds a new column called `input_ids`, as shown below for the first row of dataset

In [None]:
row = tok_ds[0]
row['input'], row['input_ids']

> Renaming the column `is_duplicate` to `labels` in the dataset `tok_ds`, preparing it for model training where target values are expected under the name `labels`

In [None]:
tok_ds = tok_ds.rename_columns({'is_duplicate':'labels'})

## Test Set

This line creates the **test set** `eval_df` by loading 1,000 separate rows (from index 10,500 to 11,500) from the CSV file, ensuring they are not part of the training or validation sets.

In [None]:
eval_df = pd.read_csv(path/'questions.csv').iloc[10500:11500]

In [None]:
eval_df.describe()

We'll prepare the input column for the **test set** by concatenating the two string columns `question1` and `question2` like below-

In [None]:
eval_df['input'] = 'QUES1: ' + eval_df.question1 + ' QUES2: ' + eval_df.question2
eval_ds = Dataset.from_pandas(eval_df).map(tok_inp, batched=True)

This line splits the tokenized dataset `tok_ds` into a **training set and a validation set**, with **25% used for validation**, using a fixed seed for reproducibility.

In [None]:
dds = tok_ds.train_test_split(0.25, seed=42)
dds

In summary, the 10,000-row dataset was split into a **training set** with 7,500 rows and a **validation set** with 2,500 rows. Additionally, a separate **test set** `eval_df` of 1,000 rows was created to ensure it does not overlap with either the training or validation sets.

## Training our model

Defining a function `corr_func` that calculates and returns the **Pearson correlation** between predictions and true labels during evaluation.

In [None]:
def corr_func(eval_pred): return {'pearson': corr(*eval_pred)}

In [None]:
from transformers import TrainingArguments, Trainer

Setting up training configuration, specifying **parameters** like `learning rate`, `batch size`, `number of epochs`, and more. 
Note: larger batch sizes may exceed available GPU memory and lead to out-of-memory errors.

In [None]:
bs = 32
epochs = 3

In [None]:
lr = 8e-5

In [None]:
args = TrainingArguments(
    'outputs', 
    learning_rate=lr,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',
    fp16=True,
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs,
    weight_decay=0.01,
    report_to='none')

Loading a pre-trained model for sequence classification with 2 labels, setting up a `Trainer` using the specified training/validation sets and metrics, then fine-tuning the model with `trainer.train()`.
NOTE: In the below code, *dds['train']* represents the `training set` and *dds['test']* represents the `validation set`, we'll be evaluating the model on the `test set` *eval_ds* later.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(pt_model, num_labels=2)

trainer = Trainer(
    model,
    args,
    train_dataset=dds['train'],
    eval_dataset=dds['test'],
    tokenizer=tok,
    compute_metrics=corr_func
)

In [None]:
trainer.train();

## Evaluating our Model

Using the trained model to make predictions on the **test set** `eval_ds`, extracting the output scores (`logits`), and selecting the class with the highest score as the final predicted label for each example.

In [None]:
predictions = trainer.predict(eval_ds)
logits = predictions.predictions
predicted_labels = np.argmax(logits, axis=1)
predicted_labels[:10]

After predicting labels for the test set, we'll **review a few sample predictions** before calculating the overall accuracy.

The predicted label for index 1 in the test set is 0, indicating that the pair of questions is **not a duplicate** — which is supported by the input questions shown below.

In [None]:
print(eval_df['input'].iloc[1]),
print("(Duplicate Pair)" if predicted_labels[1] == 1 else "(Not a duplicate pair)")

Similarly, the predicted label for index 2 in the test set is 1, indicating that the pair of questions is considered a **duplicate** — as supported by the input questions shown below.

In [None]:
print(eval_df['input'].iloc[2]),
print("(Duplicate Pair)" if predicted_labels[2] == 1 else "(Not a duplicate pair)")

The first `10 actual label values` from the test set are shown below:

In [None]:
print(eval_df['is_duplicate'][:10])

Calculates and prints the `accuracy` of the model’s **predicted labels** against the **true labels** from the test set.

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(eval_df['is_duplicate'], predicted_labels)
print(f"Accuracy: {accuracy:.4f}")