# <font color = 'indianred'>**Identify Duplicate Questions in Quora Question Pairs using Siamese Network and Softmax** </font>


**Objective:**
In this notebook, we will built upon the prebvious notebook: Quora_find_duplicate_questions_bert.ipynb. We will understand how to train model using Siamese Network. In this notebook, we will use Sentence-Transformer library.

**Plan**
1. Set Environment
2. Load Dataset
3. Accessing and Manipulating Splits
4. Model Training
6. Perfromance on Test Set
7. Model Inference





















# <font color = 'indianred'> **1. Setting up the Environment** </font>



In [None]:
from pathlib import Path
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount("/content/drive")
    !pip install datasets sentence-transformers -U -qq
    base_folder = Path("/content/drive/MyDrive/data")
else:
    base_folder = Path("/home/harpreet/Insync/google_drive_shaannoor/data")

Mounted at /content/drive
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.9/388.9 kB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[?25h

<font color = 'indianred'> *Load Libraries* </font>

In [None]:
# standard data science libraries for data handling and visualization

import torch
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, InputExample, losses, util, models, evaluation
from torch.utils.data import DataLoader
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from sklearn.metrics.pairwise import paired_cosine_distances


# <font color = 'indianred'> **2. Load Dataset**
    


**Quora Dataset**

The Quora dataset is composed of question pairs, and the task is to determine if the questions are paraphrases of each other (have the same meaning).



In [None]:
quora_dataset = load_dataset("quora")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/35.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/404290 [00:00<?, ? examples/s]

# <font color = 'indianred'> **3. Accessing and Manipulating Splits**</font>

<font color = 'indianred'>*Create futher subdivions of the splits*</font>

In [None]:
# Split the test set into test and validation sets
train_temp_splits = quora_dataset["train"].train_test_split(
    test_size=0.3, seed=42)  # 70% for training, 30% for test/validation

val_test_splits = train_temp_splits["test"].train_test_split(
    test_size=0.5, seed=42)  # 15% for validation and 15% for test

# Extract the test and validation splits
train_split = train_temp_splits["train"]
valid_split = val_test_splits["train"]
test_split = val_test_splits["test"]


<font color = 'indianred'> *Create subset for experimentation* </font>

In [None]:
train_split_small = train_split.shuffle(seed=42).select(range(10000))
valid_split_small = valid_split.shuffle(seed=42).select(range(5000))
test_split_small = test_split.shuffle(seed=42).select(range(5000))

<font color = 'indianred'> *Convert to input format for Sentence Transformers* </font>

In [None]:
def convert_to_input_example(split):
    samples = []
    for row in split:
        samples.append(InputExample(
            texts=[row['questions']['text'][0], row['questions']['text'][1]],
            label=int(row['is_duplicate'])
        ))
    return samples

In [None]:
train_samples = convert_to_input_example(train_split_small)
train_loader = DataLoader(train_samples, shuffle=True, batch_size=32)
for pair in train_samples[:5]:
    print(pair)

<InputExample> label: 1, texts: Is it possible that Trump entered the Presidential campaign to ensure that Hillary Clinton wins?; Is Hillary Clinton secretly paying Donald Trump to throw the election?
<InputExample> label: 1, texts: What is the minimum CGPA for doing an MBA in the USA?; What is the minimum CGPA required for MBA in the USA?
<InputExample> label: 0, texts: Could not get user data from social network?; How do analytic websites get data from social networks?
<InputExample> label: 0, texts: How good is nus for architecture?; I am an Indian CBSE student who is in the 12 grade, what are the pre-requistes for admission in architecture at NUS?
<InputExample> label: 0, texts: What is the revenue model of Airbnb?; How much revenue is Airbnb making?


In [None]:
valid_sentences1 = [row['questions']['text'][0] for row in valid_split_small]
valid_sentences2 = [row['questions']['text'][1] for row in valid_split_small]
valid_labels = [int(row['is_duplicate']) for row in valid_split_small]
valid_sentences1[:5], valid_sentences2[:5], valid_labels[:5]

(['If Bashar Assad remains in power, what will that mean for Israel?',
  'What qualifications should someone have to get a job in BBC?',
  'Why do people ask such questions here on Quora which could be easily found on the internet?',
  "Is it possible to find Rosberg's F1 Mercedes for sale?",
  'What will be in hand salary if pay scale is 15600?'],
 ['How does Bashar al-Assad look in his most recent pictures?',
  'What are the qualifications needed so that I can get a job in the USA?',
  'Why do so many people ask questions on Quora instead of searching the answers on Wikipedia?',
  "Is it possible to find Rosberg's Mercedes for sale?",
  'What is my salary if pay scale is 45500?'],
 [0, 0, 1, 1, 0])

#  <font color = 'indianred'> **4. Model Training**

In [None]:
bert = models.Transformer('bert-base-uncased')
pooler = models.Pooling(
    bert.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True
)

model = SentenceTransformer(modules=[bert, pooler])

model

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

In [None]:
train_loss = losses.SoftmaxLoss(model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=2)

In [None]:
valid_evaluator = evaluation.BinaryClassificationEvaluator(valid_sentences1, valid_sentences2, valid_labels)

In [None]:
# Tune the model
output_folder = str(base_folder / "models/nlp_spring_2024/quora/sbert_sts_small")
model.fit(train_objectives=[(train_loader, train_loss)],
          evaluator=valid_evaluator,
          epochs=1,
          optimizer_params={'lr': 2e-5},
          optimizer_class=torch.optim.AdamW,
          weight_decay=0.01,
          warmup_steps=0,
          save_best_model=True,

          evaluation_steps=10,
          output_path= output_folder)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/313 [00:00<?, ?it/s]

In [None]:
model = SentenceTransformer(output_folder)
eval_metrics = valid_evaluator.compute_metrices(model)

In [None]:
eval_metrics

{'cossim': {'accuracy': 0.7184,
  'accuracy_threshold': 0.8494030833244324,
  'f1': 0.6768790849673203,
  'f1_threshold': 0.7788944244384766,
  'precision': 0.5503155097974095,
  'recall': 0.8790450928381963,
  'ap': 0.6277804127525659},
 'manhattan': {'accuracy': 0.7158,
  'accuracy_threshold': 168.03646850585938,
  'f1': 0.6739748274462039,
  'f1_threshold': 208.80795288085938,
  'precision': 0.5458730680697139,
  'recall': 0.8806366047745358,
  'ap': 0.6237755513320777},
 'euclidean': {'accuracy': 0.7148,
  'accuracy_threshold': 7.68890380859375,
  'f1': 0.6726567286093527,
  'f1_threshold': 9.453948974609375,
  'precision': 0.546812749003984,
  'recall': 0.8737400530503979,
  'ap': 0.6229833774382553},
 'dot': {'accuracy': 0.6884,
  'accuracy_threshold': 188.3572998046875,
  'f1': 0.6435704096189118,
  'f1_threshold': 151.83961486816406,
  'precision': 0.5225016545334216,
  'recall': 0.8376657824933686,
  'ap': 0.5789007825175448}}

#  <font color = 'indianred'> **5. Performance on Test Set**


In [None]:
test_sentences1 = [row['questions']['text'][0] for row in test_split_small]
test_sentences2 = [row['questions']['text'][1] for row in test_split_small]
test_labels = [int(row['is_duplicate']) for row in test_split_small]

In [None]:
model = SentenceTransformer(output_folder)

In [None]:
u = model.encode(test_sentences1)
v= model.encode(test_sentences2)
scores = 1 - paired_cosine_distances(u, v)

In [None]:
# function borrowed from sentence transformer library
def evaluate_test(scores, threshold_acc, threshold_f1, labels):
    """
    Evaluate classification metrics based on similarity scores and separate thresholds
    for accuracy and for F1, precision, and recall.

    Args:
        scores (np.ndarray): Array of pairwise similarity scores.
        threshold_acc (float): Threshold for classifying pairs when calculating accuracy.
        threshold_f1 (float): Threshold for classifying pairs when calculating F1, precision, and recall.
        labels (np.ndarray): Ground truth binary labels indicating whether pairs are similar (1) or not (0).

    Returns:
        dict: Dictionary containing accuracy (based on threshold_acc) and F1 score, precision, and recall (based on threshold_f1).
    """
    # Convert scores to binary predictions based on the thresholds
    predictions_acc = (scores >= threshold_acc).astype(int)
    predictions_f1 = (scores >= threshold_f1).astype(int)

    # Compute accuracy using the threshold for accuracy
    accuracy = accuracy_score(labels, predictions_acc)

    # Compute precision, recall, and F1 score using the threshold for F1, precision, and recall
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions_f1, average='binary')

    return {
        "test_accuracy": accuracy,
        "test_f1_score": f1,
        "test_precision": precision,
        "test_recall": recall
    }

In [None]:
threshold_acc = eval_metrics['cossim']['accuracy_threshold']
threshold_f1 = eval_metrics['cossim']['f1_threshold']
threshold_acc, threshold_f1

(0.8494030833244324, 0.7788944244384766)

In [None]:
test_evaluation = evaluate_test(scores, threshold_acc=threshold_acc, threshold_f1=threshold_f1, labels=test_labels)

In [None]:
test_evaluation

{'test_accuracy': 0.7076,
 'test_f1_score': 0.6525388166177087,
 'test_precision': 0.524451939291737,
 'test_recall': 0.8634092171016102}

# <Font color = 'indianred'> **6. Model Inference**


In [None]:
sentences = ['What do House Republicans think of President Obama?',
 'Do republicans really think President Obama did a bad job?',
 'Why are so many people content with just earning a salary and working 9-6 their entire adult life?',
 'Jobs and Careers: Why are so many people content with just earning a salary and working 9-6 their entire adult life?',
 'How do you check the balance on a target gift card?',
 'How do you check your balance on a Target gift card?',
 'What are the best tips to stay young looking?',
 'What are best ways to stay and look young for longer time?',
 'How do you go about writing a novel?',
 'What are some tips for writing a novel?',
 'Is downloading app slow down the WiFi?',
 'Why does Xbox slow down when downloading games? Is there any setting to improve its speed?',
 'What is a good website for free books?',
 'Where can I get online PDF or EPUB versions of books?',
 'How do you switch phones on Metro PCS?',
 'How can I switch from Sprint to Metro PCs?',
 "Why don't some people fear death?",
 'Why do people fear death?',
 'What is the weighted average income in the United States?',
 'How does jumping rope help burn fat?']



In [None]:
model = SentenceTransformer(output_folder)

# Encode all sentences
embeddings = model.encode(sentences)

In [None]:
# Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

# Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim) - 1):
    for j in range(i + 1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

# Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))

Top-5 most similar pairs:
How do you check the balance on a target gift card? 	 How do you check your balance on a Target gift card? 	 0.9706
Why are so many people content with just earning a salary and working 9-6 their entire adult life? 	 Jobs and Careers: Why are so many people content with just earning a salary and working 9-6 their entire adult life? 	 0.9542
How do you go about writing a novel? 	 What are some tips for writing a novel? 	 0.8656
What do House Republicans think of President Obama? 	 Do republicans really think President Obama did a bad job? 	 0.8641
Why don't some people fear death? 	 Why do people fear death? 	 0.8453
