# <font color = 'indianred'>**Identify Duplicate Questions in Quora Question Pairs using Siamese Network and Multiple Negatives Ranking Loss** </font>

**Objective:**
In this notebook, we will built upon the prebvious notebook: Quora_find_duplicate_questions_bert.ipynb. We will understand how to train model using Siamese Network. In this notebook, we will use Huggingface Trainer.

**Change from prebvious notebook**
- Multiple Negatives Ranking Loss instead of Cross Entropy Loss Function

**Plan**
1. Set Environment
2. Load Dataset
3. Accessing and Manipulating Splits
4. Load Pre-trained Tokenizer
5. Create Function for Tokenizer
4. Train Model
  1. Custom Data Collator <br>
  2. Download and modify the model config file <br>
  3. Custom Model Class <br>
  4. Training Arguments <br>
  5. Instantiate Trainer <br>
  6. Setup WandB <br>
  7. Training and Validation
6. Perfromance on Test Set
7. Model Inference




















# <font color = 'indianred'> **1. Setting up the Environment** </font>



In [4]:
from pathlib import Path
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount("/content/drive")
    !pip install datasets transformers evaluate wandb accelerate swifter sentence-transformers -U -qq
    base_folder = Path("/content/drive/MyDrive/data")
else:
    base_folder = Path("/home/harpreet/Insync/google_drive_shaannoor/data")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h

<font color = 'indianred'> *Load Libraries* </font>

In [27]:
# standard data science librraies for data handling and v isualization
import numpy as np
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from sklearn.metrics.pairwise import paired_cosine_distances


# New libraries introduced in this notebook
import torch
from datasets import load_dataset, DatasetDict, ClassLabel
from transformers import Pipeline
from transformers import TrainingArguments, Trainer
from transformers import AutoTokenizer
from transformers import PreTrainedModel
from transformers.modeling_outputs import ModelOutput
from transformers import BertModel, BertConfig

from sentence_transformers import  util

import torch
import torch.nn as nn
import torch.nn.functional as F
import wandb

# <font color = 'indianred'> **2. Load Data set**
    


**Quora Dataset**

The Quora dataset is composed of question pairs, and the task is to determine if the questions are paraphrases of each other (have the same meaning).



In [6]:
quora_dataset = load_dataset("quora")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/35.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/404290 [00:00<?, ? examples/s]

<font color = 'indianred'> *Understanding the datatype of columns*


In [7]:
quora_dataset['train'].features

{'questions': Sequence(feature={'id': Value(dtype='int32', id=None), 'text': Value(dtype='string', id=None)}, length=-1, id=None),
 'is_duplicate': Value(dtype='bool', id=None)}

In [8]:
# Renaming 'is_duplicate' column to 'labels' to match the naming convention expected by Hugging Face Trainer
quora_dataset = quora_dataset.rename_column('is_duplicate', 'labels')

# Retrieve the features of the 'train' split from the quora_dataset
features = quora_dataset['train'].features

# Define the 'labels' feature as a ClassLabel with two classes: 'not_duplicate' and 'duplicate'
features['labels'] = ClassLabel(num_classes=2, names=['not_duplicate', 'duplicate'])

# Cast the 'labels' column in the dataset to the ClassLabel type, ensuring compatibility with Hugging Face's Trainer
quora_dataset = quora_dataset.cast(features)

Casting the dataset:   0%|          | 0/404290 [00:00<?, ? examples/s]

In [9]:
# Verify the change by printing the updated features of the 'train' split, ensuring 'labels' is now of type ClassLabel
quora_dataset['train'].features

{'questions': Sequence(feature={'id': Value(dtype='int32', id=None), 'text': Value(dtype='string', id=None)}, length=-1, id=None),
 'labels': ClassLabel(names=['not_duplicate', 'duplicate'], id=None)}

# <font color = 'indianred'> **3. Accessing and Manuplating Splits**</font>

We only have train split, we will create train/valid/test splits now.

<font color = 'indianred'>*Create futher subdivions of the splits*</font>

In [10]:
# Split the test set into test and validation sets
train_temp_splits = quora_dataset["train"].train_test_split(
    test_size=0.3, seed=42)  # 70% for training, 30% for test/validation

val_test_splits = train_temp_splits["test"].train_test_split(
    test_size=0.5, seed=42)  # 15% for validation and 15% for test

# Extract the test and validation splits
train_split = train_temp_splits["train"]
valid_split = val_test_splits["train"]
test_split = val_test_splits["test"]


<font color = 'indianred'> *Create subset for experimentation* </font>

In [11]:
train_split_small = train_split.shuffle(seed=32).select(range(2000))
val_split_small = valid_split.shuffle(seed=32).select(range(1000))
test_split_small = test_split.shuffle(seed=32).select(range(1000))

<font color = 'indianred'> *Filter the dataset to include only duplicate pairs* </font>

In [12]:
# Filter the dataset to include only duplicate pairs
train_duplicates = train_split_small.filter(lambda example: example['labels'] == 1)

print(f"Original number of rows: {len(train_split_small)}")
print(f"Number of duplicate rows: {len(train_duplicates)}")

Filter:   0%|          | 0/2000 [00:00<?, ? examples/s]

Original number of rows: 2000
Number of duplicate rows: 746


In [13]:
val_duplicates = val_split_small.filter(lambda example: example['labels'] == 1)

print(f"Original number of rows: {len(val_split_small)}")
print(f"Number of duplicate rows: {len(val_duplicates)}")

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Original number of rows: 1000
Number of duplicate rows: 352


<font color = 'indianred'>*Combine splits*</font>

We will combine train and validation splits as we will be applying the same processing steps to both the splits.


In [14]:
train_val_small = DatasetDict(
    {"train": train_duplicates, "valid": val_duplicates})

We have created the datset. The next step is to tokenize the dataset in a format so that we can pass the tokenized inputs to the pre-trained model.

# <font color = 'indianred'>**4. Load pre-trained Tokenizer**</font>

In our next step, we will download a pre-trained tokenizer specifically designed to work with BERT. This tokenizer will handle the conversion of our text into a format that BERT can understand.

In [15]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [16]:
def tokenize_fn(batch):
  question1 = []
  question2 = []
  for question_pair in batch['questions']:
    question1.append(question_pair['text'][0])
    question2.append(question_pair['text'][1])

  tokenized_question1 = tokenizer(question1, truncation=True)
  tokenized_question2 = tokenizer(question2, truncation=True)
  return {
      'input_ids_q1': tokenized_question1['input_ids'],
      'attention_mask_q1': tokenized_question1['attention_mask'],
      'input_ids_q2': tokenized_question2['input_ids'],
      'attention_mask_q2': tokenized_question2['attention_mask'],
  }


In [17]:
tokenized_dataset_small = train_val_small.map(tokenize_fn, batched=True).remove_columns( ['questions'])

Map:   0%|          | 0/746 [00:00<?, ? examples/s]

Map:   0%|          | 0/352 [00:00<?, ? examples/s]

In [18]:
tokenized_dataset_small.set_format(type='torch')

In [19]:
tokenized_dataset_small['train'].features

{'labels': ClassLabel(names=['not_duplicate', 'duplicate'], id=None),
 'input_ids_q1': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'attention_mask_q1': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'input_ids_q2': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'attention_mask_q2': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

In [20]:
print(len(tokenized_dataset_small["train"]["input_ids_q1"][2]))
print(len(tokenized_dataset_small["train"]["input_ids_q1"][1]))

14
10


The varying lengths in the dataset indicate that padding has not been applied yet. Instead of padding the entire dataset, we prefer processing small batches during training. Padding is done selectively for each batch based on the maximum length in the batch. We will discuss this in more detail in a later section of this notebook.

#  <font color = 'indianred'> **5. Model Training**

##  <font color = 'indianred'> **5.1 Custom Data Collator**

In [21]:
class SiameseDataCollatorWithPadding:
    def __init__(self, tokenizer, padding=True):
        """
        Custom data collator for Siamese network structure with separate tokenization for two inputs.

        Args:
        tokenizer (PreTrainedTokenizer): The tokenizer used for encoding the text inputs.
        padding (bool, optional): Whether to pad the inputs to the maximum length in the batch. Defaults to True.
        """
        self.tokenizer = tokenizer
        self.padding = padding

    def __call__(self, features):
        # Separate features for question1 and question2
        features_q1 = [{"input_ids": feature["input_ids_q1"], "attention_mask": feature["attention_mask_q1"]} for feature in features]
        features_q2 = [{"input_ids": feature["input_ids_q2"], "attention_mask": feature["attention_mask_q2"]} for feature in features]

        # Pad each set of features independently
        batch_q1 = self.tokenizer.pad(features_q1, padding=self.padding, return_tensors="pt")
        batch_q2 = self.tokenizer.pad(features_q2, padding=self.padding, return_tensors="pt")

        # Combine the padded features into one dictionary
        batch = {
            "input_ids_q1": batch_q1["input_ids"],
            "attention_mask_q1": batch_q1["attention_mask"],
            "input_ids_q2": batch_q2["input_ids"],
            "attention_mask_q2": batch_q2["attention_mask"],
        }

        # If labels exist, include them in the batch
        if "labels" in features[0]:
            batch["labels"] = torch.tensor([feature["labels"] for feature in features], dtype=torch.long)

        return batch


In [22]:
data_collator = SiameseDataCollatorWithPadding(tokenizer)

##  <font color = 'indianred'> **5.2 Downaload and Modify Model Config File**

<font color = 'indianred'>*Download config file of pre-trained Model*</font>



In [23]:
config = BertConfig()
class_names = tokenized_dataset_small["train"].features["labels"].names
label2id = {label: i for i, label in enumerate(class_names)}
id2label = {i: label for label, i in label2id.items()}
config.id2label = id2label
config.label2id = label2id


##  <font color = 'indianred'> **5.3 Custom Model Class**

In [24]:
def mean_pool(token_embeds, attention_mask):
    # reshape attention_mask to cover 768-dimension embeddings
    in_mask = attention_mask.unsqueeze(-1).expand(token_embeds.size()).float()
    # perform mean-pooling but exclude padding tokens (specified by in_mask)
    pool = torch.sum(token_embeds * in_mask, 1) / torch.clamp(in_mask.sum(1), min=1e-9 )
    return pool

In [25]:
class SiameseBertModel(PreTrainedModel):
    config_class = BertConfig

    def __init__(self, config):
        super().__init__(config)
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.classifier = nn.Linear(config.hidden_size * 3, 2)  # Assuming binary classification (duplicate or not)

    def forward(self, input_ids_q1, attention_mask_q1, input_ids_q2, attention_mask_q2, labels=None):
        u = self.bert(input_ids_q1, attention_mask=attention_mask_q1).last_hidden_state
        v= self.bert(input_ids_q2, attention_mask=attention_mask_q2).last_hidden_state

        # print(u.size())
        # get the mean pooled vectors
        u = mean_pool(u, attention_mask_q1)
        v = mean_pool(v, attention_mask_q2)

        u = F.normalize(u, p=2, dim=1)
        v = F.normalize(v, p=2, dim=1)

        # Compute scores (dot product) between v2 and v1
        scores = torch.matmul(v, u.T)
        labels = torch.tensor(range(len(scores)), dtype=torch.long, device=scores.device)


        # compute the loss if labels are provided
        loss = None
        loss_fct = nn.CrossEntropyLoss()
        loss = loss_fct(scores, labels.view(-1))

        return ModelOutput( loss=loss, embeddings=(u,v))


In [29]:
model = SiameseBertModel(config)

## <font color = 'indianred'> **5.4 Training Arguments**</font>







In [30]:
# Define the directory where model checkpoints will be saved
model_folder = base_folder / "models"/"nlp_spring_2024/quora/sbert_hf_mnr_small"
# Create the directory if it doesn't exist
model_folder.mkdir(exist_ok=True, parents=True)

# Configure training parameters
training_args = TrainingArguments(
    # Training-specific configurations
    num_train_epochs=1,  # Total number of training epochs
    # Number of samples per training batch for each device
    per_device_train_batch_size=32,
    # Number of samples per evaluation batch for each device
    per_device_eval_batch_size=32,
    weight_decay=0.01,  # Apply L2 regularization to prevent overfitting
    learning_rate=2e-5,  # Step size for the optimizer during training
    optim='adamw_torch',  # Optimizer,

    # Checkpoint saving and model evaluation settings
    output_dir=str(model_folder),  # Directory to save model checkpoints
    evaluation_strategy='steps',  # Evaluate model at specified step intervals
    eval_steps=2,  # Perform evaluation every 10 training steps
    save_strategy="steps",  # Save model checkpoint at specified step intervals
    save_steps=2,  # Save a model checkpoint every 10 training steps
    load_best_model_at_end=True,  # Reload the best model at the end of training
    save_total_limit=2,  # Retain only the best and the most recent model checkpoints
    # Use 'accuracy' as the metric to determine the best model
    metric_for_best_model="eval_loss",
    greater_is_better=False,  # A model is 'better' if its accuracy is higher


    # Experiment logging configurations (commented out in this example)
    logging_strategy='steps',
    logging_steps=2,
    report_to='wandb',  # Log metrics and results to Weights & Biases platform
    run_name='quora_sbert_hf_mnr_small',  # Experiment name for Weights & Biases


    fp16=True,
    remove_unused_columns=False
)


##  <font color = 'indianred'> **5.5 Initialize Trainer**</font>



In [31]:
# initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_small["train"],
    eval_dataset=tokenized_dataset_small["valid"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

## <font color = 'indianred'> **5.6 Setup WandB**</font>

In [32]:
# setup wandb
wandb.login()  # you will need to craete wandb account first
# Set project name for logging
%env WANDB_PROJECT = nlp_course_spring_2024_quora-sbert-small

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


env: WANDB_PROJECT=nlp_course_spring_2024_quora-sbert-small


##  <font color = 'indianred'> **5.7 Training and Validation**

In [33]:
trainer.train()  # start training


[34m[1mwandb[0m: Currently logged in as: [33mhsingh-utd[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss
2,3.2389,3.127885
4,3.1353,3.078386
6,3.1102,2.984331
8,3.0018,2.940957
10,2.9672,2.886491
12,2.9383,2.855474
14,2.8923,2.832847
16,2.9101,2.816329
18,2.8486,2.804554
20,2.8894,2.796226


TrainOutput(global_step=24, training_loss=2.9224660197893777, metrics={'train_runtime': 133.9579, 'train_samples_per_second': 5.569, 'train_steps_per_second': 0.179, 'total_flos': 0.0, 'train_loss': 2.9224660197893777, 'epoch': 1.0})

<font color = 'indianred'> *Evaluate model on Validation Set* </font>


In [34]:
evaluation_results = trainer.evaluate(tokenized_dataset_small["valid"])

In [35]:
evaluation_results

{'eval_loss': 2.7864768505096436,
 'eval_runtime': 0.401,
 'eval_samples_per_second': 877.859,
 'eval_steps_per_second': 27.433,
 'epoch': 1.0}

<font color = 'indianred'> **Validation set accuracy, recall, f1 , precision** </font>

In [36]:
val_split_small_tokenized = val_split_small.map(tokenize_fn, batched=True).remove_columns( ['questions'])

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [37]:
val_split_small_tokenized.set_format(type='torch')
val_split_small_tokenized.features

{'labels': ClassLabel(names=['not_duplicate', 'duplicate'], id=None),
 'input_ids_q1': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'attention_mask_q1': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'input_ids_q2': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'attention_mask_q2': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

In [38]:
eval_outputs = trainer.predict(val_split_small_tokenized)

In [39]:
eval_outputs._fields

('predictions', 'label_ids', 'metrics')

In [40]:
u, v = eval_outputs.predictions
labels = eval_outputs.label_ids

In [41]:
scores = 1 - paired_cosine_distances(u, v)

In [42]:
# the function is borrowed from the sentence-transformers library
def find_best_acc_and_threshold(scores, labels, high_score_more_similar: bool):
    assert len(scores) == len(labels)
    rows = list(zip(scores, labels))

    rows = sorted(rows, key=lambda x: x[0], reverse=high_score_more_similar)

    max_acc = 0
    best_threshold = -1

    positive_correct_so_far = 0 # positives predicted correctly so far
    negatives_correct_so_far = sum(labels == 0) # negatives predicted correctly so far

    for i in range(len(rows) - 1):
        score, label = rows[i]
        if label == 1:
            positive_correct_so_far += 1
        else:
            negatives_correct_so_far -= 1

        acc = (positive_correct_so_far + negatives_correct_so_far) / len(labels)
        if acc > max_acc:
            max_acc = acc
            best_threshold = (rows[i][0] + rows[i + 1][0]) / 2

    return max_acc, best_threshold

In [43]:
eval_accuracy, threshold_accuracy = find_best_acc_and_threshold(scores, labels, True)

In [44]:
eval_accuracy, threshold_accuracy

(0.712, 0.8890573978424072)

In [45]:
wandb.log({"eval_accuracy": eval_accuracy, "sim_threshold_for_acc": threshold_accuracy})

In [46]:
# the function is borrowed from the sentence-transformers library
def find_best_f1_and_threshold(scores, labels, high_score_more_similar: bool):
    assert len(scores) == len(labels)

    scores = np.asarray(scores)
    labels = np.asarray(labels)

    rows = list(zip(scores, labels))

    rows = sorted(rows, key=lambda x: x[0], reverse=high_score_more_similar)

    best_f1 = best_precision = best_recall = 0
    threshold = 0
    total_predicted_as_positives_so_far = 0
    true_positives_so_far = 0
    total_positives_in_data = sum(labels)

    for i in range(len(rows) - 1):
        score, label = rows[i]
        total_predicted_as_positives_so_far += 1

        if label == 1:
            true_positives_so_far += 1

        if true_positives_so_far > 0:
            precision = true_positives_so_far / total_predicted_as_positives_so_far
            recall = true_positives_so_far / total_positives_in_data
            f1 = 2 * precision * recall / (precision + recall)
            if f1 > best_f1:
                best_f1 = f1
                best_precision = precision
                best_recall = recall
                threshold = (rows[i][0] + rows[i + 1][0]) / 2

    return best_f1, best_precision, best_recall, threshold

In [None]:
best_f1, best_precision, best_recall, threshold_f1 = find_best_f1_and_threshold(scores, labels, True)

In [None]:
best_f1, best_precision, best_recall, threshold_f1

In [None]:
wandb.log(evaluation_results)

<font color = 'indianred'> *Get best checkpoint*</font>


In [None]:
# After training, let us check the best checkpoint
# We need this for Inference
best_model_checkpoint_step = trainer.state.best_model_checkpoint.split('-')[-1]
print(f"The best model was saved at step {best_model_checkpoint_step}.")


In [None]:
wandb.finish()


#  <font color = 'indianred'> **6. Performance on Test Set**


<font color = 'indianred'>*Load model and tokenizer*</font>

In [None]:
checkpoint = str(model_folder/f'checkpoint-{best_model_checkpoint_step}')
checkpoint

In [None]:
model = SiameseBertModel.from_pretrained(checkpoint, config = config)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


In [None]:
test_set_tokenized = test_split_small.map(tokenize_fn, batched=True)

<font color = 'indianred'>*Training Arguments*</font>



In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    per_device_eval_batch_size=128,
    do_train=False,
    do_eval=True,
    report_to=[],
    remove_unused_columns=False
)

<font color = 'indianred'>*Instantiate Trainer*</font>

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=test_set_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

<font color = 'indianred'>*Evaluate using Trainer*</font>

In [None]:
test_outputs = trainer.predict(test_set_tokenized)

In [None]:
u, v = test_outputs.predictions
scores = 1 - paired_cosine_distances(u, v)
labels = test_outputs.label_ids

In [None]:
def evaluate_test(scores, threshold_acc, threshold_f1, labels):
    """
    Evaluate classification metrics based on similarity scores and separate thresholds
    for accuracy and for F1, precision, and recall.

    Args:
        scores (np.ndarray): Array of pairwise similarity scores.
        threshold_acc (float): Threshold for classifying pairs when calculating accuracy.
        threshold_f1 (float): Threshold for classifying pairs when calculating F1, precision, and recall.
        labels (np.ndarray): Ground truth binary labels indicating whether pairs are similar (1) or not (0).

    Returns:
        dict: Dictionary containing accuracy (based on threshold_acc) and F1 score, precision, and recall (based on threshold_f1).
    """
    # Convert scores to binary predictions based on the thresholds
    predictions_acc = (scores >= threshold_acc).astype(int)
    predictions_f1 = (scores >= threshold_f1).astype(int)

    # Compute accuracy using the threshold for accuracy
    accuracy = accuracy_score(labels, predictions_acc)

    # Compute precision, recall, and F1 score using the threshold for F1, precision, and recall
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions_f1, average='binary')

    return {
        "test_accuracy": accuracy,
        "test_f1_score": f1,
        "test_precision": precision,
        "test_recall": recall
    }




In [None]:
test_evaluation = evaluate_test(scores, threshold_acc=threshold_accuracy, threshold_f1=threshold_f1, labels=labels)

In [None]:
test_evaluation

# <Font color = 'indianred'> **7. Model Inference**


In [None]:
class SentenceEmbeddingPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        return {}, {}, {}

    def preprocess(self, text, second_text=None):
        return self.tokenizer(text, return_tensors=self.framework, padding = True, truncation = True)

    def _forward(self, model_inputs):
        token_embeds= self.model(**model_inputs).last_hidden_state
        attention_mask = model_inputs['attention_mask']
        in_mask = attention_mask.unsqueeze(-1).expand(token_embeds.size()).float()
        pool = torch.sum(token_embeds * in_mask, 1) / torch.clamp(in_mask.sum(1), min=1e-9 )

        return pool

    def postprocess(self, model_outputs):
        return model_outputs

In [None]:
checkpoint = str(model_folder/f'checkpoint-{best_model_checkpoint_step}')
model = SiameseBertModel.from_pretrained(checkpoint, config = config ).bert
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sentence_pair_pipeline = SentenceEmbeddingPipeline(model=model, tokenizer=tokenizer, device = 0, framework='pt')

In [None]:
sentences = ['What do House Republicans think of President Obama?',
 'Do republicans really think President Obama did a bad job?',
 'Why are so many people content with just earning a salary and working 9-6 their entire adult life?',
 'Jobs and Careers: Why are so many people content with just earning a salary and working 9-6 their entire adult life?',
 'How do you check the balance on a target gift card?',
 'How do you check your balance on a Target gift card?',
 'What are the best tips to stay young looking?',
 'What are best ways to stay and look young for longer time?',
 'How do you go about writing a novel?',
 'What are some tips for writing a novel?',
 'Is downloading app slow down the WiFi?',
 'Why does Xbox slow down when downloading games? Is there any setting to improve its speed?',
 'What is a good website for free books?',
 'Where can I get online PDF or EPUB versions of books?',
 'How do you switch phones on Metro PCS?',
 'How can I switch from Sprint to Metro PCs?',
 "Why don't some people fear death?",
 'Why do people fear death?',
 'What is the weighted average income in the United States?',
 'How does jumping rope help burn fat?']



In [None]:
embeddings = sentence_pair_pipeline(sentences)

In [None]:
embeddings = torch.cat(embeddings, dim=0)

In [None]:

# Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)
cos_sim


In [None]:
# Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim) - 1):
    for j in range(i + 1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

# Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))
