<a href="https://colab.research.google.com/github/human-ai2025/nlp_projects/blob/master/SetenceSimilarity_DL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Kaggle Stuff

In [1]:
! pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
! mkdir ~/.kaggle

In [4]:
!cp /content/drive/MyDrive/ColabNotebooks/tokens/kaggle.json ~/.kaggle/kaggle.json

In [5]:
! chmod 600 ~/.kaggle/kaggle.json

## Downloading Dataset

In [6]:
!kaggle competitions download -c quora-question-pairs
!unzip quora-question-pairs.zip
!unzip train.csv.zip

Downloading quora-question-pairs.zip to /content
 96% 297M/309M [00:02<00:00, 144MB/s]
100% 309M/309M [00:02<00:00, 122MB/s]
Archive:  quora-question-pairs.zip
  inflating: sample_submission.csv.zip  
  inflating: test.csv                
  inflating: test.csv.zip            
  inflating: train.csv.zip           
Archive:  train.csv.zip
  inflating: train.csv               


## Code Stuff

In [7]:
!pip install pytorch-lightning==1.7 --quiet
!pip install transformers==4.22.2 --quiet

[K     |████████████████████████████████| 700 kB 4.9 MB/s 
[K     |████████████████████████████████| 529 kB 59.5 MB/s 
[K     |████████████████████████████████| 4.9 MB 5.1 MB/s 
[K     |████████████████████████████████| 6.6 MB 46.9 MB/s 
[K     |████████████████████████████████| 163 kB 56.7 MB/s 
[?25h

In [8]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

from transformers import AutoModelForSequenceClassification, AutoTokenizer, AdamW, get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping

pl.seed_everything(41)

INFO:pytorch_lightning.utilities.seed:Global seed set to 41


41

Approach is concat the two sentences and do sequence classification. 
- Divide the train into train, val and test
- The test dataset will be for final testing. 

In [9]:
data = pd.read_csv('/content/train.csv')
data = data.dropna()
print("The percentage of non similar question pairs is : ")
print(len(data[data['is_duplicate']==0].index)*100/len(data.index))
print("The percentage of similar question pairs is : ")
print(len(data[data['is_duplicate']==1].index)*100/len(data.index))

The percentage of non similar question pairs is : 
63.07994073517081
The percentage of similar question pairs is : 
36.92005926482919


In [89]:
def prepare_dataset(path: str):
    dataframe = pd.read_csv(path)
    dataframe = dataframe.dropna()
    dataframe = dataframe.sample(frac=0.5, random_state=42).reset_index(drop = True)
    qone = list(dataframe['question1'].values)
    qtwo = list(dataframe['question2'].values)
    labels = list(dataframe['is_duplicate'].values)

    # lets do stratified splitting 
    train_inp_q1, val_inp_q1, train_inp_q2, val_inp_q2, train_label, val_label  = train_test_split(qone,
                                                                                                  qtwo,
                                                                                                  labels,
                                                                                                  random_state=2022,
                                                                                                  test_size = 0.1,
                                                                                                  stratify=labels)
    return train_inp_q1, val_inp_q1, train_inp_q2, val_inp_q2, train_label, val_label

In [90]:
class SimilarSentences(Dataset):
    def __init__(self, tokenizer, qone, qtwo, label, maxsize):
        self.tokenizer = tokenizer
        self.question1 = qone
        self.question2 = qtwo
        self.label = label
        self.maxlen = maxsize

    def __len__(self):
        q = str(self.question1) + ' [SEP] ' + str(self.question2) 
        return len(q) 

    def __getitem__(self, idx):
        text = str(self.question1[idx]) + ' [SEP] ' + str(self.question2[idx]) 
        encodedText = self.tokenizer.encode_plus(
            text, 
            add_special_tokens = True,      # Add '[CLS]' and '[SEP]'
            max_length = self.maxlen,       # Pad & truncate all sentences.
            pad_to_max_length = True,
            return_attention_mask = True,   # Construct attn. masks.
            return_tensors = 'pt',          # Return pytorch tensors.
        )
        ids = encodedText['input_ids'].flatten()
        mask = encodedText['attention_mask'].flatten()

        return dict(
                text=text,
                input_ids= torch.tensor(ids, dtype=torch.long), 
                attention_mask=torch.tensor(mask, dtype=torch.long),
                labels=torch.tensor(self.label[idx], dtype=torch.float)
                )            
        


In [91]:
class SimilarSentenceModel(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.pretrainedModel = AutoModelForSequenceClassification.from_pretrained('distilroberta-base', 
                                                                              return_dict = True, 
                                                                              num_labels=1
                                                                              ) 

    self.loss_func = torch.nn.BCEWithLogitsLoss()

  def forward(self, text, inputIds, attentionMask, labels=None):
    output = self.pretrainedModel(input_ids = inputIds,attention_mask = attentionMask)
    logits = output.logits 
    loss = 0
    if labels is not None:
      labels = labels.unsqueeze(1)
      loss = self.loss_func(logits, labels)
    return loss, logits


In [92]:
class SimilarSentenceModelPL(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = SimilarSentenceModel()

    def forward(self, text, input_ids, attention_mask, labels=None):
      return self.model(text, input_ids, attention_mask, labels)

    def training_step(self, batch, batch_idx):
        loss, outputs = self(**batch)
        return {"loss":loss, "predictions":outputs, "labels":batch["labels"]}

    def validation_step(self, batch, batch_idx):
      loss, outputs = self(**batch)
      return {"loss":loss, "predictions":outputs, "labels":batch["labels"]}
    

    def training_epoch_end(self, outputs):
      print("training_epoch_end")

    def validation_epoch_end(self, outputs):
      print("validation_epoch_end")


    def configure_optimizers(self):
        optimizer = AdamW(self.parameters(), lr = 0.000001, weight_decay=0)
        
        return [optimizer]

In [93]:
def prepare_dataloaders(train_inp_q1, val_inp_q1, train_inp_q2, val_inp_q2, train_label, val_label):
  tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
  train_dataset = SimilarSentences(tokenizer, train_inp_q1, train_inp_q2, train_label, 512)
  train_data_loader = DataLoader(train_dataset, batch_size=12)
  val_dataset = SimilarSentences(tokenizer, val_inp_q1, val_inp_q2, val_label, 512)
  val_data_loader = DataLoader(val_dataset, batch_size=12, shuffle=False)
  return train_data_loader, val_data_loader

In [94]:
train_inp_q1, val_inp_q1, train_inp_q2, val_inp_q2, train_label, val_label = prepare_dataset(path="/content/train.csv")
train_data_loader, val_data_loader = prepare_dataloaders(train_inp_q1, val_inp_q1, train_inp_q2, val_inp_q2, train_label, val_label)
model = SimilarSentenceModelPL()

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'roberta.pooler.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 

In [None]:
trainer = pl.Trainer(max_epochs = 2, gpus=1, num_sanity_val_steps=1, precision=16)
trainer.fit(model, train_data_loader, val_data_loader)

  f"Setting `Trainer(gpus={gpus!r})` is deprecated in v1.7 and will be removed"
INFO:pytorch_lightning.utilities.rank_zero:Using 16bit native Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                 | Params
-----------------------------------------------
0 | model | SimilarSentenceModel | 82.1 M
-----------------------------------------------
82.1 M    Trainable params
0         Non-trainable params
82.1 M    Total params
164.238   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


validation_epoch_end


Training: 0it [00:00, ?it/s]