<a href="https://colab.research.google.com/github/iuliaiarina/Banking-Management-System/blob/main/Lab_4_Natural_Language_Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

import sys
import traceback
import re
print(f"Python {sys.version}")

from matplotlib import pyplot as plt

"""
We need to make this determinitsic so we can keep a track of changes we do to the model
If we are using the same initialisation all the time, then changes
"""
import numpy as np
import pandas as pd

np.random.seed(317)

import random

random.seed(317)

try:
  import nltk
  nltk.download('stopwords')
  nltk.download('wordnet')
  from nltk.corpus import stopwords
  from nltk import WordNetLemmatizer
except:
  print("No nltk")

try:
    import torch

    DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
    # DEVICE = 'cpu'
    print(f"PyTorch {torch.__version__}")
    print(f"DEVICE={DEVICE}")
    if torch.cuda.is_available():
        print(f"\tGPU: {torch.cuda.get_device_name(0)}")
        print(f"\t\tcapability: {torch.cuda.get_device_capability('cuda')[0]}")
        print(f"\tCUDA version: {torch.version.cuda}")
        print("\tcuDNN available: ", torch.backends.cudnn.is_available())

        if torch.backends.cudnn.is_available():
            print("\t\tcuDNN version: ", torch.backends.cudnn.version())

        # Print the number of GPUs available
        print(f"\tNumber of GPUs available: {torch.cuda.device_count()}")
        torch.manual_seed(317) # Moved this line here

    from torch import nn
    from torch.nn import functional as F
    from torch.optim import AdamW
    from torch.utils.data import Dataset, DataLoader


except:
    print("No PyTorch")
    print(traceback.format_exc())

try:
    import tensorflow as tf

    print(f"TensorFlow {tf.__version__}")
    print(f"Build with CUDA: {tf.test.is_built_with_cuda()}")
    print(f"Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
except:
    print("No TensorFlow")

try:
  from transformers import AutoTokenizer, BertForSequenceClassification
  from tqdm import tqdm
  from datasets import load_dataset

  #print(f"Tokenizer version {BertTokenizer.__version__}")
except:
  print("No Transformers")

try:
  import sklearn
  from sklearn.model_selection import train_test_split
  from sklearn.metrics import accuracy_score
  from sklearn.model_selection import train_test_split
  from sklearn.feature_extraction.text import TfidfVectorizer
  print(f"scikit-learn {sklearn.__version__}")
except:
  print("No scikit-learn")

# DEVICE="cpu"

Python 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


PyTorch 2.8.0+cu126
DEVICE=cuda
	GPU: Tesla T4
		capability: 7
	CUDA version: 12.6
	cuDNN available:  True
		cuDNN version:  91002
	Number of GPUs available: 1
TensorFlow 2.19.0
Build with CUDA: True
Num GPUs Available:  1
scikit-learn 1.6.1


# Natural Language Inference (NLI)

Transformer Resources: [Jurafsky](https://web.stanford.edu/~jurafsky/slp3/) book - Chapter 8 and 10

[BERT](https://arxiv.org/pdf/1810.04805) paper.



# [Exercise 1] Train BERT model on SNLI

We train a BERT-based model for Natural Language Inference (NLI) on the SNLI dataset, where each example consists of a `premise` and a `hypothesis` labeled as entailment, neutral, or contradiction.

For example, given the pair:
1. “A man is playing a guitar”  
2. “A person is performing music”

the correct label is entailment.

The training process involves `tokenizing` sentence pairs with the BERT tokenizer, creating a PyTorch dataset and dataloader, and fine-tuning `BertForSequenceClassification` with three output labels. We evaluate in-domain performance on the SNLI validation set for each epoch, choosing the best model.

Then we test the model on the test dataset.



## [Exercise 1.1] Load and visualize the data
Required Actions:

* Load: Load the [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) dataset using the `datasets` library.

* Clean: Define a cleaning procedure that converts the label column to an integer type and filters out all rows where the label is not 0, 1, or 2.

* Sample: Subset the cleaned data splits to manageable sizes: 3,000 samples for training, and 1,000 samples each for validation and testing.


In [2]:
snli = load_dataset("snli")
print(snli)

def clean_df_by_label(df):
    # ==================
    # YOUR CODE HERE
    # ==================
    return df



# ==================
# YOUR CODE HERE
# snli_train_df = clean_df_by_label(snli["train"].to_pandas())
# ...val +test
# ==================

print(snli_train_df.head(10))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/412k [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/413k [00:00<?, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/19.6M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/550152 [00:00<?, ? examples/s]

DatasetDict({
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 10000
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 550152
    })
})


NameError: name 'snli_train_df' is not defined

## [Exercise 1.2] Create the Dataset class

Required Actions (Implementation of `NLIDataset`):

* Initialization: Accept a Pandas DataFrame, a pre-trained tokenizer, and a maximum sequence length (max_length).

* Label Conversion: Extract the label column from the DataFrame and convert it directly into a PyTorch Long Tensor.

* Text Encoding: Tokenize the pairs of premise and hypothesis columns:

  1. Apply padding and truncation based on the provided max_length.

  2. Ensure the output tensors are in PyTorch format (return_tensors='pt').

  3. Exclude token type IDs (return_token_type_ids=False).

* Interface Implementation: Implement the mandatory __len__ method to return the dataset size, and __getitem__ method to return a dictionary containing `input_ids`, `attention_mask`, and `labels`.

In [None]:
class NLIDataset(Dataset):
    def __init__(self, df, tokenizer, max_length):
        # ================== tokenizati here
        # YOUR CODE HERE
        # ==================
        tokenizer(df.premises, df.hypotesis) -> pune autoamt token, 512 add params look

    def __len__(self):
        # ==================
        # YOUR CODE HERE
        # ==================

    def __getitem__(self, idx):
        # ==================
        # YOUR CODE HERE
        # ==================

## [Exercise 1.3] Create the model class.

Required Actions (Implementation of NLIModel):

* Initialization: Load `BertForSequenceClassification` with configurable model_name and num_labels=3.

* Forward Pass: Define forward pass.


In [None]:
class NLIModel(nn.Module):
    def __init__(self, model_name="bert-base-uncased", num_labels=3):
        super(NLIModel, self).__init__()
        # ==================
        # YOUR CODE HERE
        # ==================
        aici pui modelul

    def forward(self, input_ids, attention_mask=None, token_type_ids=None, labels=None):
        # ==================
        # YOUR CODE HERE
        # ==================

        aici apelezi

        return outputs

## [Exercise 1.4] Construct the training loop and train your model

Required Actions (Implementation of train_classifier):

* Setup: Initialize NLIModel (using bert-base-uncased), AdamW optimizer, and CrossEntropyLoss.

* Training Loop: Iterate over epochs, setting the model to train(), performing the forward pass, calculating loss, backpropagating, and stepping the optimizer for each batch.

* Validation Loop: Iterate over validation data using torch.no_grad(), setting the model to eval(), and calculating validation loss and accuracy.

* Checkpointing: Implement model saving (torch.save(model.state_dict(), "model.pt")) based on achieving the minimum validation loss.

* Logging: Calculate and print epoch-level average training loss/accuracy and validation loss/accuracy.

* Output: Return the final trained model object.

In [None]:
def train_classifier(train_loader, val_loader, epochs=7, lr=1e-5):

    # ==================
    # YOUR CODE HERE
    # ==================
    for epoch in range(epochs):

        # ==================
        # YOUR CODE HERE
        # ==================

        for batch in tqdm(train_loader):

            # ==================
            # YOUR CODE HERE
            # ==================

        # ==================
        # YOUR CODE HERE
        # ==================

        with torch.no_grad():
            for batch in tqdm(val_loader):

              # ==================
              # YOUR CODE HERE
              # ==================

              if min_val_loss > total_val_loss/len(val_loader):
                  min_val_loss = total_val_loss/len(val_loader)
                  torch.save(model.state_dict(), "model.pt")


        print(f"Epoch {epoch+1}: \nTrain Loss = {total_train_loss/len(train_loader):.4f} Train Acc = {total_train_correct/total_train_items:.4f}\nVal Loss = {total_val_loss/len(val_loader):.4f} Val Acc = {total_val_correct/total_val_items:.4f}")


    return model



## [Exercise 1.5] Construct the test loop and inference function

Required Actions (Implementation of test and inference):

* Test Function (test):

* Enter evaluation mode (model.eval()) and disable gradient tracking (torch.no_grad()).

* Iterate over the test_loader to calculate cumulative loss and correct predictions.

* Compute and print the final average test loss and classification accuracy.

Inference Function (inference):

* Tokenize the input premise and hypothesis using the provided tokenizer.

* Perform a forward pass on the model using the tokenized inputs (without labels).

* Determine the final class prediction by taking the argmax of the output logits.


In [None]:
def test(model, test_loader):
    # ==================
    # YOUR CODE HERE
    # ==================

    model.eval()
    with torch.no_grad():
      for batch in test_loader:
          # ==================
          # YOUR CODE HERE
          # ==================


    print(f"Test Loss = {total_loss/len(test_loader):.4f} Accuracy: {total_correct/total_items:.4f}")


def inference(model, tokenizer, premise, hypothesis, max_length):
    # ==================
    # YOUR CODE HERE
    # ==================

    print(f"Premise: {premise}")
    print(f"Hypothesis: {hypothesis}")
    print(f"Predicted Class: {predicted_class}")
    return predicted_class



## [Exercise 1.6] Put it all together

Required Actions (Implementation of test and inference):

* Initialize Tokenizer
* Initialize the datasets (train, val, test)
* Initialize the dataloaders
* Train the model
* Load saved model and run test loop
* Run the `inference` method with an example


In [None]:
max_length = 512
model_name = 'bert-base-uncased'


# ==================
# YOUR CODE HERE
# ==================
print("Tokenizer initialized.")

# ==================
# YOUR CODE HERE
# ==================
print("Datasets initialized.")


# ==================
# YOUR CODE HERE
# ==================
print("Dataloaders initialized.")

# ==================
# YOUR CODE HERE
# ==================
print("Training Completed.")

# ==================
# YOUR CODE HERE
# ==================
print("Testing Completed.")

premise = "The sun is shining on the clear blue sky."
hypothesis = "It is pouring outside."
inference(model, tokenizer, premise, hypothesis, max_length)


## [Exercise 2] Out-of-distribution testing.

NLI models are often trained and evaluated on the same dataset distribution (e.g., SNLI), achieving high in-domain accuracy.

However, when tested on out-of-distribution (OOD) datasets — such as MNLI, SICK, or ANLI — model performance typically drops sharply.
This suggests that the model may have learned dataset-specific biases or lexical artifacts, rather than robust inference skills.

Try to test your model on another NLI dataset (MNLI). What can you observe? How is the model performing in an Out-of-distribution setting?

In [None]:
/# valideaza pe datasetul asta

mnli = load_dataset("SetFit/mnli", split='validation')

def clean_df_by_label(df):
    # ==================
    # YOUR CODE HERE
    # ==================

    return df

print(mnli)

# ==================
# YOUR CODE HERE
# ==================
mnli_val_df = mnli_val_df.rename(columns={'text1': 'premise', 'text2': 'hypothesis'})
print(mnli_val_df.head(10))

In [None]:
# Test the model trained on SNLI on the MNLI val split
# ==================
# YOUR CODE HERE
# ==================

## [Exercise 3] OOD solution?

Think for a solution to solve the OOD issue (can be a subject of your project too). You can find an example [here](https://arxiv.org/abs/2502.09567).

You can also try to evaluate the accuracy of your model on different datasets, and then try to compose a 'general' dataset and train on that. What happens?

Another idea is to calibrate the prediction models and see if the OOD performance improves.

## [Exercise 4] Artifacts in NLI

What is the complexity of certain NLI datasets (SNLI, MNLI, SICK, FEVER)? What artifacts do they hide?

An artifact refers to an unintended pattern or bias in the data that allows a model to make correct predictions without performing true inference.

For example, in the SNLI dataset, certain words or lexical cues in the hypothesis (like “no”, “nobody”) strongly correlate with specific labels (e.g., contradiction), enabling models to classify examples correctly by exploiting surface heuristics rather than understanding the semantic relationship between premise and hypothesis.

You can experiment by predicting the label on SNLI just focusing on the hypotheses. What other such problems can you find for different datasets?

You can read [this](https://studenttheses.uu.nl/bitstream/handle/20.500.12932/40692/BA_thesis_improved.pdf?sequence=1&isAllowed=y) article.

## [Extra Work] Training a BERT from scratch

If you want to comprehend better how transformers (and BERT) work, you can try to do this lab from another subject (Machine Learning 2), where you construct a BERT layer by layer, from scratch.

[link](https://colab.research.google.com/drive/1NWHzFkeSCyd42RAOAV-caBYfWGWZiUsz?usp=sharing)