# CentraleSupelec - Natural language processing
# Practical session n°8
### Mohammed EL Hamidi


## Natural Language Inferencing (NLI): 

(NLI) is a classical NLP (Natural Language Processing) problem that involves taking two sentences (the premise and the hypothesis ), and deciding how they are related (if the premise *entails* the hypothesis, *contradicts* it, or *neither*).

Ex: 


| Premise | Label | Hypothesis |
| --- | --- | --- |
| A man inspects the uniform of a figure in some East Asian country. | contradiction | The man is sleeping. |
| An older and younger man smiling. | neutral | Two men are smiling and laughing at the cats playing on the floor. |
| A soccer game with multiple males playing. | entailment | Some men are playing a sport. |

### Stanford NLI (SNLI) corpus

In this labwork, I propose to use the Stanford NLI (SNLI) corpus ( https://nlp.stanford.edu/projects/snli/ ), available in the *Datasets* library by Huggingface.

    from datasets import load_dataset
    snli = load_dataset("snli")
    #Removing sentence pairs with no label (-1)
    snli = snli.filter(lambda example: example['label'] != -1) 

## Subject

You are asked to provide an operational Jupyter notebook that performs the task of NLI. For that, you need to tackle the following aspects of the problem:

1. Loading and preprocessing the data
2. Designing a PyTorch model that, given two sentences, decides how they are related (*entails*, *contradicts* or *neither*.)
3. Training and evaluating the model using appropriate metrics
4. (Optional) Allowing to play with the model (forward user sentences and visualize the prediction easily)
5. (Optional) Providing visual insight about the model (i.e. visualizing the attention if your model is using attention)

Although it is not mandatory, I suggest that you use a transformer model to perform the task. For that, you can use the *Transformer* library by Huggingface.

## Evaluation

The evaluation will be based on several criteria:

- Clarity and readability of the notebook. The notebook is the report of you project. Make it easy and pleasant to read.
- Justification of implementation choices (i.e. the network, the cost funtion, the optimizer, ...)
- Quality of the code. The various deeplearning and NLP labworks provide many example of good practices for designing experiments with neural networks. Use them as inspirational examples!

## Additional recommendations

- You are not seeking to publish a research paper! I'm not expecting state-of-the-art results! The idea of this labwork is to assess that you have integrated the skills necessary to handle textual data using deep neural network techniques.

- This labwork will be evaluated but we are still here to help you! Don't hesitate to request our help if you are stuck.

- If you intend to use BERT based models, let me give you an advice. The bert-base-* models available in *Transformers* need more than 12Go to be fine-tuned on GPU. To avoid memory issues, you can use several solutions: 

    - Use a lighter BERT based model such as DistilBERT, ALBERT, ...
    - Train a classification model on top of BERT, whithout fine-tuning it (i.e. freezing BERT weights)

## Huggingface documentations

In case you want to use the huggingface *Datasets* and *Transformer* libraries (which I advice), here are some useful documentation pages:

- Dataset quick tour

    https://huggingface.co/docs/datasets/quicktour.html
    
- Documentation on data preprocessing for transformers

    https://huggingface.co/transformers/preprocessing.html
    
- Transformer Quick tour (with distilbert example for classification).

    https://huggingface.co/transformers/quicktour.html
    


In [None]:
!python3 -m venv bilstm-nli-env
!source bilstm-nli-env/bin/activate
!pip install torch torchvision torchinfo tqdm matplotlib scikit-learn datasets sentence-transformers

## Data loading and processing


#### This part focuses on preparing the SNLI dataset for use with a neural network. It involves loading the dataset, filtering out unusable data, tokenizing the text to convert words to numerical IDs, and finally setting up PyTorch DataLoaders to facilitate batch processing during training and validation.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from datasets import load_dataset
from torch.utils.data import Dataset
from sentence_transformers import SentenceTransformer
import random
import os
import torch.nn
import tqdm

In [2]:
snli = load_dataset("snli")
snli = snli.filter(lambda example: example['label'] != -1) 

  from .autonotebook import tqdm as notebook_tqdm


In [1]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda") if use_cuda else torch.device("cpu")
device

device(type='cuda')

In [None]:
embedder = SentenceTransformer("all-MiniLM-L6-v2", device=device)
def embed(text):
    return embedder.encode(text)

embedding_size=len(embed("hello world"))

In [3]:

class SNILDataset(Dataset):
    def __init__(
        self,
        subset,
        directory="./datasets/",
    ):
        possible_subsets = ["train", "validation", "test"]
        if subset not in possible_subsets:
            raise ValueError(
                "Possible values for 'subset' are: {} (given {})".format(
                    possible_subsets, subset
                )
            )

        self.subset = subset
        self.test = subset == "test"

    def __len__(self):
        return len(snli[self.subset])

    def __getitem__(
        self,
        index,
    ):
        return torch.tensor((embed(snli[self.subset][index]["premise"]), embed(snli[self.subset][index]["hypothesis"])), dtype=torch.float32), snli[self.subset][index]["label"]
        


In [4]:
batch_size = 32
num_workers = 1

def get_dataloaders():

    print("  - Dataset creation")

    train_dataset = SNILDataset(
        subset="train",
    )
    valid_dataset = SNILDataset(
        subset="validation",
    )

    print(f"  - I loaded {len(train_dataset)} samples for training and {len(valid_dataset)} samples for validation")

    # Build the dataloaders
    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=batch_size,
        #num_workers=num_workers,
        pin_memory=use_cuda,
    )

    valid_loader = torch.utils.data.DataLoader(
        valid_dataset,
        batch_size=batch_size,
        #num_workers=num_workers,
        pin_memory=use_cuda,
    )

    num_classes = 3
    print(f"  - Number of classes is {num_classes}")
    input_size = tuple(train_dataset[0][0].shape)
    print(f"  - Input size is {input_size}")

    return train_loader, valid_loader, input_size, num_classes


In [8]:
train_loader, valid_loader, input_size, num_classes = get_dataloaders()

  - Dataset creation
  - I loaded 549367 samples for training and 9842 samples for validation
  - Number of classes is 3
  - Input size is (2, 384)


  return torch.tensor((embed(snli[self.subset][index]["premise"]), embed(snli[self.subset][index]["hypothesis"])), dtype=torch.float32), snli[self.subset][index]["label"]


### Why BiLSTM for NLI?
#### Sequential Data Understanding
##### BiLSTM networks are particularly well-suited for tasks involving sequential data, such as text. NLI involves understanding and reasoning over pairs of sentences (premise and hypothesis) to determine the relationship between them (entailment, contradiction, or neutral). BiLSTMs process sequences in both forward and backward directions, providing a comprehensive understanding of the context, which is crucial for capturing the nuances necessary for accurate inference.

Long-Term Dependencies
One of the key challenges in NLI is capturing long-term dependencies within and between sentences. BiLSTMs are designed to address this issue, making them capable of understanding complex sentence structures and the subtle meanings that influence the inference process.

Feature Extraction
The model's embedding layer, powered by Sentence Transformers, converts sentences into dense vectors, capturing semantic information effectively. This pre-processing step is critical for NLI, as the relationship between sentences often hinges on deep semantic similarities or differences that surface-level features might not capture.

Why It Should Work
Rich Contextual Representation
The bidirectional nature of BiLSTMs allows the model to gather context from both the beginning and the end of a sentence, providing a richer representation of each sentence and its elements. This is beneficial for NLI, where the relation might depend on the context that precedes or follows a key piece of information.

Adaptability to Different Text Lengths
BiLSTMs are inherently adaptable to sequences of varying lengths, thanks to their recurrent structure. This flexibility is valuable for NLI tasks, where premises and hypotheses can significantly differ in length and complexity.

In [9]:
class BiLSTM(nn.Module):
	def __init__(self, embedding_dim, hidden_dim, dropout_rate, out_dim, batch_size, device):
		super(BiLSTM, self).__init__()
		self.batch_size = batch_size
		self.embed_dim = embedding_dim
		self.hidden_size = hidden_dim
		self.directions = 2
		self.num_layers = 2
		self.concat = 4
		self.projection = nn.Linear(self.embed_dim, self.hidden_size)
		self.lstm = nn.LSTM(self.hidden_size, self.hidden_size, self.num_layers,
									bidirectional = True, batch_first = True, dropout = dropout_rate)
		self.relu = nn.ReLU()
		self.dropout = nn.Dropout(p = dropout_rate)

		self.lin1 = nn.Linear(self.hidden_size * self.directions * self.concat, self.hidden_size)
		self.lin2 = nn.Linear(self.hidden_size, self.hidden_size)
		self.lin3 = nn.Linear(self.hidden_size, out_dim)

		self.device = device

		for lin in [self.lin1, self.lin2, self.lin3]:
			nn.init.xavier_uniform_(lin.weight)
			nn.init.zeros_(lin.bias)

		self.out = nn.Sequential(
			self.lin1,
			self.relu,
			self.dropout,
			self.lin2,
			self.relu,
			self.dropout,
			self.lin3
		)

	def forward(self, batch):
		premise_embed = batch[:, 0:1, :]
		hypothesis_embed = batch[:, 1:2, :]

		premise_proj = self.relu(self.projection(premise_embed))
		hypothesis_proj = self.relu(self.projection(hypothesis_embed))

		h0 = c0 = torch.tensor([]).new_zeros((self.num_layers * self.directions, batch.shape[0], self.hidden_size)).to(self.device)

		_, (premise_ht, _) = self.lstm(premise_proj, (h0, c0))
		_, (hypothesis_ht, _) = self.lstm(hypothesis_proj, (h0, c0))
		
		premise = premise_ht[-2:].transpose(0, 1).contiguous().view(batch.shape[0], -1)
		hypothesis = hypothesis_ht[-2:].transpose(0, 1).contiguous().view(batch.shape[0], -1)

		combined = torch.cat((premise, hypothesis, torch.abs(premise - hypothesis), premise * hypothesis), 1)
		return self.out(combined)

In [10]:
def test(model, loader, f_loss, device):
    """
    Test a model over the loader
    using the f_loss as metrics
    Arguments :
    model     -- A torch.nn.Module object
    loader    -- A torch.utils.data.DataLoader
    f_loss    -- The loss function, i.e. a loss Module
    device    -- A torch.device
    Returns :
    """

    # We enter eval mode.
    # This is important for layers such as dropout, batchnorm, ...
    model.eval()

    total_loss = 0
    num_samples = 0
    for i, (inputs, targets) in enumerate(loader):

        inputs, targets = inputs.to(device), targets.to(device)

        # Compute the forward propagation
        outputs = model(inputs)

        loss = f_loss(outputs, targets)

        # Update the metrics
        # We here consider the loss is batch normalized
        total_loss += inputs.shape[0] * loss.item()
        num_samples += inputs.shape[0]

    return total_loss / num_samples


        

def train(model, loader, f_loss, optimizer, device, dynamic_display=True):
    """
    Train a model for one epoch, iterating over the loader
    using the f_loss to compute the loss and the optimizer
    to update the parameters of the model.
    Arguments :
    model     -- A torch.nn.Module object
    loader    -- A torch.utils.data.DataLoader
    f_loss    -- The loss function, i.e. a loss Module
    optimizer -- A torch.optim.Optimzer object
    device    -- A torch.device
    Returns :
    The averaged train metrics computed over a sliding window
    """

    # We enter train mode.
    # This is important for layers such as dropout, batchnorm, ...
    model.train()
    print("---- start training ----")
    total_loss = 0
    num_samples = 0
    for i, (inputs, targets) in (pbar := tqdm.tqdm(enumerate(loader))):

        inputs, targets = inputs.to(device), targets.to(device)

        # Compute the forward propagation
        outputs = model(inputs)

        loss = f_loss(outputs, targets)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Update the metrics
        # We here consider the loss is batch normalized
        total_loss += inputs.shape[0] * loss.item()
        num_samples += inputs.shape[0]
        pbar.set_description(f"Train loss : {total_loss/num_samples:.2f}")

    return total_loss / num_samples

In [11]:
import torch.optim as optim
import multiprocessing
multiprocessing.set_start_method('spawn', force=True)

learning_rate=0.01
hidden_dim=200
dropout_rate=0.2

In [12]:
model = BiLSTM(embedding_dim=embedding_size, hidden_dim=hidden_dim, dropout_rate=dropout_rate, out_dim=num_classes, batch_size=batch_size, device=device).to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
f_loss=nn.CrossEntropyLoss()

In [13]:
class ModelCheckpoint(object):
    """
    Early stopping callback
    """

    def __init__(
        self,
        model: torch.nn.Module,
        savepath,
        min_is_best: bool = True,
    ) -> None:
        self.model = model
        self.savepath = savepath
        self.best_score = None
        if min_is_best:
            self.is_better = self.lower_is_better
        else:
            self.is_better = self.higher_is_better

    def lower_is_better(self, score):
        return self.best_score is None or score < self.best_score

    def higher_is_better(self, score):
        return self.best_score is None or score > self.best_score

    def update(self, score):
        if self.is_better(score):
            torch.save(self.model.state_dict(), self.savepath)
            self.best_score = score
            return True
        return False


model_checkpoint = ModelCheckpoint(
    model, str("./best_model.pt"), min_is_best=True
)


In [16]:
import os
import torch

class ModelCheckpoint(object):
    """
    Early stopping callback
    """
    def __init__(
        self,
        model: torch.nn.Module,
        savepath,
        min_is_best: bool = True,
    ) -> None:
        self.model = model
        self.savepath = savepath
        self.best_score = None
        if min_is_best:
            self.is_better = self.lower_is_better
        else:
            self.is_better = self.higher_is_better

        # Ensure directory exists
        os.makedirs(os.path.dirname(savepath), exist_ok=True)

    def lower_is_better(self, score):
        return self.best_score is None or score < self.best_score

    def higher_is_better(self, score):
        return self.best_score is None or score > self.best_score

    def update(self, score):
        if self.is_better(score):
            torch.save(self.model.state_dict(), self.savepath)
            self.best_score = score
            return True
        return False

model_checkpoint = ModelCheckpoint(
    model, str("./best_model.pt"), min_is_best=True
)



In [17]:
# Train 1 epoch

for e in range(1):

    train_loss = train(
        model=model,
        loader=train_loader,
        f_loss=f_loss,
        optimizer=optimizer,
        device=device,
    )

    # Test
    test_loss = test(model, valid_loader, f_loss, device)

    updated = model_checkpoint.update(test_loss)


---- start training ----


Train loss : 0.89: : 17168it [1:18:36,  3.64it/s]


In [18]:
test_dataset = SNILDataset(
    subset="test",
)

# Build the dataloaders
test_loader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=batch_size,
    #num_workers=num_workers,
    pin_memory=use_cuda,
)

In [22]:
def test(model, loader, device):
    """
    Test a model over the loader
    to compute accuracy.
    Arguments :
    model     -- A torch.nn.Module object
    loader    -- A torch.utils.data.DataLoader
    device    -- A torch.device
    Returns :
    Accuracy as a float
    """

    # We enter eval mode.
    model.eval()

    correct_predictions = 0
    num_samples = 0
    with torch.no_grad():  # No need to track the gradients
        for inputs, targets in loader:
            inputs, targets = inputs.to(device), targets.to(device)

            # Compute the forward propagation
            outputs = model(inputs)

            # Get the predicted class by finding the max index in the logit dimension
            _, predicted = outputs.max(1)
            correct_predictions += predicted.eq(targets).sum().item()
            num_samples += targets.size(0)

    accuracy = correct_predictions / num_samples
    return accuracy

# Load the best model
model.load_state_dict(torch.load("./best_model.pt"))

# Compute the accuracy on the test set
test_accuracy = test(model, test_loader, device)

print(f"Test Accuracy: {test_accuracy * 100:.2f}%")


Test Accuracy: 65.82%
