# CentraleSupelec - Natural language processing
# Practical session n°7

## Natural Language Inferencing (NLI): 

(NLI) is a classical NLP (Natural Language Processing) problem that involves taking two sentences (the premise and the hypothesis ), and deciding how they are related (if the premise *entails* the hypothesis, *contradicts* it, or *neither*).

Ex: 


| Premise | Label | Hypothesis |
| --- | --- | --- |
| A man inspects the uniform of a figure in some East Asian country. | contradiction | The man is sleeping. |
| An older and younger man smiling. | neutral | Two men are smiling and laughing at the cats playing on the floor. |
| A soccer game with multiple males playing. | entailment | Some men are playing a sport. |

### Stanford NLI (SNLI) corpus

In this labwork, I propose to use the Stanford NLI (SNLI) corpus ( https://nlp.stanford.edu/projects/snli/ ), available in the *Datasets* library by Huggingface.

    from datasets import load_dataset
    snli = load_dataset("snli")
    #Removing sentence pairs with no label (-1)
    snli = snli.filter(lambda example: example['label'] != -1) 

## Subject

You are asked to provide an operational Jupyter notebook that performs the task of NLI. For that, you need to tackle the following aspects of the problem:

1. Loading and preprocessing the data
2. Designing a PyTorch model that, given two sentences, decides how they are related (*entails*, *contradicts* or *neither*.)
3. Training and evaluating the model using appropriate metrics
4. (Optional) Allowing to play with the model (forward user sentences and visualize the prediction easily)
5. (Optional) Providing visual insight about the model (i.e. visualizing the attention if your model is using attention)

Although it is not mandatory, I suggest that you use a transformer model to perform the task. For that, you can use the *Transformer* library by Huggingface.

## Evaluation

The evaluation will be based on several criteria:

- Clarity and readability of the notebook. The notebook is the report of you project. Make it easy and pleasant to read.
- Justification of implementation choices (i.e. the network, the cost funtion, the optimizer, ...)
- Quality of the code. The various deeplearning and NLP labworks provide many example of good practices for designing experiments with neural networks. Use them as inspirational examples!

## Additional recommendations

- You are not seeking to publish a research paper! I'm not expecting state-of-the-art results! The idea of this labwork is to assess that you have integrated the skills necessary to handle textual data using deep neural network techniques.

- This labwork will be evaluated but we are still here to help you! Don't hesitate to request our help if you are stuck.

- If you intend to use BERT based models, let me give you an advice. The bert-base-* models available in *Transformers* need more than 12Go to be fine-tuned on GPU. To avoid memory issues, you can use several solutions: 

    - Use a lighter BERT based model such as DistilBERT, ALBERT, ...
    - Train a classification model on top of BERT, whithout fine-tuning it (i.e. freezing BERT weights)

## Huggingface documentations

In case you want to use the huggingface *Datasets* and *Transformer* libraries (which I advice), here are some useful documentation pages:

- Dataset quick tour

    https://huggingface.co/docs/datasets/quicktour.html
    
- Documentation on data preprocessing for transformers

    https://huggingface.co/transformers/preprocessing.html
    
- Transformer Quick tour (with distilbert example for classification).

    https://huggingface.co/transformers/quicktour.html
    


# Loading and preprocessing the data

In [1]:
import torch

In [4]:
from datasets import load_dataset
snli = load_dataset("snli")
#Removing sentence pairs with no label (-1)
snli = snli.filter(lambda example: example['label'] != -1) 

# 0 : entails
# 1 : neutral
# 2 : contradicts

In [4]:
len(snli["train"])

549367

In [5]:
from transformers import AutoConfig

model_name = "roberta-large"
config = AutoConfig.from_pretrained(model_name)
config

RobertaConfig {
  "_name_or_path": "roberta-large",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.39.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

In [7]:
from transformers import AutoTokenizer, AutoConfig, AutoModel
from tqdm import tqdm

# model_name = "distilbert-base-uncased"
# model_name = "bert-base-uncased"
# model_name = "gpt2-medium"
# model_name = "roberta-base"
# model_name = "roberta-large"
# model_name = "albert-large-v2"
# model_name = "gpt2"
model_name = "microsoft/deberta-v3-large"



tokenizer = AutoTokenizer.from_pretrained(model_name)


def preprocess_data(examples):
    premises = [example for example in examples['premise']]
    hypotheses = [example for example in examples['hypothesis']]
    return tokenizer(premises, hypotheses, padding=False, truncation=True, max_length=512)

tokenized_datasets = snli.map(
    preprocess_data,
    batched=True,
)


Map:   0%|          | 0/9842 [00:00<?, ? examples/s]

In [2]:
MODEL_NAME = 'microsoft/deberta-v3-large'
# model = AutoModel.from_pretrained(MODEL_NAME)
# config = AutoConfig.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)



In [15]:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.add_special_tokens({'cls_token': '[CLS]'})
tokenizer.add_special_tokens({'sep_token': '[SEP]'})


1

In [13]:
tokenizer("Hi")
tokenizer("What's up")
tokenizer("<")
tokenizer("[PAD]")
tokenizer("[CLS]")
# tokenizer("[SEP]")
# tokenizer.eos_token


{'input_ids': [2, 2, 3], 'token_type_ids': [0, 0, 0], 'attention_mask': [1, 1, 1]}

In [6]:
tokenizer(["Hi", "Hello"], ["What's up", "How are you doing"], padding=False, truncation=True, max_length=512)

{'input_ids': [[1, 2684, 2, 458, 280, 268, 322, 2], [1, 5365, 2, 577, 281, 274, 653, 2]], 'token_type_ids': [[0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

In [10]:
#tokenizer name : 
tokenizer.__class__.__name__

'GPT2TokenizerFast'

In [18]:
tokenized_datasets['test']

Dataset({
    features: ['premise', 'hypothesis', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 9824
})

In [21]:
tokenizer(tokenizer.cls_token + "Hi, how are you. I am fine?" + tokenizer.sep_token + "okoeke", padding = False, return_tensors='pt')

{'input_ids': tensor([[ 101,  101, 7632, 1010, 2129, 2024, 2017, 1012, 1045, 2572, 2986, 1029,
          102, 7929, 8913, 3489,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [4]:
print(tokenizer(["Hey bro, !" + tokenizer.sep_token + "This is my moment"], truncation=True, padding=True, max_length=128))
print(tokenizer.sep_token_id)
print(tokenizer.cls_token_id)

{'input_ids': [[101, 4931, 22953, 1010, 999, 102, 2023, 2003, 2026, 2617, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
102
101


In [4]:
import torch
from torch.utils.data import Dataset


class NLI_Dataset(Dataset):
    def __init__(self, tokenized_datasets, tokenizer, max_len, split = 'train', augmentation = False, size: int = None):
        if size is not None:
            self.input_ids = tokenized_datasets[split]['input_ids'][:size]
            self.attention_mask = tokenized_datasets[split]['attention_mask'][:size]
            self.labels = tokenized_datasets[split]['label'][:size]
        else:
            self.input_ids = tokenized_datasets[split]['input_ids']
            self.attention_mask = tokenized_datasets[split]['attention_mask']
            self.labels = tokenized_datasets[split]['label']
        self.augmentation = augmentation
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        input_ids = torch.tensor(self.input_ids[idx])
        attention_mask = torch.tensor(self.attention_mask[idx])
        labels = torch.tensor(self.labels[idx])

        pad_token_id = self.tokenizer.pad_token_id
            
        if self.augmentation:
            input_ids = self.transform(input_ids, attention_mask, labels)

        if input_ids.size(0) > self.max_len:
            input_ids = input_ids[:self.max_len]
            attention_mask = attention_mask[:self.max_len]
        elif input_ids.size(0) < self.max_len:
            input_ids = torch.cat((input_ids, torch.tensor([pad_token_id] * (self.max_len - input_ids.size(0))))).long()
            attention_mask = torch.cat((attention_mask, torch.tensor([0] * (self.max_len - attention_mask.size(0))))).long()
        return input_ids, attention_mask, labels

    # Data augmentation : replace premise and hypothesis
    def transform(self, input_id):
        sep_token_id = self.tokenizer.sep_token_id
        cls_token_id = self.tokenizer.cls_token_id

        sep_indices = torch.where(input_id == sep_token_id)[0]

        premise = input_id[1:sep_indices[0]]
        hypothesis = input_id[sep_indices[0]+1:sep_indices[1]]
        input_id_transformed = torch.cat((torch.tensor([cls_token_id]), hypothesis, torch.tensor([sep_token_id]), premise, torch.tensor([sep_token_id])))
        # Equal prob of transforming or no :
        if torch.rand(1) > 0.5:
            return input_id_transformed
        else:
            return input_id


In [5]:
nli_dataset = NLI_Dataset(tokenized_datasets, tokenizer, augmentation=True)

TypeError: __init__() missing 1 required positional argument: 'max_len'

In [12]:
cst_len = None
for i, inp in enumerate(tokenized_datasets[split]['input_ids']):
    if cst_len is None:
        cst_len = len(inp)
    elif cst_len != len(inp):
        print(f"Error : {cst_len} != {len(inp)}")
        print(f"Error : {i}, {inp}")
        break

Error : 248 != 208
Error : 1000, [101, 1055, 1049, 1037, 1048, 1048, 1048, 1037, 1057, 1043, 1044, 1045, 1050, 1043, 1039, 1044, 1045, 1048, 1040, 1059, 1045, 1056, 1044, 1038, 1048, 1051, 1050, 1040, 1011, 1044, 1037, 1045, 1054, 1055, 1045, 1056, 1056, 1045, 1050, 1043, 1037, 1056, 1037, 1056, 1037, 1038, 1048, 1041, 1044, 1051, 1048, 1040, 1045, 1050, 1043, 1037, 1043, 1054, 1041, 1041, 1050, 1055, 1045, 1052, 1052, 1061, 1039, 1057, 1052, 1012, 102, 1056, 1044, 1041, 1039, 1044, 1045, 1048, 1040, 1045, 1055, 1044, 1037, 1052, 1052, 1061, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [49]:
idx = 703
snli["train"]["premise"][idx], snli["train"]["hypothesis"][idx]

('A man carrying a load of fresh direct boxes on car with wheels in the city streets, as a woman walks towards him.',
 'A man delivers groceries to shut-in seniors in the city.')

In [51]:
nli_dataset.augmentation = False
inp, att, label = nli_dataset[703]
inp

tensor([   0,  250, 1437, 1437,  475,   10,  295, 1437, 1437,  740,   10,  910,
         910, 1423,  939,  295,  821, 1437, 1437,   10, 1437, 1437,  784, 1021,
          10,  385, 1437, 1437, 1021,  856, 1437, 1437,  856,  910,  364,  579,
        1368, 1437, 1437,  385,  939,  910,  364,  740,  326, 1437, 1437,  741,
        1021, 3023,  364,  579, 1437, 1437, 1021,  295, 1437, 1437,  740,   10,
         910, 1437, 1437,  885,  939,  326, 1368, 1437, 1437,  885, 1368,  364,
         364,  784,  579, 1437, 1437,  939,  295, 1437, 1437,  326, 1368,  364,
        1437, 1437,  740,  939,  326, 1423, 1437, 1437,  579,  326,  910,  364,
         364,  326,  579, 2156, 1437, 1437,   10,  579, 1437, 1437,   10, 1437,
        1437,  885, 1021,  475,   10,  295, 1437, 1437,  885,   10,  784,  449,
         579, 1437, 1437,  326, 1021,  885,   10,    2])

In [42]:
import numpy as np
rand = np.random.randint(0, len(nli_dataset))
print(rand)
inp, att, lab = nli_dataset[rand]
inp

703


IndexError: index 1 is out of bounds for dimension 0 with size 1

In [37]:
def transform(inp, att, lab):
    # inp : [CLS] premise [SEP] hypothesis [EOS]
    # [CLS] : 101
    # [SEP] : 102
    # [EOS] : 102
    sep_indices = torch.where(inp == 102)[0]
    premise = inp[1:sep_indices[0]]
    hypothesis = inp[sep_indices[0]+1:sep_indices[1]]
    inp_transformed = torch.cat((torch.tensor([101]), hypothesis, torch.tensor([102]), premise, torch.tensor([102])))
    # padding
    inp_transformed = torch.cat((inp_transformed, torch.zeros(128-len(inp_transformed), dtype=torch.long)))
    return inp_transformed, att, lab


In [33]:
inp

tensor([ 101, 1039, 1044, 1045, 1048, 1040, 1054, 1041, 1050, 1055, 1049, 1045,
        1048, 1045, 1050, 1043, 1037, 1050, 1040, 1059, 1037, 1058, 1045, 1050,
        1043, 1037, 1056, 1039, 1037, 1049, 1041, 1054, 1037,  102, 1056, 1044,
        1041, 1047, 1045, 1040, 1055, 1037, 1054, 1041, 1042, 1054, 1051, 1059,
        1050, 1045, 1050, 1043, 1031, 1041, 2891, 1033,  102,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0])

In [38]:
inp_transformed, att, lab = transform(inp, att, lab)
inp_transformed

tensor([ 101, 1056, 1044, 1041, 1047, 1045, 1040, 1055, 1037, 1054, 1041, 1042,
        1054, 1051, 1059, 1050, 1045, 1050, 1043, 1031, 1041, 2891, 1033,  102,
        1039, 1044, 1045, 1048, 1040, 1054, 1041, 1050, 1055, 1049, 1045, 1048,
        1045, 1050, 1043, 1037, 1050, 1040, 1059, 1037, 1058, 1045, 1050, 1043,
        1037, 1056, 1039, 1037, 1049, 1041, 1054, 1037,  102,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0])

In [10]:
import torch
import torch.nn as nn
import gc
from transformers import AutoModel, AutoConfig, AutoModelForSequenceClassification

class NLIModel(nn.Module):
    def __init__(self, model_name):
        super(NLIModel, self).__init__()
        self.transformer = AutoModel.from_pretrained(model_name)
        self.classifier = nn.Linear(self.transformer.config.hidden_size, 3)

    def forward(self, input_ids, attention_mask):
        outputs = self.transformer(input_ids, attention_mask=attention_mask)
        hidden_state = outputs[0]  # (batch_size, sequence_length, hidden_size)
        pooled_output = hidden_state[:, 0]  # (batch_size, hidden_size)
        logits = self.classifier(pooled_output)
        return logits

In [37]:
model = AutoModelForSequenceClassification.from_pretrained("gpt2", num_labels=3)
model.to(torch.device("cuda"))

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=3, bias=False)
)

In [10]:
outputs = model.transformer(inp.unsqueeze(0), attention_mask=att.unsqueeze(0))

In [27]:
for o in outputs.items():
    print(o[1].shape)

torch.Size([1, 128, 768])


In [6]:
# nli_dataset = NLI_Dataset(tokenized_datasets, size=1000)
nli_dataset = NLI_Dataset(tokenized_datasets)

In [11]:
model = NLIModel(model_name)
model_config = AutoConfig.from_pretrained(model_name)

pytorch_model.bin:   0%|          | 0.00/874M [00:00<?, ?B/s]

In [17]:
from transformers import DebertaTokenizer, AutoModel

tokenizer = DebertaTokenizer.from_pretrained("microsoft/deberta-v3-large")
model = AutoModel.from_pretrained("microsoft/deberta-v3-large")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DebertaV2TokenizerFast'. 
The class this function is called from is 'DebertaTokenizer'.


TypeError: expected str, bytes or os.PathLike object, not NoneType

In [12]:
model("Coucou")

TypeError: forward() missing 1 required positional argument: 'attention_mask'

In [12]:
max_seq_length = model_config.max_position_embeddings
max_seq_length

512

In [13]:
nli_dataset = NLI_Dataset(tokenized_datasets, tokenizer, max_seq_length)

In [15]:


inp, att, lab = nli_dataset[0]


outputs = model(inp.unsqueeze(0), att.unsqueeze(0))
print(f"Output : {outputs}, Label : {lab}")

Output : tensor([[ 0.0944, -0.4864, -0.0281]], grad_fn=<AddmmBackward0>), Label : 1


In [8]:
model.load_state_dict(torch.load('model.pth'))

<All keys matched successfully>

In [9]:
import torch.optim as optim
from torch.utils.data import DataLoader
from tqdm import tqdm, trange
from sklearn.metrics import accuracy_score

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

batch_size = 32

dataloader = DataLoader(nli_dataset, batch_size=batch_size, shuffle=True)

optimizer = optim.Adam(model.parameters(), lr=1e-5)
loss_fn = nn.CrossEntropyLoss()

epochs = 5

for epoch in range(epochs):
    model.train()
    train_loss = 0
    train_acc = 0
    train_steps = 0

    progress_bar = tqdm(dataloader, desc=f"Epoch {epoch + 1}")

    for batch, (input_ids, attention_mask, labels) in enumerate(progress_bar):
        optimizer.zero_grad()
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)
        outputs = model(input_ids, attention_mask)
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        train_steps += 1

        logits = outputs.detach().cpu()
        label_ids = labels.detach().cpu()
        batch_acc = accuracy_score(label_ids, logits.argmax(dim=1))
        train_acc += batch_acc * len(label_ids)

        # Update the current loss and accuracy at each batch size
        current_loss = train_loss / (batch + 1)
        current_acc = train_acc / ((batch + 1) * batch_size)
        progress_bar.set_postfix({'Current Loss': f'{current_loss:.3f}', 'Current Acc': f'{current_acc:.3f}'})

    train_loss /= train_steps
    train_acc /= len(dataloader.dataset)

    print(f"Epoch: {epoch + 1}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc:.3f}")

Epoch 1:  32%|███▏      | 5574/17168 [07:56<16:31, 11.69it/s, Current Loss=0.356, Current Acc=0.856] 


KeyboardInterrupt: 