# CentraleSupelec - Natural language processing
# Practical session n°8

### Mohammed EL Hamidi

In [None]:
!python3 -m venv bert-nli-env
!source bert-nli-env/bin/activate
!pip install torch transformers datasets


## Data loading and processing

In [1]:
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW
from sklearn.metrics import accuracy_score

  from .autonotebook import tqdm as notebook_tqdm


#### This part focuses on preparing the SNLI dataset for use with a neural network. It involves loading the dataset, filtering out unusable data, tokenizing the text to convert words to numerical IDs, and finally setting up PyTorch DataLoaders to facilitate batch processing during training and validation.

In [2]:
dataset = load_dataset("snli")
dataset = dataset.filter(lambda example: example['label'] != -1)  # Remove examples without a label

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples['premise'], examples['hypothesis'], padding='max_length', truncation=True, max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True)


Map: 100%|██████████| 9824/9824 [00:03<00:00, 2661.58 examples/s]
Map: 100%|██████████| 9842/9842 [00:03<00:00, 2589.71 examples/s]
Map: 100%|██████████| 549367/549367 [03:19<00:00, 2746.86 examples/s]


In [3]:
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

train_dataset = tokenized_datasets['train']
val_dataset = tokenized_datasets['validation']

train_loader = DataLoader(train_dataset, shuffle=True, batch_size=8)
val_loader = DataLoader(val_dataset, batch_size=8)


#### in this section, we initialize our neural network model. We use a pre-trained DistilBERT model from the Hugging Face Transformers library, which is fine-tuned for a sequence classification task. The model is configured to output three labels, corresponding to the possible outcomes in the SNLI dataset. We also set up an optimizer, which is responsible for updating the model's weights during training based on the computed gradients.

In [4]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)
model.to(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))

optimizer = AdamW(model.parameters(), lr=5e-5)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Training and validating

#### We iterate over the dataset in batches, calculating the loss (how well the model's predictions match the actual labels) and adjusting the model's weights to minimize this loss. The training process is repeated for a fixed number of epochs. Following training, we evaluate the model's performance on a separate validation dataset to gauge its generalization ability. The model's accuracy—how often its predictions match the true labels—is reported as a measure of its performance.

In [5]:
num_epochs = 3

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch in train_loader:
        batch = {k: v.to(model.device) for k, v in batch.items()}
        outputs = model(**batch)
        
        loss = outputs.loss
        total_loss += loss.item()
        
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    
    print(f"Epoch {epoch+1} | Loss: {total_loss / len(train_loader)}")


Epoch 1 | Loss: 0.47950792495433675


KeyboardInterrupt: 

In [6]:
model.eval()
total_eval_accuracy = 0

for batch in val_loader:
    batch = {k: v.to(model.device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    total_eval_accuracy += (predictions == batch['labels']).sum().item()

print(f"Validation Accuracy: {total_eval_accuracy / len(val_dataset)}")


Validation Accuracy: 0.8669985775248933
