# Fine-tuning DistilBERT with your own dataset for multi-classification task

As natural language processing (NLP) has rapidly evolved over the past few years, transformer-based architectures have shown tremendous success on various NLP tasks. The attention mechanism that allows the model to capture contextual information has been a significant breakthrough in the field of NLP. One of the popular transformer-based models is BERT (Bidirectional Encoder Representations from Transformers). BERT has been successful in various NLP tasks, such as text classification, named entity recognition, question answering, and many more.

In this blog post, we will focus on fine-tuning DistilBERT, a smaller and faster version of BERT, on our own dataset for a multi-classification task. We will cover the following topics in this blog post:

* Introduction to tokenizer.
* Overview of DistilBERT.
* Preparing data for multi-classification task.
* Training and evaluating DistilBERT on our own dataset.

## Introduction to Tokenizer 

Tokenization is the process of splitting the text into smaller sub-parts, called tokens, and each token represents a word, a punctuation mark, or a part of a word. In NLP, the tokenizer is one of the essential components, which converts the raw text data into a form that can be processed by the machine learning model. The tokenizer maps each word into a unique integer, which is used as an input to the model.

In this blog post, we will use the AutoTokenizer provided by Hugging Face, a popular NLP library. It automatically selects the appropriate tokenizer for the given transformer-based model, and we don't have to worry about selecting the tokenizer manually.

## Overview of DistilBERT 

DistilBERT is a smaller and faster version of BERT that has been pre-trained on large datasets. The main idea behind DistilBERT is to remove some of the redundant parameters of BERT while still maintaining its performance. DistilBERT has fewer parameters and requires less computation power, making it easier to fine-tune on smaller datasets.

## Preparing data for multi-classification task 

To train and evaluate DistilBERT on our own dataset for a multi-classification task, we need to convert our dataset into a format that DistilBERT can understand. In this blog post, we will use a dataset that contains labeled tweets, and we will classify the tweets into three categories: hate speech, offensive language, and neither.

First, we will load the dataset using the pandas library and split it into the training, validation, and test sets. We will then create a custom dataset class that will read the data from a CSV file and tokenize each tweet using the AutoTokenizer provided by Hugging Face. Finally, we will create data loaders for the training, validation, and test sets using the PyTorch DataLoader.

## Training and evaluating DistilBERT on our own dataset 

We will use our own dataset: [Hate Speech and Offensive Language Dataset](https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset), with thousands of labeled tweets in 3 clases: Hate Speech, Offensive or Neither, class labels are already encoded as 0,1,2 respectively.  And we will load DistilBert using HuggingFace transformers library and then retrain a model based on DistilBert and our own fine tuned data set for a classification task.


## Introduction to Transformers

Transformers are a type of neural network architecture that has been used to achieve state-of-the-art performance on a wide range of natural language processing tasks, including sentiment analysis. Transformers are designed to process sequences of input data (e.g., sequences of words in a sentence) by leveraging attention mechanisms to selectively focus on different parts of the input sequence. This allows the model to capture long-range dependencies between different parts of the input sequence, which can be important for tasks such as sentiment analysis.

## The Dataset

For this tutorial, we will be using a dataset of tweets labeled with sentiment scores. The dataset contains thousands of  tweets labeled with one of three sentiment scores: 0 (hate specch), 1 (offensive), or 2 (neither). We will use this dataset to train and evaluate our sentiment analysis model.

## Preprocessing the Data

We will start by preprocessing the data, which involves loading the dataset and partitioning it into training, validation, and testing sets. We will use the load_dataset_into_dataframe() function to load the dataset from a CSV file and convert it into a pandas DataFrame.

```
df = load_dataset_into_dataframe()
```

Next, we will partition the data into training, validation, and testing sets using the partition_dataset() function. This function shuffles the data and partitions it into three sets based on a 70/15/15 split.

```
partition_dataset(df)
```

## Tokenization

Before we can train our sentiment analysis model, we need to tokenize the text data. Tokenization involves breaking up the text into individual words or subwords, which can then be fed into the model as input. We will use the `AutoTokenizer` class from the transformers library to tokenize the text data.

```
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
```

We will use the `CustomDataset` class to create a PyTorch Dataset object that can be fed into the model during training. The `CustomDataset` class reads the CSV files containing the text data and labels, tokenizes the text using the `tokenizer`, and returns the tokenized data along with the labels.

```
train_dataset = CustomDataset("train.csv")
val_dataset = CustomDataset("val.csv")
test_dataset = CustomDataset("test.csv")
```

## Training the Model

We will use the `AutoModelForSequenceClassification` class from the transformers library to create a pre-trained sentiment analysis model. We will then fine-tune the model on our training data using the train() function.

```
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)
train(5, model, optimizer, train_loader, val_loader, device="cuda")
```

The train() function trains the model for a given number of epochs using the training data and evaluates the model on the validation data after each epoch. We use the Adam optimizer with a learning rate of 2e-5 and a batch size of 16. 

## Adam Optimizer

Adam optimizer is a popular optimization algorithm used in machine learning to train neural networks. It is a variant of the stochastic gradient descent (SGD) algorithm that uses adaptive learning rates.

In simple terms, Adam optimizer adjusts the learning rate of each parameter based on the gradient of the parameter and the historical gradients of that parameter. This helps to ensure that the learning rate is neither too high nor too low, leading to faster convergence during training.

To summarize, Adam optimizer is an algorithm used to optimize the weights and biases in a neural network during training, by adapting the learning rate of each parameter based on the gradients and historical gradients. This helps to improve the speed and efficiency of the learning process.

## Stochastic gradient descent (SGD) 

It is a popular optimization algorithm used in machine learning to train neural networks. It is a variant of the gradient descent algorithm that uses mini-batches of training data to update the model's parameters.

In simple terms, SGD updates the parameters of the model by calculating the gradient of the loss function with respect to the parameters for a small batch of randomly selected training data. The parameters are then updated using the gradient and a learning rate, which determines the size of the step taken in the direction of the gradient. This process is repeated for multiple batches of training data until the model has learned the patterns in the data.

To summarize, SGD is an optimization algorithm that updates the parameters of a neural network using mini-batches of training data and the gradient of the loss function with respect to the parameters. It is an efficient and effective way to train models on large datasets and is widely used in deep learning.

# Enough talk, show me the code:

Code is highly commented, enjoy and happy coding

In [1]:
import os
import os.path as op
import time
import os
import sys
import tarfile
import time
import numpy as np
import pandas as pd
import torch
import torchmetrics
import urllib
import os
import csv
import pandas as pd

from datasets import load_dataset
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from watermark import watermark
from packaging import version
from torch.utils.data import Dataset
from tqdm import tqdm
from torch.utils.data.sampler import SubsetRandomSampler
from sklearn.model_selection import StratifiedShuffleSplit
from datasets import load_dataset, Features, Value, ClassLabel

  from .autonotebook import tqdm as notebook_tqdm


### reporthook(count, block_size, total_size)

The function takes three arguments: count, block_size, and total_size. These are the number of blocks downloaded so far, the size of each block in bytes, and the total size of the file being downloaded, respectively.

The global keyword is used to indicate that the variable start_time is defined in the global scope, which means it can be accessed and modified from within the function.

If count is zero, the function sets start_time to the current time and returns without writing any progress report. This is because there is no progress to report for the first block, and we only want to report progress for subsequent blocks.

If count is greater than zero, the function calculates the elapsed time since the start of the download by subtracting start_time from the current time. It also calculates the size of the downloaded block in bytes by multiplying count and block_size.

The function then calculates the download speed in MB/s by dividing the size of the downloaded block in bytes by the elapsed time in seconds, and converting the result to megabytes. It also calculates the percentage of the download completed by multiplying the size of the downloaded block by 100 and dividing by the total size of the file.

Finally, the function writes a progress report to stdout using sys.stdout.write(). The progress report includes the percentage of the download completed, the size of the downloaded block in megabytes, the download speed in megabytes per second, and the elapsed time

In [2]:
import time
import sys
def reporthook(count, block_size, total_size):
    global start_time

    # If this is the first block being downloaded
    if count == 0:
        # Record the current time
        start_time = time.time()
        return

    # Calculate the elapsed time since the start of the download
    duration = time.time() - start_time

    # Calculate the size of the downloaded block in bytes
    progress_size = int(count * block_size)

    # Calculate the download speed in MB/s
    speed = progress_size / (1024.0**2 * duration)

    # Calculate the percentage of the download completed
    percent = count * block_size * 100.0 / total_size

    # Write the progress report to stdout and flush the buffer
    sys.stdout.write(
        f"\r{int(percent)}% | {progress_size / (1024.**2):.2f} MB "
        f"| {speed:.2f} MB/s | {duration:.2f} sec elapsed"
    )
    sys.stdout.flush()

### download_dataset()

The given code defines a function called download_dataset() which downloads and reads a CSV file called "labeled_data2.csv". Here's a breakdown of what the code does:

* The function initializes a variable called data_file and sets its value to "labeled_data2.csv".
* The code checks whether the file data_file exists in the current working directory by calling `os.path.isfile(data_file)`.
* If the file does not exist, the function prints an error message and returns None.
* If the file exists, the function opens it in read mode using the with `open(data_file, 'r') as f:` statement. This ensures that the file is properly closed after the data has been read from it.
* The `csv.DictReader(f)` function is used to read the file object f and create a dictionary object for each row. The `DictReader` class takes the first row of the CSV file as a header row and uses the values in that row as the keys for each subsequent row's dictionary.
* The function then initializes an empty list called data to store the data.
* For each row in the file, the function extracts the values for the "class" and "tweet" keys using `row['class']` and `row['tweet']` respectively.
* The function then creates a tuple with the extracted class and tweet values and appends it to the data list.
* After all rows have been processed, the function returns the data list.

In [3]:
import os
import csv

def download_dataset():
    # Set the file name to download
    data_file = "labeled_data2.csv"

    # Check if the file exists in the current working directory
    if not os.path.isfile(data_file):
        # If the file does not exist, print an error message and return None
        print(f"Error: {data_file} not found")
        return None

    # Open the file in read mode
    with open(data_file, 'r') as f:
        # Use csv.DictReader to read the file and create dictionaries for each row
        reader = csv.DictReader(f)
        # Initialize an empty list to store the data
        data=[]
        # Loop through each row in the file
        for row in reader:
            # Extract the class and tweet values from the row
            class_ = row['class']
            tweet = row['tweet']
            # Create a tuple with the extracted values and append it to the data list
            data.append((class_, tweet))

    # Return the data list
    return data

### load_dataset_into_dataframe()

This code defines a function named `load_dataset_into_dataframe()` that loads a CSV file named labeled_data2.csv into a pandas dataframe, shuffles the rows randomly, drops an unnecessary column named "Unnamed: 0", and returns the resulting dataframe.

In [4]:
def load_dataset_into_dataframe(): # define function
    data_file = "labeled_data2.csv" # assign file name to a variable
    
    try: # try to read the file
        df = pd.read_csv(data_file, on_bad_lines='skip') # read csv file with pandas
    except FileNotFoundError: # if file is not found
        print(f"Error: {data_file} not found") # print error message
        
    df["class"] = df["class"].astype(int) # convert "class" column to integer type
    df = df.sample(frac=1, random_state=42).reset_index(drop=True) # shuffle the rows randomly
    df.drop("Unnamed: 0", inplace=True, axis=1) # remove "Unnamed: 0" column
    print("Class distribution:") # print message
    print(df["class"].value_counts()) # print the count of each class
    return df # return the dataframe

### `partition_dataset()`
In the `partition_dataset()` method the number of rows in the shuffled dataset is used to compute the number of rows for each split based on the desired proportions. The num_train variable is set to 70% of the total number of rows, and the `num_val` variable is set to 15% of the total number of rows. The `iloc` method is used to slice the dataframe into the train, validation, and test sets, based on their respective row indices. Finally, the resulting dataframes are saved to separate CSV files.

In [5]:
def partition_dataset(df): # define function that takes a pandas dataframe as input
    df_shuffled = df.sample(frac=1, random_state=1).reset_index(drop=True) # shuffle the rows of the dataframe randomly

    num_rows = len(df_shuffled) # calculate the number of rows in the shuffled dataframe
    num_train = int(num_rows * 0.7) # calculate the number of rows to include in the training set (70% of total rows)
    num_val = int(num_rows * 0.15) # calculate the number of rows to include in the validation set (15% of total rows)

    df_train = df_shuffled.iloc[:num_train] # slice the first num_train rows and assign them to the training set
    df_val = df_shuffled.iloc[num_train:num_train+num_val] # slice the next num_val rows and assign them to the validation set
    df_test = df_shuffled.iloc[num_train+num_val:] # slice the remaining rows and assign them to the test set

    df_train.to_csv("train.csv", index=False, encoding="utf-8") # write the training set to a CSV file named "train.csv"

    df_val.to_csv("val.csv", index=False, encoding="utf-8") # write the validation set to a CSV file named "val.csv"
##    df_test.to_csv("test.csv", index=False, encoding="utf-8") # write the test set to a CSV file named "test.csv"

### CustomDataset

This code defines a custom dataset class in PyTorch, which inherits from the Dataset class provided by PyTorch. The custom dataset is designed to load data from a CSV file containing labeled text data, and preprocess the text data so that it can be fed into a neural network for training or inference. 

In [6]:
class CustomDataset(Dataset):
    def __init__(self, file_path):
        """
        Reads data from a CSV file and stores it in the dataset object.
        """
        self.data = []
        self.labels = []  # Store labels separately
        with open(file_path, 'r') as f:
            reader = csv.DictReader(f)
            for row in reader:
                class_ = row['class']  # Extract the class label from the row
                tweet = row['tweet']  # Extract the tweet from the row
                self.data.append((class_, tweet))  # Store the tweet and class label as a tuple
                self.labels.append(class_)  # Store the label in a separate list

    def __len__(self):
        """
        Returns the length of the dataset.
        """
        return len(self.data)

    def __getitem__(self, index):
        """
        Retrieves a single item (a tweet and its corresponding class label) from the dataset at the specified index.
        Tokenizes the tweet using the tokenizer object and returns the input IDs, attention mask, and label as a tuple.
        """
        class_, tweet = self.data[index]  # Retrieve the tweet and class label at the specified index
        encoding = tokenizer.encode_plus(tweet, add_special_tokens=True, max_length=256, padding='max_length', return_tensors='pt', truncation=True)  # Tokenize the tweet using the tokenizer object
        input_ids = encoding['input_ids'][0]  # Get the input IDs from the tokenization output
        attention_mask = encoding['attention_mask'][0]  # Get the attention mask from the tokenization output
        label = int(self.labels[index])  # Retrieve the label for the tweet and convert it to an integer
        return input_ids, attention_mask, label  # Return the input IDs, attention mask, and label as a tuple


### tokenize_test()

The `tokenize_text` function takes in a batch of text data in the form of a dictionary, where the key is "tweet" and the value is a string of text to be tokenized. It uses the `tokenizer` object to tokenize the text by converting it into a list of integers, where each integer corresponds to a particular token in the vocabulary of the `tokenizer`.

The`truncation=True` parameter tells the `tokenizer` to truncate the sequence of tokens if it exceeds a certain length, while `padding=True` parameter tells the `tokenizer` to pad the sequence of tokens with special tokens (e.g., [PAD]) to ensure that all inputs have the same length.

The function then returns the tokenized text as a dictionary with keys "input_ids" and "attention_mask", which contain the tokenized input sequence and a mask indicating which elements of the sequence should be attended to by the model, respectively.


In [7]:
def tokenize_text(batch):
    # expects a dictionary `batch` with a key `"text"` containing a string of text to be tokenized
    return tokenizer(batch["tweet"], truncation=True, padding=True)
    # tokenize the text in `batch["text"]` using the `tokenizer` object
    # `truncation=True` tells the tokenizer to truncate the sequence of tokens if it exceeds a certain length
    # `padding=True` tells the tokenizer to pad the sequence of tokens with special tokens (e.g., `[PAD]`) to ensure that all inputs have the same length
    # the function returns the tokenized text as a list of integers, where each integer corresponds to a particular token in the vocabulary of the `tokenizer`

### train()

The `train()` function trains a neural network model for a given number of epochs on a given dataset. In each epoch, the model is trained on batches of data using an optimizer algorithm such as Adam or SGD. The train_loader and val_loader are data loaders that provide batches of data for training and validation, respectively.

During training, the model goes through a forward and backpropagation process to update its weights. The forward propagation computes the model's output for a given input, while the backpropagation computes the gradients of the loss function with respect to the model's parameters. The optimizer uses these gradients to update the model's parameters and minimize the loss function.

The `train_acc` and `val_acc `are instances of the Accuracy metric from the torchmetrics library. This metric measures the accuracy of the model's predictions by comparing them to the ground truth labels. In each epoch, the accuracy is calculated for both the training and validation sets.

Before training the model, it is initialized with random weights. During training, the model's weights are updated to minimize the loss function. The optimizer.zero_grad() function resets the gradients of all model parameters to zero before the backward propagation step. This is necessary because PyTorch accumulates gradients during backward propagation. By zeroing the gradients, we ensure that we only calculate the gradients for the current batch of data.

Finally, the `print()` statements provide information about the training progress. The loss function is printed after each batch of training data, while the accuracy is printed at the end of each epoch. The output of the train() function is the trained model.

In [8]:

def train(num_epochs, model, optimizer, train_loader, val_loader, device):
    # trains a neural network model for a given number of epochs on a given dataset
    # `num_epochs`: number of epochs to train for
    # `model`: neural network model to train
    # `optimizer`: optimizer algorithm to use during training (e.g. Adam or SGD)
    # `train_loader`: data loader for training set
    # `val_loader`: data loader for validation set
    # `device`: device to use for computation (e.g. "cpu" or "cuda")

    for epoch in range(num_epochs):
        # iterate over each epoch

        # initialize a new instance of the Accuracy metric for training set
        train_acc = torchmetrics.Accuracy(task="multiclass", num_classes=3).to(device)
        for batch_idx,batch in enumerate(train_loader):
            model.train()
            input_ids, attention_mask, label = batch  # unpack the tuple into its elements
            # move the tensors to the device
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            label = label.to(device)
            # assign the tensors to the batch dictionary
            batch = {"input_ids": input_ids, "attention_mask": attention_mask, "label": label}
            
            ### FORWARD AND BACK PROP
            outputs = model(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
                labels=batch["label"],
            )

            optimizer.zero_grad()
            outputs["loss"].backward()

            ### UPDATE MODEL PARAMETERS
            optimizer.step()

            ### LOGGING
            #if not batch_idx % 300:
            print(f"Epoch: {epoch+1:04d}/{num_epochs:04d} | Batch {batch_idx:04d}/{len(train_loader):04d} | Loss: {outputs['loss']:.4f}")

            model.eval()
            with torch.no_grad():
                predicted_labels = torch.argmax(outputs["logits"], 1)
                train_acc.update(predicted_labels, batch["label"])

        ### MORE LOGGING
        with torch.no_grad():
            model.eval()
            val_acc = torchmetrics.Accuracy(task="multiclass", num_classes=3).to(device)
            for batch in val_loader:
                input_ids, attention_mask, label = batch  # unpack the tuple into its elements
                # move the tensors to the device
                input_ids = input_ids.to(device)
                attention_mask = attention_mask.to(device)
                label = label.to(device)
                # assign the tensors to the batch dictionary
                batch = {"input_ids": input_ids, "attention_mask": attention_mask, "label": label}
           
                outputs = model(
                    batch["input_ids"],
                    attention_mask=batch["attention_mask"],
                    labels=batch["label"],
                )
                predicted_labels = torch.argmax(outputs["logits"], 1)
                val_acc.update(predicted_labels, batch["label"])

            print(
                f"Epoch: {epoch+1:04d}/{num_epochs:04d} | Train acc.: {train_acc.compute()*100:.2f}% | Val acc.: {val_acc.compute()*100:.2f}%"
            )

###  fine-tuning a pre-trained model Distilbert

The concept of fine-tuning a pre-trained model refers to the process of adjusting the weights of a pre-trained model on a specific task or dataset. The idea behind this is that a pre-trained model has already learned to extract relevant features from a large corpus of text, and we can use that knowledge to improve the performance of the model on a task-specific dataset.

In the given code, we begin by loading a dataset, and then proceed to tokenize and numericalize the data using the AutoTokenizer from the transformers library. This involves converting the text data into a numerical representation that can be understood by the model. We then create data loaders using DataLoader from PyTorch, which allow us to feed data to the model in batches during training.

Next, we initialize a pre-trained model AutoModelForSequenceClassification from the transformers library, which is a transformer-based model trained on a large corpus of text. We specify the number of labels for our task, and then move the model to the device, which can be a GPU if available for faster computation.

Finally, we finetune the model by training it on our task-specific dataset using the train function defined in the code. During training, we update the weights of the model using an optimizer called Adam with a learning rate of 5e-5. We train the model for 3 epochs and then evaluate its performance on the test dataset using the torchmetrics library. The accuracy of the model on the test dataset is printed at the end of the code.

In [9]:
if __name__ == "__main__":
    print(watermark(packages="torch,lightning,transformers", python=True))
    print("Torch CUDA available?", torch.cuda.is_available())
    device = "cuda:0" if torch.cuda.is_available() else "cpu"

    torch.manual_seed(123)
    ##########################
    ### 1 Loading the Dataset
    ##########################
    #download_dataset()
    df = load_dataset_into_dataframe()
    if not (op.exists("train.csv") and op.exists("val.csv") and op.exists("test.csv")):
        partition_dataset(df)

    features = Features(
        {
            "class": ClassLabel(
                num_classes=3,
                names=["hate speech", "Offensive", "Neither"],
            ),
            "tweet": Value(dtype="string"),
        })

    hatespeeck_dataset = load_dataset(
        "csv",
        data_files={
            "train": "train.csv",
            "validation": "val.csv",
            "test": "test.csv",
        },
        features=features
    )

    #########################################
    ### 2 Tokenization and Numericalization
    #########################################

    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    print("Tokenizer input max length:", tokenizer.model_max_length, flush=True)
    print("Tokenizer vocabulary size:", tokenizer.vocab_size, flush=True)
    print("Tokenizing ...", flush=True)
    hatespeech_tokenized = hatespeeck_dataset.map(tokenize_text, batched=True, batch_size=None)
    del hatespeeck_dataset
    hatespeech_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "tweet"])
    os.environ["TOKENIZERS_PARALLELISM"] = "false"

    #########################################
    ### 3 Set Up DataLoaders
    #########################################
    
    # Define split ratios and other parameters
    train_split = 0.7
    validation_split = 0.15
    test_split = 0.15
    shuffle_dataset = True
    random_seed = 42
    batch_size = 16

    # Create dataset object
    dataset = CustomDataset("labeled_data2.csv")

    # Implement stratified splits
    splitter_train = StratifiedShuffleSplit(n_splits=1, test_size=1 - train_split, random_state=random_seed)
    train_indices, remaining_indices = next(splitter_train.split(np.zeros(len(dataset)), dataset.labels))

    splitter_test_val = StratifiedShuffleSplit(n_splits=1, test_size=test_split/(test_split+validation_split), random_state=random_seed)
    test_indices, val_indices = next(splitter_test_val.split(np.zeros(len(remaining_indices)), [dataset.labels[i] for i in remaining_indices]))

    # Create PT data samplers and loaders
    train_sampler = SubsetRandomSampler(train_indices)
    valid_sampler = SubsetRandomSampler(val_indices)
    test_sampler = SubsetRandomSampler(test_indices)

    train_loader = DataLoader(dataset, batch_size=batch_size, sampler=train_sampler)
    validation_loader = DataLoader(dataset, batch_size=batch_size, sampler=valid_sampler)
    test_loader = DataLoader(dataset, batch_size=batch_size, sampler=test_sampler)


    #########################################
    ### 4 Initializing the Model
    #########################################

    model = AutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased", num_labels=3
    )

    model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)

    #########################################
    ### 5 Finetuning
    #########################################

    start = time.time()
    train(
        num_epochs=3,
        model=model,
        optimizer=optimizer,
        train_loader=train_loader,
        val_loader=validation_loader,
        device=device,
    )

    end = time.time()
    elapsed = end - start
    print(f"Time elapsed {elapsed/60:.2f} min")

    with torch.no_grad():
        model.eval()
        test_acc = torchmetrics.Accuracy(task="multiclass", num_classes=3).to(device)
        for batch in test_loader:
            input_ids, attention_mask, label = batch  # unpack the tuple into its elements
            # move the tensors to the device
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            label = label.to(device)
            # assign the tensors to the batch dictionary
            batch = {"input_ids": input_ids, "attention_mask": attention_mask, "label": label}
      
            outputs = model(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
                labels=batch["label"],
            )
            predicted_labels = torch.argmax(outputs["logits"], 1)
            test_acc.update(predicted_labels, batch["label"])

    print(f"Test accuracy {test_acc.compute()*100:.2f}%")

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 8.4.0

torch       : 1.12.0
lightning   : not installed
transformers: 4.27.1

Torch CUDA available? True
Class distribution:
1    701
2    157
0     49
Name: class, dtype: int64
Tokenizer input max length: 512
Tokenizer vocabulary size: 30522
Tokenizing ...
Epoch: 0001/0003 | Batch 0000/0041 | Loss: 1.0423
Epoch: 0001/0003 | Batch 0001/0041 | Loss: 0.9934
Epoch: 0001/0003 | Batch 0002/0041 | Loss: 0.8500
Epoch: 0001/0003 | Batch 0003/0041 | Loss: 0.8439
Epoch: 0001/0003 | Batch 0004/0041 | Loss: 0.5773
Epoch: 0001/0003 | Batch 0005/0041 | Loss: 0.7593
Epoch: 0001/0003 | Batch 0006/0041 | Loss: 0.5719
Epoch: 0001/0003 | Batch 0007/0041 | Loss: 0.5381
Epoch: 0001/0003 | Batch 0008/0041 | Loss: 0.6897
Epoch: 0001/0003 | Batch 0009/0041 | Loss: 0.9545
Epoch: 0001/0003 | Batch 0010/0041 | Loss: 0.5337
Epoch: 0001/0003 | Batch 0011/0041 | Loss: 0.6696
Epoch: 0001/0003 | Batch 0012/0041 | Loss: 0.5834
Epoch: 00

Found cached dataset csv (/home/azureuser/.cache/huggingface/datasets/csv/default-fcccae54916af5c7/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)
100%|██████████| 3/3 [00:00<00:00, 72.32it/s]
Loading cached processed dataset at /home/azureuser/.cache/huggingface/datasets/csv/default-fcccae54916af5c7/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-963dbc1b3f34fd40.arrow
Loading cached processed dataset at /home/azureuser/.cache/huggingface/datasets/csv/default-fcccae54916af5c7/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-91c774380f7a4b7f.arrow
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification 

And then we save the model as PTH file in our local disk.

In [10]:
from pathlib import Path

# 1. Create models directory
MODEL_PATH = Path("models")
MODEL_PATH.mkdir(parents=True, exist_ok= True)

# 2. Create model save path
MODEL_NAME = "distilbertafinetuned.pth"
MODEL_SAVE_PATH = MODEL_PATH / MODEL_NAME

# 3. Save the model state dict
print(f"Saving model to: {MODEL_SAVE_PATH}")
torch.save(obj=model.state_dict(),
           f=MODEL_SAVE_PATH)

Saving model to: models/distilbertafinetuned.pth


And finally we create the code in order to use the model for predictions.

In [27]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# model = AutoModelForSequenceClassification.from_pretrained("models/distilbertafinetuned.pth")

# Set device to cuda or cpu
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Sample text to classify
text = "ok all is good"

# Tokenize the text
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

# Move the input tensors to the device
inputs = {key: val.to(device) for key, val in inputs.items()}

# Make a prediction
with torch.no_grad():
    model.eval()
    outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits).item()
    
# Print the predicted class
print(predicted_class)

2
