# **Text Embeddings With Tabular Classification Model**

This tutorial will guide you through a step-by-step breakdown of using a Multilayer Perceptron (MLP) with embeddings from a pre-trained `DistilBERT` model to classify text sentiment from the IMDB movie reviews dataset. We'll cover everything from dataset preprocessing to model evaluation, explaining each part in detail.

## Tips
For best performance, ensure that the runtime is set to use a GPU (`Runtime > Change runtime type > T4 GPU`).

## Help & Questions

If you have any questions, please reachout on our [Discord](https://discord.gg/dncQwFdN9m).

You can also use our [documenation](https://docs.modlee.ai/README.html) as a reference for using our package.


## Step 1: Environment Setup

First, we need to make sure that we have the necessary packages installed.

In [None]:
!pip3 install modlee torch torchvision pytorch-lightning datasets torchtext==0.18.0

## Step 2: Importing Libraries

In this section, we import the necessary libraries from `PyTorch` and `Torchvision`.


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import DistilBertTokenizer, DistilBertModel
from torch.utils.data import TensorDataset, DataLoader
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import modlee

  from .autonotebook import tqdm as notebook_tqdm


Now we will set our Modlee API key and initialize the Modlee package.
Make sure that you have a Modlee account and an API key [from the dashboard](https://www.dashboard.modlee.ai/).
Replace `replace-with-your-api-key` with your API key.

In [2]:
# Set the API key to an environment variable,
# to simulate setting this in your shell profile
import os

os.environ['MODLEE_API_KEY'] = "OktSzjtS27JkuFiqpuzzyZCORw88Cz0P"
modlee.init(api_key=os.environ['MODLEE_API_KEY'])


## Step 3: Defining the MLP Model

The MLP (Multilayer Perceptron) model is defined here as a neural network with three fully connected linear layers. Each layer is followed by a `ReLU` activation function. The input size is determined by the output size of `DistilBERT` embeddings, and the output size corresponds to the number of classes, which is binary classification in this case.

In [3]:
class MLP(modlee.model.TabularClassificationModleeModel):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(input_size, 256),  # First fully connected layer
            nn.ReLU(),                   # ReLU activation
            nn.Linear(256, 128),          # Second fully connected layer
            nn.ReLU(),                   # ReLU activation
            nn.Linear(128, num_classes)   # Output layer
        )
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, x):
        # Pass input through the model defined in nn.Sequential
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y_target = batch
        y_pred = self(x)
        loss = self.loss_fn(y_pred, y_target) # Calculate the loss
        return {"loss": loss}

    def validation_step(self, val_batch, batch_idx):
        x, y_target = val_batch
        y_pred = self(x)
        val_loss = self.loss_fn(y_pred, y_target)  # Calculate validation loss
        return {'val_loss': val_loss}

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(), lr=0.001, momentum=0.9)  # Define the optimizer
        return optimizer


## Step 4: Loading DistilBERT
In this step, we load `DistilBERT`, which is a more compact version of `BERT` (Bidirectional Encoder Representations from Transformers). `DistilBERT` retains much of `BERT's` accuracy while being smaller and faster, making it a great option for tasks requiring pre-trained language models like sentiment analysis.


The tokenizer is responsible for converting raw text into a format that the `DistilBERT` model can understand. It splits the text into smaller pieces called tokens and then maps these tokens to their corresponding numeric IDs, which are used by the model

In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the pre-trained DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
bert = DistilBertModel.from_pretrained('distilbert-base-uncased').to(device)


## Step 5: Loading the IMDB Dataset

We load the IMDB dataset using the Hugging Face datasets library. The IMDB dataset contains movie reviews labeled as positive or negative, making it suitable for binary classification.

In [5]:
dataset = load_dataset('imdb')

## Step 6: Sampling a Subset of Data
Since processing the entire dataset can be slow, we sample a subset of 1000 examples from the dataset to speed up computation. This function randomly selects a subset of the data for both training and testing.

In [6]:
def sample_subset(dataset, subset_size=1000):
    # Randomly shuffle dataset indices and select a subset
    sample_indices = torch.randperm(len(dataset))[:subset_size]
    # Select the sampled data based on the shuffled indices
    sampled_data = dataset.select(sample_indices.tolist())

    return sampled_data


## Step 7: Precomputing Text Embeddings Using DistilBERT
`DistilBERT` turns text into numerical embeddings that the model can understand. We first preprocess the text by tokenizing and padding it. Then, `DistilBERT` generates embeddings for each sentence. To speed up training, we precompute these embeddings in advance so we don't have to repeat this step during training. This way, we use the embeddings directly when training the model.

In [7]:
def get_text_embeddings(texts, tokenizer, bert, device, max_length=128):

    # Tokenize the input texts, with padding and truncation to a fixed max length
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)

    # Get the embeddings from BERT without calculating gradients
    with torch.no_grad():
        # Average over the last hidden states to get sentence-level embeddings
        embeddings = bert(input_ids, attention_mask=attention_mask).last_hidden_state.mean(dim=1)
    return embeddings


# Precompute embeddings for the entire dataset
def precompute_embeddings(dataset, tokenizer, bert, device, max_length=128):
    texts = dataset['text']  # Extract texts from the dataset
    embeddings = get_text_embeddings(texts, tokenizer, bert, device, max_length)
    return embeddings


## Step 8: Splitting Data into Training and Validation Sets
We use `train_test_split` to split the precomputed embeddings and their corresponding labels into training and validation sets.

In [8]:
def split_data(embeddings, labels):
    # Split the embeddings and labels into training and validation sets (80% train, 20% validation)
    train_embeddings, val_embeddings, train_labels, val_labels = train_test_split(
        embeddings, labels, test_size=0.2, random_state=42  # Random state for reproducibility
    )
    return train_embeddings, val_embeddings, train_labels, val_labels


## Step 9: Creating DataLoaders
DataLoaders help batch the data and allow for efficient iteration during training and evaluation.

The training and validation data are batched using the `PyTorch DataLoader`, which ensures efficient processing during training.

In [9]:
def create_dataloaders(train_embeddings, train_labels, val_embeddings, val_labels, batch_size):
    # Create TensorDataset objects for training and validation data
    train_dataset = TensorDataset(train_embeddings, train_labels)
    val_dataset = TensorDataset(val_embeddings, val_labels)

    # Create DataLoader objects to handle batching and shuffling of data
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)  # Shuffle training data
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)  # No shuffling for validation data

    return train_loader, val_loader


## Step 10: Training the MLP Model
The `train_model` function defines the training loop. It uses `cross-entropy loss` and the `Adam optimizer` to train the model.

In [10]:
def train_model(model, train_loader, num_epochs=1):
    # Define the loss function and optimizer
    criterion = nn.CrossEntropyLoss()  # Loss function for classification
    optimizer = optim.Adam(model.parameters(), lr=0.001)  # Optimizer to update model weights

    # Iterate over epochs
    for epoch in range(num_epochs):
        model.train()  # Set the model to training mode
        running_loss = 0.0

        # Iterate over batches of data
        for embeddings, labels in train_loader:
            embeddings, labels = embeddings.to(device), labels.to(device)  # Move data to the appropriate device

            # Forward pass: compute predictions and loss
            outputs = model(embeddings)
            loss = criterion(outputs, labels)

            # Backward pass and optimization
            optimizer.zero_grad()  # Clear previous gradients
            loss.backward()  # Compute gradients
            optimizer.step()  # Update model weights

            running_loss += loss.item() * embeddings.size(0)  # Accumulate loss

        # Print average loss for the epoch
        epoch_loss = running_loss / len(train_loader.dataset)
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.4f}')


## Step 11: Evaluating the Model
After training, the model is evaluated on the validation set. The accuracy of predictions is calculated by comparing the model's output to the true labels.

In [11]:
def evaluate_model(model, val_loader):
    model.eval()  # Set the model to evaluation mode
    correct = 0
    total = 0

    with torch.no_grad():  # Disable gradient calculation for evaluation
        # Iterate over validation data
        for embeddings, labels in val_loader:
            embeddings, labels = embeddings.to(device), labels.to(device)  # Move data to the appropriate device
            outputs = model(embeddings)  # Get model predictions
            _, predicted = torch.max(outputs.data, 1)  # Get the predicted class labels

            total += labels.size(0)  # Update total count
            correct += (predicted == labels).sum().item()  # Count correct predictions

    # Calculate and print accuracy
    accuracy = (correct / total) * 100
    print(f'Accuracy: {accuracy:.2f}%')


## Step 12: Running the Main Script
Finally, we run the script, which follows these steps: loading and sampling the dataset, precomputing embeddings, training the MLP, and evaluating the model.

In [12]:

if __name__ == "__main__":
    # Load and preprocess a subset of the IMDB dataset
    train_data = sample_subset(dataset['train'], subset_size=1000)  # Sample a subset for training
    test_data = sample_subset(dataset['test'], subset_size=1000)  # Sample a subset for testing

    # Precompute BERT embeddings to speed up training
    print("Precomputing embeddings for training and testing data...")
    train_embeddings = precompute_embeddings(train_data, tokenizer, bert, device)  # Get embeddings for training data
    test_embeddings = precompute_embeddings(test_data, tokenizer, bert, device)  # Get embeddings for testing data

    # Convert labels from lists to tensors
    train_labels = torch.tensor(train_data['label'], dtype=torch.long)  # Convert training labels to tensor
    test_labels = torch.tensor(test_data['label'], dtype=torch.long)  # Convert testing labels to tensor

    # Split the training data into training and validation sets
    train_embeddings, val_embeddings, train_labels, val_labels = split_data(train_embeddings, train_labels)  # Split data

    # Create DataLoader instances for batching data
    batch_size = 32  # Define batch size
    train_loader, val_loader = create_dataloaders(train_embeddings, train_labels, val_embeddings, val_labels, batch_size)  # Create loaders

    # Initialize and train the MLP model
    input_size = 768  # Output size of BERT embeddings
    num_classes = 2   # Number of classes (positive/negative)
    mlp_text = MLP(input_size=input_size, num_classes=num_classes).to(device)  # Initialize MLP model

    print("Starting training...")
    train_model(mlp_text, train_loader, num_epochs=1)  # Train the model

    # Evaluate the model's performance
    print("Evaluating model...")
    evaluate_model(mlp_text, val_loader)  # Evaluate the model on validation data


Precomputing embeddings for training and testing data...
Starting training...
Epoch [1/1], Loss: 0.6837
Evaluating model...
Accuracy: 73.00%


# **Awesome Work!**

We've successfully completed a text classification project using `DistilBERT` and a custom MLP. Here's a quick recap of what we achieved:

- Loaded and preprocessed the IMDB dataset.
- Used `DistilBERT` to extract text embeddings.
- Built and trained a custom MLP for sentiment classification.
- Evaluated the model's accuracy.

This project has introduced you to combining pre-trained transformer models with custom architectures for text classification tasks. With this knowledge, you're ready to explore new datasets, tweak model designs, and continue honing your machine learning skills. Keep building and learning!