# Day 2 -  Text Classification with Embeddings using LlamaIndex, Azure OpenAI, and PyTorch

## Overview
Welcome back to the 5-day Generative AI course. In this notebook, we'll dive deep into text classification using embeddings. We'll learn how to:

1. Generate embeddings from text using Azure OpenAI
2. Build a neural network classifier using PyTorch
3. Train and evaluate the model
4. Make predictions on new text

This tutorial takes a hands-on approach to understanding how embeddings can be used for text classification. Instead of training on raw text, we'll use embeddings as input features, which often leads to better results with smaller datasets.

## Embedding Recap

Embeddings are dense vector representations of text (or other data) where similar items are mapped to nearby points in a high-dimensional space. For example, the sentences "I love dogs" and "I like puppies" would have similar embedding vectors because they express similar concepts.

Benefits of using embeddings include:
- Reduced dimensionality compared to traditional text representations
- Capture semantic relationships between words and phrases
- Can be generated quickly using pre-trained models
- Work well with neural networks

More can be found in [notes](../notes/embeddings-and-vector-stores.md).

## Prerequisites

Before starting, ensure you have:

- An Azure subscription
- Access to Azure OpenAI service
- Python 3.7 or later
- Basic understanding of Python and machine learning concepts

## Setup and Installation

First, let's install the required packages:

```python
%pip install -q llama-index llama-index-core llama-index-llms-azure-openai python-dotenv torch scikit-learn pandas numpy tqdm
```

Let's understand what each package does:

- `llama-index`: Framework for building LLM applications
- `torch`: PyTorch deep learning framework
- `scikit-learn`: For loading our dataset and utilities
- `pandas`: For data manipulation
- `numpy`: For numerical computations
- `tqdm`: For progress bars
- `python-dotenv`: For managing environment variables

In [1]:
from dotenv import load_dotenv
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from tqdm import tqdm
from typing import List
import os

tqdm.pandas() # Enable progress_apply


Create `.env` file with the following content:

```
AZURE_OPENAI_ENDPOINT="YOUR_AZURE_ENDPOINT"
AZURE_OPENAI_KEY="YOUR_API_KEY"
OPENAI_API_VERSION="YOUR_API_VERSION"
```

Setup OpenAI client:

In [2]:
load_dotenv()
embed_model = AzureOpenAIEmbedding(
    model= 'text-embedding-ada-002',
    azure_deployment = 'text-embedding-ada-002',
    api_key = os.getenv("AZURE_OPENAI_KEY"),
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version = '2023-05-15',
    max_retries=3,
    timeout=10
)

## Dataset 

The [20 Newsgroups Text Dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) contains 18,000 newsgroups posts on 20 topics divided into training and test sets. The split between the training and val datasets are based on messages posted before and after a specific date. 

This dataset is great for text classification because:

1. It contains real-world text data
2. It's pre-categorized into distinct topics
3. It's a manageable size for learning
4. It contains a good mix of topics

For this tutorial, you will use sampled subsets of the training and val sets, and perform some processing using `Pandas`.


In [10]:
newsgroups_train = fetch_20newsgroups(subset="train")
newsgroups_test = fetch_20newsgroups(subset="test")

# View available categories
print("Available categories:")
for idx, name in enumerate(newsgroups_train.target_names):
    print(f"{idx}. {name}")

Available categories:
0. alt.atheism
1. comp.graphics
2. comp.os.ms-windows.misc
3. comp.sys.ibm.pc.hardware
4. comp.sys.mac.hardware
5. comp.windows.x
6. misc.forsale
7. rec.autos
8. rec.motorcycles
9. rec.sport.baseball
10. rec.sport.hockey
11. sci.crypt
12. sci.electronics
13. sci.med
14. sci.space
15. soc.religion.christian
16. talk.politics.guns
17. talk.politics.mideast
18. talk.politics.misc
19. talk.religion.misc


Example of what a record from teh training set looks like

In [11]:
print(newsgroups_train.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







### Data Preprocessing

Raw text data often contains noise and unnecessary information. To remove any sensitive information like names and email addresses, you will take only the subject and body of each message. 

Let's clean it up:

In [12]:
import email
import re

def preprocess_newsgroup_row(data: str) -> str:
    """
    Clean and prepare a single newsgroup post.
    
    Args:
        data (str): Raw text of newsgroup post
        
    Returns:
        str: Cleaned text containing only subject and body
    """
    # extract subject and body
    msg = email.message_from_string(data)
    text = f"{msg['Subject']}\n\n{msg.get_payload()}"

    # remove email address for privacy
    text = re.sub(r"[\w\.-]+@[\w\.-]+", "", text)

    # truncate to manageable length
    text = text[:5000]
    return text

def preprocess_newsgroup_data(newsgroup_dataset) -> pd.DataFrame: 
    """
    Convert newsgroup dataset into a pandas DataFrame and clean the text.
    
    Args:
        newsgroup_dataset: Scikit-learn newsgroup dataset
        
    Returns:
        pd.DataFrame: Processed dataset with text and labels
    """
    df = pd.DataFrame({
        "Text": newsgroup_dataset.data,
        "Label": newsgroup_dataset.target
    })

    df["Text"] = df["Text"].apply(preprocess_newsgroup_row)

    # add readable category names
    df["Class Name"] = df["Label"].apply(lambda x: newsgroup_dataset.target_names[x])

    return df

### Sampling the Dataset

To make the tutorial manageable, we'll work with a subset of the data:

In [13]:
def sample_data(df: pd.DataFrame,
                num_samples: int,
                classes_to_keep: str) -> pd.DataFrame:
    
    """
    Create a balanced sample of the dataset, keeping only specified classes.
    
    Args:
        df (pd.DataFrame): Input DataFrame
        num_samples (int): Number of samples per class
        classes_to_keep (str): String to filter class names
        
    Returns:
        pd.DataFrame: Sampled and filtered dataset
    """
    # sample equal number of samples from each class
    df = (
        df.groupby("Label")[df.columns]
        .apply(lambda x: x.sample(num_samples))
        .reset_index(drop=True)
    )

    # keep only specified classes
    df = df[df["Class Name"].str.contains(classes_to_keep)]

    # Re-encode labels starting from 0
    df["Class Name"] = df["Class Name"].astype("category")
    df['Encoded Label'] = df['Class Name'].cat.codes

    return df

In [14]:
# Configuration
TRAIN_NUM_SAMPLES = 100
TEST_NUM_SAMPLES = 25
CLASSES_TO_KEEP = "sci"  # Class name should contain 'sci' to keep science categories

df_train = sample_data(preprocess_newsgroup_data(newsgroups_train), 
                       TRAIN_NUM_SAMPLES, 
                       CLASSES_TO_KEEP)
df_test = sample_data(preprocess_newsgroup_data(newsgroups_test), 
                      TEST_NUM_SAMPLES, 
                      CLASSES_TO_KEEP)

# Verify class distribution
print("\nTraining set class distribution:")
print(df_train["Class Name"].value_counts())
print("\nTest set class distribution:")
print(df_test["Class Name"].value_counts())


Training set class distribution:
Class Name
sci.crypt          100
sci.electronics    100
sci.med            100
sci.space          100
Name: count, dtype: int64

Test set class distribution:
Class Name
sci.crypt          25
sci.electronics    25
sci.med            25
sci.space          25
Name: count, dtype: int64


## Generate Embeddings

Now we'll convert our text data into embeddings using Azure OpenAI:

In [15]:
def embed_fn(text: str) -> List[float]: 
    """
    Generate embedding vector for a piece of text.
    
    Args:
        text (str): Input text
        
    Returns:
        list[float]: Embedding vector
    """
    response = embed_model.get_text_embedding(text)
    return response

def create_embeddings(df: pd.DataFrame) -> pd.DataFrame:
    """
    Add embedding vectors to DataFrame.
    
    Args:
        df (pd.DataFrame): Input DataFrame with 'Text' column
        
    Returns:
        pd.DataFrame: DataFrame with added 'Embeddings' column
    """
    df["Embeddings"] = df["Text"].progress_apply(embed_fn)
    return df

In [16]:
print("Generating embeddings (this may take a few minutes)...")
df_train = create_embeddings(df_train) 
df_test = create_embeddings(df_test)

Generating embeddings (this may take a few minutes)...


100%|██████████| 400/400 [00:52<00:00,  7.56it/s]
100%|██████████| 100/100 [00:16<00:00,  6.10it/s]


Note: Embedding generation can take some time, especially for larger datasets. Each text needs to be processed individually through the API.

In [17]:
df_train.head()

Unnamed: 0,Text,Label,Class Name,Encoded Label,Embeddings
1100,Re: How to detect use of an illegal cipher?\n\...,11,sci.crypt,0,"[0.000520402449183166, -0.003595346584916115, ..."
1101,"Re: Once tapped, your code is no good any more...",11,sci.crypt,0,"[0.013352726586163044, -0.0009401556453667581,..."
1102,Re: Re-inventing Crypto Policy? An EFF Statem...,11,sci.crypt,0,"[0.009235640987753868, -0.008010792545974255, ..."
1103,"powerful ""similarity"" too\n\nA Unix tool of cr...",11,sci.crypt,0,"[-0.013430314138531685, -0.015955360606312752,..."
1104,"Fear, Uncertainty, Doubt\n\n I suspect that t...",11,sci.crypt,0,"[0.012093139812350273, -0.002550253877416253, ..."


## Build the PyTorch Model

We'll create a custom dataset class and neural network model using PyTorch.

### Custom Dataset
PyTorch's Dataset class provides a clean interface for accessing our data:

In [18]:
class NewsGroupDataset(Dataset):
    """
    Custom Dataset for newsgroup data.

    Attributes:
        embeddings (np.ndarray): Array of embedding vectors
        labels (np.ndarray): Array of labels
    """
    def __init__(self, embeddings, labels):
        self.embeddings = embeddings
        self.labels = labels

    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return (
            torch.tensor(self.embeddings[idx], dtype=torch.float32),
            torch.tensor(self.labels[idx], dtype=torch.long)
        )

### Neural Network Model
Our classifier is a simple feedforward neural network:

In [19]:
class NewsGroupClassifier(nn.Module):
    """
    Neural network for classifying newsgroup posts.
    
    Architecture:
    - Input layer (embedding size)
    - Hidden layer with ReLU activation
    - Output layer (number of classes)
    """
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, num_classes)
        )

    def forward(self, X):
        return self.model(X)   

### Set up Training Infrastructure

Configure the device (CPU/GPU) and prepare data loaders:

In [23]:
# set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# prepare data
x_train = np.stack(df_train["Embeddings"])
y_train = df_train["Encoded Label"].values
x_val = np.stack(df_test["Embeddings"])
y_val = df_test["Encoded Label"].values

# create datasets
train_dataset = NewsGroupDataset(x_train, y_train)
val_dataset = NewsGroupDataset(x_val, y_val)

# create data loaders
BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, 
                          batch_size=BATCH_SIZE,
                          shuffle=True)
val_loader = DataLoader(val_dataset,
                         batch_size=BATCH_SIZE,
                         shuffle=False)

# initialize model
input_size = len(df_train['Embeddings'].iloc[0])
hidden_size = input_size
num_classes = len(df_train['Class Name'].unique())

model = NewsGroupClassifier(
    input_size=input_size,
    hidden_size=hidden_size,
    num_classes=num_classes
)

Using device: cpu


### Train the model

Let's implement the training loop with proper progress tracking, early stop and evaluation:

In [24]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

def train_epoch(model, train_loader, criterion, optimizer, device):
    """
    Train the model for one epoch.
    
    Args:
        model (nn.Module): Neural network model
        train_loader (DataLoader): Training data loader
        criterion: Loss function
        optimizer: Optimization algorithm
        device: Device to run the model on
        
    Returns:
        tuple: (average loss, accuracy)
    """
    model.train()
    total_loss = 0
    correct = 0
    total = 0 

    for inputs, labels in tqdm(train_loader, desc="Training"):
        inputs, labels = inputs.to(device), labels.to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # backward pass and optimize
        loss.backward()
        optimizer.step()

        # track statistics
        total_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    return total_loss / len(train_loader), correct / total

def evaluate(model, test_loader, criterion, device):
    """
    Evaluate the model on test data.
    
    Args:
        model: PyTorch model
        test_loader: DataLoader for test data
        criterion: Loss function
        device: Device to run on
        
    Returns:
        tuple: (average loss, accuracy)
    """
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, labels)

            total_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return total_loss / len(test_loader), correct / total
     
# train model
NUM_EPOCHS = 20
PATIENCE = 5
best_val_acc = 0

print("Starting training...")
for epoch in range(NUM_EPOCHS):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    val_loss, val_acc = evaluate(model, val_loader, criterion, device)

    if val_acc > best_val_acc:
        best_val_acc = val_acc
        best_model = model.state_dict()
        patience = 0
    else:
        patience += 1
        if patience > PATIENCE:
            print("Early stopping...")
            break

    print(f"\nEpoch {epoch+1}/{NUM_EPOCHS}")
    print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}")
    print(f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

Starting training...


Training: 100%|██████████| 13/13 [00:00<00:00, 60.82it/s]



Epoch 1/20
Train Loss: 1.2364, Train Acc: 0.7500
Val Loss: 1.0436, Val Acc: 0.8700


Training: 100%|██████████| 13/13 [00:00<00:00, 61.85it/s]



Epoch 2/20
Train Loss: 0.7296, Train Acc: 0.9500
Val Loss: 0.6502, Val Acc: 0.8600


Training: 100%|██████████| 13/13 [00:00<00:00, 213.11it/s]



Epoch 3/20
Train Loss: 0.3062, Train Acc: 0.9700
Val Loss: 0.5351, Val Acc: 0.8800


Training: 100%|██████████| 13/13 [00:00<00:00, 253.61it/s]



Epoch 4/20
Train Loss: 0.1432, Train Acc: 0.9775
Val Loss: 0.4686, Val Acc: 0.8900


Training: 100%|██████████| 13/13 [00:00<00:00, 144.87it/s]



Epoch 5/20
Train Loss: 0.0838, Train Acc: 0.9925
Val Loss: 0.5533, Val Acc: 0.8700


Training: 100%|██████████| 13/13 [00:00<00:00, 246.62it/s]



Epoch 6/20
Train Loss: 0.0546, Train Acc: 0.9950
Val Loss: 0.5786, Val Acc: 0.9000


Training: 100%|██████████| 13/13 [00:00<00:00, 222.14it/s]



Epoch 7/20
Train Loss: 0.0368, Train Acc: 1.0000
Val Loss: 0.5883, Val Acc: 0.8900


Training: 100%|██████████| 13/13 [00:00<00:00, 147.88it/s]



Epoch 8/20
Train Loss: 0.0236, Train Acc: 1.0000
Val Loss: 0.6258, Val Acc: 0.9000


Training: 100%|██████████| 13/13 [00:00<00:00, 231.70it/s]



Epoch 9/20
Train Loss: 0.0195, Train Acc: 1.0000
Val Loss: 0.6388, Val Acc: 0.8900


Training: 100%|██████████| 13/13 [00:00<00:00, 244.82it/s]



Epoch 10/20
Train Loss: 0.0156, Train Acc: 1.0000
Val Loss: 0.6391, Val Acc: 0.9100


Training: 100%|██████████| 13/13 [00:00<00:00, 147.35it/s]



Epoch 11/20
Train Loss: 0.0129, Train Acc: 1.0000
Val Loss: 0.6771, Val Acc: 0.8900


Training: 100%|██████████| 13/13 [00:00<00:00, 250.90it/s]



Epoch 12/20
Train Loss: 0.0105, Train Acc: 1.0000
Val Loss: 0.6807, Val Acc: 0.9000


Training: 100%|██████████| 13/13 [00:00<00:00, 241.04it/s]



Epoch 13/20
Train Loss: 0.0090, Train Acc: 1.0000
Val Loss: 0.6956, Val Acc: 0.9000


Training: 100%|██████████| 13/13 [00:00<00:00, 144.27it/s]



Epoch 14/20
Train Loss: 0.0076, Train Acc: 1.0000
Val Loss: 0.7018, Val Acc: 0.9000


Training: 100%|██████████| 13/13 [00:00<00:00, 261.48it/s]



Epoch 15/20
Train Loss: 0.0066, Train Acc: 1.0000
Val Loss: 0.7134, Val Acc: 0.9000


Training: 100%|██████████| 13/13 [00:00<00:00, 258.56it/s]

Early stopping...





### Mark Predictions

Finally, let's create a function to make predictions on new text:

In [25]:
def predict_category(text, model, embed_model, class_names):
    """
    Predict the category of a new text.
    
    Args:
        text (str): Input text
        model: Trained PyTorch model
        embed_model: Embedding model
        class_names: List of class names
        
    Prints:
        Probability for each class
    """
    # Get embedding
    embedding = embed_fn(text)
    # Convert to tensor
    input_tensor = torch.tensor(embedding, dtype=torch.float32).unsqueeze(0).to(device)

    # get prediction
    model.eval()
    with torch.no_grad():
        output = model(input_tensor)
        probabilities = nn.functional.softmax(output, dim=1)[0]

    print("\nPredicted probabilities:")
    for i, category in enumerate(class_names): 
        print(f"{category}: {probabilities[i] * 100:.2f}%")

new_text = """
First-timer looking to get out of here.

Hi, I'm writing about my interest in travelling to the outer limits!

What kind of craft can I buy? What is easiest to access from this 3rd rock?

Let me know how to do that please.
"""
print("Making prediction on example text...")
predict_category(new_text, model, embed_model, df_test["Class Name"].cat.categories)

Making prediction on example text...

Predicted probabilities:
sci.crypt: 0.17%
sci.electronics: 6.09%
sci.med: 1.08%
sci.space: 92.66%
