<style>
    /* Main container style */
    .note-box {
        background-color: #1e1e2e;       /* Dark Blue-Grey Background */
        color: #cdd6f4;                  /* Soft White Text */
        border-left: 6px solid #89b4fa;  /* Blue Accent Border */
        border-radius: 8px;
        padding: 20px;
        margin: 20px 0;
        font-family: system-ui, -apple-system, sans-serif;
        line-height: 1.6;
        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.2);
        box-sizing: border-box;
        max-width: 100%;
        overflow-wrap: break-word;
    }
    
    /* Header style */
    .note-box h2 {
        color: #89b4fa;                  /* Blue Header */
        margin-top: 0;
        margin-bottom: 15px;
        font-size: 1.6rem;
        font-weight: 600;
        border-bottom: 1px solid #45475a;
        padding-bottom: 10px;
    }

    /* Important keywords */
    .note-box strong {
        color: #f9e2af;                  /* Soft Gold/Yellow */
        font-weight: 600;
    }

    /* Inline code snippets */
    .note-box .code-inline {
        background-color: #313244;
        color: #f38ba8;                  /* Soft Red/Pink */
        padding: 2px 6px;
        border-radius: 4px;
        font-family: 'Menlo', 'Consolas', monospace;
        font-size: 0.9em;
        border: 1px solid #45475a;
        white-space: pre-wrap;
    }

    /* Lists */
    .note-box ul {
        padding-left: 20px;
        margin: 10px 0;
    }
    .note-box li {
        margin-bottom: 8px;
    }
</style>
<div class="note-box">
    <h2>Chapter 4.3: From Numbers to Words (Text Classification)</h2>
    <p>
        <strong>Objective</strong>: Transition from continuous data (like sine waves) to discrete data (text). We will build a Sentiment Classifier that reads a sentence and predicts if it is <strong>Positive</strong> or <strong>Negative</strong>.
    </p>
    <p><strong>The Shift</strong>: Neural Networks only understand numbers. Unlike a sine wave value (e.g., 0.54), words are categorical. We cannot just pass "Apple" into a network.</p>
    <p><strong>Key Concepts</strong>:</p>
    <ul>
        <li><strong>Tokenization</strong>: Converting words into integer IDs.</li>
        <li><strong>Embeddings</strong>: Converting integers into dense vectors that capture meaning.</li>
        <li><strong>Packing Sequences</strong>: Handling sentences of different lengths efficiently using <span class="code-inline">pack_padded_sequence</span>.</li>
    </ul>
</div>

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch.utils.data import Dataset, DataLoader
import numpy as np

# Check for acceleration
device = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: mps


<style>
    /* Main container style */
    .note-box {
        background-color: #1e1e2e;       /* Dark Blue-Grey Background */
        color: #cdd6f4;                  /* Soft White Text */
        border-left: 6px solid #89b4fa;  /* Blue Accent Border */
        border-radius: 8px;
        padding: 20px;
        margin: 20px 0;
        font-family: system-ui, -apple-system, sans-serif;
        line-height: 1.6;
        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.2);
        box-sizing: border-box;
        max-width: 100%;
        overflow-wrap: break-word;
    }
    
    /* Header style */
    .note-box h2 {
        color: #89b4fa;                  /* Blue Header */
        margin-top: 0;
        margin-bottom: 15px;
        font-size: 1.6rem;
        font-weight: 600;
        border-bottom: 1px solid #45475a;
        padding-bottom: 10px;
    }

    /* Important keywords */
    .note-box strong {
        color: #f9e2af;                  /* Soft Gold/Yellow */
        font-weight: 600;
    }

    /* Inline code snippets */
    .note-box .code-inline {
        background-color: #313244;
        color: #f38ba8;                  /* Soft Red/Pink */
        padding: 2px 6px;
        border-radius: 4px;
        font-family: 'Menlo', 'Consolas', monospace;
        font-size: 0.9em;
        border: 1px solid #45475a;
        white-space: pre-wrap;
    }

    /* Lists */
    .note-box ul {
        padding-left: 20px;
        margin: 10px 0;
    }
    .note-box li {
        margin-bottom: 8px;
    }
</style>
<div class="note-box">
    <h2>Step 1: The Vocabulary</h2>
    <p>We start with a tiny synthetic dataset of reviews. Before the model sees them, we must build a <strong>Vocabulary</strong>.</p>
    <ul>
        <li><strong>Mapping</strong>: Assign a unique integer to every unique word.</li>
        <li><strong>Padding Token</strong> (<span class="code-inline">&lt;PAD&gt;</span>): Used to fill short sentences so batches have uniform shape. Usually index 0.</li>
        <li><strong>Unknown Token </strong>(<span class="code-inline">&lt;UNK&gt;</span>): For words not in our vocabulary.</li>
    </ul>
</div>

In [2]:
# 1. Raw Data (Sentence, Label) -> 1 = Positive, 0 = Negative
raw_data = [
    ("this movie was amazing", 1),
    ("i loved the acting", 1),
    ("great plot and characters", 1),
    ("what a waste of time", 0),
    ("terrible movie do not watch", 0),
    ("boring and slow", 0),
    ("best film ever", 1),
    ("awful just awful", 0)
]

# 2. Build Vocabulary
word_to_ix = {"<PAD>": 0, "<UNK>": 1}
for sentence, _ in raw_data:
    for word in sentence.split():
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

print(f"Vocab Size: {len(word_to_ix)}")
print("Sample Mapping:", list(word_to_ix.items())[:5])

Vocab Size: 30
Sample Mapping: [('<PAD>', 0), ('<UNK>', 1), ('this', 2), ('movie', 3), ('was', 4)]


<style>
    /* Main container style */
    .note-box {
        background-color: #1e1e2e;       /* Dark Blue-Grey Background */
        color: #cdd6f4;                  /* Soft White Text */
        border-left: 6px solid #89b4fa;  /* Blue Accent Border */
        border-radius: 8px;
        padding: 20px;
        margin: 20px 0;
        font-family: system-ui, -apple-system, sans-serif;
        line-height: 1.6;
        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.2);
        box-sizing: border-box;
        max-width: 100%;
        overflow-wrap: break-word;
    }
    
    /* Header style */
    .note-box h2 {
        color: #89b4fa;                  /* Blue Header */
        margin-top: 0;
        margin-bottom: 15px;
        font-size: 1.6rem;
        font-weight: 600;
        border-bottom: 1px solid #45475a;
        padding-bottom: 10px;
    }

    /* Important keywords */
    .note-box strong {
        color: #f9e2af;                  /* Soft Gold/Yellow */
        font-weight: 600;
    }

    /* Inline code snippets */
    .note-box .code-inline {
        background-color: #313244;
        color: #f38ba8;                  /* Soft Red/Pink */
        padding: 2px 6px;
        border-radius: 4px;
        font-family: 'Menlo', 'Consolas', monospace;
        font-size: 0.9em;
        border: 1px solid #45475a;
        white-space: pre-wrap;
    }

    /* Lists */
    .note-box ul {
        padding-left: 20px;
        margin: 10px 0;
    }
    .note-box li {
        margin-bottom: 8px;
    }
</style>
<div class="note-box">
    <h2>Step 2: Handling Variable Lengths (Crucial!)</h2>
    <p>Sentences have different lengths. To batch them, we must:</p>
    <ol>
        <li><strong>Pad</strong> short sentences with 0s to match the longest sentence in the batch.</li>
        <li><strong>Record lengths</strong>: The RNN needs to know the <em>actual</em> length of each sentence to stop processing at the right time (avoid processing 0s).</li>
    </ol>
    <p>We use a custom <span class="code-inline">collate_fn</span> in the DataLoader to handle padding dynamically.</p>
</div>

In [3]:
class SentimentDataset(Dataset):
    def __init__(self, data, vocab):
        self.data = data
        self.vocab = vocab

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text, label = self.data[idx]
        # Convert words to integers
        idxs = [self.vocab.get(w, self.vocab["<UNK>"]) for w in text.split()]
        return torch.tensor(idxs), torch.tensor(label)

def collate_fn(batch):
    # Sort batch by sequence length (descending) - Required for pack_padded_sequence
    batch.sort(key=lambda x: len(x[0]), reverse=True)
    
    sequences, labels = zip(*batch)
    lengths = torch.tensor([len(seq) for seq in sequences])
    
    # Pad sequences
    padded_seqs = nn.utils.rnn.pad_sequence(sequences, batch_first=True, padding_value=0)
    
    return padded_seqs, torch.stack(labels), lengths

# Create DataLoader
dataset = SentimentDataset(raw_data, word_to_ix)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)

# Check one batch
text_batch, label_batch, len_batch = next(iter(dataloader))
print(f"Padded Batch Shape: {text_batch.shape}")
print(f"Lengths: {len_batch}")
print(f"Batch:\n{text_batch}")

Padded Batch Shape: torch.Size([2, 5])
Lengths: tensor([5, 5])
Batch:
tensor([[19,  3, 20, 21, 22],
        [14, 15, 16, 17, 18]])


<style>
    /* Main container style */
    .note-box {
        background-color: #1e1e2e;       /* Dark Blue-Grey Background */
        color: #cdd6f4;                  /* Soft White Text */
        border-left: 6px solid #89b4fa;  /* Blue Accent Border */
        border-radius: 8px;
        padding: 20px;
        margin: 20px 0;
        font-family: system-ui, -apple-system, sans-serif;
        line-height: 1.6;
        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.2);
        box-sizing: border-box;
        max-width: 100%;
        overflow-wrap: break-word;
    }
    
    /* Header style */
    .note-box h2 {
        color: #89b4fa;                  /* Blue Header */
        margin-top: 0;
        margin-bottom: 15px;
        font-size: 1.6rem;
        font-weight: 600;
        border-bottom: 1px solid #45475a;
        padding-bottom: 10px;
    }

    /* Important keywords */
    .note-box strong {
        color: #f9e2af;                  /* Soft Gold/Yellow */
        font-weight: 600;
    }

    /* Inline code snippets */
    .note-box .code-inline {
        background-color: #313244;
        color: #f38ba8;                  /* Soft Red/Pink */
        padding: 2px 6px;
        border-radius: 4px;
        font-family: 'Menlo', 'Consolas', monospace;
        font-size: 0.9em;
        border: 1px solid #45475a;
        white-space: pre-wrap;
    }

    /* Lists */
    .note-box ul {
        padding-left: 20px;
        margin: 10px 0;
    }
    .note-box li {
        margin-bottom: 8px;
    }
</style>
<div class="note-box">
    <h2>Step 3: The Model Architecture</h2>
    <p>This model introduces two major components:</p>
    <ul>
        <li><strong>nn.Embedding</strong>: A lookup table that converts integer IDs (e.g., 42) into dense vectors (e.g., [0.1, -0.5, ...]). This allows the model to learn semantic relationships between words.</li>
        <li><strong>pack_padded_sequence</strong>: This tells PyTorch to "ignore the zeros". It flattens the batch into a single vector of valid tokens. This is computationally efficient and prevents the model from learning noise from padding tokens.</li>
    </ul>
    
$$ \text{Input (Ints)} \rightarrow \text{Embedding} \rightarrow \text{Pack} \rightarrow \text{LSTM} \rightarrow \text{Last Hidden State} \rightarrow \text{Linear} $$
</div>

In [4]:
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super(TextClassifier, self).__init__()
        
        # 1. Embedding Layer
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        
        # 2. LSTM Layer
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        
        # 3. Output Layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text, text_lengths):
        # text shape: [batch_size, max_seq_len]
        
        # 1. Apply Embeddings
        embedded = self.embedding(text) # shape: [batch, seq_len, embed_dim]
        
        # 2. Pack the Sequence
        # This removes padding tokens for calculation
        packed_embedded = pack_padded_sequence(embedded, text_lengths, batch_first=True)
        
        # 3. Pass through LSTM
        # We don't need the output for every step, just the final hidden state
        packed_output, (hidden, cell) = self.lstm(packed_embedded)
        
        # hidden shape: [1, batch_size, hidden_dim] (1 for 1 layer)
        # Squeeze removes the 1st dimension
        return self.fc(hidden.squeeze(0))

# Hyperparameters
VOCAB_SIZE = len(word_to_ix)
EMBED_DIM = 10
HIDDEN_DIM = 16
OUTPUT_DIM = 1 # Binary classification (0 or 1)

model = TextClassifier(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, OUTPUT_DIM).to(device)
print(model)

TextClassifier(
  (embedding): Embedding(30, 10, padding_idx=0)
  (lstm): LSTM(10, 16, batch_first=True)
  (fc): Linear(in_features=16, out_features=1, bias=True)
)


<style>
    /* Main container style */
    .note-box {
        background-color: #1e1e2e;       /* Dark Blue-Grey Background */
        color: #cdd6f4;                  /* Soft White Text */
        border-left: 6px solid #89b4fa;  /* Blue Accent Border */
        border-radius: 8px;
        padding: 20px;
        margin: 20px 0;
        font-family: system-ui, -apple-system, sans-serif;
        line-height: 1.6;
        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.2);
        box-sizing: border-box;
        max-width: 100%;
        overflow-wrap: break-word;
    }
    
    /* Header style */
    .note-box h2 {
        color: #89b4fa;                  /* Blue Header */
        margin-top: 0;
        margin-bottom: 15px;
        font-size: 1.6rem;
        font-weight: 600;
        border-bottom: 1px solid #45475a;
        padding-bottom: 10px;
    }

    /* Important keywords */
    .note-box strong {
        color: #f9e2af;                  /* Soft Gold/Yellow */
        font-weight: 600;
    }

    /* Inline code snippets */
    .note-box .code-inline {
        background-color: #313244;
        color: #f38ba8;                  /* Soft Red/Pink */
        padding: 2px 6px;
        border-radius: 4px;
        font-family: 'Menlo', 'Consolas', monospace;
        font-size: 0.9em;
        border: 1px solid #45475a;
        white-space: pre-wrap;
    }

    /* Lists */
    .note-box ul {
        padding-left: 20px;
        margin: 10px 0;
    }
    .note-box li {
        margin-bottom: 8px;
    }
</style>
<div class="note-box">
    <h2>Step 4: Training</h2>
    <p>Since we are doing binary classification, we use <strong>BCEWithLogitsLoss</strong> (Binary Cross Entropy). This combines a Sigmoid layer and the BCELoss in one numerically stable class.</p>
</div>

In [5]:
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.BCEWithLogitsLoss() # Automatically applies Sigmoid

epochs = 20

print("--- Starting Training ---")
for epoch in range(epochs):
    epoch_loss = 0
    model.train()
    
    for text, labels, lengths in dataloader:
        text, labels = text.to(device), labels.float().to(device)
        
        optimizer.zero_grad()
        
        # Forward pass (passing lengths is key!)
        predictions = model(text, lengths).squeeze(1)
        
        loss = criterion(predictions, labels)
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        
    if (epoch+1) % 5 == 0:
        print(f"Epoch {epoch+1}/{epochs} | Loss: {epoch_loss/len(dataloader):.4f}")

--- Starting Training ---
Epoch 5/20 | Loss: 0.4452
Epoch 10/20 | Loss: 0.0540
Epoch 15/20 | Loss: 0.0085
Epoch 20/20 | Loss: 0.0042


<style>
    /* Main container style */
    .note-box {
        background-color: #1e1e2e;       /* Dark Blue-Grey Background */
        color: #cdd6f4;                  /* Soft White Text */
        border-left: 6px solid #89b4fa;  /* Blue Accent Border */
        border-radius: 8px;
        padding: 20px;
        margin: 20px 0;
        font-family: system-ui, -apple-system, sans-serif;
        line-height: 1.6;
        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.2);
        box-sizing: border-box;
        max-width: 100%;
        overflow-wrap: break-word;
    }
    
    /* Header style */
    .note-box h2 {
        color: #89b4fa;                  /* Blue Header */
        margin-top: 0;
        margin-bottom: 15px;
        font-size: 1.6rem;
        font-weight: 600;
        border-bottom: 1px solid #45475a;
        padding-bottom: 10px;
    }

    /* Important keywords */
    .note-box strong {
        color: #f9e2af;                  /* Soft Gold/Yellow */
        font-weight: 600;
    }

    /* Inline code snippets */
    .note-box .code-inline {
        background-color: #313244;
        color: #f38ba8;                  /* Soft Red/Pink */
        padding: 2px 6px;
        border-radius: 4px;
        font-family: 'Menlo', 'Consolas', monospace;
        font-size: 0.9em;
        border: 1px solid #45475a;
        white-space: pre-wrap;
    }

    /* Lists */
    .note-box ul {
        padding-left: 20px;
        margin: 10px 0;
    }
    .note-box li {
        margin-bottom: 8px;
    }
</style>
<div class="note-box">
    <h2>Step 5: Testing with New Data</h2>
    <p>To predict a new sentence, we must perform the same pipeline: Tokenize -> Pad -> Predict.</p>
</div>

In [6]:
def predict_sentiment(sentence):
    model.eval()
    
    # 1. Tokenize
    idxs = [word_to_ix.get(w, word_to_ix["<UNK>"]) for w in sentence.split()]
    tensor = torch.tensor(idxs).unsqueeze(0).to(device) # Batch size 1
    length = torch.tensor([len(idxs)])
    
    # 2. Predict
    with torch.no_grad():
        prediction = model(tensor, length)
        probability = torch.sigmoid(prediction).item()
        
    sentiment = "POSITIVE" if probability > 0.5 else "NEGATIVE"
    print(f"Sentence: '{sentence}'")
    print(f"Prediction: {sentiment} ({probability:.4f})\n")

# Test cases
predict_sentiment("amazing movie")
predict_sentiment("what a waste")
predict_sentiment("i loved the plot")

Sentence: 'amazing movie'
Prediction: POSITIVE (0.9247)

Sentence: 'what a waste'
Prediction: NEGATIVE (0.0075)

Sentence: 'i loved the plot'
Prediction: POSITIVE (0.9892)

