# Module 3: From Manual to Automatic

## Introduction

Welcome to the third module in our series on neural networks and language modeling! This notebook demonstrates the progression from manual counting approaches to neural network-based approaches for language modeling.

In this module, we'll implement both approaches side-by-side, allowing you to see how neural networks can learn the same patterns that we previously had to specify manually.

### What You'll Learn

- **Manual Approach**: Implementing a bigram language model using explicit counting
- **Neural Approach**: Building a neural network that learns the same patterns automatically
- **Comparison**: Understanding the similarities and differences between these approaches
- **Advantages**: Why neural networks are more powerful and flexible

Let's start by setting up our environment.


In [None]:
# Install required packages
!pip install numpy pandas matplotlib torch

In [None]:
import random

import matplotlib.pyplot as plt
import numpy as np
import torch

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
random.seed(42)

# Set up plotting
plt.style.use("ggplot")
%matplotlib inline

## 1. Loading and Preparing the Data

We'll use the same names dataset as in previous modules.


In [None]:
# Load the names dataset
with open("../../02 - Makemore/names.txt") as f:
    names = f.read().splitlines()

# Convert to lowercase
names = [name.lower() for name in names]

# Take a look at the first few names
print(f"First 10 names: {names[:10]}")
print(f"Total number of names: {len(names)}")

# Add start and end tokens
names_with_tokens = ["<" + name + ">" for name in names]

# Create vocabulary
chars = sorted(list(set("".join(names_with_tokens))))
vocab_size = len(chars)

# Create mappings between characters and indices
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

print(f"Vocabulary size: {vocab_size}")
print(f"Vocabulary: {''.join(chars)}")

## 2. Manual Approach: Bigram Counting

First, let's implement the manual counting approach we saw in Module 1. We'll count all character bigrams in our dataset and use these counts to build a probability distribution.


In [None]:
# Count bigrams
def build_bigram_counts(names: list[str]) -> np.ndarray:
    # Add start and end tokens
    names = ["<" + name + ">" for name in names]

    # Initialize count matrix
    counts = np.zeros((vocab_size, vocab_size), dtype=np.int32)

    # Count bigrams
    for name in names:
        for c1, c2 in zip(name, name[1:], strict=False):
            idx1 = char_to_idx[c1]
            idx2 = char_to_idx[c2]
            counts[idx1, idx2] += 1

    return counts


# Build the bigram counts
bigram_counts = build_bigram_counts(names)

# Convert to probabilities
# Add a small smoothing factor to avoid division by zero
bigram_probs = (bigram_counts + 1) / (
    bigram_counts.sum(axis=1, keepdims=True) + vocab_size
)

# Visualize the bigram probabilities
plt.figure(figsize=(12, 10))
plt.imshow(bigram_probs, cmap="Blues")
plt.colorbar()
plt.title("Bigram Probabilities")
plt.xlabel("Next Character")
plt.ylabel("Current Character")
plt.xticks(range(vocab_size), chars, rotation=90)
plt.yticks(range(vocab_size), chars)
plt.show()

### Generating Names with the Manual Approach

Now that we have our bigram probabilities, we can generate new names by sampling from these distributions.


In [None]:
def generate_name_manual(bigram_probs: np.ndarray, max_len: int = 20) -> str:
    name = ["<"]  # Start with the start token

    while True:
        # Get the last character
        last_char = name[-1]
        last_idx = char_to_idx[last_char]

        # Get the probability distribution for the next character
        next_char_probs = bigram_probs[last_idx]

        # Sample the next character
        next_idx = np.random.choice(vocab_size, p=next_char_probs)
        next_char = idx_to_char[next_idx]

        # Add to the name
        name.append(next_char)

        # Check if we're done
        if next_char == ">" or len(name) > max_len:
            break

    # Join and remove the tokens
    return "".join(name[1:-1])


# Generate 10 names
print("Names generated using manual bigram approach:")
for _ in range(10):
    print(generate_name_manual(bigram_probs))

## 3. Neural Approach: Learning Bigram Patterns

Now, let's implement a neural network that learns the same bigram patterns automatically. We'll use a simple one-layer network with no hidden layers.


In [None]:
# Prepare training data
def prepare_training_data(names: list[str]) -> tuple[torch.Tensor, torch.Tensor]:
    # Add start and end tokens
    names = ["<" + name + ">" for name in names]

    # Create input-output pairs
    xs = []  # Input characters
    ys = []  # Target characters (next character)

    for name in names:
        for c1, c2 in zip(name, name[1:], strict=False):
            xs.append(char_to_idx[c1])
            ys.append(char_to_idx[c2])

    # Convert to PyTorch tensors
    xs = torch.tensor(xs)
    ys = torch.tensor(ys)

    return xs, ys


# Prepare the data
xs, ys = prepare_training_data(names)

print(f"Number of training examples: {len(xs)}")
print("First 5 input-output pairs:")
for i in range(5):
    print(f"  {idx_to_char[xs[i].item()]} → {idx_to_char[ys[i].item()]}")

In [None]:
# Define a simple neural network model
class BigramLanguageModel:
    def __init__(self, vocab_size: int) -> None:
        self.vocab_size = vocab_size
        # Initialize weights randomly
        self.W = torch.randn(vocab_size, vocab_size, requires_grad=True)

    def forward(self, idx: torch.Tensor) -> torch.Tensor:
        # One-hot encode the input
        x_one_hot = torch.zeros(len(idx), self.vocab_size)
        x_one_hot.scatter_(1, idx.unsqueeze(1), 1.0)

        # Forward pass (equivalent to embedding lookup)
        logits = x_one_hot @ self.W

        return logits

    def loss(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        # Cross-entropy loss
        loss = torch.nn.functional.cross_entropy(logits, targets)
        return loss

    def update(self, lr: float = 0.1) -> None:
        # Manual gradient descent
        with torch.no_grad():
            self.W -= lr * self.W.grad
            self.W.grad = None


# Initialize model
model = BigramLanguageModel(vocab_size)

# Training loop
n_epochs = 100
batch_size = 32
losses = []

for epoch in range(n_epochs):
    # Random batch
    idx = torch.randint(0, len(xs), (batch_size,))

    # Forward pass
    logits = model.forward(xs[idx])
    loss = model.loss(logits, ys[idx])

    # Backward pass
    loss.backward()

    # Update
    model.update(lr=0.1)

    # Track loss
    losses.append(loss.item())

    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

# Plot loss over time
plt.figure(figsize=(10, 6))
plt.plot(losses)
plt.title("Loss vs. Epoch")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.grid(True)
plt.show()

### Generating Names with the Neural Approach

Now that we've trained our neural network, let's use it to generate new names.


In [None]:
def generate_name_neural(model: BigramLanguageModel, max_len: int = 20) -> str:
    out = []
    idx = char_to_idx["<"]  # Start token

    while True:
        # Forward pass
        x_one_hot = torch.zeros(1, model.vocab_size)
        x_one_hot[0, idx] = 1.0
        logits = x_one_hot @ model.W

        # Apply softmax to get probabilities
        probs = torch.nn.functional.softmax(logits, dim=1)

        # Sample from the distribution
        idx = torch.multinomial(probs, num_samples=1).item()

        # Add to output
        out.append(idx_to_char[idx])

        # Check if we're done
        if idx_to_char[idx] == ">" or len(out) > max_len:
            break

    return "".join(out[:-1])  # Remove the end token


# Generate 10 names
print("Names generated using neural network approach:")
for _ in range(10):
    name = generate_name_neural(model)
    print(name)

## 4. Comparing the Approaches

Let's compare the learned weights of our neural network with the manually counted bigram probabilities.


In [None]:
# Convert neural network weights to probabilities
neural_probs = torch.softmax(model.W, dim=1).detach().numpy()

# Visualize the neural network probabilities
plt.figure(figsize=(12, 10))
plt.imshow(neural_probs, cmap="Blues")
plt.colorbar()
plt.title("Neural Network Learned Probabilities")
plt.xlabel("Next Character")
plt.ylabel("Current Character")
plt.xticks(range(vocab_size), chars, rotation=90)
plt.yticks(range(vocab_size), chars)
plt.show()

# Calculate the difference between manual and neural probabilities
diff = np.abs(neural_probs - bigram_probs)

# Visualize the difference
plt.figure(figsize=(12, 10))
plt.imshow(diff, cmap="Reds")
plt.colorbar()
plt.title("Difference Between Manual and Neural Probabilities")
plt.xlabel("Next Character")
plt.ylabel("Current Character")
plt.xticks(range(vocab_size), chars, rotation=90)
plt.yticks(range(vocab_size), chars)
plt.show()

# Calculate the mean absolute difference
mean_diff = np.mean(diff)
print(
    f"Mean absolute difference between manual and neural probabilities: {mean_diff:.6f}"
)

## 5. Advantages of Neural Networks

While our simple neural network learns essentially the same patterns as our manual counting approach, neural networks offer several advantages:

1. **Scalability**: Neural networks can handle much larger vocabularies and context windows
2. **Expressiveness**: They can learn more complex patterns beyond simple bigrams
3. **Generalization**: They can generalize better to unseen examples
4. **Feature Learning**: They can automatically learn useful features from the data

Let's demonstrate this by extending our neural network to consider more context (trigrams instead of bigrams).


## Summary

In this module, we've seen how neural networks can learn the same patterns that we previously had to specify manually. We've implemented both approaches side-by-side and compared their results.

Key takeaways:
- Manual counting is explicit and interpretable but limited in scalability
- Neural networks can learn the same patterns automatically from data
- The neural approach can be extended to more complex models with minimal changes
- Both approaches generate similar names, showing that the neural network has successfully learned the bigram patterns

In the next module, we'll explore PyTorch fundamentals in more depth and build more sophisticated neural network architectures for language modeling.