# Module 4: PyTorch Fundamentals

## Introduction

Welcome to the fourth module in our series on neural networks and language modeling! This notebook covers the essential PyTorch concepts needed for building more sophisticated neural network architectures.

In this module, we'll explore PyTorch's core features and how they make deep learning development more efficient and intuitive. We'll build on the concepts from previous modules, but now using PyTorch's higher-level APIs.

### What You'll Learn

- **Tensors**: PyTorch's fundamental data structure
- **Autograd**: Automatic differentiation for building and training neural networks
- **Neural Network API**: Building models with PyTorch's nn module
- **Optimizers**: Using PyTorch's optimization algorithms
- **Practical Example**: Building a more sophisticated language model

Let's start by setting up our environment.


In [None]:
# Install required packages
!pip install torch numpy matplotlib

In [None]:
import random

import matplotlib.pyplot as plt
import numpy as np
import torch

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
random.seed(42)

# Set up plotting
plt.style.use("ggplot")
%matplotlib inline

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 1. Tensors: PyTorch's Fundamental Data Structure

Tensors are multi-dimensional arrays similar to NumPy arrays but with additional features for deep learning. They can run on GPUs for accelerated computing and automatically track gradients for backpropagation.

### Creating Tensors


In [None]:
# Creating tensors from Python lists
x = torch.tensor([1, 2, 3, 4])
print(f"Tensor from list: {x}")
print(f"Shape: {x.shape}, Dtype: {x.dtype}")

# Creating tensors with specific data types
x_float = torch.tensor([1, 2, 3, 4], dtype=torch.float32)
print(f"Float tensor: {x_float}")
print(f"Shape: {x_float.shape}, Dtype: {x_float.dtype}")

# Creating tensors with specific shapes
zeros = torch.zeros(2, 3)  # 2x3 tensor of zeros
ones = torch.ones(2, 3)  # 2x3 tensor of ones
rand = torch.rand(2, 3)  # 2x3 tensor of random values from uniform distribution [0, 1)
randn = torch.randn(
    2, 3
)  # 2x3 tensor of random values from normal distribution (mean=0, std=1)

print(f"Zeros:\n{zeros}")
print(f"Ones:\n{ones}")
print(f"Random uniform:\n{rand}")
print(f"Random normal:\n{randn}")

# Creating tensors with specific devices
if torch.cuda.is_available():
    x_gpu = torch.tensor([1, 2, 3, 4], device="cuda")
    print(f"GPU tensor: {x_gpu}")
else:
    print("GPU not available")

### Tensor Operations

PyTorch provides a rich set of operations for manipulating tensors.


In [None]:
# Basic arithmetic operations
a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])

print(f"a + b = {a + b}")
print(f"a - b = {a - b}")
print(f"a * b = {a * b}")  # Element-wise multiplication
print(f"a / b = {a / b}")

# Matrix operations
m1 = torch.tensor([[1, 2], [3, 4]])
m2 = torch.tensor([[5, 6], [7, 8]])

print(f"Matrix multiplication (m1 @ m2):\n{m1 @ m2}")
print(f"Element-wise multiplication (m1 * m2):\n{m1 * m2}")

# Aggregation operations
x = torch.tensor([1, 2, 3, 4, 5])
print(f"Sum: {x.sum()}")
print(f"Mean: {x.mean()}")
print(f"Max: {x.max()}")
print(f"Min: {x.min()}")

# Reshaping operations
x = torch.tensor([1, 2, 3, 4, 5, 6])
print(f"Original: {x}, shape: {x.shape}")

x_reshaped = x.reshape(2, 3)
print(f"Reshaped: \n{x_reshaped}, shape: {x_reshaped.shape}")

x_view = x.view(
    3, 2
)  # View is similar to reshape but shares memory with the original tensor
print(f"View: \n{x_view}, shape: {x_view.shape}")

### Converting Between NumPy and PyTorch

PyTorch tensors can be easily converted to and from NumPy arrays.


In [None]:
# NumPy array to PyTorch tensor
np_array = np.array([1, 2, 3, 4, 5])
tensor = torch.from_numpy(np_array)

print(f"NumPy array: {np_array}")
print(f"PyTorch tensor: {tensor}")

# PyTorch tensor to NumPy array
tensor = torch.tensor([1, 2, 3, 4, 5])
np_array = tensor.numpy()

print(f"PyTorch tensor: {tensor}")
print(f"NumPy array: {np_array}")

# Note: When converting between NumPy and PyTorch, they share the same memory
# if the tensor is on CPU and has a compatible data type
np_array = np.array([1, 2, 3, 4, 5])
tensor = torch.from_numpy(np_array)
np_array[0] = 100  # Modify the NumPy array

print(f"Modified NumPy array: {np_array}")
print(f"PyTorch tensor (shared memory): {tensor}")

## 2. Autograd: Automatic Differentiation

PyTorch's autograd system enables automatic computation of gradients, which is essential for training neural networks. It tracks operations on tensors and automatically computes gradients during backpropagation.


In [None]:
# Create tensors with gradient tracking
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

# Perform operations
z = x**2 + y**3

print(f"x = {x}")
print(f"y = {y}")
print(f"z = x^2 + y^3 = {z}")

# Compute gradients
z.backward()

# Access gradients
print(f"dz/dx = {x.grad}")  # Should be 2*x = 4
print(f"dz/dy = {y.grad}")  # Should be 3*y^2 = 27

### Gradient Accumulation and Zeroing

By default, gradients accumulate in PyTorch. You need to zero them out before each backward pass if you're doing multiple iterations.


In [None]:
# Create a tensor with gradient tracking
x = torch.tensor(2.0, requires_grad=True)

# First backward pass
y = x**2
y.backward()
print(f"First gradient: {x.grad}")  # Should be 2*x = 4

# Second backward pass (gradients accumulate!)
y = x**2
y.backward()
print(f"Accumulated gradient: {x.grad}")  # Should be 4 + 4 = 8

# Zero out gradients
x.grad.zero_()
y = x**2
y.backward()
print(f"After zeroing: {x.grad}")  # Should be 4 again

### Gradient Tracking Control

Sometimes you want to disable gradient tracking to save memory or improve performance, especially during evaluation.


In [None]:
# Using torch.no_grad()
x = torch.tensor(2.0, requires_grad=True)

with torch.no_grad():
    y = x**2
    print(f"y computed with no_grad: {y}")
    print(f"y.requires_grad: {y.requires_grad}")

# Using .detach()
x = torch.tensor(2.0, requires_grad=True)
y = x**2
z = y.detach()  # Creates a new tensor that doesn't require gradients

print(f"y: {y}, requires_grad: {y.requires_grad}")
print(f"z: {z}, requires_grad: {z.requires_grad}")

## 3. Neural Network API: torch.nn

PyTorch's `nn` module provides high-level building blocks for creating neural networks. It includes layers, activation functions, loss functions, and more.

### Building a Simple Neural Network


In [None]:
import torch.nn as nn
import torch.nn.functional as F


# Define a simple neural network
class SimpleNN(nn.Module):
    def __init__(self, input_size: int, hidden_size: int, output_size: int) -> None:
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x


# Create an instance of the network
input_size = 10
hidden_size = 20
output_size = 5
model = SimpleNN(input_size, hidden_size, output_size)

# Print the model architecture
print(model)

# Count the number of parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params}")

# Examine the parameters
for name, param in model.named_parameters():
    print(f"{name}: {param.shape}")

### Forward Pass

Let's see how to perform a forward pass through our neural network.


In [None]:
# Create a random input tensor
batch_size = 3
x = torch.randn(batch_size, input_size)

# Forward pass
output = model(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Output:\n{output}")

### Common Neural Network Layers

PyTorch provides a wide range of layers for building neural networks.


In [None]:
# Linear (Fully Connected) Layer
linear = nn.Linear(10, 5)
x = torch.randn(3, 10)
output = linear(x)
print(f"Linear layer output shape: {output.shape}")

# Convolutional Layer
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
x = torch.randn(1, 3, 28, 28)  # (batch_size, channels, height, width)
output = conv(x)
print(f"Conv2d layer output shape: {output.shape}")

# Recurrent Layer
rnn = nn.RNN(input_size=10, hidden_size=20, num_layers=1, batch_first=True)
x = torch.randn(3, 5, 10)  # (batch_size, sequence_length, input_size)
output, hidden = rnn(x)
print(f"RNN output shape: {output.shape}")
print(f"RNN hidden state shape: {hidden.shape}")

# Embedding Layer
embedding = nn.Embedding(num_embeddings=100, embedding_dim=8)
x = torch.tensor([[1, 2, 3], [4, 5, 6]])
output = embedding(x)
print(f"Embedding layer output shape: {output.shape}")

### Activation Functions

Activation functions introduce non-linearity into neural networks, allowing them to learn complex patterns.


In [None]:
# Create a sample tensor
x = torch.linspace(-5, 5, 100)

# Apply different activation functions
relu = F.relu(x)
sigmoid = torch.sigmoid(x)
tanh = torch.tanh(x)
softmax = F.softmax(x, dim=0)  # Softmax along the first dimension

# Plot the activation functions
plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
plt.plot(x.numpy(), relu.numpy())
plt.title("ReLU")
plt.grid(True)

plt.subplot(2, 2, 2)
plt.plot(x.numpy(), sigmoid.numpy())
plt.title("Sigmoid")
plt.grid(True)

plt.subplot(2, 2, 3)
plt.plot(x.numpy(), tanh.numpy())
plt.title("Tanh")
plt.grid(True)

plt.subplot(2, 2, 4)
plt.plot(x.numpy(), softmax.numpy())
plt.title("Softmax")
plt.grid(True)

plt.tight_layout()
plt.show()

## 4. Optimizers: Training Neural Networks

PyTorch provides various optimization algorithms for training neural networks. These optimizers update the model parameters based on the computed gradients.


In [None]:
import torch.optim as optim

# Create a simple model
model = nn.Linear(10, 1)

# Create an optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop (simplified)
for epoch in range(5):
    # Forward pass
    x = torch.randn(5, 10)
    y_true = torch.randn(5, 1)
    y_pred = model(x)

    # Compute loss
    loss = F.mse_loss(y_pred, y_true)
    print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

    # Backward pass
    optimizer.zero_grad()  # Zero out gradients
    loss.backward()  # Compute gradients
    optimizer.step()  # Update parameters

### Common Optimizers

PyTorch provides several optimization algorithms, each with its own characteristics.


In [None]:
# Create a simple model
model = nn.Linear(10, 1)

# Create different optimizers
sgd = optim.SGD(model.parameters(), lr=0.01)
sgd_momentum = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
adam = optim.Adam(model.parameters(), lr=0.01)
rmsprop = optim.RMSprop(model.parameters(), lr=0.01)

print("Available optimizers:")
print(f"SGD: {sgd}")
print(f"SGD with momentum: {sgd_momentum}")
print(f"Adam: {adam}")
print(f"RMSprop: {rmsprop}")

### Learning Rate Schedulers

Learning rate schedulers adjust the learning rate during training, which can help improve convergence.


In [None]:
from torch.optim.lr_scheduler import ExponentialLR, ReduceLROnPlateau, StepLR

# Create a model and optimizer
model = nn.Linear(10, 1)
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Create different schedulers
step_scheduler = StepLR(optimizer, step_size=2, gamma=0.1)
exp_scheduler = ExponentialLR(optimizer, gamma=0.9)
plateau_scheduler = ReduceLROnPlateau(optimizer, mode="min", factor=0.1, patience=2)

# Simulate training with StepLR
print("Training with StepLR:")
for epoch in range(5):
    # Training code would go here
    print(f"Epoch {epoch}, LR: {optimizer.param_groups[0]['lr']:.4f}")
    step_scheduler.step()

# Reset optimizer
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Simulate training with ExponentialLR
print("\nTraining with ExponentialLR:")
for epoch in range(5):
    # Training code would go here
    print(f"Epoch {epoch}, LR: {optimizer.param_groups[0]['lr']:.4f}")
    exp_scheduler.step()

## 5. Practical Example: Building a Language Model

Let's apply what we've learned to build a more sophisticated language model using PyTorch's high-level APIs. We'll create a character-level language model with a recurrent neural network (RNN).

First, let's load and prepare our data:


In [None]:
# Load the names dataset
with open("../../02 - Makemore/names.txt") as f:
    names = f.read().splitlines()

# Convert to lowercase
names = [name.lower() for name in names]

# Add start and end tokens
names_with_tokens = ["<" + name + ">" for name in names]

# Create vocabulary
chars = sorted(list(set("".join(names_with_tokens))))
vocab_size = len(chars)

# Create mappings between characters and indices
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

print(f"Vocabulary size: {vocab_size}")
print(f"Vocabulary: {''.join(chars)}")


# Prepare training data
def prepare_training_data(names: list[str]) -> tuple[torch.Tensor, torch.Tensor]:
    # Add start and end tokens
    names = ["<" + name + ">" for name in names]

    # Create input-output pairs
    xs = []  # Input characters
    ys = []  # Target characters (next character)

    for name in names:
        for c1, c2 in zip(name, name[1:], strict=False):
            xs.append(char_to_idx[c1])
            ys.append(char_to_idx[c2])

    # Convert to PyTorch tensors
    xs = torch.tensor(xs, dtype=torch.long)
    ys = torch.tensor(ys, dtype=torch.long)

    return xs, ys


# Prepare the data
xs, ys = prepare_training_data(names)

print(f"Number of training examples: {len(xs)}")

### Defining the RNN Language Model

Now, let's define our RNN-based language model using PyTorch's nn module:


In [None]:
class RNNLanguageModel(nn.Module):
    def __init__(self, vocab_size: int, embedding_dim: int, hidden_dim: int) -> None:
        super(RNNLanguageModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(
        self, x: torch.Tensor, hidden: torch.Tensor | None = None
    ) -> tuple[torch.Tensor, torch.Tensor]:
        # x shape: (batch_size, sequence_length)
        # If sequence_length is 1, we're predicting one character at a time

        # Embed the input
        embedded = self.embedding(x)  # (batch_size, sequence_length, embedding_dim)

        # Pass through RNN
        output, hidden = self.rnn(
            embedded, hidden
        )  # output: (batch_size, sequence_length, hidden_dim)

        # Pass through fully connected layer
        output = self.fc(output)  # (batch_size, sequence_length, vocab_size)

        return output, hidden

    def init_hidden(self, batch_size: int) -> torch.Tensor:
        # Initialize hidden state
        return torch.zeros(1, batch_size, self.rnn.hidden_size)


# Create the model
embedding_dim = 32
hidden_dim = 64
model = RNNLanguageModel(vocab_size, embedding_dim, hidden_dim)

print(model)

### Training the Model

Let's train our RNN language model:


In [None]:
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training parameters
batch_size = 128
num_epochs = 10
seq_length = 1  # We're treating each character as a sequence of length 1

# Create DataLoader
from torch.utils.data import DataLoader, TensorDataset

dataset = TensorDataset(xs.unsqueeze(1), ys)  # Add sequence dimension
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Training loop
losses = []

for epoch in range(num_epochs):
    epoch_loss = 0
    hidden = model.init_hidden(batch_size)

    for batch_idx, (inputs, targets) in enumerate(dataloader):
        # Adjust hidden state size for the last batch which might be smaller
        if inputs.size(0) != batch_size:
            hidden = model.init_hidden(inputs.size(0))

        # Zero gradients
        optimizer.zero_grad()

        # Forward pass
        output, hidden = model(inputs, hidden)

        # Reshape output for loss calculation
        output = output.squeeze(1)  # Remove sequence dimension

        # Calculate loss
        loss = criterion(output, targets)

        # Backward pass
        loss.backward()

        # Update parameters
        optimizer.step()

        # Detach hidden state to prevent backprop through the entire sequence
        hidden = hidden.detach()

        # Track loss
        epoch_loss += loss.item()

    # Average loss for the epoch
    avg_loss = epoch_loss / len(dataloader)
    losses.append(avg_loss)
    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {avg_loss:.4f}")

# Plot loss over time
plt.figure(figsize=(10, 6))
plt.plot(losses)
plt.title("Training Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.grid(True)
plt.show()

### Generating Names with the RNN Model

Now, let's use our trained RNN model to generate new names:


In [None]:
def generate_name_rnn(model: RNNLanguageModel, max_len: int = 20) -> str:
    model.eval()  # Set to evaluation mode

    with torch.no_grad():
        # Start with the start token
        current_idx = char_to_idx["<"]
        hidden = model.init_hidden(1)

        # Store generated characters
        chars = []

        for _ in range(max_len):
            # Prepare input
            x = torch.tensor([[current_idx]], dtype=torch.long)

            # Forward pass
            output, hidden = model(x, hidden)

            # Get probabilities
            probs = torch.softmax(output.squeeze(), dim=0)

            # Sample from the distribution
            current_idx = torch.multinomial(probs, num_samples=1).item()

            # Convert to character
            current_char = idx_to_char[current_idx]

            # Check if end token
            if current_char == ">":
                break

            # Add to result
            chars.append(current_char)

    return "".join(chars)


# Generate 10 names
print("Names generated using RNN model:")
for _ in range(10):
    name = generate_name_rnn(model)
    print(name)

## Summary

In this module, we've covered the essential PyTorch concepts needed for building neural networks:

1. **Tensors**: PyTorch's fundamental data structure for representing and manipulating data
2. **Autograd**: Automatic differentiation for computing gradients during backpropagation
3. **Neural Network API**: Building models with PyTorch's nn module
4. **Optimizers**: Using PyTorch's optimization algorithms for training neural networks
5. **Practical Example**: Building an RNN-based language model for generating names

These concepts form the foundation for building more sophisticated neural network architectures, such as those used in state-of-the-art language models like GPT.

### Key Takeaways for Senior Developers

- **PyTorch's Design**: PyTorch follows a "define-by-run" paradigm, making it more intuitive and flexible than static graph frameworks
- **Tensor Operations**: Most NumPy operations have PyTorch equivalents, but with added benefits like GPU acceleration and automatic differentiation
- **Neural Network Building Blocks**: PyTorch provides high-level abstractions for common neural network components
- **Training Loop Pattern**: PyTorch's training loop pattern (forward pass, loss computation, backward pass, parameter update) is consistent across different model architectures
- **Ecosystem Integration**: PyTorch integrates well with Python's scientific computing ecosystem (NumPy, Pandas, Matplotlib, etc.)

With these fundamentals, you're now equipped to explore more advanced neural network architectures and techniques for natural language processing and other machine learning tasks.