# Tutorial 8-2: The Ghost of Shakespeare â€“ "Character-Level RNN"

**Course:** CSEN 342: Deep Learning  
**Topic:** Recurrent Neural Networks (RNNs), Sequence Modeling, and Text Generation

## Objective
In the lecture (Slide 53), we introduced the **Vanilla RNN** equation:
$$ h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t) $$

Most deep learning libraries provide a black-box `nn.RNN` layer that hides this logic. In this tutorial, we will **open the black box**. 

We will:
1.  **Implement `RNNCell` from scratch:** You will write the raw matrix multiplications and activation functions.
2.  **Unroll the Network:** You will write the loop that passes the "hidden state" memory from one step to the next.
3.  **Train on Shakespeare:** We will teach the network to generate text that looks like (gibberish) Shakespeare.

---

## Part 1: Data Preparation

We treat text as a sequence of characters. Our goal is: given a sequence of characters (e.g., "hell"), predict the next character ("o").

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import os
import sys

# 1. Download Data (Tiny Shakespeare)
data_root = '../data'
os.makedirs(data_root, exist_ok=True)
file_path = os.path.join(data_root, 'tinyshakespeare.txt')

if not os.path.exists(file_path):
    print("Downloading Tiny Shakespeare...")
    os.system(f"wget -nc -P {data_root} https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt")
    os.rename(os.path.join(data_root, 'input.txt'), file_path)

# 2. Load and Tokenize
with open(file_path, 'r') as f:
    text = f.read()

chars = sorted(list(set(text)))
vocab_size = len(chars)

# Mappings
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

print(f"Total characters: {len(text):,}")
print(f"Unique characters (Vocab size): {vocab_size}")
print(f"First 100 chars:\n{text[:100]}")

---

## Part 2: The Raw RNN Cell

This is the heart of the tutorial. We will implement the equation from **Slide 53** manually.

We need two linear transformations:
1.  `i2h`: Input to Hidden ($W_{xh} x_t$)
2.  `h2h`: Hidden to Hidden ($W_{hh} h_{t-1}$)

The combined state is passed through a `Tanh` activation.

In [None]:
class VanillaRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        # W_xh: Input -> Hidden
        self.i2h = nn.Linear(input_size, hidden_size)
        # W_hh: Hidden -> Hidden
        self.h2h = nn.Linear(hidden_size, hidden_size)
        
    def forward(self, x, hidden):
        # The Core Equation (Slide 53)
        # h_t = tanh(W_xh * x_t + W_hh * h_{t-1})
        
        # 1. Compute contributions
        from_input = self.i2h(x)
        from_hidden = self.h2h(hidden)
        
        # 2. Combine and Activate
        next_hidden = torch.tanh(from_input + from_hidden)
        
        return next_hidden

---

## Part 3: The Unrolled Network

An RNN cell only handles *one time step*. To process a sentence, we need a loop. This corresponds to the "unrolled" computational graph seen in **Slide 61**.

**Architecture:**
1.  **Embedding:** Converts char indices to vectors.
2.  **RNN Loop:** Updates hidden state step-by-step.
3.  **Output Layer:** Converts hidden state to vocabulary probabilities ($y_t = W_{hy} h_t$).

In [None]:
class CharRNN(nn.Module):
    def __init__(self, vocab_size, hidden_size, embedding_dim):
        super().__init__()
        self.hidden_size = hidden_size
        
        # 1. Embedding Layer (Optional but recommended for better performance)
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # 2. Our Custom Cell
        self.rnn_cell = VanillaRNNCell(embedding_dim, hidden_size)
        
        # 3. Output Layer (Hidden -> Vocab Class scores)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, input_seq, hidden=None):
        # input_seq shape: (Batch, Seq_Len)
        batch_size, seq_len = input_seq.size()
        
        if hidden is None:
            hidden = self.init_hidden(batch_size)
            
        # Convert indices to vectors: (Batch, Seq, Emb_Dim)
        embeds = self.embedding(input_seq)
        
        outputs = []
        
        # --- The Recurrent Loop (Slide 61) ---
        for t in range(seq_len):
            # Extract input at time t
            x_t = embeds[:, t, :] 
            
            # Update hidden state using our custom cell
            hidden = self.rnn_cell(x_t, hidden)
            
            # Compute output y_t for this step (Many-to-Many architecture)
            out_t = self.fc(hidden)
            outputs.append(out_t)
            
        # Stack outputs to shape (Batch, Seq_Len, Vocab_Size)
        return torch.stack(outputs, dim=1), hidden

    def init_hidden(self, batch_size):
        return torch.zeros(batch_size, self.hidden_size).to(next(self.parameters()).device)

---

## Part 4: Training Loop

We treat this as a standard classification problem. At every time step $t$, we try to predict the character at $t+1$.

In [None]:
# Hyperparameters
hidden_size = 128
embedding_dim = 64
seq_len = 50
batch_size = 64
lr = 0.005
epochs = 2000 # Iterations, not full epochs for this tutorial speed

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = CharRNN(vocab_size, hidden_size, embedding_dim).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

# Helper to get random batch
def get_batch(text_data, seq_len, batch_size):
    inputs = []
    targets = []
    for _ in range(batch_size):
        start_idx = np.random.randint(0, len(text_data) - seq_len - 1)
        chunk = text_data[start_idx : start_idx + seq_len + 1]
        # Convert to indices
        indices = [char_to_idx[c] for c in chunk]
        inputs.append(indices[:-1])   # 0 to 49
        targets.append(indices[1:])   # 1 to 50
        
    return torch.tensor(inputs).to(device), torch.tensor(targets).to(device)

print("Starting Training...")
losses = []

for i in range(epochs):
    inputs, targets = get_batch(text, seq_len, batch_size)
    
    optimizer.zero_grad()
    
    # Forward pass
    # Note: We detach hidden state (Truncated BPTT) implicitly by re-initializing zeros each batch in this simple loop
    outputs, _ = model(inputs)
    
    # Flatten outputs for CrossEntropy: (Batch*Seq, Vocab)
    loss = criterion(outputs.view(-1, vocab_size), targets.view(-1))
    loss.backward()
    
    # Gradient Clipping (Slide 80 - Critical for Vanilla RNNs!)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)
    
    optimizer.step()
    
    if i % 100 == 0:
        losses.append(loss.item())
        print(f"Iter {i} Loss: {loss.item():.4f}")

print("Training Complete.")

---

## Part 5: Generation (Dreaming)

Now the fun part. We can seed the model with a character (e.g., 'T') and ask it to predict the next one. We then feed that prediction back in as input.

**Sampling:** Instead of just taking the `argmax` (which is boring and repetitive), we sample from the probability distribution. This introduces variety.

In [None]:
def generate(model, start_str="The", predict_len=200, temperature=0.8):
    model.eval()
    hidden = None
    
    # Convert start string to indices
    input_indices = [char_to_idx[c] for c in start_str]
    input_tensor = torch.tensor(input_indices).unsqueeze(0).to(device)
    
    # "Prime" the network (build up hidden state context)
    # We only care about the hidden state, not the outputs here
    with torch.no_grad():
        _, hidden = model(input_tensor, hidden)
    
    # Use the last character as the first input for generation
    current_input = input_tensor[:, -1].unsqueeze(1)
    
    generated_str = start_str
    
    for _ in range(predict_len):
        with torch.no_grad():
            # Single step forward
            output, hidden = model(current_input, hidden)
            
            # Output is (Batch=1, Seq=1, Vocab)
            logits = output.squeeze() 
            
            # Apply Temperature (Higher = crazier, Lower = safer)
            probs = torch.softmax(logits / temperature, dim=0)
            
            # Sample from distribution
            char_idx = torch.multinomial(probs, 1).item()
            
            # Append to string
            generated_str += idx_to_char[char_idx]
            
            # Prepare input for next step
            current_input = torch.tensor([[char_idx]]).to(device)
            
    return generated_str

print("--- Generated Text (Temperature 0.8) ---")
print(generate(model, start_str="ROMEO:", temperature=0.8))

print("\n--- Generated Text (Temperature 0.5 - Safer) ---")
print(generate(model, start_str="ROMEO:", temperature=0.5))

### Discussion
You likely see text that *looks* like a play (CAPITAL names, newlines, maybe some old English words like "thee" or "thou"), even if it doesn't make total sense.

**Why this is amazing:**
We never taught the model what a "word" is, or how to spell, or grammar rules. It learned all of this purely by observing the statistical probability of character $B$ following character $A$.

**Limitations of Vanilla RNN:**
If you train this longer, you might notice it struggles to keep track of long-term context (e.g., closing a parenthesis opened 100 characters ago). This is the **Vanishing Gradient** problem (Slide 78), which LSTMs (Tutorial 22) are designed to solve.