# st126161
# A2: Language Model using LSTM
## Harry Potter (7-Book Dataset)

This notebook implements a character-level LSTM language model trained on
the complete Harry Potter book series (7 books) obtained from Kaggle.

The model is trained using GPU (CUDA) acceleration for improved performance
and is capable of generating coherent text based on a user-provided prompt.


In [2]:
import os
import time
import torch
import torch.nn as nn
import numpy as np

print(torch.__version__)



2.8.0


In [3]:
''' # Check CUDA availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Optional: show GPU name
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0)) '''

# I'm using the M4 chip on a MacBook Air

# Device selection: CUDA (NVIDIA), MPS (Apple Silicon), or CPU
if torch.cuda.is_available():
    device = torch.device("cuda")
    device_name = torch.cuda.get_device_name(0)
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    device_name = "Apple Silicon GPU (MPS)"
else:
    device = torch.device("cpu")
    device_name = "CPU"

print("Using device:", device)
print("Device name:", device_name)

torch.mps.empty_cache()


Using device: mps
Device name: Apple Silicon GPU (MPS)


# Load data

Data was downloaded from Kaggle (https://www.kaggle.com/datasets/shubhammaindola/harry-potter-books?resource=download)

In [4]:
data_dir = "harry_potter_books"

all_text = ""
for filename in sorted(os.listdir(data_dir)):
    if filename.endswith(".txt"):
        with open(os.path.join(data_dir, filename), "r", encoding="utf-8") as f:
            all_text += f.read() + "\n"

print("Total characters in dataset:", len(all_text))


Total characters in dataset: 6285445


In [5]:
text = all_text.lower()


In [6]:
# Display a short glimpse of the dataset
print("=== Dataset Sample (First 1,000 characters) ===\n")
print(text[:1000])


=== Dataset Sample (First 1,000 characters) ===

m r. and mrs. dursley, of number four, privet drive, were proud to say that they were perfectly normal, thank you very much. they were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.

mr. dursley was the director of a firm called grunnings, which made drills. he was a big, beefy man with hardly any neck, although he did have a very large mustache. mrs. dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. the dursleys had a small son called dudley and in their opinion there was no finer boy anywhere.

the dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. they didn’t think they could bear it if anyone found out about the potters. mrs. potter was mrs. dur

A brief inspection of the dataset was conducted by printing sample paragraphs from the combined corpus to verify textual integrity and narrative continuity.

# Pre-processing

## Character level tokenization

In [7]:
chars = sorted(list(set(text)))
vocab_size = len(chars)

char2idx = {c: i for i, c in enumerate(chars)}
idx2char = {i: c for c, i in char2idx.items()}

print("Vocabulary size:", vocab_size)


Vocabulary size: 80


In [8]:
import pickle

with open("vocab.pkl", "wb") as f:
    pickle.dump((char2idx, idx2char), f, protocol=pickle.HIGHEST_PROTOCOL)

print("Vocabulary saved successfully")


Vocabulary saved successfully


## Encode text

In [21]:
encoded_text = np.array([char2idx[c] for c in text])


## Create input target sequence

In [22]:
seq_length = 40

X = []
y = []

for i in range(len(encoded_text) - seq_length):
    X.append(encoded_text[i:i + seq_length])
    y.append(encoded_text[i + seq_length])

X = torch.tensor(X, dtype=torch.long)
y = torch.tensor(y, dtype=torch.long)

print("Input shape:", X.shape)
print("Target shape:", y.shape)


Input shape: torch.Size([6285405, 40])
Target shape: torch.Size([6285405])


In [23]:
X = X.to(device)
y = y.to(device)


# Define LSTM

In [24]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        out, _ = self.lstm(x)
        out = self.fc(out[:, -1, :])
        return out


In [25]:
embed_size = 128
hidden_size = 256

model = LSTMLanguageModel(vocab_size, embed_size, hidden_size).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.003)


# Training loop

Due to the large size of the combined 7-book corpus, i utilized mini-batch training using PyTorch’s DataLoader to avoid memory overflow and to enable efficient GPU acceleration on Apple Silicon via MPS.

In [26]:
from torch.utils.data import TensorDataset, DataLoader

batch_size = 128  # Safe for Apple Silicon

dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

epochs = 10
start_time = time.time()

for epoch in range(epochs):
    model.train()
    epoch_loss = 0.0

    for batch_X, batch_y in dataloader:
        optimizer.zero_grad()

        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    avg_loss = epoch_loss / len(dataloader)
    print(f"Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.4f}")

end_time = time.time()
print("Training time (seconds):", end_time - start_time)


Epoch [1/10], Loss: 1.4028
Epoch [2/10], Loss: 1.3477
Epoch [3/10], Loss: 1.3890
Epoch [4/10], Loss: 1.4660
Epoch [5/10], Loss: 1.5186
Epoch [6/10], Loss: 1.5714
Epoch [7/10], Loss: 1.5304
Epoch [8/10], Loss: 1.5491
Epoch [9/10], Loss: 1.5397
Epoch [10/10], Loss: 1.5287
Training time (seconds): 13460.781935930252


# Text Generation Function

In [29]:
import torch.nn.functional as F

def generate_text(model, start_text, gen_length=200, temperature=0.8):
    model.eval()

    input_seq = torch.tensor(
        [char2idx[c] for c in start_text.lower()],
        dtype=torch.long
    ).unsqueeze(0).to(device)

    generated_text = start_text

    for _ in range(gen_length):
        output = model(input_seq)

        # Apply temperature
        logits = output / temperature
        probs = F.softmax(logits, dim=1)

        # Sample instead of argmax
        next_char_idx = torch.multinomial(probs, 1).item()
        generated_text += idx2char[next_char_idx]

        input_seq = torch.cat(
            [input_seq[:, 1:], torch.tensor([[next_char_idx]]).to(device)],
            dim=1
        )

    return generated_text


# Testing the Model

In [31]:
prompt = "harry potter is"
print(generate_text(model, prompt, gen_length=300, temperature=0.8))


harry potter is always a probling the said on the peciar, where she to be as they seen the sead him.

“all you closed looking them, complacing with her people. puns minders. hermione what she town. he know.”

harry had behnied or ell, his has better. that had see angling harry was lused to the dox crabbe chick to 


Initial text generation using greedy decoding resulted in repetitive outputs due to the model favoring high-probability character sequences. To address this, temperature-based probabilistic sampling was implemented, which significantly improved output diversity and reduced repetition.

The text data preprocessing began by loading and merging the seven Harry Potter book files into a single corpus. The text was normalized by converting all characters to lowercase to reduce vocabulary size and improve training stability. A brief dataset inspection was performed by printing sample paragraphs to verify textual integrity. Character-level tokenization was then applied by identifying all unique characters and mapping them to integer indices. The entire corpus was encoded numerically, after which fixed-length input sequences were generated, where each sequence was paired with the subsequent character as the prediction target. Finally, the data was converted into PyTorch tensors and transferred to the appropriate computation device.

A character-level LSTM language model was implemented using PyTorch. The model consists of an embedding layer that transforms character indices into dense vectors, followed by an LSTM layer that captures sequential dependencies within the text. The final fully connected layer projects the LSTM output to the vocabulary space to predict the next character. The model was trained using a cross-entropy loss function and optimized with the Adam optimizer. Due to the large dataset size, mini-batch training was employed using a DataLoader to ensure efficient memory usage.

# Exporting the model for web application use

In [32]:
torch.save(model.state_dict(), "harry_potter_lstm.pth")
