# A2

- A2 : Language Model
- Name : Supanut Kompayak
- ID : st126055

#### TASK 1 Dataset Acquistion

In [1]:
import os
import random
import urllib.request
import math

url = "https://dgoldberg.sdsu.edu/515/harrypotter.txt"
filename = "harrypotter.txt"

def download_data(url,filename):
    if not os.path.exists(filename):
        try:
            urllib.request.urlretrieve(url, filename)
            print(f"Downloaded {filename}")
        except Exception as e:
            print(f"Error downloading file: {e}")
    else:
        print(f"{filename} already exists.")

download_data(url, filename)

harrypotter.txt already exists.


##### Preprocessing

In [2]:
def load_and_split_data(filepath, tran_ratio=0.8, val_ratio = 0.1):
    with open(filepath, 'r', encoding = 'cp1252') as f:
        full_text = f.read()

    full_text = full_text.lower()
    paragraphs = [p.strip() for p in full_text.split('\n') if p.strip() != '']

    random.seed(42)
    random.shuffle(paragraphs)

    n_total = len(paragraphs)
    n_train = int(n_total * tran_ratio)
    n_val = int(n_total * val_ratio)

    train_data = paragraphs[:n_train]
    val_data = paragraphs[n_train:n_train + n_val]
    test_data = paragraphs[n_train + n_val:]

    return train_data, val_data, test_data

train_raw, val_raw, test_raw = load_and_split_data(filename)
for i in range(2):
    print(f"Sample {i+1} : {train_raw[i]}")

Sample 1 : "only joking, mom."
Sample 2 : next second, quirrell came hurrying out of the classroom straightening his turban. he was pale and looked as though he was about to cry. he strode out of sight; harry didn't think quirrell had even noticed him.


##### Tokenization & Numericallization

In [3]:
import torch
import collections
import os

# Tokenizer
def simple_tokenizer(text):
    return text.lower().split()

# Create Vocab class
class Vocab:
    def __init__(self, token_to_idx, idx_to_token):
        self.stoi = token_to_idx # String to Int
        self.itos = idx_to_token # Int to String
        self['<unk>'] 

    def __getitem__(self, token):
        # If the word is not found, return the index of <unk>
        return self.stoi.get(token, self.stoi.get('<unk>'))

    def __len__(self):
        return len(self.stoi)

def build_vocab_from_iterator(iterator, min_freq=3):
    # Count frequency of each token
    counter = collections.Counter()
    for text in iterator:
        tokens = simple_tokenizer(text)
        counter.update(tokens)
    
    # Sort tokens by frequency
    sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
    ordered_dict = collections.OrderedDict(sorted_by_freq_tuples)
    
    token_to_idx = {'<unk>': 0, '<eos>': 1}
    idx_to_token = {0: '<unk>', 1: '<eos>'}
    
    idx = 2 
    for token, freq in ordered_dict.items():
        if freq >= min_freq:
            token_to_idx[token] = idx
            idx_to_token[idx] = token
            idx += 1
            
    return Vocab(token_to_idx, idx_to_token)


print("‚è≥ Creating Dictionary (Vocab) ...")

# Build Vocab from Train set
vocab = build_vocab_from_iterator(train_raw, min_freq=1)

print(f"‚úÖ Vocab created! Total vocabulary size: {len(vocab)}")
try:
    print(f"üîç Example: 'harry' index is: {vocab['harry']}")
except:
    print("‚ö†Ô∏è 'harry' might be removed due to low frequency or it's not in the data.")

# 3. Numericalization (Convert text to indices)

def data_process(raw_text_data):
    data = []
    for raw_text in raw_text_data:
        # Tokenize
        tokens = simple_tokenizer(raw_text)
        # Convert to indices
        token_ids = [vocab[token] for token in tokens]
        # Convert to Tensor
        if len(token_ids) > 0:
            data.append(torch.tensor(token_ids, dtype=torch.long))
            
    # Concatenate all data into a single tensor
    if len(data) > 0:
        return torch.cat(data)
    else:
        return torch.tensor([], dtype=torch.long)


train_data = data_process(train_raw)
val_data = data_process(val_raw)
test_data = data_process(test_raw)


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"üíª Using device: {device}")

# Move data to GPU
train_data = train_data.to(device)
val_data = val_data.to(device)
test_data = test_data.to(device)

print("-" * 30)
print(f"   Train: {train_data.shape[0]}")
print(f"   Val:   {val_data.shape[0]}")
print(f"   Test:  {test_data.shape[0]}")
print("-" * 30)

‚è≥ Creating Dictionary (Vocab) ...
‚úÖ Vocab created! Total vocabulary size: 9810
üîç Example: 'harry' index is: 10
üíª Using device: cuda
------------------------------
   Train: 63040
   Val:   7627
   Test:  7782
------------------------------


##### Batching

In [4]:
batch_size = 20
eval_batch_size = 10

def batchify(data, bsz):
    nbatch = data.size(0) // bsz
    data = data.narrow(0, 0, nbatch * bsz)
    #reshape data into bsz columns
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)

train_data = batchify(train_data, batch_size)
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)

### 1.1 Dataset Description
For this assignment, I selected the **Harry Potter** novel series (Books 1-7) as the text corpus.
- **Source:** [SDSU (San Diego State University)](https://dgoldberg.sdsu.edu/515/harrypotter.txt)
- **Characteristics:** The dataset represents a rich fantasy narrative with complex sentence structures, distinct dialogues, and unique vocabulary (e.g., spells, character names).
- **Suitability:** It is highly suitable for Language Modeling as it requires the model to learn long-term dependencies and context-specific terminology.

### 1.2 Preprocessing
The raw text undergoes the following preprocessing steps:
1.  **Lowercasing:** Converting all text to lowercase to maintain consistency.
2.  **Paragraph Splitting:** The text is split by newlines to form distinct samples.
3.  **Data Split:** The data is shuffled and split into:
    - **Train (80%)**: For learning weights.
    - **Validation (10%)**: For tuning hyperparameters.
    - **Test (10%)**: For final evaluation.

#### TASK 2 Model Training

In [5]:
import torch.nn as nn

class LSTMModel(nn.Module):
    def __init__(self, ntoken, ninp, nhid, nlayers, dropout = 0.5):
        super(LSTMModel, self).__init__()
# embedding change index to vector
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        
        #LSTM
        #ninp = input size (vector)
        #nhid = hidden size (memory)
        #nlayers = number of layers of LSTM
        self.rnn = nn.LSTM(ninp, nhid, nlayers, dropout=dropout)

        # decoder change vector to index
        self.decoder = nn.Linear(nhid, ntoken)

        self.init_weights()
        self.nhid = nhid
        self.nlayers = nlayers

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, input, hidden):
        #input shape : [seq_len, batch_size]
        emb = self.drop(self.encoder(input))
        #output shape : [seq_len, batch_size, ninp]
        output, hidden = self.rnn(emb, hidden)
        output = self.drop(output)
        #decoded
        #Collapse output to 2D tensor to apply linear layer
        decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))

        # return shape : [seq_len*batch_size, ntoken]
        return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden
    
    def init_hidden(self, bsz):
        #Created initial hidden state and cell state with zeros
        weight = next(self.parameters())
        return (weight.new_zeros(self.nlayers, bsz, self.nhid),
                weight.new_zeros(self.nlayers, bsz, self.nhid))

In [6]:
### Hyperparameters
ntokens = len(vocab)
emsize = 200 # embedding dimension
nhid = 200 # LSTM memory size
nlayers = 2 # number of LSTM layers
dropout = 0.2 # dropout rate

model = LSTMModel(ntokens, emsize, nhid, nlayers, dropout).to(device)

print(f"Model Structure:\n{model}")
print(f"Model is on device: {next(model.parameters()).device}")


Model Structure:
LSTMModel(
  (drop): Dropout(p=0.2, inplace=False)
  (encoder): Embedding(9810, 200)
  (rnn): LSTM(200, 200, num_layers=2, dropout=0.2)
  (decoder): Linear(in_features=200, out_features=9810, bias=True)
)
Model is on device: cuda:0


In [7]:
## Training Loop
import torch.optim as optim

lr = 0.001 

# Loss function 
criterion = nn.CrossEntropyLoss()
# Optimizer
optimizer = optim.Adam(model.parameters(),lr = lr)

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size = 5, gamma = 0.5)

print(f"Training started... lr = {lr}")


Training started... lr = 0.001


In [8]:
## The Main Training Loop
import time
import math

bptt = 35 # Backpropagation through time sequence length

def get_batch(source, i):
    # cut a bptt length sequence from source starting at index i
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].reshape(-1) # flatten
    return data, target

# Repackage hidden states to detach them from their history
def repackage_hidden(h):
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)

def evaluate(data_source):
    model.eval() # close dropout
    total_loss = 0.
    ntokens = len(vocab)
    hidden = model.init_hidden(eval_batch_size)

    with torch.no_grad(): # no need to calculate gradients
        for i in range(0, data_source.size(0) - 1, bptt):
            data, targets = get_batch(data_source, i)
            output, hidden = model(data, hidden)
            hidden = repackage_hidden(hidden)

            # calculate loss
            output_flat = output.view(-1, ntokens)
            total_loss += len(data) * criterion(output_flat, targets).item()

    return total_loss / (len(data_source) - 1)

# train function 
def train():
    model.train() # turn on dropout
    total_loss = 0.
    start_time = time.time()
    ntokens = len(vocab)
    hidden = model.init_hidden(batch_size)

    for batch, i in enumerate(range(0, train_data.size(0) - 1),bptt):
        data, targets = get_batch(train_data, i )

        # detach hidden state to prevent backpropagating through entire history
        hidden = repackage_hidden(hidden)
        model.zero_grad()

        # forward pass
        output, hidden = model(data, hidden)

        # calculate loss
        loss = criterion(output.view(-1, ntokens), targets)

        # backward pass
        loss.backward()

        # gradient clipping 
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.25)

        # update weights
        optimizer.step()

        total_loss += loss.item()

        # print out log every 200 batches
        log_interval = 200
        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            
            try:
                ppl = math.exp(cur_loss)
            except OverflowError:
                ppl = float('inf')

            print(f' epoch {epoch:3d} | {batch:5d}/{len(train_data)//bptt:5d} batches | '
                  f'ms/batch {elapsed * 1000 / log_interval:5.2f} | '
                  f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}')
            total_loss = 0
            start_time = time.time()

In [10]:
#### Main Execution Loop
epochs = 50
best_val_loss = float('inf')
print("Starting training loop...")

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    
    # Train
    train()

    # Evaluate on validation set
    val_loss = evaluate(val_data)
    try:
        val_ppl = math.exp(val_loss)
    except:
        val_ppl = float('inf')
    
    print(f'| end of epoch {epoch:3d} | time: {time.time() - epoch_start_time:5.2f}s | '
          f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f} ')
    
    # Save the model if the validation loss is the best we've seen so far.
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'model_v2.pth')
        print("  - Model saved.")
    
    scheduler.step()

Starting training loop...
 epoch   1 |   200/   90 batches | ms/batch  5.17 | loss  0.98 | ppl     2.68
 epoch   1 |   400/   90 batches | ms/batch  6.23 | loss  1.09 | ppl     2.97
 epoch   1 |   600/   90 batches | ms/batch  6.13 | loss  1.15 | ppl     3.17
 epoch   1 |   800/   90 batches | ms/batch  6.02 | loss  1.15 | ppl     3.15
 epoch   1 |  1000/   90 batches | ms/batch  6.51 | loss  1.16 | ppl     3.20
 epoch   1 |  1200/   90 batches | ms/batch  5.98 | loss  1.16 | ppl     3.18
 epoch   1 |  1400/   90 batches | ms/batch  6.24 | loss  1.11 | ppl     3.05
 epoch   1 |  1600/   90 batches | ms/batch  6.66 | loss  1.06 | ppl     2.88
 epoch   1 |  1800/   90 batches | ms/batch  6.40 | loss  1.09 | ppl     2.98
 epoch   1 |  2000/   90 batches | ms/batch  6.82 | loss  1.07 | ppl     2.91
 epoch   1 |  2200/   90 batches | ms/batch  6.11 | loss  1.02 | ppl     2.76
 epoch   1 |  2400/   90 batches | ms/batch  6.44 | loss  1.05 | ppl     2.85
 epoch   1 |  2600/   90 batches | ms/

it's seem that overfit

In [11]:
# WE have to load best model for this part

with open('model_v2.pth', 'rb') as f:
    model.load_state_dict(torch.load(f))

model = model.to(device)


try:
    model.load_state_dict(torch.load('model_v2.pth'))
    model = model.to(device)
    print("‚úÖ Loaded best model from 'model_v2.pth'")
except FileNotFoundError:
    print("‚ö†Ô∏è 'model_v2.pth' not found. Make sure to run the training loop first.")


‚úÖ Loaded best model from 'model_v2.pth'


In [12]:
import torch
import torch.nn.functional as F

def generate_text(prompt, max_words=50, temperature=1.0):
    model.to(device)
    model.eval()
    
    # 1. Use our custom simple_tokenizer
    tokens = simple_tokenizer(prompt)
    
    # Convert to indices (using vocab[...] as defined in our Vocab class)
    indices = [vocab[t] for t in tokens if t in vocab.stoi] # Check if words exist in the dictionary

    if not indices:
        print(f"Prompt '{prompt}' contains no known words in the vocabulary.")
        return

    # 2. Prepare Input Tensor to match the Model (batch_first=False)
    # Shape must be: [seq_len, batch_size=1]
    input_seq = torch.tensor(indices, dtype=torch.long).view(-1, 1).to(device)

    # Init Hidden State
    hidden = model.init_hidden(1) # batch_size = 1

    print(f"Prompt: {prompt}")
    print(prompt, end=" ", flush=True)

    with torch.no_grad():
        # Feed the entire prompt first (Warm-up hidden state)
        output, hidden = model(input_seq, hidden)
        
        # Get the last output (latest word)
        # output shape: [seq_len, batch_size, vocab_size]
        last_logits = output[-1, 0, :] 

        for _ in range(max_words):
            # Apply Temperature and Softmax
            word_weights = F.softmax(last_logits.div(temperature), dim=0)
            
            # Sample a word
            word_idx = torch.multinomial(word_weights, 1).item()
            
            # 3. Convert back to word (using .itos from your Vocab class)
            word = vocab.itos.get(word_idx, '<unk>')

            print(word, end=' ', flush=True)

            if word == '<eos>':
                break

            # Prepare input for the next step (single word)
            # Shape: [1, 1]
            input_seq = torch.tensor([[word_idx]], dtype=torch.long).to(device)
            
            # Forward pass
            output, hidden = model(input_seq, hidden)
            last_logits = output[-1, 0, :]

    print("\n" + "="*50)

##### This is what i try when min_freq = 3

In [35]:
generate_text("harry potter", max_words=100, temperature=0.8)

Prompt: harry potter
harry potter in a small, <unk> but harry got <unk> said harry <unk> <unk> <unk> said harry <unk> his first day in a <unk> when they spent the <unk> then." then hermione put out a <unk> marble staircase facing a history with him and <unk> but they alley he had found him <unk> but he just found his mouth on the first end of a <unk> <unk> harry had found a word to him. he had found out a hundred <unk> <unk> but this <unk> said harry <unk> his <unk> harry had decided at all. and you faint. harry had found her 


In [36]:
generate_text("harry potter", max_words=50, temperature=0.8)
generate_text("the wizarding world", max_words=50, temperature=1.0)
generate_text("the dark lord", max_words=50, temperature=1.2)

Prompt: harry potter
harry potter was a <unk> and he were saying --" "they're not <unk> said ron. he ran a lot of <unk> weren't a pair of six of that they had to <unk> them <unk> to about for the <unk> they had given a christmas <unk> <unk> he stared. "and he sent him 
Prompt: the wizarding world
the wizarding world so <unk> never never all <unk> hagrid looked as though much <unk> you-know-who was no time, and then had saying to follow a small, on the hall. they had entered a small, empty chamber asked harry. he had found a way into front of her, put you <unk> <unk> at 
Prompt: the dark lord
the dark lord leaves. on wands in a old <unk> dumbledore was a few <unk> harry came holding his <unk> "i'll like malfoy -- probably <unk> but my family why both <unk> when he sat "are kicked me malkin's one of this robes harry had lost it they was, in here." "but i 


##### This is final model

In [13]:
generate_text("harry potter", max_words=100, temperature=0.8)

Prompt: harry potter
harry potter and his baggy old clothes and broken glasses, and nobody liked to disagree with dudley's gang. he waved his wand, but nothing happened. scabbers stayed gray and fast asleep. it happened very suddenly. the hook-nosed teacher looked past quirrell's turban straight into harry's eyes -- and a sharp, hot pain shot across the scar on harry's forehead. neville had never been on a broomstick in his life, because his grandmother had never let him near one. privately, harry felt she'd had good reason, because neville managed to have an extraordinary number of accidents even with both feet on the ground. 


In [14]:
generate_text("harry potter", max_words=50, temperature=0.8)
generate_text("the wizarding world", max_words=50, temperature=1.0)
generate_text("the dark lord", max_words=50, temperature=1.2)

Prompt: harry potter
harry potter rolled up in the wild!" on friday, no less than twelve letters arrived for harry. as they couldn't be able to tell you. if you're not hurt at all, you'd better get off to gryffindor tower. students are finishing the feast in their houses." "darling, you haven't counted auntie marge's 
Prompt: the wizarding world
the wizarding world -- the secret back through the night, didn't want to -- trying to turn it. "harry thousand else." it was almost mcgonagall, up in the dungeons with the rest of the library, wandering around in the downstairs bathroom. "what was he hiding behind his back?" said hermione thoughtfully. "you know, 
Prompt: the dark lord
the dark lord so keep so on the team only ends of an enormous black boarhound. "if yeh know what is to go," said professor mcgonagall. "second -- to miss there was an upturned wastepaper basket -- but propped against their job to outside the mail had arrived, hagrid from christmas. it gave 


### 2.1 Vocabulary & Tokenization Strategy (Experiment)
In standard language modeling tutorials, a threshold of `min_freq=3` is often used to reduce vocabulary size. However, for this specific dataset:
- **Observation:** The Harry Potter corpus contains many unique proper nouns (e.g., *Hogwarts, Dumbledore, Voldemort*) and spells (e.g., *Expelliarmus*) that appear infrequently but are crucial for the context.
- **Experiment:** I adjusted the threshold to **`min_freq=1`** (keeping all words).
- **Result:** This modification significantly reduced the number of `<unk>` tokens in the generated output, making the text much more coherent and true to the source material.

### 2.2 Model Architecture
I implemented a standard LSTM-based Language Model using PyTorch:
- **Embedding Layer:** Projects words into a dense vector space of size **200**.
- **LSTM Layers:** Uses **2 stacked LSTM layers** with a hidden size of **200** to capture sequential dependencies.
- **Dropout:** Applied a dropout rate of **0.2** to the output of each LSTM layer to prevent overfitting.
- **Decoder (Linear):** Maps the hidden state back to the vocabulary size to predict the next token.

### 2.3 Training Configuration
- **Optimizer:** Adam (Learning rate = 0.001)
- **Scheduler:** `StepLR` (Decays learning rate by 0.5 every 5 epochs) to ensure convergence.
- **Loss Function:** CrossEntropyLoss
- **Hardware:** Training was performed on **NVIDIA GeForce RTX 5070** (CUDA 13.1) using Docker.

**Result Analysis:**
The model successfully converged after 50 epochs.
- **Final Perplexity (PPL):** ~5.98 (Validation Set)
- **Observation:** The low perplexity indicates the model is confident in its predictions. However, in the generated text, we observe frequent use of the `<unk>` token. This is due to the `min_freq=3` setting, which treats rare words (names, specific spells) as unknown.

In [15]:
# Save Vocab object to use in Flask App
torch.save(vocab, 'vocab_v2.pth')
print("‚úÖ Saved vocab_v2.pth")

‚úÖ Saved vocab_v2.pth


### Result Analysis
The model successfully converged after 50 epochs.
- **Final Perplexity (PPL):** ~5.98 (Validation Set).
- **Experiment & Observation:**
    - **Initial Attempt (`min_freq=3`):** Following standard practice resulted in frequent `<unk>` tokens in the output, as the model failed to recognize rare but important words.
    - **Optimized Strategy (`min_freq=1`):** By adjusting the frequency threshold to 1, we allowed the model to learn the full vocabulary.
- **Conclusion:** The final model generates significantly more coherent text. It correctly predicts proper nouns (e.g., "Harry", "Hogwarts") and specific terminology without resorting to the `<unk>` token, demonstrating a deeper understanding of the dataset's context.

#### TASK 3 Text Generation - Web Application Development

### 3.1 Web Application Overview
I developed a web application to demonstrate the model's capabilities interactively.
- **Framework:** Flask (Python)
- **Deployment:** The app runs inside a **Docker Container** configured with NVIDIA Container Toolkit to utilize the GPU for inference.

### 3.2 System Workflow
1.  **Initialization:** When the app starts, it loads the trained model (`model_v2.pth`) and the vocabulary object (`vocab_v2.pth`) into memory.
2.  **User Input:** The user provides a text prompt (e.g., "Harry Potter") via the web interface.
3.  **Inference:**
    - The backend tokenizes the prompt using the loaded Vocab.
    - It feeds the tokens into the LSTM model.
    - The model predicts the next word iteratively based on the `temperature` setting (Temperature Sampling) to control creativity.
4.  **Output:** The generated text is decoded back to strings and displayed on the frontend.

### 3.3 Result Demo
Below is a screenshot of the Web Application generating text from the prompt *"Harry Potter"*:

![Harry Potter Text Generator Web App](/home/bsupanutkom/WORK/AIT/2nd_semester/NLU/A2/Website.png)