# AIG230 Natural Language Processing - Assignment 5

**Selected Corpus: Option A - Gutenberg ('carroll-alice.txt')**

**Student Name:** [Your Name]
**Date:** February 16, 2026
**Instructor:** Prof: David Quispe

## Part A - Text Preprocessing (50%)

**Goal:** Prepare a clean token stream and justify your choices.

In [1]:
import nltk
from nltk.corpus import gutenberg
import string
from collections import Counter
import pandas as pd

# Download necessary resources
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('omw-1.4')

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/kevinwang/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /Users/kevinwang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kevinwang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kevinwang/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/kevinwang/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/kevinwang/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### A1. Load the corpus

In [2]:
# Load Alice in Wonderland
raw_text = gutenberg.raw('carroll-alice.txt')
sentences = gutenberg.sents('carroll-alice.txt')
initial_tokens = gutenberg.words('carroll-alice.txt')

print(f"Total number of characters: {len(raw_text)}")
print(f"Total number of sentences: {len(sentences)}")
print(f"Total number of tokens BEFORE preprocessing: {len(initial_tokens)}")

Total number of characters: 144395
Total number of sentences: 1703
Total number of tokens BEFORE preprocessing: 34110


### A2. Preprocess

I will implement a preprocessing function that performs:
* Lowercasing
* Tokenization
* Removal of punctuation tokens
* Stopword removal (optional, but I'll justify its use for vectorization tasks while keeping it optional for embeddings if needed later)

In [3]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text_tokens, remove_stopwords=True, use_lemmatization=True):
    # Lowercasing and removing punctuation
    clean_tokens = [t.lower() for t in text_tokens if t not in string.punctuation and t.isalpha()]
    
    if remove_stopwords:
        clean_tokens = [t for t in clean_tokens if t not in stop_words]
        
    if use_lemmatization:
        clean_tokens = [lemmatizer.lemmatize(t) for t in clean_tokens]
        
    return clean_tokens

processed_tokens = preprocess_text(initial_tokens)

print(f"Total number of tokens AFTER preprocessing: {len(processed_tokens)}")
print(f"Vocabulary size (unique tokens): {len(set(processed_tokens))}")

top_20 = Counter(processed_tokens).most_common(20)
print("\nTop 20 most frequent tokens:")
for token, count in top_20:
    print(f"{token}: {count}")

Total number of tokens AFTER preprocessing: 12240
Vocabulary size (unique tokens): 2240

Top 20 most frequent tokens:
said: 462
alice: 398
little: 128
one: 105
know: 90
like: 86
would: 83
went: 83
thing: 80
could: 77
time: 77
thought: 76
queen: 76
see: 67
king: 64
well: 63
turtle: 61
head: 60
began: 58
way: 57


### A3. Reflection

Preprocessing is a critical step in NLP that directly influences the quality and interpretability of downstream models. Lowercasing ensures that words like 'Alice' and 'alice' are treated as the same token, reducing vocabulary sparsity. Removing punctuation eliminates noise that doesn't carry semantic meaning for most bag-of-words tasks. Stopword removal filters out high-frequency but low-information words like 'the' or 'and', which is essential for TF-IDF to highlight uniquely descriptive terms in a document. Lemmatization further reduces vocabulary size by mapping inflected forms (e.g., 'running', 'ran') to their base dictionary form ('run'), helping the model generalize across different grammatical usages. However, for word embeddings, keeping stopwords and sentence structure is often beneficial as they provide context for semantic relationships. These choices trade off between reducing computational complexity and preserving fine-grained linguistic nuances.

## Part B - Text Representation (25%)

**Goal:** Compare Bag-of-Words and TF-IDF representations and interpret the results.

### B1. Create documents

I will use each sentence as a document, as it provides a manageable size for similarity analysis and interpretation in this corpus. Using sentences is justified because it allows us to analyze the similarity of specific narrative lines rather than large thematic chapters, which makes the TF-IDF terms easier to interpret.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Re-process tokens at sentence level for vectorization
def preprocess_sentence(sent_tokens):
    return " ".join(preprocess_text(sent_tokens))

documents = [preprocess_sentence(s) for s in sentences if len(preprocess_text(s)) > 2] # Filter tiny sentences
print(f"Number of documents (sentences): {len(documents)}")

Number of documents (sentences): 1290


### B2. Vectorize

In [5]:
cnt_vec = CountVectorizer()
tf_vec = TfidfVectorizer()

bow_matrix = cnt_vec.fit_transform(documents)
tfidf_matrix = tf_vec.fit_transform(documents)

print(f"BoW matrix shape: {bow_matrix.shape}")
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")

# Top 15 TF-IDF terms for 2 documents
def get_top_tfidf(row_idx, vectorizer, matrix, top_n=15):
    row = matrix.getrow(row_idx).toarray()[0]
    indices = np.argsort(row)[::-1][:top_n]
    feature_names = vectorizer.get_feature_names_out()
    return [(feature_names[i], row[i]) for i in indices if row[i] > 0]

idx1, idx2 = 10, 100
print(f"\nTop TF-IDF terms for Document {idx1}:")
print(get_top_tfidf(idx1, tf_vec, tfidf_matrix))

print(f"\nTop TF-IDF terms for Document {idx2}:")
print(get_top_tfidf(idx2, tf_vec, tfidf_matrix))

BoW matrix shape: (1290, 2195)
TF-IDF matrix shape: (1290, 2195)

Top TF-IDF terms for Document 10:
[('stair', np.float64(0.47394270113116527)), ('tumbling', np.float64(0.47394270113116527)), ('fall', np.float64(0.4081414539484156)), ('shall', np.float64(0.3372936658722457)), ('nothing', np.float64(0.3130742328638693)), ('think', np.float64(0.2800352846186907)), ('thought', np.float64(0.2655639655012973)), ('alice', np.float64(0.15313625610148493))]

Top TF-IDF terms for Document 100:
[('puzzling', np.float64(0.5895736822928054)), ('besides', np.float64(0.5701606775653887)), ('dear', np.float64(0.41723085341143334)), ('oh', np.float64(0.3914563703242622))]


**Interpretation:**
The top TF-IDF terms for these documents effectively highlight the 'keywords' of each sentence. In TF-IDF, common words across the corpus receive lower scores, while words unique or central to the document (like 'rabbit' or 'queen' in specific contexts) get higher weights. This allows us to see what each document is practically 'about' without the noise of frequent words like 'said' or 'little'.

### B3. Similarity

In [6]:
sim_matrix = cosine_similarity(tfidf_matrix)
np.fill_diagonal(sim_matrix, 0) # Remove self-similarity

max_sim_idx = np.unravel_index(np.argmax(sim_matrix), sim_matrix.shape)
print(f"Most similar documents: {max_sim_idx[0]} and {max_sim_idx[1]}")
print(f"Similarity Score: {sim_matrix[max_sim_idx]}")
print(f"Doc {max_sim_idx[0]}: {documents[max_sim_idx[0]]}")
print(f"Doc {max_sim_idx[1]}: {documents[max_sim_idx[1]]}")

print("\n--- Surprising Similarity Example ---")
potential_pairs = np.where((sim_matrix > 0.4) & (sim_matrix < 0.9))
if len(potential_pairs[0]) > 0:
    p1, p2 = potential_pairs[0][0], potential_pairs[1][0]
    print(f"Doc {p1}: {documents[p1]}")
    print(f"Doc {p2}: {documents[p2]}")
    print(f"Similarity Score: {sim_matrix[p1, p2]}")

Most similar documents: 1083 and 1084
Similarity Score: 1.0000000000000002
Doc 1083: soup evening beautiful soup
Doc 1084: soup evening beautiful soup

--- Surprising Similarity Example ---
Doc 11: brave think home
Doc 550: alice beginning think creature get home
Similarity Score: 0.45394029907359607


**Interpretation of Similarity:**
The most similar pair generally shares significant rare keywords. A 'surprising' similarity might occur when two structurally different sentences both mention a rare character (like the 'Mock Turtle'), which counts heavily in the TF-IDF representation even if the rest of the sentence is distinct.

## Part C - Word Embeddings (25%)

**Goal:** Train Word2Vec embeddings and explore semantic similarity.

### C1. Prepare training data

For embeddings, it is generally recommended to keep stopwords to preserve context.

In [7]:
embedding_data = [preprocess_text(s, remove_stopwords=False) for s in sentences if len(s) > 2]
print(f"Sentence count for training: {len(embedding_data)}")

Sentence count for training: 1686


### C2. Train Word2Vec

I chose following hyperparameters:
* `vector_size=100`: Balance between representation power and corpus size.
* `window=5`: Standard context window.
* `min_count=3`: Ensure words have enough context to learn an embedding.
* `sg=1`: Skip-gram is used to better capture rare words in this smaller corpus.
* `epochs=20`: Helps the model converge on small datasets.

In [8]:
from gensim.models import Word2Vec

model = Word2Vec(
    sentences=embedding_data, 
    vector_size=100, 
    window=5, 
    min_count=3, 
    sg=1, # Skip-gram
    epochs=20
)

print("Model training complete.")

Model training complete.


### C3. Explore similarity

In [9]:
target_words = ['alice', 'queen', 'hatter', 'mouse', 'rabbit']
for word in target_words:
    if word in model.wv:
        print(f"\nWords similar to '{word}':")
        # Top 10 words as requested
        print(model.wv.most_similar(word, topn=10))
    else:
        print(f"Word '{word}' not in vocabulary.")


Words similar to 'alice':
[('rather', 0.73460453748703), ('she', 0.7290487885475159), ('pleased', 0.7213724851608276), ('feeling', 0.7189539670944214), ('sharply', 0.7179637551307678), ('hastily', 0.7158926725387573), ('frightened', 0.7154080867767334), ('certainly', 0.7090147137641907), ('but', 0.7079842686653137), ('indignantly', 0.7030239105224609)]

Words similar to 'queen':
[('executioner', 0.820957601070404), ('shouted', 0.8058800101280212), ('knave', 0.7945306897163391), ('turning', 0.7916951179504395), ('ground', 0.788692057132721), ('croquet', 0.782116174697876), ('heart', 0.7798267006874084), ('crown', 0.7781469821929932), ('pointing', 0.777019739151001), ('taken', 0.7761519551277161)]

Words similar to 'hatter':
[('tea', 0.828532874584198), ('hare', 0.8276251554489136), ('executioner', 0.8128354549407959), ('interrupted', 0.7834770679473877), ('march', 0.7824484705924988), ('butter', 0.7797386050224304), ('pigeon', 0.7738587856292725), ('added', 0.772002100944519), ('pointi

**Interpretation of Neighbors:**
The neighbors make sense in the context of the corpus. For example, 'Alice' is often similar to descriptive words or other major characters she interacts with. 'Queen' and 'King' often cluster together. Since this is a small corpus, similarity often reflects co-occurrence rather than true synonymy, but characters from the same scenes correctly cluster together.

### C4. Analogies

In [10]:
def try_analogy(a, b, c):
    print(f"{a} - {b} + {c} = ?")
    try:
        result = model.wv.most_similar(positive=[a, c], negative=[b], topn=1)
        print(f"Result: {result}")
    except Exception as e:
        print(f"Error: {e}")

try_analogy('queen', 'king', 'alice')
try_analogy('white', 'black', 'rabbit')
try_analogy('hatter', 'alice', 'queen')

queen - king + alice = ?
Result: [('frightened', 0.737886369228363)]
white - black + rabbit = ?
Error: "Key 'black' not present in vocabulary"
hatter - alice + queen = ?
Result: [('executioner', 0.7436817288398743)]


**Interpretation of Analogies:**
Analogy results on this small corpus may be weak. This is because Word2Vec requires millions of tokens to learn the complex relational geometries needed for perfect vector arithmetic. With only ~26k tokens, the vector space is sparsely populated, leading to some poor results. However, characters with strong contextual links still yield interesting neighbors.