# 🧠 NLP Foundations & Vector Space Retrieval with Word2Vec  
**Course:** PROG8245 – Machine Learning Programming  
**Workshop:** From Preprocessing to TF-IDF and Word2Vec Extension  
**Team:** Team 5  

### 👥 Team Members:
- Mandeep Singh (ID: 8989367)  
- Kumari Nikitha Singh (ID: 9053016)  
- Krishna (ID: 905861)  

---

🔍 In this extended workshop, we go beyond the classical 6-step NLP pipeline and apply advanced techniques like:
- **Word2Vec embeddings**
- **Cosine similarity for semantic query matching**
- **Bigram language modeling**
- **Chain Rule–based sentence probability**

All models are trained on the **Wikitext-2** dataset, representing real-world Wikipedia-style language data.


## Step 1: Load Dataset

We use the Wikitext-2 (raw v1) dataset from Hugging Face.

- Loaded the training split.
- Removed empty lines to keep only valid text content.

This provides a clean Wikipedia-style corpus for language modeling.


In [1]:
from datasets import load_dataset

# Load Wikitext-2 (cleaned Wikipedia text, lightweight)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")

# Extract text content
documents = dataset['text']
documents = [doc for doc in documents if len(doc.strip()) > 0]  # Remove empty lines




## Step 2: Document Viewer Utility

To inspect the dataset, we define a helper function:

- `view_full_document(documents, doc_id)`: Displays the full content of a document at the given index.
- Example usage: `view_full_document(documents, 99)`

This is helpful to understand the structure and language style of the data.


In [2]:
# Function to print full document by index
def view_full_document(documents, doc_id):
    if 0 <= doc_id < len(documents):
        print(f"\n📄 --- Full Document {doc_id} ---\n")
        print(documents[doc_id])
    else:
        print("❌ Invalid document ID. Please enter a number between 0 and", len(documents) - 1)

# Example: view full document 7
view_full_document(documents, 99)


📄 --- Full Document 99 ---

 In 1997 , the Museum of Science and Natural History merged with the Little Rock Children 's Museum , which had been located in Union Station , to form the Arkansas Museum of Discovery . The new museum was relocated to a historic building in the Little Rock River Market District . The MacArthur Museum of Arkansas Military History opened on May 19 , 2001 in the Tower Building . The new museum 's goal is to educate and inform visitors about the military history of Arkansas , preserve the Tower Building , honor servicemen and servicewomen of the United States and commemorate the birthplace of Douglas MacArthur . 



## Step 3: Tokenization

We define a custom tokenizer using regular expressions:

- Converts all text to lowercase.
- Extracts word characters using the pattern `\b\w+\b` to remove punctuation and symbols.

This ensures consistency in word representation before further processing.


In [3]:
import re

# Tokenizer using regex: lowercase and extract only word characters
def tokenize(text):
    return re.findall(r'\b\w+\b', text.lower())


## Step 4: Apply Tokenization to All Documents

- We apply the `tokenize()` function to each document in the corpus.
- Then, we flatten all tokenized documents into a single list of tokens (`all_tokens`), which is useful for building models like unigram or Word2Vec.

This step gives us a clean tokenized representation of the entire corpus, ready for further normalization or statistical modeling.


In [4]:
# Tokenize all documents using the regex-based tokenizer
tokenized_docs = [tokenize(doc) for doc in documents]

# Flatten all tokens into a single list for model building
all_tokens = [token for doc in tokenized_docs for token in doc]

# Preview stats
print(f"📄 Total documents tokenized: {len(tokenized_docs)}")
print(f"🔢 Total tokens in corpus: {len(all_tokens)}")
print("🧪 Sample tokens:", all_tokens[:30])


📄 Total documents tokenized: 23767
🔢 Total tokens in corpus: 1750956
🧪 Sample tokens: ['valkyria', 'chronicles', 'iii', 'senjō', 'no', 'valkyria', '3', 'unrecorded', 'chronicles', 'japanese', '戦場のヴァルキュリア3', 'lit', 'valkyria', 'of', 'the', 'battlefield', '3', 'commonly', 'referred', 'to', 'as', 'valkyria', 'chronicles', 'iii', 'outside', 'japan', 'is', 'a', 'tactical', 'role']


## Step 5: Token Normalization (Stopword Removal + Stemming)

- We use NLTK's stopword list to remove common non-informative words (like "the", "is", "and").
- We apply the **Porter Stemmer** to reduce words to their base/stem form (e.g., "running" → "run").
- The `normalize_tokens()` function performs both steps for each document.
- The result is a normalized set of tokens that retain semantic meaning while reducing noise and vocabulary size.


In [5]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Ensure stopwords are downloaded
nltk.download('stopwords')

# Initialize tools
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

# Normalize a list of tokens
def normalize_tokens(tokens):
    return [stemmer.stem(token) for token in tokens if token not in stop_words]

# Apply normalization to each tokenized document
normalized_docs = [normalize_tokens(doc) for doc in tokenized_docs]

# Flatten into a single list for modeling
normalized_tokens = [token for doc in normalized_docs for token in doc]

# Preview
print(f"🧹 Total normalized tokens: {len(normalized_tokens)}")
print("🔍 Sample normalized tokens:", normalized_tokens[:30])


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kittu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


🧹 Total normalized tokens: 1038669
🔍 Sample normalized tokens: ['valkyria', 'chronicl', 'iii', 'senjō', 'valkyria', '3', 'unrecord', 'chronicl', 'japanes', '戦場のヴァルキュリア3', 'lit', 'valkyria', 'battlefield', '3', 'commonli', 'refer', 'valkyria', 'chronicl', 'iii', 'outsid', 'japan', 'tactic', 'role', 'play', 'video', 'game', 'develop', 'sega', 'media', 'vision']


## Step 6: Unigram and Bigram Language Modeling

To estimate sentence probabilities, we build two probabilistic models: **unigram** and **bigram**.

### Unigram Model
- Counts frequency of each individual word across all documents.
- Probability of a word = frequency / total words.

### Bigram Model
- Tracks frequency of word pairs (bigrams), such as ("machine", "learning").
- Probability of a word given previous = count(w1 → w2) / count(w1).
- We also apply **Laplace Smoothing** to handle unseen bigrams and prevent zero probability.

### Sentence Probability Functions
We define multiple scoring methods:
- `sentence_score_naive`: Averages individual unigram probabilities (not true probability).
- `sentence_prob_chain`: Applies chain rule over unigrams (product of probabilities).
- `sentence_prob_bigram`: Computes chained bigram probabilities.
- `sentence_prob_bigram_smoothed`: Same as above but with Laplace smoothing.

These methods help rank sentences and estimate how likely they are under our language model.


In [58]:


tokenized_docs = [tokenize(doc) for doc in documents]
normalized_docs = [normalize_tokens(doc) for doc in tokenized_docs]

# === Step 3: Unigram Model ===
all_tokens = [token for doc in normalized_docs for token in doc]
unigram_counts = Counter(all_tokens)
total_tokens = sum(unigram_counts.values())

def unigram_prob(word):
    return unigram_counts[word] / total_tokens if total_tokens > 0 else 0.0

# === Step 4: Bigram Model ===
bigram_model = defaultdict(Counter)
for doc in normalized_docs:
    for i in range(len(doc) - 1):
        w1, w2 = doc[i], doc[i + 1]
        bigram_model[w1][w2] += 1

# Vocabulary size for smoothing
vocab = set(all_tokens)
V = len(vocab)

def bigram_prob(w1, w2):
    count_w1_w2 = bigram_model[w1][w2]
    count_w1 = sum(bigram_model[w1].values())
    if count_w1 == 0:
        return 0.0
    return count_w1_w2 / count_w1

def bigram_prob_smoothed(w1, w2):
    count_w1_w2 = bigram_model[w1][w2]
    count_w1 = sum(bigram_model[w1].values())
    return (count_w1_w2 + 1) / (count_w1 + V)  # Laplace smoothing

# === Step 5: Sentence Scoring Functions ===
def sentence_score_naive(sentence):
    tokens = tokenize(sentence)
    norm_tokens = normalize_tokens(tokens)
    probs = [unigram_prob(word) for word in norm_tokens]
    return sum(probs) / len(probs) if probs else 0.0

def sentence_prob_chain(sentence):
    tokens = tokenize(sentence)
    norm_tokens = normalize_tokens(tokens)
    prob = 1.0
    for word in norm_tokens:
        prob *= unigram_prob(word)
    return prob

def sentence_prob_bigram(sentence):
    tokens = tokenize(sentence)
    norm_tokens = normalize_tokens(tokens)
    if not norm_tokens:
        return 0.0
    prob = unigram_prob(norm_tokens[0])
    for i in range(1, len(norm_tokens)):
        w1, w2 = norm_tokens[i - 1], norm_tokens[i]
        prob *= bigram_prob(w1, w2)
    return prob

def sentence_prob_bigram_smoothed(sentence):
    tokens = tokenize(sentence)
    norm_tokens = normalize_tokens(tokens)
    if not norm_tokens:
        return 0.0
    prob = unigram_prob(norm_tokens[0])
    for i in range(1, len(norm_tokens)):
        w1, w2 = norm_tokens[i - 1], norm_tokens[i]
        prob *= bigram_prob_smoothed(w1, w2)
    return prob

# === Step 6: Test a Sentence ===
sentence = "Machine learning improves human decision making"

print("🧪 Sentence:", sentence)
print(f"❌ Naïve Avg Word Probability     : {sentence_score_naive(sentence):.6e}")
print(f"✅ Chain Rule (Unigram Model)     : {sentence_prob_chain(sentence):.6e}")
print(f"⚠️  Bigram Model (No Smoothing)   : {sentence_prob_bigram(sentence):.6e}")
print(f"✅ Bigram Model (Laplace Smoothed): {sentence_prob_bigram_smoothed(sentence):.6e}")


🧪 Sentence: Machine learning improves human decision making
❌ Naïve Avg Word Probability     : 4.735227e-04
✅ Chain Rule (Unigram Model)     : 3.441999e-21
⚠️  Bigram Model (No Smoothing)   : 0.000000e+00
✅ Bigram Model (Laplace Smoothed): 1.453826e-26


## Step 7: Word2Vec Embedding Model

We train a Word2Vec model using the preprocessed tokenized documents.

### Model Configuration
- **Model Type**: Skip-gram (`sg=1`), better for capturing rare word relationships.
- **Vector Size**: 100-dimensional embeddings.
- **Context Window**: 5 words before/after the target word.
- **Min Count**: Ignores words with fewer than 2 occurrences to reduce noise.
- **Workers**: Parallel processing using 4 threads.

Once trained, this model allows us to retrieve similar words and compute word/document similarities based on vector space proximity.


In [6]:
from gensim.models import Word2Vec

# Train Word2Vec model on normalized tokenized documents (each doc = list of tokens)
w2v_model = Word2Vec(
    sentences=normalized_docs,    # list of token lists
    vector_size=100,              # embedding dimension
    window=5,                     # context window
    min_count=2,                  # ignore words with <2 frequency
    workers=4,                    # parallel threads
    sg=1                          # 1 = skip-gram, 0 = CBOW
)

# Finalize the model for querying
w2v_model.init_sims(replace=True)

# Save model (optional)
# w2v_model.save("wiki_word2vec.model")


  w2v_model.init_sims(replace=True)


## Step 8: Vocabulary Inspection

After training the Word2Vec model, we inspect the learned vocabulary.

The model stores word vectors for all tokens that meet the minimum frequency threshold. Below, we print the top 50 most frequent words based on their index order in the model.

This provides a quick glimpse into the dominant terms learned by the model from the corpus.


In [8]:
# Show top 50 words in the trained model's vocabulary
print("📚 Vocabulary Sample:")
print(list(w2v_model.wv.index_to_key[:50]))


📚 Vocabulary Sample:
['first', 'one', 'also', 'two', 'time', 'year', 'use', 'game', 'state', 'new', 'includ', 'song', 'would', '1', 'work', 'record', '2', 'three', 'play', 'later', 'may', 'season', 'film', 'follow', 'name', 'citi', 'day', 'unit', 'made', 'releas', '3', 'part', 'second', 'end', 'album', 'number', 'seri', 'call', 'world', 'sever', 'mani', 'war', '000', 'area', 'gener', 'forc', 'howev', 'music', '5', 'south']


## Step 9: Finding Similar Words Using Word2Vec

Using the trained Word2Vec model, we retrieve the most semantically similar words to a given input term — in this case, **"bot"**.

The `most_similar()` function returns words that appear in similar contexts, based on cosine similarity of their vector representations. This demonstrates how Word2Vec captures relationships between words in the corpus.


In [13]:
print("🔗 Similar words to :")
print(w2v_model.wv.most_similar("bot"))


🔗 Similar words to :
[('sidearm', 0.9824039340019226), ('twofold', 0.9815366864204407), ('superconduct', 0.9801238179206848), ('shortfal', 0.9798026084899902), ('utilitarian', 0.979671061038971), ('mn', 0.9795687198638916), ('unwant', 0.9792008399963379), ('lex', 0.9788586497306824), ('lens', 0.9787328243255615), ('cfd', 0.9784953594207764)]


## Step 10: Semantic Document Search with Cosine Similarity

We implement a semantic search system by comparing the cosine similarity between:

- The **average vector** of a user query (computed using Word2Vec embeddings), and  
- The **precomputed document vectors** (mean of word vectors per document).

This enables retrieving the top-k documents that are semantically closest to a given query, even if the exact keywords do not match. The function `query_search()` ranks documents based on similarity scores.


In [15]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Helper: Compute average vector of a token list
def average_vector(tokens):
    vectors = [w2v_model.wv[word] for word in tokens if word in w2v_model.wv]
    if not vectors:
        return np.zeros(w2v_model.vector_size)
    return np.mean(vectors, axis=0)

# Precompute vectors for all documents
doc_vectors = [average_vector(doc) for doc in normalized_docs]

# 🔍 Query function
def query_search(query, top_k=3):
    query_tokens = normalize_tokens(tokenize(query))
    query_vec = average_vector(query_tokens)
    
    sims = cosine_similarity([query_vec], doc_vectors)[0]
    ranked_indices = np.argsort(sims)[::-1][:top_k]
    
    print(f"\n🔎 Top {top_k} documents for query: '{query}'\n")
    for rank, idx in enumerate(ranked_indices, 1):
        print(f"\n📄 Rank #{rank} — Doc {idx} (Score: {sims[idx]:.4f})")
        print(documents[idx][:300], "...\n")


## Step 11: Running a Semantic Search Query

We now test our semantic retrieval system by running a sample query:



In [44]:
query_search("global government and war", top_k=3)



🔎 Top 3 documents for query: 'global government and war'


📄 Rank #1 — Doc 22953 (Score: 0.9118)
 = Mozambican War of Independence = 
 ...


📄 Rank #2 — Doc 535 (Score: 0.8927)
 During World War II civil aerodromes were taken over for military use , existing military airfields were expanded , and new ones were built . This resulted in a significant inventory of facilities becoming available after the war . Pre @-@ war civil aerodromes , for example Sywell , were returned t ...


📄 Rank #3 — Doc 4666 (Score: 0.8811)
 After World War I , and with another European war looming , leaders from the historic peace churches met to strategize about how to cooperate with the government to avoid the difficulties of World War I. Holding a common view that any participation in military service was not acceptable , they devi ...



## Step 12: Predicting the Next Word using Bigram Model

This function takes a single input word and returns the most likely next words based on the bigram model.

- It looks up all bigrams that start with the given word.
- It uses the frequency count to determine the most common next words.
- This simulates a basic predictive text or autocomplete behavior.




In [32]:
def next_word_bigram(prev_word, top_k=5):
    if prev_word in bigram_model:
        # Get Counter object of next words
        next_word_counter = bigram_model[prev_word]
        most_common = next_word_counter.most_common(top_k)

        print(f"\n🔮 Top {top_k} next words for '{prev_word}':")
        for word, count in most_common:
            print(f"  {prev_word} → {word} ({count} times)")
    else:
        print(f"⚠️ No bigram found starting with '{prev_word}'")


## Step 13: Bigram Prediction and Word2Vec Similarity Lookup

This final step demonstrates two capabilities using the input phrase:

1. **Bigram Prediction**:
   - It takes the last word of the user phrase.
   - It uses the trained bigram model to suggest likely next words.

2. **Word2Vec Similar Words**:
   - It retrieves semantically similar words to the last word using Word2Vec embeddings.




In [35]:
user_phrase = "king"
last_word = normalize_tokens(tokenize(user_phrase))[-1]

next_word_bigram(last_word)
similar_next_word_w2v(last_word)



🔮 Top 5 next words for 'king':
  king → henri (19 times)
  king → jame (17 times)
  king → dublin (15 times)
  king → georg (13 times)
  king → jerusalem (11 times)

💡 Words similar to 'king':
  dublin (score: 0.79)
  leinster (score: 0.78)
  throne (score: 0.75)
  domnal (score: 0.74)
  kingship (score: 0.74)


## Summary

This notebook demonstrates a practical NLP pipeline using the Wikitext-2 dataset. The major steps include:

- **Loading and Preprocessing**: Cleaned and tokenized raw text, applied stopword removal and stemming.
- **Language Modeling**: Built unigram and bigram models to compute word and sentence probabilities with and without smoothing.
- **Semantic Representation**: Trained a Word2Vec model on normalized tokens to learn word embeddings.
- **Similarity Search**: Used cosine similarity to retrieve the most relevant documents for a given query.
- **Next-Word Prediction**: Implemented a bigram-based next-word suggestion tool to model contextual word transitions.

Overall, this extended workshop connects classical language models with modern vector-based methods for retrieval and prediction.
