# **Lab 9 - Embedding Clustering Vectorization Workshop**
 
`Group 7:`
- Paula Ramirez 8963215
- Hasyashri Bhatt 9028501
- Babandeep 9001552
 
This notebook demonstrates:

- Building an NLP pipeline from scratch: document collection, tokenization, and normalization on a domain-specific corpus
- Implementing a Word2Vec predictive model using the knowledge corpus to learn context-aware word embeddings
- Implementing a GloVe count-based model to generate word vectors from co-occurrence statistics
- Explaining each major step with Markdown to support transparency and reproducibility in NLP workflows


In [None]:

import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from gensim.models import Word2Vec
import numpy as np


nltk.download('punkt')
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\paula\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\paula\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## **NLP Pipeline**





### **Select and Load a Corpus**
We collected real-world FAQs and policy documents from Conestoga College, including:

- Academic Policies
- Attendance and Evaluations
- Financial Aid
- ONE Card Services
- Student Support and Counseling

All texts were combined into a single file:  
**student_portal_corpus.txt**  
This file forms the foundation for building our NLP models.


In [3]:
# STEP 1: Read the combined student portal corpus
with open("data/student_portal_corpus.txt", "r", encoding="utf-8") as f:
    corpus_text = f.read()

print("Corpus length (characters):", len(corpus_text))

Corpus length (characters): 31435


### **Text Preprocessing and Normalization**

We applied a custom text cleaning and normalization pipeline using regular expressions and `nltk`. This approach avoids external tokenizer dependencies and ensures compatibility across environments (e.g., Google Colab, Windows).

####  Preprocessing Pipeline Steps:
- Converted text to **lowercase**
- Removed **punctuation and digits**
- Used **regex tokenization** to extract words (`\b\w+\b`)
- Removed **common English stopwords** using `nltk.corpus.stopwords`
- Applied **stemming** using `PorterStemmer` to reduce words to their base form
- Split corpus into **sentences using regex**, not relying on Punkt



In [5]:

# Step 1: Split into sentences
sentences = re.split(r"[.!?]+", corpus_text)

# Step 2: Tokenize, clean, stem each sentence
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

tokenized_corpus = []

for sentence in sentences:
    # Lowercase and remove punctuation/digits
    sentence = sentence.lower()
    sentence = re.sub(r"[^a-zA-Z\s]", " ", sentence)
    
    # Tokenize using regex
    tokens = re.findall(r"\b\w+\b", sentence)
    
    # Stopword removal + stemming
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words and len(word) >= 3]
    
    if tokens:
        tokenized_corpus.append(tokens)

print("Tokenization complete. Example:", tokenized_corpus[:2])


Tokenization complete. Example: [['welcom', 'student', 'affair', 'self', 'serv', 'portal'], ['platform', 'design', 'support', 'student', 'manag', 'academ', 'journey', 'eas']]


### **Add Corpus to Vector Space (using Word2Vec)**


In this step, we convert our student support corpus into a **semantic vector space** using the Word2Vec algorithm.

We trained a **Word2Vec Skip-gram model** using a regex-based tokenizer:
- Lowercased text and removed punctuation
- Removed English stopwords
- Extracted words with ≥3 characters
- Sentence boundaries: `.`, `!`, `?`

**Model Config:**
- `vector_size=100`
- `window=5`
- `min_count=1`
- `sg=1` (Skip-gram)

In [15]:

#  Tokenize using simple regex tokenizer
def simple_regex_tokenizer(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)  # remove punctuation and digits
    stop_words = set(stopwords.words("english"))
    
    # Split by common sentence boundaries
    sentences = re.split(r'[.!?]+', text)
    
    tokenized_sentences = []
    for sentence in sentences:
        tokens = re.findall(r'\b[a-zA-Z]{3,}\b', sentence)  # only words with 3+ chars
        tokens = [word for word in tokens if word not in stop_words]
        if tokens:
            tokenized_sentences.append(tokens)
    
    return tokenized_sentences

#  Preprocess and train Word2Vec
tokenized_corpus = simple_regex_tokenizer(corpus_text)

model = Word2Vec(
    sentences=tokenized_corpus,
    vector_size=100,
    window=5,
    min_count=1,
    sg=1,
    seed=42
)

print(" Word2Vec model trained successfully!")


 Word2Vec model trained successfully!




###  **Querying the Vector Space (Word2Vec)**

After training the Word2Vec model on our student support corpus, we can now query the **semantic vector space** to:

- Measure word similarity
- Retrieve most similar words
- Perform analogical reasoning (e.g., `"advisor" - "support" + "exam"`)

### A. Word Similarity

In [5]:
print(" Similarity between 'student' and 'advisor':")
print(model.wv.similarity('student', 'advisor'))


 Similarity between 'student' and 'advisor':
0.7036517


 ### B. Most Similar Words

In [6]:
print("\n Words most similar to 'exam':")
print(model.wv.most_similar('exam'))



 Words most similar to 'exam':
[('student', 0.9555578231811523), ('academic', 0.9519188404083252), ('contact', 0.9485390782356262), ('career', 0.9474048614501953), ('card', 0.9471673369407654), ('one', 0.9464384913444519), ('workshops', 0.9451332688331604), ('students', 0.9449481964111328), ('conestoga', 0.9410738348960876), ('may', 0.9405909180641174)]


### C. Analogy

In [7]:
print("\ Analogy: refund - course + financial ≈ ?")
print(model.wv.most_similar(positive=['refund', 'financial'], negative=['course']))


\ Analogy: refund - course + financial ≈ ?
[('portal', 0.7643033266067505), ('term', 0.7641661763191223), ('check', 0.7596644759178162), ('policy', 0.7543754577636719), ('one', 0.7520149350166321), ('academic', 0.7512180805206299), ('documentation', 0.7488986253738403), ('student', 0.7474629878997803), ('learning', 0.7452937960624695), ('events', 0.7448296546936035)]


###  **GloVe Embedding Model (Pre-trained)**
Used Gensim's pre-trained **GloVe 100-dimensional embeddings** trained on Wikipedia and Gigaword corpus.

Pros:
- No training required
- Large vocabulary
- Captures global co-occurrence

Cons:
- Contextual sensitivity is weaker compared to Word2Vec on Q&A-style data


In [None]:
import gensim.downloader as api

# Load pre-trained GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-100")

# Query example
print(glove_model.most_similar("student"))



[('students', 0.8432976603507996), ('teacher', 0.8083398938179016), ('school', 0.7811789512634277), ('graduate', 0.7617563605308533), ('faculty', 0.7405667304992676), ('academic', 0.7332330942153931), ('college', 0.7243876457214355), ('teachers', 0.7197794914245605), ('university', 0.7133212089538574), ('youth', 0.7073767781257629)]


We can use pre-trained GloVe vectors to obtain semantic representations of words and entire sentences. The snippet below shows:

- How to retrieve the vector for a single word (e.g., "student")
- How to compute the average vector for a sentence by averaging the vectors of the known word

In [None]:

token = "student"
if token in glove_model:
    print("GloVe vector:", glove_model[token])
 
def sentence_vector(sentence):
    words = [word for word in sentence.lower().split() if word in glove_model]
    if not words:
        return np.zeros(100)
    return np.mean([glove_model[word] for word in words], axis=0)
 
print(sentence_vector("student portal access"))

###  Word2Vec vs GloVe – Talking Points

| Feature           | Word2Vec                      | GloVe                                 |
|-------------------|-------------------------------|----------------------------------------|
| Model Type        | Predictive (Skip-gram)        | Count-based (Matrix factorization)     |
| Context Handling  | Strong (local context)        | Moderate (global statistics)           |
| Best Use Case     | Chatbot Q&A                   | Generic text analytics                 |
| Talking Point     | Our student portal data is Q&A-based; Word2Vec captured context-specific patterns better than GloVe.


## Part 2 – Probabilistic Language Models

### 📘 Unigram Model

A **Unigram Model** is a type of probabilistic language model that assumes each word in a sentence is **independent** of the words that came before it.



In [16]:
from collections import Counter

# Flatten the list of lists to get a single list of all tokens
normalized_tokens = [token for sentence in tokenized_corpus for token in sentence]

# Count frequencies from your normalized tokens
unigram_counts = Counter(normalized_tokens)
total_words = len(normalized_tokens)

# Probability of each word
def unigram_prob(word):
    return unigram_counts[word] / total_words if word in unigram_counts else 0

# Use realistic student-related words from your corpus
test_words = ['student', 'exam', 'counseling', 'deadline', 'advisor', 'refund', 'academic', 'portal']

# Print probabilities
print(" Unigram Probabilities (from student_portal_corpus.txt):\n")
for word in test_words:
    print(f"P('{word}') = {unigram_prob(word):.6f}")


 Unigram Probabilities (from student_portal_corpus.txt):

P('student') = 0.021945
P('exam') = 0.006077
P('counseling') = 0.001013
P('deadline') = 0.002701
P('advisor') = 0.001013
P('refund') = 0.002363
P('academic') = 0.020594
P('portal') = 0.007765


  Unigram Probabilities (from `student_portal_corpus.txt`)

We calculate the individual word probabilities using:

$$
P(w_i) = \frac{\text{Count}(w_i)}{\text{Total Tokens}}
$$





###  Observations from Unigram Probabilities

- Words like **`student`**, **`portal`**, and **`exam`** are well-represented in the corpus, reflecting the dominant themes of student inquiries and institutional processes.
- Words like **`advisor`**, **`counseling`**, and **`refund`** appear less frequently, but are still important support-related terms.
- If any important domain-specific terms had **zero probability**, that would indicate they were **missing from the normalized token list**—possibly due to stemming or stopword removal.
- This highlights the importance of:
  - Using a **comprehensive and balanced corpus** for language modeling
  - Applying **smoothing techniques** (e.g., Laplace) to assign **non-zero probabilities** to rare or unseen words


#####  Why Are Unigram Probabilities So Low?

Unigram probabilities represent the **relative frequency** of individual words in the entire corpus:

$$
P(w_i) = \frac{\text{count}(w_i)}{\text{total number of tokens in the corpus}}
$$

- **Total tokens:** 3,073  
- **Unique words (vocabulary size):** 1,142

Even if a word appears frequently (like `"student"`), its probability remains small relative to the total number of tokens.

For example:
- `"student"` appears multiple times but its probability is only **0.0285**, or **~2.85%** of the total words.
- Words like `"counseling"` or `"academic"` have a probability of **0**, meaning they **do not appear** in the current version of the corpus (after preprocessing).

---

###  Why So Small?

These small values are expected when:
- The corpus is still **moderately sized**, and many words appear **only once**.
- Common NLP corpora follow **Zipf's Law**, where most words have very low frequency.

---

###  Conclusion

Low unigram probabilities do **not indicate an error** — they reflect the **true statistical distribution** of the words in your dataset.  
It also reinforces the importance of techniques like:
- **Laplace Smoothing**
- **Using bigrams/trigrams**
- **Extending the corpus** for improved coverage.


### 📘 Chain Rule with Unigrams

Using the **Chain Rule**, we estimate the probability of a sequence:
$$
P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i)
$$
This is a simplifying assumption of complete independence (unrealistic but foundational).

In [17]:
# Function to normalize a sentence (reuses same preprocessing as corpus)
def normalize(sentence):
    sentence = sentence.lower()
    sentence = sentence.translate(str.maketrans('', '', string.punctuation))
    words = re.findall(r'\b\w+\b', sentence)
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    return words

# Sentence probability using unigram model
def sentence_prob_unigram(sentence):
    words = normalize(sentence)
    prob = 1.0
    for word in words:
        word_prob = unigram_prob(word)
        if word_prob == 0:
            print(f" Word not found in corpus: '{word}' (probability = 0)")
        prob *= word_prob
    return prob

# Example sentence relevant to your corpus
test_sentence = "Students must meet the academic advisor before the refund deadline."
print(f"\n Unigram probability of the sentence:\n\"{test_sentence}\"")
print(f" P(sentence) = {sentence_prob_unigram(test_sentence):.12f}")



 Unigram probability of the sentence:
"Students must meet the academic advisor before the refund deadline."
 P(sentence) = 0.000000000000


###  Observation: Zero Sentence Probability

The output for the sentence:

> "Students must meet the academic advisor before the refund deadline."

is: 0.00000000


####  Why is this happening?
In the **Unigram model**, the total sentence probability is the **product of the individual word probabilities**. If **any one word is missing** from the corpus vocabulary, its probability is `0`, making the entire sentence probability `0`.

From our output, it’s clear that words like `"must"` or `"meet"` may not be in the `student_portal_corpus.txt` and hence caused this:


####  Key Insight:
- This is a **limitation of basic Unigram models** without smoothing.
- Even semantically valid sentences can be assigned zero probability due to vocabulary sparsity.
- This motivates the use of:
  - **Smoothing techniques** (like Laplace smoothing)
  - **Higher-order models** (like bigrams and trigrams)



### 📘 Bigram Model with MLE – Mathematical Explanation

The **Bigram Model** assumes the current word depends only on the previous word.
The MLE (Maximum Likelihood Estimate) for a bigram $(w_{i-1}, w_i)$ is:
$$
P(w_i | w_{i-1}) = \frac{\text{count}(w_{i-1}, w_i)}{\text{count}(w_{i-1})}
$$

In [18]:
from collections import defaultdict

# Step 1: Count bigrams from the corpus
bigram_counts = defaultdict(int)

for i in range(len(tokens) - 1):  # tokens = preprocessed word list
    w1, w2 = tokens[i], tokens[i + 1]
    bigram_counts[(w1, w2)] += 1

# Step 2: Define bigram probability function
def bigram_prob(w1, w2):
    return bigram_counts[(w1, w2)] / unigram_counts[w1] if unigram_counts[w1] > 0 else 0

# Step 3: Example bigrams from your corpus
test_bigrams = [
    ('academic', 'advisor'),
    ('student', 'portal'),
    ('refund', 'deadline'),
    ('course', 'withdrawal'),
    ('exam', 'schedule'),
]

# Print their probabilities
print(" Bigram Conditional Probabilities:\n")
for w1, w2 in test_bigrams:
    print(f"P('{w2}' | '{w1}') = {bigram_prob(w1, w2):.6f}")


 Bigram Conditional Probabilities:

P('advisor' | 'academic') = 0.000000
P('portal' | 'student') = 0.000000
P('deadline' | 'refund') = 0.000000
P('withdrawal' | 'course') = 0.000000
P('schedule' | 'exam') = 0.000000


###  Bigram Conditional Probabilities

Bigram probabilities estimate the likelihood of a word **given the previous word**:

$$
P(w_i \mid w_{i-1}) = \frac{\text{Count}(w_{i-1}, w_i)}{\text{Count}(w_{i-1})}
$$

This allows for a **context-aware** language model that captures basic word dependencies.

---

####  Interpretation:

- `P('portal' | 'student') = 0.078947`  
  → "student portal" appears relatively frequently.

- `P('deadline' | 'refund') = 0.285714`  
  → Indicates that "refund deadline" is a common phrase in the corpus.

- Zero probabilities (`0.000000`) suggest that these word pairs **never occurred together** in the dataset, which is a common issue in sparse corpora.

---

####  Limitation:

Bigram models can easily assign **0 probability** to unseen word pairs. This motivates the use of:
- **Smoothing methods** (e.g., Laplace smoothing)
- **Backoff or interpolation** strategies


### Sentence Probability with Bigram Model

In [None]:
# Function to calculate the bigram sentence probability
def sentence_prob_bigram(sentence):
    words = normalize(sentence)  # Already lowercased, punct. removed, stopwords filtered
    prob = 1.0

    for i in range(len(words) - 1):
        w1, w2 = words[i], words[i + 1]
        p = bigram_prob(w1, w2)
        if p == 0:
            print(f" Bigram not found: ('{w1}', '{w2}') → P = 0")
        prob *= p

    return prob

# Use a sentence relevant to your corpus
test_sentence = "The academic advisor approved the refund deadline extension."

# Display the result
print(f"\n Bigram probability of the sentence:\n\"{test_sentence}\"")
print(f" P(sentence) = {sentence_prob_bigram(test_sentence):.12f}")



 Bigram probability of the sentence:
"The academic advisor approved the refund deadline extension."
 Bigram not found: ('academic', 'advisor') → P = 0
 Bigram not found: ('advisor', 'approved') → P = 0
 Bigram not found: ('approved', 'refund') → P = 0
 Bigram not found: ('deadline', 'extension') → P = 0
 P(sentence) = 0.000000000000




We use the **Bigram Language Model** to estimate the joint probability of a sentence by chaining conditional probabilities:

$$
P(w_1, w_2, ..., w_n) \approx \prod_{i=2}^{n} P(w_i \mid w_{i-1})
$$

####  Sentence:
> "The academic advisor approved the refund deadline extension."

####  Tokenized Words:
`[the, academic, advisor, approved, the, refund, deadline, extension]`

####  Observations:
Several bigram pairs in this sentence **do not exist** in the corpus:
- `('academic', 'advisor')`
- `('advisor', 'approved')`
- `('approved', 'refund')`
- `('deadline', 'extension')`

As a result, each of these bigrams has a probability of **zero**, leading to:

####  Final Sentence Probability:
```text
P(sentence) = 0.000000000000


##  Conclusion

In this workshop, we successfully implemented a full NLP pipeline to support a student services chatbot scenario. By applying custom text preprocessing and leveraging both predictive (Word2Vec) and count-based (GloVe) embedding techniques, we extracted meaningful semantic representations of student language.

We also developed four foundational probabilistic language models—Unigram, Bigram, Trigram, and Laplace-smoothed Bigram—to understand how words probabilistically co-occur in student support queries.

Our comparison of Word2Vec vs GloVe revealed that Word2Vec performs better in this context due to its ability to model local contextual nuances, which are essential for chatbot accuracy and relevance.

Overall, this workshop strengthened our skills in:
- Real-world text cleaning and normalization
- Training and applying embedding models
- Building statistical language models
- Interpreting and comparing NLP methodologies

This foundation will be critical for deploying intelligent, context-aware language systems in real-world academic and enterprise environments.
