# **Lab 9 - Embedding Clustering Vectorization Workshop**
 
`Group 7:`
- Paula Ramirez 8963215
- Hasyashri Bhatt 9028501
- Babandeep 9001552
 
This notebook demonstrates:

- Building an NLP pipeline from scratch: document collection, tokenization, and normalization on a domain-specific corpus
- Implementing a Word2Vec predictive model using the knowledge corpus to learn context-aware word embeddings
- Implementing a GloVe count-based model to generate word vectors from co-occurrence statistics
- Explaining each major step with Markdown to support transparency and reproducibility in NLP workflows


In [None]:

import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from gensim.models import Word2Vec
import numpy as np


nltk.download('punkt')
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\paula\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\paula\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## **NLP Pipeline**





### **Select and Load a Corpus**
We collected real-world FAQs and policy documents from Conestoga College, including:

- Academic Policies
- Attendance and Evaluations
- Financial Aid
- ONE Card Services
- Student Support and Counseling

All texts were combined into a single file:  
**student_portal_corpus.txt**  
This file forms the foundation for building our NLP models.


In [3]:
# STEP 1: Read the combined student portal corpus
with open("data/student_portal_corpus.txt", "r", encoding="utf-8") as f:
    corpus_text = f.read()

print("Corpus length (characters):", len(corpus_text))

Corpus length (characters): 31435


### **Text Preprocessing and Normalization**

We applied a custom text cleaning and normalization pipeline using regular expressions and `nltk`. This approach avoids external tokenizer dependencies and ensures compatibility across environments (e.g., Google Colab, Windows).

####  Preprocessing Pipeline Steps:
- Converted text to **lowercase**
- Removed **punctuation and digits**
- Used **regex tokenization** to extract words (`\b\w+\b`)
- Removed **common English stopwords** using `nltk.corpus.stopwords`
- Applied **stemming** using `PorterStemmer` to reduce words to their base form
- Split corpus into **sentences using regex**, not relying on Punkt



In [5]:

# Step 1: Split into sentences
sentences = re.split(r"[.!?]+", corpus_text)

# Step 2: Tokenize, clean, stem each sentence
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

tokenized_corpus = []

for sentence in sentences:
    # Lowercase and remove punctuation/digits
    sentence = sentence.lower()
    sentence = re.sub(r"[^a-zA-Z\s]", " ", sentence)
    
    # Tokenize using regex
    tokens = re.findall(r"\b\w+\b", sentence)
    
    # Stopword removal + stemming
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words and len(word) >= 3]
    
    if tokens:
        tokenized_corpus.append(tokens)

print("Tokenization complete. Example:", tokenized_corpus[:2])


Tokenization complete. Example: [['welcom', 'student', 'affair', 'self', 'serv', 'portal'], ['platform', 'design', 'support', 'student', 'manag', 'academ', 'journey', 'eas']]


### **Add Corpus to Vector Space (using Word2Vec)**


In this step, we convert our student support corpus into a **semantic vector space** using the Word2Vec algorithm.

We trained a **Word2Vec Skip-gram model** using a regex-based tokenizer:
- Lowercased text and removed punctuation
- Removed English stopwords
- Extracted words with ≥3 characters
- Sentence boundaries: `.`, `!`, `?`

**Model Config:**
- `vector_size=100`
- `window=5`
- `min_count=1`
- `sg=1` (Skip-gram)

In [15]:

#  Tokenize using simple regex tokenizer
def simple_regex_tokenizer(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)  # remove punctuation and digits
    stop_words = set(stopwords.words("english"))
    
    # Split by common sentence boundaries
    sentences = re.split(r'[.!?]+', text)
    
    tokenized_sentences = []
    for sentence in sentences:
        tokens = re.findall(r'\b[a-zA-Z]{3,}\b', sentence)  # only words with 3+ chars
        tokens = [word for word in tokens if word not in stop_words]
        if tokens:
            tokenized_sentences.append(tokens)
    
    return tokenized_sentences

#  Preprocess and train Word2Vec
tokenized_corpus = simple_regex_tokenizer(corpus_text)

model = Word2Vec(
    sentences=tokenized_corpus,
    vector_size=100,
    window=5,
    min_count=1,
    sg=1,
    seed=42
)

print(" Word2Vec model trained successfully!")


 Word2Vec model trained successfully!




###  **Querying the Vector Space (Word2Vec)**

After training the Word2Vec model on our student support corpus, we can now query the **semantic vector space** to:

- Measure word similarity
- Retrieve most similar words
- Perform analogical reasoning (e.g., `"advisor" - "support" + "exam"`)

### A. Word Similarity

In [5]:
print(" Similarity between 'student' and 'advisor':")
print(model.wv.similarity('student', 'advisor'))


 Similarity between 'student' and 'advisor':
0.7036517


 ### B. Most Similar Words

In [6]:
print("\n Words most similar to 'exam':")
print(model.wv.most_similar('exam'))



 Words most similar to 'exam':
[('student', 0.9555578231811523), ('academic', 0.9519188404083252), ('contact', 0.9485390782356262), ('career', 0.9474048614501953), ('card', 0.9471673369407654), ('one', 0.9464384913444519), ('workshops', 0.9451332688331604), ('students', 0.9449481964111328), ('conestoga', 0.9410738348960876), ('may', 0.9405909180641174)]


### C. Analogy

In [7]:
print("\ Analogy: refund - course + financial ≈ ?")
print(model.wv.most_similar(positive=['refund', 'financial'], negative=['course']))


\ Analogy: refund - course + financial ≈ ?
[('portal', 0.7643033266067505), ('term', 0.7641661763191223), ('check', 0.7596644759178162), ('policy', 0.7543754577636719), ('one', 0.7520149350166321), ('academic', 0.7512180805206299), ('documentation', 0.7488986253738403), ('student', 0.7474629878997803), ('learning', 0.7452937960624695), ('events', 0.7448296546936035)]


###  **GloVe Embedding Model (Pre-trained)**
Used Gensim's pre-trained **GloVe 100-dimensional embeddings** trained on Wikipedia and Gigaword corpus.

Pros:
- No training required
- Large vocabulary
- Captures global co-occurrence

Cons:
- Contextual sensitivity is weaker compared to Word2Vec on Q&A-style data


In [None]:
import gensim.downloader as api

# Load pre-trained GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-100")

# Query example
print(glove_model.most_similar("student"))



[('students', 0.8432976603507996), ('teacher', 0.8083398938179016), ('school', 0.7811789512634277), ('graduate', 0.7617563605308533), ('faculty', 0.7405667304992676), ('academic', 0.7332330942153931), ('college', 0.7243876457214355), ('teachers', 0.7197794914245605), ('university', 0.7133212089538574), ('youth', 0.7073767781257629)]


We can use pre-trained GloVe vectors to obtain semantic representations of words and entire sentences. The snippet below shows:

- How to retrieve the vector for a single word (e.g., "student")
- How to compute the average vector for a sentence by averaging the vectors of the known word

In [None]:

token = "student"
if token in glove_model:
    print("GloVe vector:", glove_model[token])
 
def sentence_vector(sentence):
    words = [word for word in sentence.lower().split() if word in glove_model]
    if not words:
        return np.zeros(100)
    return np.mean([glove_model[word] for word in words], axis=0)
 
print(sentence_vector("student portal access"))

###  Word2Vec vs GloVe – Talking Points

| Feature           | Word2Vec                      | GloVe                                 |
|-------------------|-------------------------------|----------------------------------------|
| Model Type        | Predictive (Skip-gram)        | Count-based (Matrix factorization)     |
| Context Handling  | Strong (local context)        | Moderate (global statistics)           |
| Best Use Case     | Chatbot Q&A                   | Generic text analytics                 |
| Talking Point     | Our student portal data is Q&A-based; Word2Vec captured context-specific patterns better than GloVe.


##  Conclusion

In this workshop, we successfully implemented a full NLP pipeline to support a student services chatbot scenario. By applying custom text preprocessing and leveraging both predictive (Word2Vec) and count-based (GloVe) embedding techniques, we extracted meaningful semantic representations of student language.

Our comparison of Word2Vec vs GloVe revealed that Word2Vec performs better in this context due to its ability to model local contextual nuances, which are essential for chatbot accuracy and relevance.

Overall, this workshop strengthened our skills in:
- Real-world text cleaning and normalization
- Training and applying embedding models
- Interpreting and comparing NLP methodologies

This foundation will be critical for deploying intelligent, context-aware language systems in real-world academic and enterprise environments.
