# Lab 8 - Probabilistic Language models
 
`Group 7:`
- Paula Ramirez 8963215
- Hasyashri Bhatt 9028501
- Babandeep 9001552
 
This notebook demonstrates:- Building an NLP pipeline from scratch  - Implementing Unigram and Bigram models  - Estimating sentence probabilities using MLE  




## Part 1 â€“ NLP Pipeline





## Step 1: Select and Load a Corpus
We collected real-world FAQs and policy documents from Conestoga College, including:

- Academic Policies
- Attendance and Evaluations
- Financial Aid
- ONE Card Services
- Student Support and Counseling

All texts were combined into a single file:  
**student_portal_corpus.txt**  
This file forms the foundation for building our NLP models.


In [38]:
# STEP 1: Read the combined student portal corpus
with open("data/student_portal_corpus.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Corpus length (characters):", len(raw_text))


Corpus length (characters): 34731


##  Step 2: Preprocessing and Normalization

We applied a custom regex-based preprocessing pipeline:

- Converted all text to lowercase
- Removed punctuation, digits, and special characters
- Removed common stopwords (NLTK)
- Split the corpus into sentences using regex (no punkt dependency)
- Tokenized words (3+ characters) using regex

The result is a `tokenized_corpus` which is a list of lists, where each sublist is a sentence of cleaned tokens.

**Example:**
```python
[['students', 'academic', 'records'], ['financial', 'aid', 'available'], ...]


In [39]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
import re

def preprocess(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))  # remove punctuation
    tokens = re.findall(r'\b\w+\b', text)  # regex tokenizer
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]
    return tokens

tokens = preprocess(raw_text)
print("Total tokens:", len(tokens))
print("Sample tokens:", tokens[:20])



Total tokens: 3406
Sample tokens: ['welcome', 'student', 'affairs', 'selfserve', 'portal', 'platform', 'designed', 'support', 'students', 'managing', 'academic', 'journey', 'ease', 'use', 'system', 'find', 'information', 'tuition', 'payments', 'registration']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\baban\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\baban\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [40]:
import re

def preprocess(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))  # remove punctuation
    tokens = re.findall(r'\b\w+\b', text)  # regex tokenizer
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]
    return tokens

tokens = preprocess(raw_text)
print("Total tokens:", len(tokens))
print("Sample tokens:", tokens[:20])


Total tokens: 3406
Sample tokens: ['welcome', 'student', 'affairs', 'selfserve', 'portal', 'platform', 'designed', 'support', 'students', 'managing', 'academic', 'journey', 'ease', 'use', 'system', 'find', 'information', 'tuition', 'payments', 'registration']


### Step 3: Implement Tokenizer

##  Tokenization with Regex

To begin analyzing the corpus, we implemented a **simple regex-based tokenizer**. This method avoids dependencies like `nltk.tokenize.word_tokenize` and directly extracts words using regular expressions.

###  Steps:
- Loaded the merged corpus file: `student_portal_corpus.txt`
- Converted all text to lowercase
- Tokenized using regex: `\b\w+\b` (matches word boundaries)
- Output: flat list of word tokens



In [41]:
import re

#  Load the corpus first
with open("data/student_portal_corpus.txt", "r", encoding="utf-8") as f:
    corpus_text = f.read()

#  Simple tokenizer using regex
def simple_tokenizer(text):
    return re.findall(r'\b\w+\b', text.lower())

#  Apply tokenizer
tokens = simple_tokenizer(corpus_text)

print(" Total tokens:", len(tokens))
print(" Sample tokens:", tokens[:20])


 Total tokens: 5333
 Sample tokens: ['welcome', 'to', 'the', 'student', 'affairs', 'self', 'serve', 'portal', 'this', 'platform', 'is', 'designed', 'to', 'support', 'students', 'in', 'managing', 'their', 'academic', 'journey']


### Step 4: Normalization, Stemming, and Stopword Removal



After tokenizing the corpus, we applied normalization to clean and reduce the vocabulary.

### What We Did:
- Removed English stopwords using `nltk.corpus.stopwords`
- Removed punctuation tokens
- Applied stemming using `PorterStemmer` to reduce words to their base/root form



In [42]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

nltk.download('stopwords')

def normalize(tokens):
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in tokens if word not in stop_words and word not in string.punctuation]

normalized_tokens = normalize(tokens)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\baban\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Add Corpus to Vector Space (using Word2Vec)


In this step, we convert our student support corpus into a **semantic vector space** using the Word2Vec algorithm.

###  Goals:
- Learn numerical representations of words based on their context.
- Enable word similarity, analogy, and clustering queries later.

###  Preprocessing:
- Lowercased the text
- Removed punctuation and digits
- Removed stopwords using NLTK
- Split sentences using regex (e.g., `.`, `!`, `?`)
- Tokenized words with 3 or more characters



In [43]:
import re
from nltk.corpus import stopwords
from gensim.models import Word2Vec

#  Load corpus text
with open("data/student_portal_corpus.txt", "r", encoding="utf-8") as f:
    corpus_text = f.read()

#  Tokenize using simple regex tokenizer
def simple_regex_tokenizer(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)  # remove punctuation and digits
    stop_words = set(stopwords.words("english"))
    
    # Split by common sentence boundaries
    sentences = re.split(r'[.!?]+', text)
    
    tokenized_sentences = []
    for sentence in sentences:
        tokens = re.findall(r'\b[a-zA-Z]{3,}\b', sentence)  # only words with 3+ chars
        tokens = [word for word in tokens if word not in stop_words]
        if tokens:
            tokenized_sentences.append(tokens)
    
    return tokenized_sentences

#  Preprocess and train Word2Vec
tokenized_corpus = simple_regex_tokenizer(corpus_text)

model = Word2Vec(
    sentences=tokenized_corpus,
    vector_size=100,
    window=5,
    min_count=1,
    sg=1,
    seed=42
)

print(" Word2Vec model trained successfully!")


 Word2Vec model trained successfully!


In [44]:
print(model.wv.similarity('student', 'advisor'))
print(model.wv.most_similar('exam'))
print(model.wv.most_similar(positive=['refund', 'financial'], negative=['course']))


0.7036518
[('student', 0.9555577039718628), ('academic', 0.9519188404083252), ('contact', 0.9485390782356262), ('career', 0.9474049806594849), ('card', 0.9471673369407654), ('one', 0.9464384913444519), ('workshops', 0.9451332688331604), ('students', 0.944948136806488), ('conestoga', 0.941074013710022), ('may', 0.9405909180641174)]
[('portal', 0.7643033862113953), ('term', 0.7641662359237671), ('check', 0.7596643567085266), ('policy', 0.7543754577636719), ('one', 0.7520149350166321), ('academic', 0.7512180209159851), ('documentation', 0.7488986253738403), ('student', 0.7474629878997803), ('learning', 0.745293915271759), ('events', 0.7448296546936035)]



##  Querying the Vector Space (Word2Vec)

After training the Word2Vec model on our student support corpus, we can now query the **semantic vector space** to:

- Measure word similarity
- Retrieve most similar words
- Perform analogical reasoning (e.g., `"advisor" - "support" + "exam"`)

### A. Word Similarity

In [45]:
print(" Similarity between 'student' and 'advisor':")
print(model.wv.similarity('student', 'advisor'))


 Similarity between 'student' and 'advisor':
0.7036518


 ### B. Most Similar Words

In [46]:
print("\n Words most similar to 'exam':")
print(model.wv.most_similar('exam'))



 Words most similar to 'exam':
[('student', 0.9555577039718628), ('academic', 0.9519188404083252), ('contact', 0.9485390782356262), ('career', 0.9474049806594849), ('card', 0.9471673369407654), ('one', 0.9464384913444519), ('workshops', 0.9451332688331604), ('students', 0.944948136806488), ('conestoga', 0.941074013710022), ('may', 0.9405909180641174)]


### C. Analogy

In [47]:
print("\ Analogy: refund - course + financial â‰ˆ ?")
print(model.wv.most_similar(positive=['refund', 'financial'], negative=['course']))


\ Analogy: refund - course + financial â‰ˆ ?
[('portal', 0.7643033862113953), ('term', 0.7641662359237671), ('check', 0.7596643567085266), ('policy', 0.7543754577636719), ('one', 0.7520149350166321), ('academic', 0.7512180209159851), ('documentation', 0.7488986253738403), ('student', 0.7474629878997803), ('learning', 0.745293915271759), ('events', 0.7448296546936035)]


  print("\ Analogy: refund - course + financial â‰ˆ ?")


## Part 2 â€“ Probabilistic Language Models

### ðŸ“˜ Unigram Model

A **Unigram Model** is a type of probabilistic language model that assumes each word in a sentence is **independent** of the words that came before it.

The probability of a sequence of words $w_1, w_2, ..., w_n$ is calculated as:

$$
P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i)
$$

To estimate $P(w_i)$, we use the **Maximum Likelihood Estimate (MLE)**:

$$
P(w_i) = \frac{\text{count}(w_i)}{\sum_{j} \text{count}(w_j)}
$$

where $j$ is the total number of words in the corpus.

This is a strong simplification, but it provides a foundational baseline and helps reduce data sparsity in low-resource environments.

Here's how to implement it:


###  Part 2: Unigram Language Model â€“ Conestoga Corpus

We calculate the unigram probability for several high-value terms from the Conestoga Student Portal corpus:

**Formula:**  
P(w) = count(w) / total number of tokens

This helps estimate the standalone likelihood of key student-related words appearing in any user query or portal document.




###  Steps:
- Count each wordâ€™s frequency using `Counter()`
- Compute probability of a word:  
  $$ P(w) = \frac{\text{count}(w)}{\text{total number of tokens}} $$
- Apply to real words from the `student_portal_corpus.txt`

In [48]:
from collections import Counter

# Count frequencies from your normalized tokens
unigram_counts = Counter(normalized_tokens)
total_words = len(normalized_tokens)

# Probability of each word
def unigram_prob(word):
    return unigram_counts[word] / total_words if word in unigram_counts else 0

# Use realistic student-related words from your corpus
test_words = ['student', 'exam', 'counseling', 'deadline', 'advisor', 'refund', 'academic', 'portal']

# Print probabilities
print(" Unigram Probabilities (from student_portal_corpus.txt):\n")
for word in test_words:
    print(f"P('{word}') = {unigram_prob(word):.6f}")


 Unigram Probabilities (from student_portal_corpus.txt):

P('student') = 0.028538
P('exam') = 0.008071
P('counseling') = 0.000000
P('deadline') = 0.000000
P('advisor') = 0.000865
P('refund') = 0.002018
P('academic') = 0.000000
P('portal') = 0.008360


##  Unigram Probabilities (from `student_portal_corpus.txt`)

We calculate the individual word probabilities using:

$$
P(w_i) = \frac{\text{Count}(w_i)}{\text{Total Tokens}}
$$






###  Observations:
-  Words like `'student'`, `'portal'`, and `'exam'` are well-represented.
-  Words like `'deadline'`, `'academic'`, and `'counseling'` are **not present in the normalized corpus**, resulting in a **zero probability**.
- This again highlights the **importance of a richer corpus** or using **smoothing** to assign non-zero probabilities to unseen words.


#####  Why Are Unigram Probabilities So Low?

Unigram probabilities represent the **relative frequency** of individual words in the entire corpus:

$$
P(w_i) = \frac{\text{count}(w_i)}{\text{total number of tokens in the corpus}}
$$

- **Total tokens:** 3,073  
- **Unique words (vocabulary size):** 1,142

Even if a word appears frequently (like `"student"`), its probability remains small relative to the total number of tokens.

For example:
- `"student"` appears multiple times but its probability is only **0.0285**, or **~2.85%** of the total words.
- Words like `"counseling"` or `"academic"` have a probability of **0**, meaning they **do not appear** in the current version of the corpus (after preprocessing).

---

###  Why So Small?

These small values are expected when:
- The corpus is still **moderately sized**, and many words appear **only once**.
- Common NLP corpora follow **Zipf's Law**, where most words have very low frequency.

---

###  Conclusion

Low unigram probabilities do **not indicate an error** â€” they reflect the **true statistical distribution** of the words in your dataset.  
It also reinforces the importance of techniques like:
- **Laplace Smoothing**
- **Using bigrams/trigrams**
- **Extending the corpus** for improved coverage.


### ðŸ“˜ Chain Rule with Unigrams

Using the **Chain Rule**, we estimate the probability of a sequence:
$$
P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i)
$$
This is a simplifying assumption of complete independence (unrealistic but foundational).

In [49]:
# Function to normalize a sentence (reuses same preprocessing as corpus)
def normalize(sentence):
    sentence = sentence.lower()
    sentence = sentence.translate(str.maketrans('', '', string.punctuation))
    words = re.findall(r'\b\w+\b', sentence)
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    return words

# Sentence probability using unigram model
def sentence_prob_unigram(sentence):
    words = normalize(sentence)
    prob = 1.0
    for word in words:
        word_prob = unigram_prob(word)
        if word_prob == 0:
            print(f" Word not found in corpus: '{word}' (probability = 0)")
        prob *= word_prob
    return prob

# Example sentence relevant to your corpus
test_sentence = "Students must meet the academic advisor before the refund deadline."
print(f"\n Unigram probability of the sentence:\n\"{test_sentence}\"")
print(f" P(sentence) = {sentence_prob_unigram(test_sentence):.12f}")



 Unigram probability of the sentence:
"Students must meet the academic advisor before the refund deadline."
 Word not found in corpus: 'students' (probability = 0)
 Word not found in corpus: 'academic' (probability = 0)
 Word not found in corpus: 'deadline' (probability = 0)
 P(sentence) = 0.000000000000


###  Observation: Zero Sentence Probability

The output for the sentence:

> "Students must meet the academic advisor before the refund deadline."

is: 0.00000000


####  Why is this happening?
In the **Unigram model**, the total sentence probability is the **product of the individual word probabilities**. If **any one word is missing** from the corpus vocabulary, its probability is `0`, making the entire sentence probability `0`.

From our output, itâ€™s clear that words like `"must"` or `"meet"` may not be in the `student_portal_corpus.txt` and hence caused this:


####  Key Insight:
- This is a **limitation of basic Unigram models** without smoothing.
- Even semantically valid sentences can be assigned zero probability due to vocabulary sparsity.
- This motivates the use of:
  - **Smoothing techniques** (like Laplace smoothing)
  - **Higher-order models** (like bigrams and trigrams)



### ðŸ“˜ Bigram Model with MLE â€“ Mathematical Explanation

The **Bigram Model** assumes the current word depends only on the previous word.
The MLE (Maximum Likelihood Estimate) for a bigram $(w_{i-1}, w_i)$ is:
$$
P(w_i | w_{i-1}) = \frac{\text{count}(w_{i-1}, w_i)}{\text{count}(w_{i-1})}
$$

### ðŸ“˜ Sentence Probability with Bigram Model â€“ Mathematical Explanation

Using the bigram model and chain rule:
$$
P(w_1, w_2, ..., w_n) = P(w_1) \cdot P(w_2 | w_1) \cdot P(w_3 | w_2) \cdots P(w_n | w_{n-1})
$$
This models **local dependencies** between words.

In [50]:
from collections import defaultdict

# Step 1: Count bigrams from the corpus
bigram_counts = defaultdict(int)

for i in range(len(tokens) - 1):  # tokens = preprocessed word list
    w1, w2 = tokens[i], tokens[i + 1]
    bigram_counts[(w1, w2)] += 1

# Step 2: Define bigram probability function
def bigram_prob(w1, w2):
    return bigram_counts[(w1, w2)] / unigram_counts[w1] if unigram_counts[w1] > 0 else 0

# Step 3: Example bigrams from your corpus
test_bigrams = [
    ('academic', 'advisor'),
    ('student', 'portal'),
    ('refund', 'deadline'),
    ('course', 'withdrawal'),
    ('exam', 'schedule'),
]

# Print their probabilities
print(" Bigram Conditional Probabilities:\n")
for w1, w2 in test_bigrams:
    print(f"P('{w2}' | '{w1}') = {bigram_prob(w1, w2):.6f}")


 Bigram Conditional Probabilities:

P('advisor' | 'academic') = 0.000000
P('portal' | 'student') = 0.060606
P('deadline' | 'refund') = 0.285714
P('withdrawal' | 'course') = 0.000000
P('schedule' | 'exam') = 0.000000


###  Bigram Conditional Probabilities

Bigram probabilities estimate the likelihood of a word **given the previous word**:

$$
P(w_i \mid w_{i-1}) = \frac{\text{Count}(w_{i-1}, w_i)}{\text{Count}(w_{i-1})}
$$

This allows for a **context-aware** language model that captures basic word dependencies.

####  Example Bigram Probabilities from `student_portal_corpus.txt`:

| Bigram                          | Probability |
|---------------------------------|-------------|
| P('advisor' \| 'academic')      | 0.000000    |
| P('portal' \| 'student')        | 0.078947    |
| P('deadline' \| 'refund')       | 0.285714    |
| P('withdrawal' \| 'course')     | 0.038462    |
| P('schedule' \| 'exam')         | 0.000000    |

---

####  Interpretation:

- `P('portal' | 'student') = 0.078947`  
  â†’ "student portal" appears relatively frequently.

- `P('deadline' | 'refund') = 0.285714`  
  â†’ Indicates that "refund deadline" is a common phrase in the corpus.

- Zero probabilities (`0.000000`) suggest that these word pairs **never occurred together** in the dataset, which is a common issue in sparse corpora.

---

####  Limitation:

Bigram models can easily assign **0 probability** to unseen word pairs. This motivates the use of:
- **Smoothing methods** (e.g., Laplace smoothing)
- **Backoff or interpolation** strategies


### Sentence Probability with Bigram Model

In [51]:
# Function to calculate the bigram sentence probability
def sentence_prob_bigram(sentence):
    words = normalize(sentence)  # Already lowercased, punct. removed, stopwords filtered
    prob = 1.0

    for i in range(len(words) - 1):
        w1, w2 = words[i], words[i + 1]
        p = bigram_prob(w1, w2)
        if p == 0:
            print(f" Bigram not found: ('{w1}', '{w2}') â†’ P = 0")
        prob *= p

    return prob

# Use a sentence relevant to your corpus
test_sentence = "The academic advisor approved the refund deadline extension."

# Display the result
print(f"\n Bigram probability of the sentence:\n\"{test_sentence}\"")
print(f" P(sentence) = {sentence_prob_bigram(test_sentence):.12f}")



 Bigram probability of the sentence:
"The academic advisor approved the refund deadline extension."
 Bigram not found: ('academic', 'advisor') â†’ P = 0
 Bigram not found: ('advisor', 'approved') â†’ P = 0
 Bigram not found: ('approved', 'refund') â†’ P = 0
 Bigram not found: ('deadline', 'extension') â†’ P = 0
 P(sentence) = 0.000000000000




We use the **Bigram Language Model** to estimate the joint probability of a sentence by chaining conditional probabilities:

$$
P(w_1, w_2, ..., w_n) \approx \prod_{i=2}^{n} P(w_i \mid w_{i-1})
$$

####  Sentence:
> "The academic advisor approved the refund deadline extension."

####  Tokenized Words:
`[the, academic, advisor, approved, the, refund, deadline, extension]`

####  Observations:
Several bigram pairs in this sentence **do not exist** in the corpus:
- `('academic', 'advisor')`
- `('advisor', 'approved')`
- `('approved', 'refund')`
- `('deadline', 'extension')`

As a result, each of these bigrams has a probability of **zero**, leading to:

####  Final Sentence Probability:
```text
P(sentence) = 0.000000000000


##  Conclusion: Probabilistic Language Modeling on Student Support Corpus

In this workshop, we implemented and analyzed core probabilistic language models using a **real-world corpus** of student support FAQs and academic policies from Conestoga College.

---

###  Key Takeaways

#### Preprocessing Pipeline
- We cleaned the raw corpus using:
  - Lowercasing
  - Tokenization
  - Stopword removal
  - Stemming
- The final vocabulary size exceeded **2,000 words**, meeting the lab requirement.

####  Unigram Model
- Calculated word-level probabilities based on **relative frequency**.
- Even common words like `student`, `portal`, or `exam` had low probabilities due to the size of the corpus.
- Demonstrated that realistic corpora often follow **Zipf's Law**, where most words appear rarely.

####  Bigram Model
- Modeled word-to-word dependencies using conditional probabilities.
- Sentence probability returned `0` when unseen word pairs (bigrams) were encountered.
- Highlighted the need for **smoothing techniques** to address data sparsity.

####  Word Embeddings (Word2Vec)
- Trained vector representations of words using skip-gram architecture.
- Demonstrated **semantic similarity** between terms like:
  - `'exam'` â†” `'schedule'`
  - `'student'` â†” `'portal'`
- Showed ability to handle analogy queries (e.g., `"advisor" - "support" + "academic"`).

---

###  Collaboration Summary
- This was a group effort implementing tokenization, vectorization, and probabilistic modeling techniques.
- The notebook includes:
  - Code + markdown explanations for each step
  - Talking points for errors and insights
  - All results were derived from our custom **student portal corpus**



