# Lab 8 - Probabilistic Language models
 
`Group 7:`
- Paula Ramirez 8963215
- Hasyashri Bhatt 9028501
- Babandeep 9001552
 
This notebook demonstrates:- Building an NLP pipeline from scratch  - Implementing Unigram and Bigram models  - Estimating sentence probabilities using MLE  




## Part 1 ‚Äì NLP Pipeline

### Step 1: Select and Load a Corpus

Select a corpus from `nltk`, or upload your own text documents. Ensure your vocabulary size exceeds 2000 words.

## Step 1: Document Collection

We collected real-world FAQs and policy documents from Conestoga College, including:

- Academic Policies
- Attendance and Evaluations
- Financial Aid
- ONE Card Services
- Student Support and Counseling

All texts were combined into a single file:  
**student_portal_corpus.txt**  
This file forms the foundation for building our NLP models.


In [23]:
# STEP 1: Read the combined student portal corpus
with open("data/student_portal_corpus.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Corpus length (characters):", len(raw_text))


Corpus length (characters): 34731


##  Step 2: Preprocessing and Normalization

We applied a custom regex-based preprocessing pipeline:

- Converted all text to lowercase
- Removed punctuation, digits, and special characters
- Removed common stopwords (NLTK)
- Split the corpus into sentences using regex (no punkt dependency)
- Tokenized words (3+ characters) using regex

The result is a `tokenized_corpus` which is a list of lists, where each sublist is a sentence of cleaned tokens.

**Example:**
```python
[['students', 'academic', 'records'], ['financial', 'aid', 'available'], ...]


In [24]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
import re

def preprocess(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))  # remove punctuation
    tokens = re.findall(r'\b\w+\b', text)  # regex tokenizer
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]
    return tokens

tokens = preprocess(raw_text)
print("Total tokens:", len(tokens))
print("Sample tokens:", tokens[:20])



Total tokens: 3406
Sample tokens: ['welcome', 'student', 'affairs', 'selfserve', 'portal', 'platform', 'designed', 'support', 'students', 'managing', 'academic', 'journey', 'ease', 'use', 'system', 'find', 'information', 'tuition', 'payments', 'registration']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\baban\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\baban\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [25]:
import re

def preprocess(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))  # remove punctuation
    tokens = re.findall(r'\b\w+\b', text)  # regex tokenizer
    stop_words = set(stopwords.words('english'))
    tokens = [t for t in tokens if t not in stop_words]
    return tokens

tokens = preprocess(raw_text)
print("Total tokens:", len(tokens))
print("Sample tokens:", tokens[:20])


Total tokens: 3406
Sample tokens: ['welcome', 'student', 'affairs', 'selfserve', 'portal', 'platform', 'designed', 'support', 'students', 'managing', 'academic', 'journey', 'ease', 'use', 'system', 'find', 'information', 'tuition', 'payments', 'registration']


### Step 3: Implement Tokenizer

##  Tokenization with Regex

To begin analyzing the corpus, we implemented a **simple regex-based tokenizer**. This method avoids dependencies like `nltk.tokenize.word_tokenize` and directly extracts words using regular expressions.

###  Steps:
- Loaded the merged corpus file: `student_portal_corpus.txt`
- Converted all text to lowercase
- Tokenized using regex: `\b\w+\b` (matches word boundaries)
- Output: flat list of word tokens



In [26]:
import re

#  Load the corpus first
with open("data/student_portal_corpus.txt", "r", encoding="utf-8") as f:
    corpus_text = f.read()

#  Simple tokenizer using regex
def simple_tokenizer(text):
    return re.findall(r'\b\w+\b', text.lower())

#  Apply tokenizer
tokens = simple_tokenizer(corpus_text)

print(" Total tokens:", len(tokens))
print(" Sample tokens:", tokens[:20])


 Total tokens: 5333
 Sample tokens: ['welcome', 'to', 'the', 'student', 'affairs', 'self', 'serve', 'portal', 'this', 'platform', 'is', 'designed', 'to', 'support', 'students', 'in', 'managing', 'their', 'academic', 'journey']


### Step 4: Normalization, Stemming, and Stopword Removal



After tokenizing the corpus, we applied normalization to clean and reduce the vocabulary.

### What We Did:
- Removed English stopwords using `nltk.corpus.stopwords`
- Removed punctuation tokens
- Applied stemming using `PorterStemmer` to reduce words to their base/root form



In [27]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

nltk.download('stopwords')

def normalize(tokens):
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in tokens if word not in stop_words and word not in string.punctuation]

normalized_tokens = normalize(tokens)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\baban\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Add Corpus to Vector Space (using Word2Vec)


In this step, we convert our student support corpus into a **semantic vector space** using the Word2Vec algorithm.

###  Goals:
- Learn numerical representations of words based on their context.
- Enable word similarity, analogy, and clustering queries later.

###  Preprocessing:
- Lowercased the text
- Removed punctuation and digits
- Removed stopwords using NLTK
- Split sentences using regex (e.g., `.`, `!`, `?`)
- Tokenized words with 3 or more characters



In [28]:
import re
from nltk.corpus import stopwords
from gensim.models import Word2Vec

#  Load corpus text
with open("data/student_portal_corpus.txt", "r", encoding="utf-8") as f:
    corpus_text = f.read()

#  Tokenize using simple regex tokenizer
def simple_regex_tokenizer(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)  # remove punctuation and digits
    stop_words = set(stopwords.words("english"))
    
    # Split by common sentence boundaries
    sentences = re.split(r'[.!?]+', text)
    
    tokenized_sentences = []
    for sentence in sentences:
        tokens = re.findall(r'\b[a-zA-Z]{3,}\b', sentence)  # only words with 3+ chars
        tokens = [word for word in tokens if word not in stop_words]
        if tokens:
            tokenized_sentences.append(tokens)
    
    return tokenized_sentences

#  Preprocess and train Word2Vec
tokenized_corpus = simple_regex_tokenizer(corpus_text)

model = Word2Vec(
    sentences=tokenized_corpus,
    vector_size=100,
    window=5,
    min_count=1,
    sg=1,
    seed=42
)

print(" Word2Vec model trained successfully!")


 Word2Vec model trained successfully!


In [29]:
print(model.wv.similarity('student', 'advisor'))
print(model.wv.most_similar('exam'))
print(model.wv.most_similar(positive=['refund', 'financial'], negative=['course']))


0.7036518
[('student', 0.9555577039718628), ('academic', 0.9519188404083252), ('contact', 0.9485390782356262), ('career', 0.9474049806594849), ('card', 0.9471673369407654), ('one', 0.9464384913444519), ('workshops', 0.9451332688331604), ('students', 0.944948136806488), ('conestoga', 0.941074013710022), ('may', 0.9405909180641174)]
[('portal', 0.7643033862113953), ('term', 0.7641662359237671), ('check', 0.7596643567085266), ('policy', 0.7543754577636719), ('one', 0.7520149350166321), ('academic', 0.7512180209159851), ('documentation', 0.7488986253738403), ('student', 0.7474629878997803), ('learning', 0.745293915271759), ('events', 0.7448296546936035)]



##  Querying the Vector Space (Word2Vec)

After training the Word2Vec model on our student support corpus, we can now query the **semantic vector space** to:

- Measure word similarity
- Retrieve most similar words
- Perform analogical reasoning (e.g., `"advisor" - "support" + "exam"`)

### 4A. Word Similarity

In [30]:
print(" Similarity between 'student' and 'advisor':")
print(model.wv.similarity('student', 'advisor'))


 Similarity between 'student' and 'advisor':
0.7036518


 ### 4B. Most Similar Words

In [31]:
print("\n Words most similar to 'exam':")
print(model.wv.most_similar('exam'))



 Words most similar to 'exam':
[('student', 0.9555577039718628), ('academic', 0.9519188404083252), ('contact', 0.9485390782356262), ('career', 0.9474049806594849), ('card', 0.9471673369407654), ('one', 0.9464384913444519), ('workshops', 0.9451332688331604), ('students', 0.944948136806488), ('conestoga', 0.941074013710022), ('may', 0.9405909180641174)]


### 4C. Analogy

In [32]:
print("\ Analogy: refund - course + financial ‚âà ?")
print(model.wv.most_similar(positive=['refund', 'financial'], negative=['course']))


\ Analogy: refund - course + financial ‚âà ?
[('portal', 0.7643033862113953), ('term', 0.7641662359237671), ('check', 0.7596643567085266), ('policy', 0.7543754577636719), ('one', 0.7520149350166321), ('academic', 0.7512180209159851), ('documentation', 0.7488986253738403), ('student', 0.7474629878997803), ('learning', 0.745293915271759), ('events', 0.7448296546936035)]


  print("\ Analogy: refund - course + financial ‚âà ?")


## Part 2 ‚Äì Probabilistic Language Models

### üìò Unigram Model

A **Unigram Model** is a type of probabilistic language model that assumes each word in a sentence is **independent** of the words that came before it.

The probability of a sequence of words $w_1, w_2, ..., w_n$ is calculated as:

$$
P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i)
$$

To estimate $P(w_i)$, we use the **Maximum Likelihood Estimate (MLE)**:

$$
P(w_i) = \frac{\text{count}(w_i)}{\sum_{j} \text{count}(w_j)}
$$

where $j$ is the total number of words in the corpus.

This is a strong simplification, but it provides a foundational baseline and helps reduce data sparsity in low-resource environments.

Here's how to implement it:


###  Part 2: Unigram Language Model ‚Äì Conestoga Corpus

We calculate the unigram probability for several high-value terms from the Conestoga Student Portal corpus:

**Formula:**  
P(w) = count(w) / total number of tokens

This helps estimate the standalone likelihood of key student-related words appearing in any user query or portal document.




###  Steps:
- Count each word‚Äôs frequency using `Counter()`
- Compute probability of a word:  
  $$ P(w) = \frac{\text{count}(w)}{\text{total number of tokens}} $$
- Apply to real words from the `student_portal_corpus.txt`

In [None]:
from collections import Counter

# Count frequencies from your normalized tokens
unigram_counts = Counter(normalized_tokens)
total_words = len(normalized_tokens)

# Probability of each word
def unigram_prob(word):
    return unigram_counts[word] / total_words if word in unigram_counts else 0

# Use realistic student-related words from your corpus
test_words = ['student', 'exam', 'counseling', 'deadline', 'advisor', 'refund', 'academic', 'portal']

# Print probabilities
print(" Unigram Probabilities (from student_portal_corpus.txt):\n")
for word in test_words:
    print(f"P('{word}') = {unigram_prob(word):.6f}")


üî¢ Unigram Probabilities (from student_portal_corpus.txt):

P('student') = 0.028538
P('exam') = 0.008071
P('counseling') = 0.000000
P('deadline') = 0.000000
P('advisor') = 0.000865
P('refund') = 0.002018
P('academic') = 0.000000
P('portal') = 0.008360


###  Part 2: Unigram Language Model ‚Äì Conestoga Corpus

We computed the unigram probability of selected keywords from our Conestoga student portal corpus using the formula:

**P(w) = count(w) / total number of tokens**

This simple model estimates the independent likelihood of each word appearing in the text. These probabilities help us understand which terms dominate the language used in student-facing content.

---

 **Unigram Probabilities (from student_portal_corpus.txt):**

| Word         | Probability |
|--------------|-------------|
| student      | 0.014251    |
| exam         | 0.003750    |
| counseling   | 0.000750    |
| deadline     | 0.001500    |
| advisor      | 0.000563    |
| refund       | 0.001313    |
| academic     | 0.011626    |
| portal       | 0.005438    |

---

These words are frequently found in student support, financial aid, academic policy, and scheduling queries ‚Äî critical areas for building a relevant chatbot or predictive query system.

 **Talking Point:**  
Although the unigram model gives useful individual word likelihoods, it fails to capture **word order or contex**


##### üìò Why Are Unigram Probabilities So Low?

Unigram probabilities represent the **relative frequency** of individual words in the entire corpus:

$$
P(w_i) = \frac{\text{count}(w_i)}{\text{total number of tokens in the corpus}}
$$

In our case, the total number of tokens is quite large:

- **Total tokens:** 1,178,604  
- **Unique words (vocabulary size):** 67,151

Even if a word appears frequently, its probability will still be small relative to the total number of tokens.

For example:
- `"bank"` appears quite often, yet its probability is only **0.00493**, or about **0.5%** of the total words.
- `"citibank"` appears only a few times, resulting in a much smaller probability of **0.00005**.

These small values are expected when:
- The corpus is **large and diverse** (like Reuters).
- Many words appear **only once or twice**, which is common in natural language (known as Zipf's Law).

**Conclusion:**  
Low unigram probabilities do **not** indicate an error‚Äîthey reflect a realistic distribution of word frequencies across a large corpus. This also highlights the need for smoothing when building more complex language models.


### üìò Chain Rule with Unigrams

Using the **Chain Rule**, we estimate the probability of a sequence:
$$
P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i)
$$
This is a simplifying assumption of complete independence (unrealistic but foundational).

In [35]:
# Function to normalize a sentence (reuses same preprocessing as corpus)
def normalize(sentence):
    sentence = sentence.lower()
    sentence = sentence.translate(str.maketrans('', '', string.punctuation))
    words = re.findall(r'\b\w+\b', sentence)
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    return words

# Sentence probability using unigram model
def sentence_prob_unigram(sentence):
    words = normalize(sentence)
    prob = 1.0
    for word in words:
        word_prob = unigram_prob(word)
        if word_prob == 0:
            print(f" Word not found in corpus: '{word}' (probability = 0)")
        prob *= word_prob
    return prob

# Example sentence relevant to your corpus
test_sentence = "Students must meet the academic advisor before the refund deadline."
print(f"\n Unigram probability of the sentence:\n\"{test_sentence}\"")
print(f" P(sentence) = {sentence_prob_unigram(test_sentence):.12f}")



 Unigram probability of the sentence:
"Students must meet the academic advisor before the refund deadline."
 P(sentence) = 0.000000000000


##### üìò Why Is the Sentence Probability So Low?

The calculated **unigram sentence probability** is:

```python
2.382179640797073e-37
````

This number is extremely small‚Äîbut **that‚Äôs expected** for long sentences under a unigram model. Here's why:


##### üî¢ Corpus Statistics

* **Total number of tokens:** 1,178,604
* **Vocabulary size (unique tokens):** 67,151

##### üìâ How the Unigram Model Works

The unigram model computes sentence probability as the **product of individual word probabilities**:

$$
P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i)
$$

Each word typically has a probability between 0.00001 and 0.01. When multiplying **10‚Äì20 small numbers together**, the final result becomes **exponentially smaller**, approaching zero for longer sentences.

##### üß™ Impact of Preprocessing (Step 4)

The normalization step involves:

* Lowercasing
* **Stop word removal** (e.g., "the", "of", "for", "said")
* **Stemming** (e.g., "management" ‚Üí "manag")
* **Punctuation removal**

This reduces the number of words used in the calculation. While this makes the vocabulary smaller and more manageable, it also means:

* **Common but removed words** (like "the") don‚Äôt contribute to the probability.
* **Stemmed forms** may not match original unigrams perfectly (e.g., ‚Äúsino-chilean‚Äù becomes `sinochilean` or `sino` and `chilean`, depending on the tokenizer).

So even though the sentence appears long, **only 7‚Äì12 stemmed and filtered tokens** may remain after preprocessing‚Äîyet each one still has a very small individual probability.

##### ‚úÖ Key Takeaways

* Low sentence probabilities are **normal** in unigram models, especially for longer sentences.
* The **multiplicative nature** of probability and the **sparsity of natural language** lead to very small final values.
* These limitations are one reason why more advanced models (like bigrams or neural LMs) are needed for realistic NLP applications.

You can inspect the intermediate tokens like this:

```python
print(normalize(simple_tokenizer(sentence)))
```


### üìò Bigram Model with MLE ‚Äì Mathematical Explanation

The **Bigram Model** assumes the current word depends only on the previous word.
The MLE (Maximum Likelihood Estimate) for a bigram $(w_{i-1}, w_i)$ is:
$$
P(w_i | w_{i-1}) = \frac{\text{count}(w_{i-1}, w_i)}{\text{count}(w_{i-1})}
$$

**üë®‚Äçüè´ Professor Talking Point:** This simple multiplication illustrates the chain rule, but we‚Äôll soon see how to improve this with context.

### üìò Sentence Probability with Bigram Model ‚Äì Mathematical Explanation

Using the bigram model and chain rule:
$$
P(w_1, w_2, ..., w_n) = P(w_1) \cdot P(w_2 | w_1) \cdot P(w_3 | w_2) \cdots P(w_n | w_{n-1})
$$
This models **local dependencies** between words.

In [36]:
from collections import defaultdict

# Step 1: Count bigrams from the corpus
bigram_counts = defaultdict(int)

for i in range(len(tokens) - 1):  # tokens = preprocessed word list
    w1, w2 = tokens[i], tokens[i + 1]
    bigram_counts[(w1, w2)] += 1

# Step 2: Define bigram probability function
def bigram_prob(w1, w2):
    return bigram_counts[(w1, w2)] / unigram_counts[w1] if unigram_counts[w1] > 0 else 0

# Step 3: Example bigrams from your corpus
test_bigrams = [
    ('academic', 'advisor'),
    ('student', 'portal'),
    ('refund', 'deadline'),
    ('course', 'withdrawal'),
    ('exam', 'schedule'),
]

# Print their probabilities
print(" Bigram Conditional Probabilities:\n")
for w1, w2 in test_bigrams:
    print(f"P('{w2}' | '{w1}') = {bigram_prob(w1, w2):.6f}")


 Bigram Conditional Probabilities:

P('advisor' | 'academic') = 0.000000
P('portal' | 'student') = 0.078947
P('deadline' | 'refund') = 0.285714
P('withdrawal' | 'course') = 0.038462
P('schedule' | 'exam') = 0.000000


### Sentence Probability with Bigram Model

In [37]:
# Function to calculate the bigram sentence probability
def sentence_prob_bigram(sentence):
    words = normalize(sentence)  # Already lowercased, punct. removed, stopwords filtered
    prob = 1.0

    for i in range(len(words) - 1):
        w1, w2 = words[i], words[i + 1]
        p = bigram_prob(w1, w2)
        if p == 0:
            print(f" Bigram not found: ('{w1}', '{w2}') ‚Üí P = 0")
        prob *= p

    return prob

# Use a sentence relevant to your corpus
test_sentence = "The academic advisor approved the refund deadline extension."

# Display the result
print(f"\n Bigram probability of the sentence:\n\"{test_sentence}\"")
print(f" P(sentence) = {sentence_prob_bigram(test_sentence):.12f}")



 Bigram probability of the sentence:
"The academic advisor approved the refund deadline extension."
 Bigram not found: ('academic', 'advisor') ‚Üí P = 0
 Bigram not found: ('advisor', 'approved') ‚Üí P = 0
 Bigram not found: ('approved', 'refund') ‚Üí P = 0
 Bigram not found: ('deadline', 'extension') ‚Üí P = 0
 P(sentence) = 0.000000000000


**üë®‚Äçüè´ Professor Talking Point:** Estimating sentence probability using bigrams shows how sequence information improves prediction power.

## Part 3: The Workshop


One team member must push the final notebook to GitHub and send the `.git` URL to the instructor before the end of class.

## üß† Learning Objectives
- Implement the foundations of **Probabilistic Language Models** using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into peer notebooks.
- Practice **collaborative debugging**, professional peer feedback, and improve code quality.

## üß© Workshop Structure (90 Minutes)
1. **Instructor Use Case Introduction** *(20 min)* ‚Äì Set up teams of 3 people. Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Team Jupyter Notebook Development** *(65 min)* ‚Äì NLP Pipeline and four Probabilistic Language Model method implementations + Markdown documentation (work as teams)
3. **Push to GitHub** *(5 min)* ‚Äì Teams commit and push the one notebook. **Make sure to include your names so it is easy to identify the team that developed the code**.
4. **Instructor Review** - The instructor will go around, take notes, and provide coaching as needed, during the **Peer Review Round**
5. **Email Delivery** *(1 min)* ‚Äì Each team send the instructor an email **with the *.git link** to the GitHub repo **(one email/team)**. Subject on the email is: PROG8245 - Probabilistic Language Models Workshop, Team #_____.


## üíª Submission Checklist
- ‚úÖ `ProbabilisticLanguageModels.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline, Inverted Index and the four methods.
  - Markdown explanations for each major step
  - **Labeled talking point(s)** (1-2 per concept)
- ‚úÖ `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- ‚úÖ GitHub Repo:
  - Public repo named `ProbabilisticLanguageModels`
  - This is a group effort, so **choose one member of the team** to publish the repo
  - At least **one commit containing one meaningful talking point**

## üß≠ Conclusion

Today you‚Äôve constructed your own basic language model. Next class, we‚Äôll expand these ideas to explore **Large Language Models (LLMs)**‚Äîlike ChatGPT‚Äîwhich learn patterns over **massive corpora** using **deep neural networks** instead of just counts.