Student: Nguyen Quang Phu

Student ID: 2252621 

**Language Model and Application for Spelling Error Correction**

**Objective**: Develop a simple English syntax error correction program.

a) Build a language model based on n-grams using the Laplace smoothing method for the following models:

- 1-gram
- 2-gram
- 3-gram

b) Calculate the probability of a sentence and compute the Perplexity of a sentence based on 1-gram, 2-gram, and 3-gram models.

c) Analyze the results (Provide your own examples of spelling errors and calculate the probability of two similar sentences, where one has the correct word order and the other has an incorrect word order).

In [1]:
%pip install gdown matplotlib wordcloud nltk





[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


# Import 

In [2]:
import sys
import os
import platform
import re
import gdown
from collections import defaultdict
import math
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import nltk
nltk.download('punkt')  # Download the Punkt tokenizer model
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# Python environment details
print("Python executable being used:", sys.executable)
print("Python version:", sys.version)

# Operating System details
print("Operating System:", platform.system())
print("OS Version:", platform.version())
print("OS Release:", platform.release())

# Machine and architecture details
print("Machine:", platform.machine())

# Visual Studio Code details (based on environment variable)
vscode_info = os.environ.get('VSCODE_PID', None)
if vscode_info:
    print("Running in Visual Studio Code")
else:
    print("Not running in Visual Studio Code")

Python executable being used: c:\Python312\python.exe
Python version: 3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]
Operating System: Windows
OS Version: 10.0.19045
OS Release: 10
Machine: AMD64
Running in Visual Studio Code


# Functions Definitions

## Function for loading the data

In [4]:
def clean_sentence(sentence):
    """
    Clean a sentence by removing unwanted characters like '-', '(', ')', etc.
    """
    cleaned = re.sub(r'[\(\)-]', '', sentence)  # Remove specific characters
    cleaned = re.sub(r'[^\w\s.,!?]', '', cleaned)  # Keep only alphanumeric, spaces, and basic punctuation
    cleaned = re.sub(r'\s+', ' ', cleaned).strip()  # Remove extra spaces and trim
    
    return cleaned


In [5]:
def load_data(filepath):
    """
    Load and preprocess text data from a file.
    Split the data into sentences using nltk's sentence tokenizer.
    """
    with open(filepath, "r", encoding="utf-8") as file:
        text = file.read()
    
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    
    # Clean each sentence
    cleaned_sentences = [clean_sentence(sentence) for sentence in sentences]

    # Debug: Print the first 10 sentences
    print("First 10 sentences from the file:")
    for i, sentence in enumerate(cleaned_sentences[:10], start=1):
        print(f"{i}: {sentence}")
    
    return cleaned_sentences  # Return the list of tokenized sentences


## Function for building a specific n-gram model with given corpus and n value

In [6]:
def build_ngram_model(corpus, n):
    """
    Build an n-gram model from a given corpus and given n.
    """
    ngram_counts = defaultdict(int)
    n_minus_1_counts = defaultdict(int)
    vocabulary = set()

    for sentence in corpus:
        # Tokenize and add padding based on n
        tokens = sentence.split()  # Tokenize the sentence
        if n > 1:
            tokens = (["<s>"] * (n - 1)) + tokens + (["</s>"] * (n - 1))
        
        # Update vocabulary
        vocabulary.update(tokens)

        # Generate n-grams and (n-1)-grams
        for i in range(len(tokens) - n + 1):
            ngram = tuple(tokens[i:i + n])
            n_minus_1_gram = tuple(tokens[i:i + n - 1])
            ngram_counts[ngram] += 1
            n_minus_1_counts[n_minus_1_gram] += 1

    return ngram_counts, n_minus_1_counts, len(vocabulary)

## Function for computing sentence probabilities with laplace smoothing

1. **Understanding N-gram Probabilities**:
   - An n-gram is a sequence of `n` tokens (words).
   - The probability of an n-gram with Laplace smoothing is computed as:
   - 
     $
     P(w_i | w_{i-(n-1)}, \ldots, w_{i-1}) = \frac{\text{Count}(w_{i-(n-1)}, \ldots, w_i) + 1}{\text{Count}(w_{i-(n-1)}, \ldots, w_{i-1}) + V}
     $

     Where:
     - $\text{Count}(w_{i-(n-1)}, \ldots, w_i)$: Count of the n-gram in the training data.
     - $\text{Count}(w_{i-(n-1)}, \ldots, w_{i-1})$: Count of the (n-1)-gram prefix.
     - $V$: Vocabulary size (total number of unique tokens in the training data).

2. **Handling Zero Counts**:
   - If the n-gram or its prefix does not appear in the training data, the smoothing process ensures a non-zero probability by adding 1 to the numerator and the vocabulary size $V$ to the denominator.

3. **Sentence Probability**:
   - A sentence's probability is the product of probabilities for all n-grams in the sentence:
   - 
     $
     P(\text{sentence}) = \prod_{i=1}^{N} P(w_i | w_{i-(n-1)}, \ldots, w_{i-1})
     $

     Here, the sentence is tokenized with start (`<s>`) and end (`</s>`) markers to capture context at the boundaries.


In [7]:
def compute_laplace_probability(ngram, ngram_counts, n_minus_1_counts, vocab_size):
    """
    Compute the Laplace smoothed probability of an n-gram.
    """
    ngram_count = ngram_counts[ngram]
    n_minus_1_count = n_minus_1_counts[ngram[:-1]] if len(ngram) > 1 else sum(ngram_counts.values())
    return (ngram_count + 1) / (n_minus_1_count + vocab_size)

In [8]:
def sentence_probability(sentence, ngram_counts, n_minus_1_counts, vocab_size, n):
    """
    Compute the probability of a sentence using an n-gram model with Laplace smoothing.
    Debug statements added to trace computation.
    """
    # Add padding based on n-gram size
    if n > 1:
        tokens = ["<s>"] * (n - 1) + sentence.split() + ["</s>"] * (n - 1)
    else:
        tokens = sentence.split()

    print(f"Tokens with padding for n={n}: {tokens}")  

    prob = 1.0  # Initialize probability
    for i in range(len(tokens) - n + 1):
        ngram = tuple(tokens[i:i + n])
        ngram_prob = compute_laplace_probability(ngram, ngram_counts, n_minus_1_counts, vocab_size)
        print(f"N-gram: {ngram}, Probability: {ngram_prob}")  

        prob *= ngram_prob

    print(f"Final sentence probability: {prob}")  
    return prob


## Function for computing perplexity with the used of laplace probability

In [9]:
def compute_perplexity(sentence, ngram_counts, n_minus_1_counts, vocab_size, n):
    """
    Compute the perplexity of a sentence using an n-gram model.
    """
    # Add padding based on n-gram size
    if n > 1:
        tokens = ["<s>"] * (n - 1) + sentence.split() + ["</s>"] * (n - 1)
    else:
        tokens = sentence.split()
        
    prob = 0.0
    for i in range(len(tokens) - n + 1):
        ngram = tuple(tokens[i:i + n])
        prob += math.log(compute_laplace_probability(ngram, ngram_counts, n_minus_1_counts, vocab_size))
    prob = -prob / len(tokens)
    return math.exp(prob)

## Function for building first 3 n-gram models 

In [10]:
def build_ngram_models(filepath):
    """
    Build n-gram models for 1-gram, 2-gram, and 3-gram.
    """
    corpus = load_data(filepath)
    ngram_models = {}
    
    for n in range(1, 4): 
        ngram_counts, n_minus_1_counts, vocab_size = build_ngram_model(corpus, n)
        ngram_models[n] = (ngram_counts, n_minus_1_counts, vocab_size)
    
    return ngram_models


## Function for calculating the probability and the perplexity of a sentence

In [11]:
def analyze_sentence(sentence, ngram_models):
    """
    Analyze a sentence by computing its probability and perplexity for 1-gram, 2-gram, and 3-gram models.
    """
    results = {}

    for n, (ngram_counts, n_minus_1_counts, vocab_size) in ngram_models.items():
        prob = sentence_probability(sentence, ngram_counts, n_minus_1_counts, vocab_size, n)
        perplexity = compute_perplexity(sentence, ngram_counts, n_minus_1_counts, vocab_size, n)
        results[n] = {"probability": prob, "perplexity": perplexity}
    
    return results


## Function for comparing 2 sentences by analyzing them

In [12]:
def compare_sentences(correct_sentence, incorrect_sentence, ngram_models):
    """
    Compare probabilities and perplexities of two sentences: one correct and one incorrect.
    """
    correct_results = analyze_sentence(correct_sentence, ngram_models)
    print("\n\n")
    incorrect_results = analyze_sentence(incorrect_sentence, ngram_models)

    comparison = {}
    for n in correct_results.keys():
        # Extract probabilities and perplexities for both sentences
        correct_prob = correct_results[n]["probability"]
        incorrect_prob = incorrect_results[n]["probability"]
        correct_perplexity = correct_results[n]["perplexity"]
        incorrect_perplexity = incorrect_results[n]["perplexity"]

        if correct_perplexity < incorrect_perplexity:
            higher_sentence = "Correct Sentence"
        elif incorrect_perplexity < correct_perplexity:
            higher_sentence = "Incorrect Sentence"
        else:
            higher_sentence = "Both sentences have equal probability"

        # Store comparison results
        comparison[n] = {
            "correct": {
                "probability": correct_prob,
                "perplexity": correct_perplexity,
            },
            "incorrect": {
                "probability": incorrect_prob,
                "perplexity": incorrect_perplexity,
            },
            "higher_probability": higher_sentence,
            "probability_difference": abs(correct_prob - incorrect_prob),
            "perplexity_difference": abs(correct_perplexity - incorrect_perplexity),
        }

    return comparison


# Call Main functions for the Exercise

**Objective**: Develop a simple English syntax error correction program.

a) Build a language model based on n-grams using the Laplace smoothing method for the following models:

- 1-gram
- 2-gram
- 3-gram

b) Calculate the probability of a sentence and compute the Perplexity of a sentence based on 1-gram, 2-gram, and 3-gram models.

c) Analyze the results (Provide your own examples of spelling errors and calculate the probability of two similar sentences, where one has the correct word order and the other has an incorrect word order).

## Download the tedtalk from public link google drive to it

In [13]:
# Download tedtalk
url = "https://drive.google.com/file/d/1ZFXJVav0rZ0V2TadMuY0TxWuwxkhN-nq/view?usp=sharing"

def download_from_google_drive(url, output_filename=None):
    # Extract file ID using regex
    match = re.search(r"/d/([^/]+)", url)
    if not match:
        print("Error: Could not extract file ID from the URL.")
        return
    
    file_id = match.group(1)
    print(f"Extracted File ID: {file_id}")

    download_url = f"https://drive.google.com/uc?id={file_id}"

    if output_filename:
        gdown.download(download_url, output_filename, quiet=False)
    else:
        gdown.download(download_url, quiet=False)

url = "https://drive.google.com/file/d/1ZFXJVav0rZ0V2TadMuY0TxWuwxkhN-nq/view?usp=sharing"
download_from_google_drive(url, "tedtalk.txt")



Extracted File ID: 1ZFXJVav0rZ0V2TadMuY0TxWuwxkhN-nq


Downloading...
From: https://drive.google.com/uc?id=1ZFXJVav0rZ0V2TadMuY0TxWuwxkhN-nq
To: e:\2_LEARNING_BKU\2_File_2\K22_HK242\CO3085_NLP\BT\Lab03\tedtalk.txt
100%|██████████| 40.3M/40.3M [00:03<00:00, 10.8MB/s]


## a. Build n-gram models using Laplace smoothing method

In [14]:
# Path 
dataset_path = os.path.join(os.getcwd(), "tedtalk.txt")

# n-gram models
print("Building n-gram models...")
ngram_models = build_ngram_models(dataset_path)


Building n-gram models...
First 10 sentences from the file:
1: Thank you so much, Chris.
2: And its truly a great honor to have the opportunity to come to this stage twice Im extremely grateful.
3: I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night.
4: And I say that sincerely, partly because Mock sob I need that.
5: Laughter Put yourselves in my position.
6: Laughter I flew on Air Force Two for eight years.
7: Laughter Now I have to take off my shoes or boots to get on an airplane!
8: Laughter Applause Ill tell you one quick story to illustrate what thats been like for me.
9: Laughter Its a true story every bit of this is true.
10: Soon after Tipper and I left the Mock sob White House Laughter we were driving from our home in Nashville to a little farm we have 50 miles east of Nashville.


In [15]:
print("\nN-gram Counts for Each Model:")
for n, (ngram_counts, _, _) in ngram_models.items():
    print(f"\n{n}-gram Model Counts:")
    
    for ngram, count in list(ngram_counts.items())[:30]: 
        print(f"{ngram}: {count}")
    
    print(f"... (Total unique {n}-grams: {len(ngram_counts)})")



N-gram Counts for Each Model:

1-gram Model Counts:
('Thank',): 4422
('you',): 75911
('so',): 22408
('much,',): 374
('Chris.',): 56
('And',): 62910
('its',): 25849
('truly',): 652
('a',): 166604
('great',): 3565
('honor',): 191
('to',): 207468
('have',): 42789
('the',): 315455
('opportunity',): 873
('come',): 5564
('this',): 55079
('stage',): 492
('twice',): 339
('Im',): 13440
('extremely',): 636
('grateful.',): 22
('I',): 107400
('been',): 11132
('blown',): 84
('away',): 2019
('by',): 19069
('conference,',): 72
('and',): 175784
('want',): 10520
... (Total unique 1-grams: 163312)

2-gram Model Counts:
('<s>', 'Thank'): 3234
('Thank', 'you'): 1203
('you', 'so'): 321
('so', 'much,'): 107
('much,', 'Chris.'): 4
('Chris.', '</s>'): 56
('<s>', 'And'): 59698
('And', 'its'): 1442
('its', 'truly'): 10
('truly', 'a'): 19
('a', 'great'): 1174
('great', 'honor'): 16
('honor', 'to'): 26
('to', 'have'): 3164
('have', 'the'): 2253
('the', 'opportunity'): 302
('opportunity', 'to'): 501
('to', 'come'

## b. Calculate the probability and perplexity of a sentence

In [16]:
sample_sentence = "That man over there is so handsome"

print("Sample sentence:", sample_sentence)

result = analyze_sentence(sample_sentence, ngram_models)
print("\nSentence Analysis:")
for n, values in result.items():
    print(f"\n{n}-gram Model:")
    print(f"Probability: {values['probability']}")
    print(f"Perplexity: {values['perplexity']}")

Sample sentence: That man over there is so handsome
Tokens with padding for n=1: ['That', 'man', 'over', 'there', 'is', 'so', 'handsome']
N-gram: ('That',), Probability: 0.0005227573613579725
N-gram: ('man',), Probability: 0.00020762769513042024
N-gram: ('over',), Probability: 0.001083625332545805
N-gram: ('there',), Probability: 0.0019332597165663407
N-gram: ('is',), Probability: 0.012566802844936534
N-gram: ('so',), Probability: 0.00306100593432736
N-gram: ('handsome',), Probability: 3.1417348605260957e-06
Final sentence probability: 2.747978276158006e-23
Tokens with padding for n=2: ['<s>', 'That', 'man', 'over', 'there', 'is', 'so', 'handsome', '</s>']
N-gram: ('<s>', 'That'), Probability: 0.005737858677873619
N-gram: ('That', 'man'), Probability: 1.1966016513102789e-05
N-gram: ('man', 'over'), Probability: 3.033373171634321e-05
N-gram: ('over', 'there'), Probability: 0.00033285449003188393
N-gram: ('there', 'is'), Probability: 0.013895619442597455
N-gram: ('is', 'so'), Probability

## c. Analyze the results (Provide your own examples of spelling errors and calculate the probability of two similar sentences, where one has the correct word order and the other has an incorrect word order).

In [17]:
# Example sentences for comparison
correct_sentence_example = "the cat sat on the mat"  
incorrect_sentence = "cat the the on mat sat"

# Compare probabilities and perplexities
comparison = compare_sentences(correct_sentence_example, incorrect_sentence, ngram_models)


Tokens with padding for n=1: ['the', 'cat', 'sat', 'on', 'the', 'mat']
N-gram: ('the',), Probability: 0.04309039618096174
N-gram: ('cat',), Probability: 2.1718949687984748e-05
N-gram: ('sat',), Probability: 6.228830853912608e-05
N-gram: ('on',), Probability: 0.005241643121868168
N-gram: ('the',), Probability: 0.04309039618096174
N-gram: ('mat',), Probability: 1.0927773427916854e-06
Final sentence probability: 1.4388166724576847e-20
Tokens with padding for n=2: ['<s>', 'the', 'cat', 'sat', 'on', 'the', 'mat', '</s>']
N-gram: ('<s>', 'the'), Probability: 5.430150168334655e-05
N-gram: ('the', 'cat'), Probability: 5.848331867769217e-05
N-gram: ('cat', 'sat'), Probability: 6.1172555544680435e-06
N-gram: ('sat', 'on'), Probability: 0.00021371566047298328
N-gram: ('on', 'the'), Probability: 0.056389635373798874
N-gram: ('the', 'mat'), Probability: 4.17737990554944e-06
N-gram: ('mat', '</s>'), Probability: 6.122911321875326e-06
Final sentence probability: 5.988224163869548e-30
Tokens with padd

In [18]:
# Display comparison results
print("\nComparison of Correct vs Incorrect Sentences:")
for n, results in comparison.items():
    print(f"\n{n}-gram Model:")
    print(f"Correct Sentence - Probability: {results['correct']['probability']:.6e}, Perplexity: {results['correct']['perplexity']:.16f}")
    print(f"Incorrect Sentence - Probability: {results['incorrect']['probability']:.6e}, Perplexity: {results['incorrect']['perplexity']:.16f}")
    print(f"Higher Probability (Lower Perplexity): {results['higher_probability']}")
    print(f"Perplexity Difference: {results['perplexity_difference']:.16f}")


Comparison of Correct vs Incorrect Sentences:

1-gram Model:
Correct Sentence - Probability: 1.438817e-20, Perplexity: 2027.6784695597984864
Incorrect Sentence - Probability: 1.438817e-20, Perplexity: 2027.6784695597984864
Higher Probability (Lower Perplexity): Both sentences have equal probability
Perplexity Difference: 0.0000000000000000

2-gram Model:
Correct Sentence - Probability: 5.988224e-30, Perplexity: 4496.1184463538020282
Incorrect Sentence - Probability: 3.907457e-37, Perplexity: 35564.2144581818647566
Higher Probability (Lower Perplexity): Correct Sentence
Perplexity Difference: 31068.0960118280636379

3-gram Model:
Correct Sentence - Probability: 2.456148e-40, Perplexity: 9140.5966350087601313
Incorrect Sentence - Probability: 1.061940e-42, Perplexity: 15753.9688859854304610
Higher Probability (Lower Perplexity): Correct Sentence
Perplexity Difference: 6613.3722509766703297
