**Language Model and Application for Spelling Error Correction**

**Objective**: Develop a simple English syntax error correction program.

a) Build a language model based on n-grams using the Laplace smoothing method for the following models:

- 1-gram
- 2-gram
- 3-gram

b) Calculate the probability of a sentence and compute the Perplexity of a sentence based on 1-gram, 2-gram, and 3-gram models.

c) Analyze the results (Provide your own examples of spelling errors and calculate the probability of two similar sentences, where one has the correct word order and the other has an incorrect word order).

In [None]:
%pip install gdown matplotlib wordcloud

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


# Import 

In [None]:
# import sys
# import os
# import platform

# # Python environment details
# print("Python executable being used:", sys.executable)
# print("Python version:", sys.version)

# # Operating System details
# print("Operating System:", platform.system())
# print("OS Version:", platform.version())
# print("OS Release:", platform.release())

# # Machine and architecture details
# print("Machine:", platform.machine())

# # Visual Studio Code details (based on environment variable)
# vscode_info = os.environ.get('VSCODE_PID', None)
# if vscode_info:
#     print("Running in Visual Studio Code")
# else:
#     print("Not running in Visual Studio Code")

Python executable being used: c:\Python312\python.exe
Python version: 3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]
Operating System: Windows
OS Version: 10.0.19045
OS Release: 10
Machine: AMD64
Running in Visual Studio Code


In [19]:
import os
import re
import gdown
from collections import defaultdict
import math
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Functions Definitions

## Function for loading the data

In [20]:
def load_data(filepath):
    """
    Load and preprocess text data from a file.
    """
    with open(filepath, "r", encoding="utf-8") as file:
        sentences = file.readlines()
    
    # Add <s> and </s> to each sentence and tokenize
    tokens = []
    for sentence in sentences:
        sentence = sentence.strip().lower()
        if sentence:  
            tokens.extend(["<s>"] + sentence.split() + ["</s>"])
    
    return tokens

## Function for building a specific n-gram model with given corpus and n value

In [21]:
def build_ngram_model(corpus, n):
    """
    Build an n-gram model from a given corpus and given n
    """
    ngram_counts = defaultdict(int)
    n_minus_1_counts = defaultdict(int)
    vocabulary = set(corpus)

    for i in range(len(corpus) - n + 1):
        ngram = tuple(corpus[i:i + n])
        n_minus_1_gram = tuple(corpus[i:i + n - 1])
        ngram_counts[ngram] += 1
        n_minus_1_counts[n_minus_1_gram] += 1

    return ngram_counts, n_minus_1_counts, len(vocabulary)

## Function for computing sentence probabilities with laplace smoothing

1. **Understanding N-gram Probabilities**:
   - An n-gram is a sequence of `n` tokens (words).
   - The probability of an n-gram with Laplace smoothing is computed as:
   - 
     $
     P(w_i | w_{i-(n-1)}, \ldots, w_{i-1}) = \frac{\text{Count}(w_{i-(n-1)}, \ldots, w_i) + 1}{\text{Count}(w_{i-(n-1)}, \ldots, w_{i-1}) + V}
     $

     Where:
     - $\text{Count}(w_{i-(n-1)}, \ldots, w_i)$: Count of the n-gram in the training data.
     - $\text{Count}(w_{i-(n-1)}, \ldots, w_{i-1})$: Count of the (n-1)-gram prefix.
     - $V$: Vocabulary size (total number of unique tokens in the training data).

2. **Handling Zero Counts**:
   - If the n-gram or its prefix does not appear in the training data, the smoothing process ensures a non-zero probability by adding 1 to the numerator and the vocabulary size $V$ to the denominator.

3. **Sentence Probability**:
   - A sentence's probability is the product of probabilities for all n-grams in the sentence:
   - 
     $
     P(\text{sentence}) = \prod_{i=1}^{N} P(w_i | w_{i-(n-1)}, \ldots, w_{i-1})
     $

     Here, the sentence is tokenized with start (`<s>`) and end (`</s>`) markers to capture context at the boundaries.


In [22]:
def compute_laplace_probability(ngram, ngram_counts, n_minus_1_counts, vocab_size):
    """
    Compute the Laplace smoothed probability of an n-gram.
    """
    ngram_count = ngram_counts[ngram]
    n_minus_1_count = n_minus_1_counts[ngram[:-1]] if len(ngram) > 1 else sum(ngram_counts.values())
    return (ngram_count + 1) / (n_minus_1_count + vocab_size)

In [23]:
def sentence_probability(sentence, ngram_counts, n_minus_1_counts, vocab_size, n):
    """
    Compute the probability of a sentence using an n-gram model with laplace smoothing
    """
    tokens = ["<s>"] + sentence.split() + ["</s>"]
    prob = 1.0
    for i in range(len(tokens) - n + 1):
        ngram = tuple(tokens[i:i + n])
        prob *= compute_laplace_probability(ngram, ngram_counts, n_minus_1_counts, vocab_size)
    return prob

## Function for computing perplexity with the used of laplace probability

In [24]:
def compute_perplexity(sentence, ngram_counts, n_minus_1_counts, vocab_size, n):
    """
    Compute the perplexity of a sentence using an n-gram model.
    """
    tokens = ["<s>"] + sentence.split() + ["</s>"]
    prob = 0.0
    for i in range(len(tokens) - n + 1):
        ngram = tuple(tokens[i:i + n])
        prob += math.log(compute_laplace_probability(ngram, ngram_counts, n_minus_1_counts, vocab_size))
    prob = -prob / len(tokens)
    return math.exp(prob)

## Function for building first 3 n-gram models 

In [None]:
def build_ngram_models(filepath):
    """
    Build n-gram models for 1-gram, 2-gram, and 3-gram.
    """
    corpus = load_data(filepath)
    ngram_models = {}
    
    for n in range(1, 4): 
        ngram_counts, n_minus_1_counts, vocab_size = build_ngram_model(corpus, n)
        ngram_models[n] = (ngram_counts, n_minus_1_counts, vocab_size)
    
    return ngram_models


## Function for calculating the probability and the perplexity of a sentence

In [26]:
def analyze_sentence(sentence, ngram_models):
    """
    Analyze a sentence by computing its probability and perplexity for 1-gram, 2-gram, and 3-gram models.
    """
    results = {}

    for n, (ngram_counts, n_minus_1_counts, vocab_size) in ngram_models.items():
        prob = sentence_probability(sentence, ngram_counts, n_minus_1_counts, vocab_size, n)
        perplexity = compute_perplexity(sentence, ngram_counts, n_minus_1_counts, vocab_size, n)
        results[n] = {"probability": prob, "perplexity": perplexity}
    
    return results


## Function for comparing 2 sentences by analyzing them

In [27]:
def compare_sentences(correct_sentence, incorrect_sentence, ngram_models):
    """
    Compare probabilities and perplexities of two sentences: one correct and one incorrect.
    """
    correct_results = analyze_sentence(correct_sentence, ngram_models)
    incorrect_results = analyze_sentence(incorrect_sentence, ngram_models)

    comparison = {}
    for n in correct_results.keys():
        # Extract probabilities and perplexities for both sentences
        correct_prob = correct_results[n]["probability"]
        incorrect_prob = incorrect_results[n]["probability"]
        correct_perplexity = correct_results[n]["perplexity"]
        incorrect_perplexity = incorrect_results[n]["perplexity"]

        if correct_prob > incorrect_prob:
            higher_sentence = "Correct Sentence"
        elif incorrect_prob > correct_prob:
            higher_sentence = "Incorrect Sentence"
        else:
            higher_sentence = "Both sentences have equal probability"

        # Store comparison results
        comparison[n] = {
            "correct": {
                "probability": correct_prob,
                "perplexity": correct_perplexity,
            },
            "incorrect": {
                "probability": incorrect_prob,
                "perplexity": incorrect_perplexity,
            },
            "higher_probability": higher_sentence,
            "probability_difference": abs(correct_prob - incorrect_prob),
            "perplexity_difference": abs(correct_perplexity - incorrect_perplexity),
        }

    return comparison


# Call Main functions for the Exercise

**Objective**: Develop a simple English syntax error correction program.

a) Build a language model based on n-grams using the Laplace smoothing method for the following models:

- 1-gram
- 2-gram
- 3-gram

b) Calculate the probability of a sentence and compute the Perplexity of a sentence based on 1-gram, 2-gram, and 3-gram models.

c) Analyze the results (Provide your own examples of spelling errors and calculate the probability of two similar sentences, where one has the correct word order and the other has an incorrect word order).

## Download the tedtalk from public link google drive to it

In [28]:
# Download tedtalk
url = "https://drive.google.com/file/d/1ZFXJVav0rZ0V2TadMuY0TxWuwxkhN-nq/view?usp=sharing"

def download_from_google_drive(url, output_filename=None):
    # Extract file ID using regex
    match = re.search(r"/d/([^/]+)", url)
    if not match:
        print("Error: Could not extract file ID from the URL.")
        return
    
    file_id = match.group(1)
    print(f"Extracted File ID: {file_id}")

    download_url = f"https://drive.google.com/uc?id={file_id}"

    if output_filename:
        gdown.download(download_url, output_filename, quiet=False)
    else:
        gdown.download(download_url, quiet=False)

url = "https://drive.google.com/file/d/1ZFXJVav0rZ0V2TadMuY0TxWuwxkhN-nq/view?usp=sharing"
download_from_google_drive(url, "tedtalk.txt")



Extracted File ID: 1ZFXJVav0rZ0V2TadMuY0TxWuwxkhN-nq


Downloading...
From: https://drive.google.com/uc?id=1ZFXJVav0rZ0V2TadMuY0TxWuwxkhN-nq
To: e:\2_LEARNING_BKU\2_File_2\K22_HK242\CO3085_NLP\BT\Lab03\tedtalk.txt
100%|██████████| 40.3M/40.3M [00:15<00:00, 2.65MB/s]


## a. Build n-gram models using Laplace smoothing method

In [29]:
# Path 
dataset_path = os.path.join(os.getcwd(), "tedtalk.txt")

# n-gram models
print("Building n-gram models...")
ngram_models = build_ngram_models(dataset_path)


Building n-gram models...


In [30]:
print("\nN-gram Counts for Each Model:")
for n, (ngram_counts, _, _) in ngram_models.items():
    print(f"\n{n}-gram Model Counts:")
    
    for ngram, count in list(ngram_counts.items())[:10]: 
        print(f"{ngram}: {count}")
    
    print(f"... (Total unique {n}-grams: {len(ngram_counts)})")



N-gram Counts for Each Model:

1-gram Model Counts:
('<s>',): 4005
('thank',): 5095
('you',): 85456
('so',): 50404
('much,',): 373
('chris.',): 56
('and',): 238523
("it's",): 32598
('truly',): 654
('a',): 170018
... (Total unique 1-grams: 175986)

2-gram Model Counts:
('<s>', 'thank'): 22
('thank', 'you'): 1515
('you', 'so'): 320
('so', 'much,'): 108
('much,', 'chris.'): 4
('chris.', 'and'): 4
('and', "it's"): 3575
("it's", 'truly'): 14
('truly', 'a'): 19
('a', 'great'): 1196
... (Total unique 2-grams: 1915064)

3-gram Model Counts:
('<s>', 'thank', 'you'): 12
('thank', 'you', 'so'): 257
('you', 'so', 'much,'): 22
('so', 'much,', 'chris.'): 3
('much,', 'chris.', 'and'): 2
('chris.', 'and', "it's"): 1
('and', "it's", 'truly'): 3
("it's", 'truly', 'a'): 4
('truly', 'a', 'great'): 1
('a', 'great', 'honor'): 8
... (Total unique 3-grams: 4649030)


## b. Calculate the probability and perplexity of a sentence

In [31]:
sample_sentence = "That man over there is so handsome"

print("Sample sentence:", sample_sentence)

result = analyze_sentence(sample_sentence, ngram_models)
print("\nSentence Analysis:")
for n, values in result.items():
    print(f"\n{n}-gram Model:")
    print(f"Probability: {values['probability']}")
    print(f"Perplexity: {values['perplexity']}")

Sample sentence: That man over there is so handsome

Sentence Analysis:

1-gram Model:
Probability: 6.426880047209636e-33
Perplexity: 3774.7558540347018

2-gram Model:
Probability: 3.3126875602593266e-34
Perplexity: 5247.832285059609

3-gram Model:
Probability: 6.727572757238024e-35
Perplexity: 6264.762651741025


## c. Analyze the results (Provide your own examples of spelling errors and calculate the probability of two similar sentences, where one has the correct word order and the other has an incorrect word order).

In [32]:
# Example sentences for comparison
correct_sentence_example = "the cat sat on the mat"  
incorrect_sentence = "the mat sat on the cat"

# Compare probabilities and perplexities
comparison = compare_sentences(correct_sentence_example, incorrect_sentence, ngram_models)

# Display comparison results
print("\nComparison of Correct vs Incorrect Sentences:")
for n, results in comparison.items():
    print(f"\n{n}-gram Model:")
    print(f"Correct Sentence - Probability: {results['correct']['probability']:.6e}, Perplexity: {results['correct']['perplexity']:.16f}")
    print(f"Incorrect Sentence - Probability: {results['incorrect']['probability']:.6e}, Perplexity: {results['incorrect']['perplexity']:.16f}")
    print(f"Higher Probability: {results['higher_probability']}")
    print(f"Probability Difference: {results['probability_difference']:.6e}")
    print(f"Perplexity Difference: {results['perplexity_difference']:.16f}")


Comparison of Correct vs Incorrect Sentences:

1-gram Model:
Correct Sentence - Probability: 5.360671e-27, Perplexity: 1922.4170439452398114
Incorrect Sentence - Probability: 5.360671e-27, Perplexity: 1922.4170439452432220
Higher Probability: Correct Sentence
Probability Difference: 7.174648e-43
Perplexity Difference: 0.0000000000034106

2-gram Model:
Correct Sentence - Probability: 9.049810e-29, Perplexity: 3201.9906991856773857
Incorrect Sentence - Probability: 9.049810e-29, Perplexity: 3201.9906991856828427
Higher Probability: Incorrect Sentence
Probability Difference: 1.121039e-44
Perplexity Difference: 0.0000000000054570

3-gram Model:
Correct Sentence - Probability: 4.723293e-31, Perplexity: 6176.1721563688815877
Incorrect Sentence - Probability: 4.723293e-31, Perplexity: 6176.1721563688815877
Higher Probability: Correct Sentence
Probability Difference: 8.758115e-47
Perplexity Difference: 0.0000000000000000
