**Language Model and Application for Spelling Error Correction**

**Objective**: Develop a simple English syntax error correction program.

a) Build a language model based on n-grams using the Laplace smoothing method for the following models:

- 1-gram
- 2-gram
- 3-gram

b) Calculate the probability of a sentence and compute the Perplexity of a sentence based on 1-gram, 2-gram, and 3-gram models.

c) Analyze the results (Provide your own examples of spelling errors and calculate the probability of two similar sentences, where one has the correct word order and the other has an incorrect word order).

In [1]:
%pip install wordcloud




# Import 

In [None]:
# import sys
# import os
# import platform

# # Python environment details
# print("Python executable being used:", sys.executable)
# print("Python version:", sys.version)

# # Operating System details
# print("Operating System:", platform.system())
# print("OS Version:", platform.version())
# print("OS Release:", platform.release())

# # Machine and architecture details
# print("Machine:", platform.machine())

# # Visual Studio Code details (based on environment variable)
# vscode_info = os.environ.get('VSCODE_PID', None)
# if vscode_info:
#     print("Running in Visual Studio Code")
# else:
#     print("Not running in Visual Studio Code")

Python executable being used: c:\Python312\python.exe
Python version: 3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]
Operating System: Windows
OS Version: 10.0.19045
OS Release: 10
Machine: AMD64
Running in Visual Studio Code


In [3]:
import os
from collections import defaultdict
import math
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# Functions Definitions

## Function for loading the data

In [4]:
def load_data(filepath):
    """
    Load and preprocess text data from a file.
    """
    with open(filepath, "r", encoding="utf-8") as file:
        sentences = file.readlines()
    
    # Add <s> and </s> to each sentence and tokenize
    tokens = []
    for sentence in sentences:
        sentence = sentence.strip().lower()
        if sentence:  
            tokens.extend(["<s>"] + sentence.split() + ["</s>"])
    
    return tokens

## Function for building a specific n-gram model with given corpus and n value

In [5]:
def build_ngram_model(corpus, n):
    """
    Build an n-gram model from a given corpus and given n
    """
    ngram_counts = defaultdict(int)
    n_minus_1_counts = defaultdict(int)
    vocabulary = set(corpus)

    for i in range(len(corpus) - n + 1):
        ngram = tuple(corpus[i:i + n])
        n_minus_1_gram = tuple(corpus[i:i + n - 1])
        ngram_counts[ngram] += 1
        n_minus_1_counts[n_minus_1_gram] += 1

    return ngram_counts, n_minus_1_counts, len(vocabulary)

## Function for computing sentence probabilities with laplace smoothing

1. **Understanding N-gram Probabilities**:
   - An n-gram is a sequence of `n` tokens (words).
   - The probability of an n-gram with Laplace smoothing is computed as:
   - 
     $
     P(w_i | w_{i-(n-1)}, \ldots, w_{i-1}) = \frac{\text{Count}(w_{i-(n-1)}, \ldots, w_i) + 1}{\text{Count}(w_{i-(n-1)}, \ldots, w_{i-1}) + V}
     $

     Where:
     - $\text{Count}(w_{i-(n-1)}, \ldots, w_i)$: Count of the n-gram in the training data.
     - $\text{Count}(w_{i-(n-1)}, \ldots, w_{i-1})$: Count of the (n-1)-gram prefix.
     - $V$: Vocabulary size (total number of unique tokens in the training data).

2. **Handling Zero Counts**:
   - If the n-gram or its prefix does not appear in the training data, the smoothing process ensures a non-zero probability by adding 1 to the numerator and the vocabulary size $V$ to the denominator.

3. **Sentence Probability**:
   - A sentence's probability is the product of probabilities for all n-grams in the sentence:
   - 
     $
     P(\text{sentence}) = \prod_{i=1}^{N} P(w_i | w_{i-(n-1)}, \ldots, w_{i-1})
     $

     Here, the sentence is tokenized with start (`<s>`) and end (`</s>`) markers to capture context at the boundaries.


In [6]:
def compute_laplace_probability(ngram, ngram_counts, n_minus_1_counts, vocab_size):
    """
    Compute the Laplace smoothed probability of an n-gram.
    """
    ngram_count = ngram_counts[ngram]
    n_minus_1_count = n_minus_1_counts[ngram[:-1]] if len(ngram) > 1 else sum(ngram_counts.values())
    return (ngram_count + 1) / (n_minus_1_count + vocab_size)

In [7]:
def sentence_probability(sentence, ngram_counts, n_minus_1_counts, vocab_size, n):
    """
    Compute the probability of a sentence using an n-gram model with laplace smoothing
    """
    tokens = ["<s>"] + sentence.split() + ["</s>"]
    prob = 1.0
    for i in range(len(tokens) - n + 1):
        ngram = tuple(tokens[i:i + n])
        prob *= compute_laplace_probability(ngram, ngram_counts, n_minus_1_counts, vocab_size)
    return prob

## Function for computing perplexity with the used of laplace probability

In [8]:
def compute_perplexity(sentence, ngram_counts, n_minus_1_counts, vocab_size, n):
    """
    Compute the perplexity of a sentence using an n-gram model.
    """
    tokens = ["<s>"] + sentence.split() + ["</s>"]
    prob = 0.0
    for i in range(len(tokens) - n + 1):
        ngram = tuple(tokens[i:i + n])
        prob += math.log(compute_laplace_probability(ngram, ngram_counts, n_minus_1_counts, vocab_size))
    prob = -prob / len(tokens)
    return math.exp(prob)

## Function for building first 3 n-gram models 

In [9]:
def build_ngram_models(filepath):
    """
    Build n-gram models for 1-gram, 2-gram, and 3-gram.
    """
    corpus = load_data(filepath)
    ngram_models = {}
    
    for n in range(1, 4): 
        ngram_counts, n_minus_1_counts, vocab_size = build_ngram_model(corpus, n)
        ngram_models[n] = (ngram_counts, n_minus_1_counts, vocab_size)
    
    return ngram_models


## Function for calculating the probability and the perplexity of a sentence

In [10]:
def analyze_sentence(sentence, ngram_models):
    """
    Analyze a sentence by computing its probability and perplexity for 1-gram, 2-gram, and 3-gram models.
    """
    results = {}

    for n, (ngram_counts, n_minus_1_counts, vocab_size) in ngram_models.items():
        prob = sentence_probability(sentence, ngram_counts, n_minus_1_counts, vocab_size, n)
        perplexity = compute_perplexity(sentence, ngram_counts, n_minus_1_counts, vocab_size, n)
        results[n] = {"probability": prob, "perplexity": perplexity}
    
    return results


## Function for comparing 2 sentences by analyzing them

In [11]:
def compare_sentences(correct_sentence, incorrect_sentence, ngram_models):
    """
    Compare probabilities and perplexities of two sentences: one correct and one incorrect.
    """
    correct_results = analyze_sentence(correct_sentence, ngram_models)
    incorrect_results = analyze_sentence(incorrect_sentence, ngram_models)

    comparison = {}
    for n in correct_results.keys():
        # Extract probabilities and perplexities for both sentences
        correct_prob = correct_results[n]["probability"]
        incorrect_prob = incorrect_results[n]["probability"]
        correct_perplexity = correct_results[n]["perplexity"]
        incorrect_perplexity = incorrect_results[n]["perplexity"]

        if correct_prob > incorrect_prob:
            higher_sentence = "Correct Sentence"
        elif incorrect_prob > correct_prob:
            higher_sentence = "Incorrect Sentence"
        else:
            higher_sentence = "Both sentences have equal probability"

        # Store comparison results
        comparison[n] = {
            "correct": {
                "probability": correct_prob,
                "perplexity": correct_perplexity,
            },
            "incorrect": {
                "probability": incorrect_prob,
                "perplexity": incorrect_perplexity,
            },
            "higher_probability": higher_sentence,
            "probability_difference": abs(correct_prob - incorrect_prob),
        }

    return comparison


# Call Main functions for the Exercise

**Objective**: Develop a simple English syntax error correction program.

a) Build a language model based on n-grams using the Laplace smoothing method for the following models:

- 1-gram
- 2-gram
- 3-gram

b) Calculate the probability of a sentence and compute the Perplexity of a sentence based on 1-gram, 2-gram, and 3-gram models.

c) Analyze the results (Provide your own examples of spelling errors and calculate the probability of two similar sentences, where one has the correct word order and the other has an incorrect word order).

In [12]:
# Path 
dataset_path = os.path.join(os.getcwd(), "tedtalk.txt")

# n-gram models
print("Building n-gram models...")
ngram_models = build_ngram_models(dataset_path)

# Example sentences for comparison
correct_sentence_example = "the cat sat on the mat"  
incorrect_sentence = "the mat sat on the cat"

# Compare probabilities and perplexities
comparison = compare_sentences(correct_sentence_example, incorrect_sentence, ngram_models)

# Display comparison results
print("\nComparison of Correct vs Incorrect Sentences:")
for n, results in comparison.items():
    print(f"\n{n}-gram Model:")
    print(f"Correct Sentence - Probability: {results['correct']['probability']:.6e}, Perplexity: {results['correct']['perplexity']:.2f}")
    print(f"Incorrect Sentence - Probability: {results['incorrect']['probability']:.6e}, Perplexity: {results['incorrect']['perplexity']:.2f}")
    print(f"Higher Probability: {results['higher_probability']}")
    print(f"Probability Difference: {results['probability_difference']:.6e}")

Building n-gram models...

Comparison of Correct vs Incorrect Sentences:

1-gram Model:
Correct Sentence - Probability: 5.360671e-27, Perplexity: 1922.42
Incorrect Sentence - Probability: 5.360671e-27, Perplexity: 1922.42
Higher Probability: Correct Sentence
Probability Difference: 7.174648e-43

2-gram Model:
Correct Sentence - Probability: 9.049810e-29, Perplexity: 3201.99
Incorrect Sentence - Probability: 9.049810e-29, Perplexity: 3201.99
Higher Probability: Incorrect Sentence
Probability Difference: 1.121039e-44

3-gram Model:
Correct Sentence - Probability: 4.723293e-31, Perplexity: 6176.17
Incorrect Sentence - Probability: 4.723293e-31, Perplexity: 6176.17
Higher Probability: Correct Sentence
Probability Difference: 8.758115e-47
