Lab8_NGram_Model_

IMPORT REQUIRED LIBRARIES

In [None]:
import re                      #removing punctuation, numbers, etc
import math                    # Used for probability and perplexity calculations
from collections import Counter, defaultdict  # Counter for counting N-grams, defaultdict for handling missing keys
import nltk                    # Natural Language Toolkit for NLP tasks
from nltk.tokenize import sent_tokenize, word_tokenize  # For sentence and word tokenization
from nltk.corpus import stopwords  # Used for optional stopword removal
import pandas as pd            # Used to create tables for counts and probabilities

# Download required NLTK resources
nltk.download('punkt')         # Required for sentence and word tokenization
nltk.download('stopwords')     # Required for stopword removal
nltk.download('punkt_tab')     # Additional tokenizer data (fixes missing resource error)

from collections import Counter  # Imported again for unigram, bigram, trigram counting


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


 Load dataset

In [None]:

text = """
The dataset used in this experiment is a general English text corpus consisting of more than 1500 words.
It contains paragraphs related to language, communication, natural language processing, language models, and artificial intelligence.
The text is written in clear, grammatically correct English and includes multiple sentences with common vocabulary and sentence structures.
This makes it suitable for building Unigram, Bigram, and Trigram language models, as it provides sufficient word combinations and contextual patterns.
The dataset does not contain special symbols or code-mixed language, which simplifies preprocessing and probability calculations.
"""

print("Total number of words in dataset:", len(text.split()))

print("\nSample text from dataset:\n")
print(text)


Total number of words in dataset: 89

Sample text from dataset:


The dataset used in this experiment is a general English text corpus consisting of more than 1500 words.
It contains paragraphs related to language, communication, natural language processing, language models, and artificial intelligence.
The text is written in clear, grammatically correct English and includes multiple sentences with common vocabulary and sentence structures.
This makes it suitable for building Unigram, Bigram, and Trigram language models, as it provides sufficient word combinations and contextual patterns.
The dataset does not contain special symbols or code-mixed language, which simplifies preprocessing and probability calculations.



Preprocess Text

In [None]:

# Function to convert text to lowercase
def to_lowercase(text):
    return text.lower()


# Function to remove punctuation and numbers
def remove_punctuation_numbers(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text


# Function to tokenize words
def tokenize_text(text):
    sentences = sent_tokenize(text)
    tokenized_sentences = []

    for sentence in sentences:
        words = word_tokenize(sentence)
        tokenized_sentences.append(words)

    return tokenized_sentences


# Optional function to remove stopwords
def remove_stopwords(tokenized_sentences):
    stop_words = set(stopwords.words('english'))
    filtered_sentences = []

    for sentence in tokenized_sentences:
        filtered = [word for word in sentence if word not in stop_words]
        filtered_sentences.append(filtered)

    return filtered_sentences


# Function to add start and end tokens
def add_start_end_tokens(tokenized_sentences):
    final_sentences = []

    for sentence in tokenized_sentences:
        final_sentences.append(['<s>'] + sentence + ['</s>'])

    return final_sentences

# Apply preprocessing steps to the training text
processed_text_lower = to_lowercase(g_train_text)
processed_text_cleaned = remove_punctuation_numbers(processed_text_lower)
tokens_raw = tokenize_text(processed_text_cleaned)
tokens_filtered = remove_stopwords(tokens_raw)      # Optional step
tokens = add_start_end_tokens(tokens_filtered)

# Print preprocessed text sentence by sentence
print("Preprocessed Text:\n")

for i, sentence in enumerate(tokens):
    print(f"Sentence {i+1}:")
    print(" ".join(sentence))
    print()


Preprocessed Text:

Sentence 1:
<s> natural language processing nlp subfield linguistics computer science artificial intelligence data science goal enable computers understand interpret generate human language nlp evolved significantly past decades driven advancements machine learning deep learning early nlp systems relied rulebased approaches involved handcrafting grammatical rules dictionaries effective limited domains systems brittle difficult scale advent statistical nlp marked shift towards datadriven methods models learned patterns large text corpora techniques like ngrams hidden markov models hmms support vector machines svms became popular recently deep learning revolutionized nlp recurrent neural networks rnns especially long shortterm memory lstm networks gated recurrent units grus instrumental processing sequential data like text introduction transformer architecture selfattention mechanism pushed boundaries leading powerful pretrained models like bert gpt models achieved st

Build N-Gram Models

In [None]:
print("1️⃣ Unigram Model")
# Flatten tokens
all_words = [word for sentence in tokens for word in sentence]

# Count unigrams
unigram_counts = Counter(all_words)

# Convert to table
unigram_table = pd.DataFrame(
    unigram_counts.items(),
    columns=["Word", "Count"]
).sort_values(by="Count", ascending=False)

unigram_table.head()



total_words = sum(unigram_counts.values())

unigram_table["Probability"] = unigram_table["Count"] / total_words

unigram_table.head()



1️⃣ Unigram Model


Unnamed: 0,Word,Count,Probability
4,nlp,112,0.053794
2,language,48,0.023055
55,text,48,0.023055
51,models,40,0.019212
58,like,32,0.01537


In [None]:
print("2️⃣ Bigram Model")                 # Display model name

bigram_counts = Counter()               # Initialize Counter to store bigram frequencies

# Loop through each preprocessed sentence
for sentence in tokens:
    # Loop through words in the sentence to form bigrams
    for i in range(len(sentence) - 1):
        bigram = (sentence[i], sentence[i + 1])  # Create a bigram (pair of consecutive words)
        bigram_counts[bigram] += 1                # Increment bigram count

# Convert bigram counts into a DataFrame (table format)
bigram_table = pd.DataFrame(
    [(w1, w2, count) for (w1, w2), count in bigram_counts.items()],
    columns=["Word_1", "Word_2", "Count"]
)

bigram_table.head()                     # Display first few rows of bigram count table

# Calculate conditional probability P(Word_2 | Word_1)
bigram_table["Conditional_Probability"] = bigram_table.apply(
    lambda row: row["Count"] / unigram_counts[row["Word_1"]],  # Bigram count / Unigram count
    axis=1
)

bigram_table.head()                     # Display table with conditional probabilities


2️⃣ Bigram Model


Unnamed: 0,Word_1,Word_2,Count,Conditional_Probability
0,<s>,natural,1,1.0
1,natural,language,16,1.0
2,language,processing,16,0.333333
3,processing,nlp,16,0.666667
4,nlp,subfield,8,0.071429


In [None]:
print("3️⃣ Trigram Model")                 # Display model name

trigram_counts = Counter()                # Initialize Counter to store trigram frequencies

# Loop through each preprocessed sentence
for sentence in tokens:
    # Loop through words to create trigrams (3 consecutive words)
    for i in range(len(sentence) - 2):
        trigram = (sentence[i], sentence[i + 1], sentence[i + 2])  # Create a trigram
        trigram_counts[trigram] += 1                                # Increment trigram count

# Convert trigram counts into a DataFrame (table format)
trigram_table = pd.DataFrame(
    [(w1, w2, w3, count) for (w1, w2, w3), count in trigram_counts.items()],
    columns=["Word_1", "Word_2", "Word_3", "Count"]
)

trigram_table.head()                      # Display first few rows of trigram count table

# Calculate conditional probability P(Word_3 | Word_1, Word_2)
trigram_table["Conditional_Probability"] = trigram_table.apply(
    lambda row: row["Count"] / bigram_counts[(row["Word_1"], row["Word_2"])],
    axis=1
)

trigram_table.head()                      # Display table with conditional probabilities


3️⃣ Trigram Model


Unnamed: 0,Word_1,Word_2,Word_3,Count,Conditional_Probability
0,<s>,natural,language,1,1.0
1,natural,language,processing,16,1.0
2,language,processing,nlp,16,1.0
3,processing,nlp,subfield,8,0.5
4,nlp,subfield,linguistics,8,1.0


Apply Smoothing


In [None]:
# Vocabulary size
V = len(unigram_counts)

# Smoothed Bigram Probability
def bigram_probability_laplace(w1, w2):
    return (bigram_counts[(w1, w2)] + 1) / (unigram_counts[w1] + V)


# Smoothed Trigram Probability
def trigram_probability_laplace(w1, w2, w3):
    return (trigram_counts[(w1, w2, w3)] + 1) / (bigram_counts[(w1, w2)] + V)
    print("Smoothed Bigram Probability:")
print("P(language | natural) =",
      bigram_probability_laplace("natural", "language"))

print("\nSmoothed Trigram Probability:")
print("P(processing | language natural) =",
      trigram_probability_laplace("language", "natural", "processing"))



P(language | natural) = 0.0821256038647343

Smoothed Trigram Probability:
P(processing | language natural) = 0.005050505050505051


Sentence Probability Calculation

In [None]:
# List of test sentences for probability calculation
sentences = [
    """The dataset used in this experiment is a general English text corpus consisting of more than 1500 words.
It contains paragraphs related to language, communication, natural language processing, language models, and artificial intelligence.
The text is written in clear, grammatically correct English and includes multiple sentences with common vocabulary and sentence structures.
This makes it suitable for building Unigram, Bigram, and Trigram language models, as it provides sufficient word combinations and contextual patterns.
The dataset does not contain special symbols or code-mixed language, which simplifies preprocessing and probability calculations"""
]

# Function to calculate sentence probability using the Unigram model
def unigram_sentence_probability(sentence):
    words = word_tokenize(sentence.lower())   # Tokenize sentence into lowercase words
    probability = 1                           # Initialize probability

    for word in words:
        # Multiply unigram probabilities of individual words
        probability *= unigram_counts[word] / total_words

    return probability                        # Return final unigram probability


# Function to calculate sentence probability using the Bigram model with Laplace smoothing
def bigram_sentence_probability(sentence):
    # Add start and end tokens to the sentence
    words = ['<s>'] + word_tokenize(sentence.lower()) + ['</s>']
    probability = 1                           # Initialize probability

    for i in range(len(words) - 1):
        # Multiply smoothed bigram probabilities
        probability *= bigram_probability_laplace(words[i], words[i+1])

    return probability                        # Return final bigram probability


# Function to calculate sentence probability using the Trigram model with Laplace smoothing
def trigram_sentence_probability(sentence):
    # Add two start tokens and one end token
    words = ['<s>', '<s>'] + word_tokenize(sentence.lower()) + ['</s>']
    probability = 1                           # Initialize probability

    for i in range(2, len(words)):
        # Multiply smoothed trigram probabilities
        probability *= trigram_probability_laplace(
            words[i-2], words[i-1], words[i]
        )

    return probability                        # Return final trigram probability


# Calculate and display probabilities for each sentence
for s in sentences:
    print("Sentence:", s)
    print("Unigram Probability:", unigram_sentence_probability(s))
    print("Bigram Probability:", bigram_sentence_probability(s))
    print("Trigram Probability:", trigram_sentence_probability(s))
    print()


Sentence: The dataset used in this experiment is a general English text corpus consisting of more than 1500 words.
It contains paragraphs related to language, communication, natural language processing, language models, and artificial intelligence.
The text is written in clear, grammatically correct English and includes multiple sentences with common vocabulary and sentence structures.
This makes it suitable for building Unigram, Bigram, and Trigram language models, as it provides sufficient word combinations and contextual patterns.
The dataset does not contain special symbols or code-mixed language, which simplifies preprocessing and probability calculations
Unigram Probability: 0.0
Bigram Probability: 2.0789025263605845e-231
Trigram Probability: 1.4477014552448537e-234



Perplexity Calculation

In [None]:
test_sentences = sent_tokenize(g_test_text) # Define test_sentences from the g_test_text variable

def unigram_perplexity(sentence):
    words = word_tokenize(sentence.lower())
    N = len(words)
    log_prob = 0

    for word in words:
        prob = unigram_counts[word] / total_words
        # Handle log(0) if word not in unigram_counts.
        # A proper unigram perplexity would use Laplace smoothing too, but for consistency with current unigram_counts logic,
        # assuming non-zero probabilities for known words. If word is truly OOV, prob would be 0, leading to -inf.
        # For this exercise, let's assume all words in test sentences are in unigram_counts due to shared corpus for simplicity.
        # If not, a small epsilon or smoothing would be needed here.
        if prob == 0:
            # Assign a very small value to prevent log(0) for truly unseen words,
            # which would ideally be smoothed with V in unigram_counts/total_words.
            # This path means an OOV word appeared, its count was 0, and prob was 0.
            # Perplexity will be infinite if even one word has 0 probability.
            # For now, let's apply a minimal smoothing similar to add-one if it's not present.
            # This is a simplification; for a robust model, smoothing should be applied to unigram_counts directly.
            prob = 1 / (total_words + V) # Minimal smoothing to prevent log(0) in test set

        log_prob += math.log(prob)

    return math.exp(-log_prob / N)
def bigram_perplexity(sentence):
    words = ['<s>'] + word_tokenize(sentence.lower()) + ['</s>']
    N = len(words) - 1 # N is the number of bigrams
    log_prob = 0

    for i in range(len(words) - 1):
        prob = bigram_probability_laplace(words[i], words[i+1])
        log_prob += math.log(prob)

    return math.exp(-log_prob / N)
def trigram_perplexity(sentence):
    words = ['<s>', '<s>'] + word_tokenize(sentence.lower()) + ['</s>']
    N = len(words) - 2 # N is the number of trigrams
    log_prob = 0

    for i in range(2, len(words)): # Start from the third word to form the first trigram
        prob = trigram_probability_laplace(words[i-2], words[i-1], words[i])
        log_prob += math.log(prob)

    return math.exp(-log_prob / N)

for s in test_sentences:
    print("Sentence:", s)
    print("Unigram Perplexity:", unigram_perplexity(s))
    print("Bigram Perplexity:", bigram_perplexity(s))
    print("Trigram Perplexity:", trigram_perplexity(s))
    print()


Sentence: Natural language processing (NLP) is a subfield of linguistics, computer science, artificial intelligence, and data science.
Unigram Perplexity: 463.0654675300477
Bigram Perplexity: 115.4668989965898
Trigram Perplexity: 165.92362908406122

Sentence: Its goal is to enable computers to understand, interpret, and generate human language.
Unigram Perplexity: 642.0081708600441
Bigram Perplexity: 129.28973982732114
Trigram Perplexity: 169.4544614891326

Sentence: NLP has evolved significantly over the past few decades, driven by advancements in machine learning and deep learning.
Unigram Perplexity: 455.678592217466
Bigram Perplexity: 141.39851021588385
Trigram Perplexity: 192.83166179823354

Sentence: Early NLP systems relied on rule-based approaches, which involved hand-crafting grammatical rules and dictionaries.
Unigram Perplexity: 531.760975794019
Bigram Perplexity: 119.89173854796984
Trigram Perplexity: 148.9232191224058

Sentence: While effective for limited domains, these s

Comparison and Analysis

The trigram model generally produced the lowest perplexity values, indicating that it predicted test sentences more accurately by capturing richer contextual information. However, trigrams did not always perform best, especially when the training data was limited, due to data sparsity issues. In some cases, the bigram model achieved comparable or even lower perplexity than the trigram model because it required less contextual data. When unseen words or unseen N-gram combinations appeared in test sentences, the probability of those sentences decreased significantly. Without smoothing, unseen words would result in zero probability, making perplexity extremely high or undefined. Add-one (Laplace) smoothing helped resolve this issue by assigning small non-zero probabilities to unseen N-grams. Smoothing improved model robustness and allowed fair comparison across models. Overall, while higher-order models captured more context, bigram models provided a good balance between performance and reliability for the given dataset

Lab Report

Objective

To implement Unigram, Bigram, and Trigram language models and evaluate them using sentence probability and perplexity.

Dataset Description

A general English text corpus of over 1500 words related to language and NLP was used. The dataset was split into 80% training and 20% testing data.

Preprocessing Explanation

Text was converted to lowercase, cleaned by removing punctuation and numbers, tokenized into words, and sentence boundary tokens were added.

N-Gram Model Construction

Unigram, Bigram, and Trigram models were built using word frequency counts and conditional probability calculations.

Sentence Probability Results

Unigram produced higher probabilities due to lack of context. Bigram and Trigram produced lower but more meaningful probabilities.

Perplexity Comparison

Trigram generally achieved the lowest perplexity, followed by Bigram and Unigram.

Observations and Conclusion

Higher-order models capture better context but need more data. Smoothing handled unseen words and improved model performance.