# N-gram Models

N-gram models are a type of statistical model used in natural language processing (NLP) to analyze and generate text. The term "n-gram" refers to a sequence of n items, where n is a positive integer.

In the context of NLP, an n-gram model is a statistical model that analyzes a sequence of n words or characters in a text and predicts the probability of the next word or character in the sequence. The model is trained on a large corpus of text data and uses the frequencies of the n-grams to make predictions.

There are several types of n-gram models, including:

1. Unigram models: These models analyze the frequency of individual words or characters in the text.
2. Bigram models: These models analyze the frequency of pairs of words or characters in the text.
3. Trigram models: These models analyze the frequency of triples of words or characters in the text.
4. N-gram models: These models analyze the frequency of sequences of n words or characters in the text.

N-gram models are widely used in NLP applications such as:

1. Language modeling: N-gram models are used to predict the next word in a sentence or the next character in a sequence.
2. Text classification: N-gram models are used to classify text into different categories such as spam vs. non-spam emails.
3. Sentiment analysis: N-gram models are used to analyze the sentiment of text, such as determining whether a piece of text is positive, negative, or neutral.
4. Machine translation: N-gram models are used to translate text from one language to another.

The advantages of n-gram models include:

1. Simple to implement: N-gram models are relatively simple to implement and require minimal computational resources.
2. Effective for short-range dependencies: N-gram models are effective for modeling short-range dependencies in text, such as the frequency of individual words or pairs of words.
3. Can be used for a variety of tasks: N-gram models can be used for a variety of NLP tasks, including language modeling, text classification, sentiment analysis, and machine translation.

However, n-gram models also have some limitations, including:

1. Limited ability to capture long-range dependencies: N-gram models are limited in their ability to capture long-range dependencies in text, such as the relationships between words that are separated by many words.
2. Can be sensitive to the order of words: N-gram models can be sensitive to the order of words in a sentence, which can make them less effective for tasks that require modeling the relationships between words.
3. Can be prone to overfitting: N-gram models can be prone to overfitting, which can occur when the model is trained on a small dataset and is unable to generalize well to new data.

Overall, n-gram models are a simple and effective way to analyze and generate text, but they have limitations and are not suitable for all NLP tasks.

In [4]:

import nltk
from nltk import word_tokenize, ngrams
from collections import defaultdict, Counter
import random

nltk.download('punkt_tab')
nltk.download('punkt')

class NGramLanguageModel:
    def __init__(self, n):
        self.n = n
        self.model = defaultdict(Counter)

    def train(self, text):
        tokens = word_tokenize(text.lower())
        for ngram in ngrams(tokens, self.n + 1):
            self.model[tuple(ngram[:-1])][ngram[-1]] += 1

    def generate_next_word(self, context):
        if tuple(context) in self.model:
            candidates = self.model[tuple(context)].most_common()
            total_count = sum(count for word, count in candidates)
            r = random.uniform(0, total_count)
            for word, count in candidates:
                r -= count
                if r <= 0:
                    return word
        return None

    def generate_text(self, seed_words, num_words):
        context = seed_words[-self.n:]
        generated_text = list(seed_words)

        for _ in range(num_words):
            next_word = self.generate_next_word(context)
            if next_word is None:
                break
            generated_text.append(next_word)
            context = generated_text[-self.n:]

        return ' '.join(generated_text)



[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/oysterable/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/oysterable/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Train trigram model


In [5]:
# Example usage
corpus = """
The quick brown fox jumps over the lazy dog. The dog was not amused.
The fox was quite proud of its agility. Quick reflexes are essential for survival in the wild.
The lazy dog eventually got up and chased the fox, but it was too late.
"""

model = NGramLanguageModel(n=2)
model.train(corpus)



## Generate text

In [6]:
seed = ["The", "quick"]
generated_text = model.generate_text(seed, num_words=20)
print("Generated text:")
print(generated_text)

# Next word prediction
context = ["the", "lazy"]
next_word = model.generate_next_word(context)
print(f"\nPredicted next word after '{' '.join(context)}': {next_word}")

# Print some n-grams and their frequency
print("\nSome trigrams and their frequency:")
for context, word_freq in list(model.model.items())[:5]:
    print(f"{context}: {dict(word_freq)}")


Generated text:
The quick

Predicted next word after 'the lazy': dog

Some trigrams and their frequency:
('the', 'quick'): {'brown': 1}
('quick', 'brown'): {'fox': 1}
('brown', 'fox'): {'jumps': 1}
('fox', 'jumps'): {'over': 1}
('jumps', 'over'): {'the': 1}
