<img src='https://www.di.uniroma1.it/sites/all/themes/sapienza_bootstrap/logo.png' width="200"/> 


# **Part_1_7_Language Models**
 
In the evolving field of Natural Language Processing (NLP), **Language Models** are foundational in generating, understanding, and analyzing text. From classic statistical models like `n-grams` to advanced Large Language Models (`LLMs`) that power modern conversational AI, language models shape applications across search engines, chatbots, summarization, translation, and more. This tutorial focuses on **n-grams**, providing a foundational understanding of their structure and applications in language modeling, with hands-on exercises to illustrate their role in text generation and probability-based modeling.
 
### **Objectives:**

By the end of this notebook, Parham will:
1. Gain an understanding of **n-gram language models**, their structure, and their role in basic text generation and probability-based language modeling.
2. Explore the theoretical and practical aspects of **n-grams**. 
3. Dive into implementation of **n-grams**, and smoothing techniques.


### **References**: 
- [https://web.stanford.edu/~jurafsky/slp3/3.pdf](https://web.stanford.edu/~jurafsky/slp3/3.pdf)
- [https://www.geeksforgeeks.org/n-gram-language-modelling-with-nltk/](https://www.geeksforgeeks.org/n-gram-language-modelling-with-nltk/)
- [https://tallinzen.net/media/readings/slp3_chapter3_ngrams.pdf](https://tallinzen.net/media/readings/slp3_chapter3_ngrams.pdf)

### **Tutors**:
- Professor Stefano Farali
    - <img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Gmail_icon_%282020%29.svg" alt="Logo" width="20" height="20"> **Email**: Stefano.faralli@uniroma1.it
    - <img src="https://www.iconsdb.com/icons/preview/red/linkedin-6-xxl.png" alt="Logo" width="20" height="20"> **LinkedIn**: [LinkedIn](https://www.linkedin.com/in/stefano-faralli-b1183920/) 
- Professor Iacopo Masi
    - <img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Gmail_icon_%282020%29.svg" alt="Logo" width="20" height="20"> **Email**: masi@di.uniroma1.it  
    - <img src="https://www.iconsdb.com/icons/preview/red/linkedin-6-xxl.png" alt="Logo" width="20" height="20"> **LinkedIn**: [LinkedIn](https://www.linkedin.com/in/iacopomasi/)  
    - <img src="https://upload.wikimedia.org/wikipedia/commons/a/ae/Github-desktop-logo-symbol.svg" alt="Logo" width="20" height="20"> **GitHub**: [GitHub](https://github.com/iacopomasi)  
    
### **Contributors**:
- Parham Membari
    - <img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Gmail_icon_%282020%29.svg" alt="Logo" width="20" height="20"> **Email**: p.membari96@gmail.com
    - <img src="https://www.iconsdb.com/icons/preview/red/linkedin-6-xxl.png" alt="Logo" width="20" height="20"> **LinkedIn**: [LinkedIn](https://www.linkedin.com/in/p-mem/)
    - <img src="https://upload.wikimedia.org/wikipedia/commons/a/ae/Github-desktop-logo-symbol.svg" alt="Logo" width="20" height="20"> **GitHub**:  [GitHub](https://github.com/parham075)
    - <img src="https://upload.wikimedia.org/wikipedia/commons/e/ec/Medium_logo_Monogram.svg" alt="Logo" width="20" height="20"> **Medium**: [Medium](https://medium.com/@p.membari96)

### **Table of Contents:**
1. Import Libraries
2. Introduction to Language Modeling
3. N-gram Models
4. Closing Thoughts

## 1. Import Libraries

In [51]:
import nltk
import numpy
import spacy
from loguru import logger
from nltk.util import ngrams
from collections import Counter
from nltk.corpus import words
import pandas as pd
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Import necessary libraries
import nltk
from nltk import bigrams, trigrams
from nltk.corpus import reuters
from collections import defaultdict
from openai import OpenAI
import openai
import os

# Download necessary NLTK resources
nltk.download('reuters')
nltk.download('punkt_tab')

[nltk_data] Downloading package reuters to /home/p/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/p/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## 2. Introduction to Language Models

Language modeling is the way of determining the probability of any sequence of words. Language modeling is used in various applications such as Speech Recognition, Spam filtering, etc. Language modeling is the key aim behind implementing many state-of-the-art Natural Language Processing models.

### Methods of Language Modelling
Two methods of Language Modeling:

- **Statistical Language Modelling**: Statistical Language Modeling, or Language Modeling, is the development of probabilistic models that can predict the next word in the sequence given the words that precede. Examples such as N-gram language modeling.

- **Neural Language Modeling**: Neural network methods are achieving better results than classical methods both on standalone language models and when models are incorporated into larger models on challenging tasks like speech recognition and machine translation. A way of performing a neural language model is through word embeddings.

## 3. N-gram Models

### Overview and Theory

An **N-gram** is a contiguous sequence of `n` items from a given text or speech sample, where the items can be letters, words, or even base pairs depending on the application. Typically, N-grams are extracted from a large corpus, providing insights into text patterns and dependencies.

For instance, N-grams can be:
- **Unigrams**: Individual words like “This", “article", “is", “on", and “NLP”
- **Bigrams**: Word pairs like “This article", “article is", “is on", and “on NLP”

An **N-gram language model** estimates the likelihood of a word given a specific context or history. For example, a bigram model estimates the probability of each word given the previous word. The model's goal is to predict the next word, capturing dependencies and patterns in language sequences.

**Calculating N-gram Probabilities**

For example, in the sentence **“This article is on...”**, if we want to predict the probability that the next word is “NLP", this can be represented as:
$$p(\text{“NLP”} | \text{“This”}, \text{“article”}, \text{“is”}, \text{“on”})$$
This probability is part of a conditional probability chain that models the probability of each word in a sentence based on its predecessors.

To generalize, the conditional probability of the n-th word given the preceding `n-1` words can be written as:
$$P(W) = p(w_n | w_1, w_2, ..., w_{n-1})$$
Using the **chain rule of probability**, this probability of a word sequence w_1, w_2, ..., w_n can be expanded as:
$$P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i | w_1, w_2, ..., w_{i-1})$$

**Markov Assumptions and Simplified Models:**

In practice, language models often simplify this calculation by applying **Markov assumptions**. This assumption posits that the probability of a word depends only on a limited history of previous words, rather than the entire sequence. Specifically, in an k-gram model, we assume that each word depends only on the previous `k` words.

- For a **unigram model** (where `k = 0`), each word is considered independently:
  $$P(w_1, w_2, ..., w_n) \approx \prod_{i=1}^{n} P(w_i)$$

- For a **bigram model** (where `k = 1`), each word depends only on the immediately preceding word:
  $$P(w_i | w_1, w_2, ..., w_{i-1}) \approx P(w_i | w_{i-1})$$

**Examples for unigram and bigram:**

Consider a simple sentence: **"The cat sat."**

- **Unigram Model Example:**

    Given the unigram formula:
    $$P(\text{"The cat sat"}) \approx P(\text{"The"}) \times P(\text{"cat"}) \times P(\text{"sat"})$$

    Suppose we have the following unigram probabilities (estimated from a large text corpus):
    $$ P(\text{"The"}) = 0.3 $$
    $$ P(\text{"cat"}) = 0.1 $$
    $$ P(\text{"sat"}) = 0.05 $$

    Then, the probability of the sentence "The cat sat" under the unigram model is:
    $$P(\text{"The cat sat"}) \approx 0.3 \times 0.1 \times 0.05 = 0.0015$$

- **Bigram Model Example**
    Using the bigram formula:
    $$P(\text{"The cat sat"}) \approx P(\text{"The"}) \times P(\text{"cat"} | \text{"The"}) \times P(\text{"sat"} | \text{"cat"})$$

    Assume the following bigram probabilities based on prior occurrences in a corpus:


    $$ P(\text{"The"}) = 0.3 \quad \text{(probability of "The" starting a sentence)} $$
    $$ P(\text{"cat"} | \text{"The"}) = 0.4 \quad \text{(probability of "cat" following "The")} $$
    $$ P(\text{"sat"} | \text{"cat"}) = 0.3 \quad \text{(probability of "sat" following "cat")} $$

    Then, the probability of the sentence "The cat sat" under the bigram model is:
    $$P(\text{"The cat sat"}) \approx 0.3 \times 0.4 \times 0.3 = 0.036$$



By applying these assumptions, we make the model computationally feasible while still capturing relevant language patterns. This approach allows us to approximate word dependencies and make educated predictions, forming the basis of applications like autocomplete and text generation.

### Implementation of N-Gram language modeling in NLTK

In [None]:


# Tokenize the text
words = nltk.word_tokenize(' '.join(reuters.words()))

# Create trigrams
tri_grams = list(trigrams(words))

# Build a trigram model
model = defaultdict(lambda: defaultdict(lambda: 0))

# Count frequency of co-occurrence
for w1, w2, w3 in tri_grams:
    model[(w1, w2)][w3] += 1

# Transform the counts into probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

# Function to predict the next word
def predict_next_word(w1, w2):
    """
    Predicts the next word based on the previous two words using the trained trigram model.
    Args:
    w1 (str): The first word.
    w2 (str): The second word.

    Returns:
    str: The predicted next word.
    """
    next_word = model[w1, w2]
    if next_word:
        predicted_word = max(next_word, key=next_word.get)  # Choose the most likely next word
        return predicted_word
    else:
        return "No prediction available"

# Example usage
print("Next Word:", predict_next_word('the', 'cat'))


Next Word: of


### Metrics for Language Modeling

Evaluating language models is essential for understanding how well they perform on tasks like text prediction, machine translation, and conversational AI. In language modeling, three fundamental metrics are widely used to assess the effectiveness and reliability of a model: **entropy**, **cross-entropy**, and **perplexity**. These metrics offer insight into a model’s accuracy, confidence, and the extent to which it captures linguistic patterns.

#### 1. Entropy

**Entropy**, introduced by Claude Shannon, is a measure of the unpredictability or “information content” within a given set of text. In the context of language models, entropy quantifies how much information is required to represent a sequence of words, based on the probability distribution of possible word sequences.

The entropy `H(P)` of a probability distribution  `P` over possible words  `X` is calculated as:

$$H(P) = -\sum_{X} P(X) \cdot \log(P(X))$$

where:
- `P(X)`  is the probability of each word X in the vocabulary,
- `log(P(X))` represents the information content of `X`, and
- `H(P)` is always greater than or equal to zero, with higher values indicating higher uncertainty and complexity.

In language modeling, lower entropy implies that the model has more certainty and requires fewer bits to represent the text. Higher entropy indicates greater unpredictability, suggesting that the text may contain more complex or diverse content.

#### 2. Cross-Entropy

**Cross-entropy** measures how well a language model’s predicted probability distribution aligns with the actual distribution of words in the test data. In other words, it evaluates how effectively the model represents unseen text based on its training.

Cross-entropy  `H(P, Q)` between the actual probability distribution `P`  and the model's predicted distribution `Q`  is defined as:

$$H(P, Q) = -\sum_{X} P(X) \cdot \log(Q(X))$$

This metric captures the “surprise” of the model when encountering new text. If the model’s predictions closely align with the real data, cross-entropy is low, indicating that the model successfully captures the language structure. Conversely, high cross-entropy suggests that the model has not accurately represented the test data. Cross-entropy is always greater than or equal to entropy, reflecting that the model's uncertainty cannot be less than the inherent uncertainty in the actual data.

#### 3. Perplexity

**Perplexity** is a metric commonly used to assess the performance of language models by measuring how “confused” or “uncertain” a model feels when predicting the next word in a sequence. Perplexity is the exponentiated form of cross-entropy, making it a more interpretable value that reflects how well the model can predict words in a sequence. Lower perplexity indicates better model performance.

For a test set of  `N`  words, perplexity `PP(W)` is calculated as:

$$PP(W) = 2^{H(P, Q)} = \prod_{i=1}^{N} P(w_i \mid w_{1}, \dots, w_{i-1})^{-\frac{1}{N}}$$

where:
- `P(w_i | w_{1}, ..., w_{i-1})` is the probability of word `w_i` given its preceding words,
- `N`  is the total number of words in the test set.

For example, consider predicting the sentence “Natural Language Processing.” If the model assigns probabilities to each word in the sequence based on preceding context, these probabilities are used to compute perplexity. A lower perplexity value indicates that the model is less “puzzled” about the sequence, signaling higher prediction accuracy.

### Probability and Smoothing Techniques

For an n-gram model, the probability of a word  `w_n`  given its preceding words can be expressed as the conditional probability  $$P(w_n \mid w_1, w_2, \dots, w_{n-1})$$ 
This probability is calculated by dividing the count of the specific n-gram by the count of the (n-1)-gram prefix that precedes it:

$$
P(w_n \mid w_{n-1}, \dots, w_1) = \frac{\text{Count}(w_1, w_2, \dots, w_n)}{\text{Count}(w_1, w_2, \dots, w_{n-1})}
$$

For example, in a bigram model (where  n = 2 \)), the probability of a word  `w_2` following a word  `w_1` is calculated as:

$$
P(w_2 \mid w_1) = \frac{\text{Count}(w_1, w_2)}{\text{Count}(w_1)}
$$

>
>**Example:**
>
> Suppose we have the following sentence in our corpus:
>
>`“The cat sat on the mat.”`
>
>In a bigram model, we calculate the probability of each word following its immediate predecessor. Let's calculate the probability of the word “sat” following “cat.”
>
> - Count the Bigram (“cat”, “sat”): In this corpus, the sequence “cat sat” appears once, so Count(“cat sat”) = 1.
> - Count the Unigram (“cat”): The word “cat” appears once as well, so Count(“cat”) = 1.
> Using the formula for bigram probability:
> $$P(\text{sat} \mid \text{cat}) = \frac{\text{Count}(\text{cat sat})}{\text{Count}(\text{cat})} = \frac{1}{1} = 1.0$$

This approach helps estimate the likelihood of a word appearing after a given context based on observed patterns in the data. However, if an n-gram is absent from the training data, this method assigns it a probability of zero, which can cause issues in tasks such as text generation and prediction.

### Smoothing Techniques

To handle zero probabilities and improve the generalization of the model, smoothing techniques adjust probability estimates for n-grams that are not present in the training data. Several common smoothing methods are used in language modeling:

- **Laplace (Add-One) Smoothing**: Adds 1 to each count, ensuring that no probability is zero.
- **Additive (Add-k) Smoothing**: Generalizes Laplace by adding a small constant  k \), allowing for flexible adjustments.
- **Good-Turing Smoothing**: Adjusts counts of observed events based on the counts of rare and unseen events, useful for infrequent n-grams.
- **Kneser-Ney Smoothing**: Considers the diversity of contexts in which a word appears, providing a more sophisticated smoothing approach.

Each of these techniques provides a way to handle unseen n-grams, improving the robustness and accuracy of language models.


#### 1. **Laplace (Add-One) Smoothing**

Laplace Smoothing adds 1 to each count to ensure that no probability is zero, even for unseen bigrams. Here’s the formula and code:

$$
P_{\text{Laplace}}(w_2 \mid w_1) = \frac{\text{Count}(w_1, w_2) + 1}{\text{Count}(w_1) + V}
$$

where  `V` is the vocabulary size. To set up the example, let’s assume we have the following bigram counts and vocabulary for the phrase:

`"the cat sat on the mat"`

Then:

In [15]:
# Sample bigram counts
bigram_counts = {
    ("the", "cat"): 3,
    ("cat", "sat"): 2,
    ("sat", "on"): 1,
    ("on", "the"): 1,
    ("the", "mat"): 1
}

# Unigram counts (counts of individual words)
unigram_counts = {
    "the": 5,
    "cat": 2,
    "sat": 1,
    "on": 1,
    "mat": 1
}

# Total vocabulary size
vocab_size = len(unigram_counts)

In [16]:
def laplace_smoothing(w1, w2, bigram_counts, unigram_counts, vocab_size):
    # Apply Laplace smoothing
    bigram_count = bigram_counts.get((w1, w2), 0)
    unigram_count = unigram_counts.get(w1, 0)
    smoothed_prob = (bigram_count + 1) / (unigram_count + vocab_size)
    return smoothed_prob

# Example
print("Laplace Smoothing (P(cat | the)):", laplace_smoothing("the", "cat", bigram_counts, unigram_counts, vocab_size))

Laplace Smoothing (P(cat | the)): 0.4


#### 2. **Additive Smoothing (Add-k)**

Additive Smoothing generalizes Laplace by adding a constant  `k` , allowing more flexible adjustments. If  `k = 0.1` , the formula becomes:

$$
P_{\text{Add-k}}(w_2 \mid w_1) = \frac{\text{Count}(w_1, w_2) + k}{\text{Count}(w_1) + k \cdot V}
$$

In [17]:
def additive_smoothing(w1, w2, bigram_counts, unigram_counts, vocab_size, k=0.1):
    # Apply Additive smoothing with parameter k
    bigram_count = bigram_counts.get((w1, w2), 0)
    unigram_count = unigram_counts.get(w1, 0)
    smoothed_prob = (bigram_count + k) / (unigram_count + k * vocab_size)
    return smoothed_prob

# Example with k=0.1
print("Additive Smoothing (P(cat | the)):", additive_smoothing("the", "cat", bigram_counts, unigram_counts, vocab_size, k=0.1))

Additive Smoothing (P(cat | the)): 0.5636363636363636


#### 3. **Good-Turing Smoothing**

Good-Turing Smoothing estimates probabilities by adjusting counts of observed events based on the counts of rare and unseen events. Here’s how we calculate the adjusted count  `C^*` :

$$
C^* = (r+1) \frac{N_{r+1}}{N_r}
$$

where  `r`  is the count of the n-gram,  `N_r`  is the number of n-grams with frequency  `r` , and  `N_{r+1}`  is the number with frequency  `r+1`.

For simplicity, let’s define a function for Good-Turing smoothing using our bigram counts.

In [18]:
def good_turing_smoothing(bigram_counts):
    # Calculate Good-Turing adjusted counts
    count_of_counts = Counter(bigram_counts.values())
    adjusted_counts = {}
    
    for bigram, count in bigram_counts.items():
        if count + 1 in count_of_counts and count in count_of_counts:
            adjusted_count = (count + 1) * (count_of_counts[count + 1] / count_of_counts[count])
        else:
            adjusted_count = count
        adjusted_counts[bigram] = adjusted_count
    
    # Calculate probabilities with adjusted counts
    total_bigrams = sum(adjusted_counts.values())
    smoothed_probs = {bigram: count / total_bigrams for bigram, count in adjusted_counts.items()}
    return smoothed_probs

# Example
print("Good-Turing Smoothing:", good_turing_smoothing(bigram_counts))

Good-Turing Smoothing: {('the', 'cat'): 0.375, ('cat', 'sat'): 0.375, ('sat', 'on'): 0.08333333333333333, ('on', 'the'): 0.08333333333333333, ('the', 'mat'): 0.08333333333333333}


#### 4. **Kneser-Ney Smoothing**

Kneser-Ney Smoothing is a more complex technique that takes into account the diversity of contexts in which a word appears. Here’s the formula:

$$
P_{\text{KN}}(w_2 \mid w_1) = \max \left( \text{Count}(w_1, w_2) - d, 0 \right) / \text{Count}(w_1) + \lambda(w_1) \cdot P_{\text{continuation}}(w_2)
$$

Where:
-  `d` is a discount constant,
-  `lambda(w_1)`  ensures the probabilities sum to 1,
-  `P_continuation(w_2)`  is based on the probability of  `w_2`  appearing in new contexts.

For simplicity, we’ll set a discount  `d` and calculate a continuation probability based on unique bigrams in which a word appears.

In [19]:
def kneser_ney_smoothing(w1, w2, bigram_counts, unigram_counts, vocab_size, discount=0.75):
    # Discounted probability for bigram
    bigram_count = bigram_counts.get((w1, w2), 0)
    unigram_count = unigram_counts.get(w1, 0)
    continuation_count = sum(1 for (prev_word, next_word) in bigram_counts if next_word == w2)
    
    # Discounted probability for seen bigrams
    p_continuation = continuation_count / len(bigram_counts)
    p_discounted = max(bigram_count - discount, 0) / unigram_count if unigram_count > 0 else 0
    
    # Normalization constant
    lambda_w1 = (discount / unigram_count) * len([bigram for bigram in bigram_counts if bigram[0] == w1]) if unigram_count > 0 else 0
    
    # Final smoothed probability
    smoothed_prob = p_discounted + lambda_w1 * p_continuation
    return smoothed_prob

# Example
print("Kneser-Ney Smoothing (P(cat | the)):", kneser_ney_smoothing("the", "cat", bigram_counts, unigram_counts, vocab_size))

Kneser-Ney Smoothing (P(cat | the)): 0.51


## 4. Closing Thoughts  

Throughout this notebook, Parham has undertaken the following activities:  
- Learned about methods if Language Modeling.
- Gained an understanding of **n-gram language models**, their structure, and their role in text generation and probability-based modeling.  
- Explored the theoretical and practical aspects of **n-grams**, delving into their use in NLP tasks. 