# Markov Chain

In this exercise, you will work with Markov Chains to generate text based on probabilistic patterns.

## Question 10 (15 points):

1.   **Load and Tokenize:** Begin by loading the `Brown` corpus from the `NLTK` library and tokenize it into words using `word_tokenize`. The `Brown` corpus will serve as our source text.

2.   **Build N-gram Models:** Create various n-gram models (e.g., unigrams, bigrams) to capture word sequences from the tokenized text. While building these models, calculate transition probabilities between n-grams.

3.   **Generate Text:** Implement a text generation function that uses the calculated probabilities. This function should generate text based on the patterns observed in the corpus.

4.   **Print Results:** For each n-gram order (1, 2, 3, and 4-gram), print the generated text. You can inspect the results to understand how different n-gram orders affect the generated text.

**Example of Transition Probabilities:**

For a 2-gram, the probabilities might look like this:

```
{('to', 'wait'): {'one': 0.043478260869565216,
  'until': 0.21739130434782608,
  ',': 0.21739130434782608,
  'for': 0.17391304347826086,
  'a': 0.043478260869565216,
  'his': 0.043478260869565216,
  '.': 0.08695652173913043,
  'to': 0.043478260869565216,
  'till': 0.08695652173913043,
  'before': 0.043478260869565216},...}
```

In [1]:
import nltk
nltk.download('brown')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import brown

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\haesh\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\haesh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
import random
import string
from nltk import ngrams
from nltk.corpus import brown
from collections import defaultdict

In [3]:
# Set the random seed for 42
SEED = 42
random.seed(SEED)

In [4]:
# Read the text corpus
corpus = brown.words()
# str_corpus = ' '.join(corpus)
tokens = word_tokenize(' '.join(corpus))

In [5]:
def build_ngram_model(tokens, n):
    """
    Build an n-gram model from a list of tokens.

    Args:
    - tokens (list): List of tokens from the corpus.
    - n (int): The order of the n-grams to build (e.g., 1 for unigrams, 2 for bigrams, 3 for trigrams).

    Returns:
    - dict: A dictionary containing n-grams as keys and their associated probability distributions as values.
    """
    ngram_model = defaultdict(lambda: defaultdict(float))

    # Generate N-grams
    ngrams_list = [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)] # to remove puncts add: if all(char not in string.punctuation for char in tokens[i:i + n])
    # print(ngrams_list)

    # Generate N+1-grams
    ngramsP1_list = [tuple(tokens[i:i + n + 1]) for i in range(len(tokens) - n)]

    # Count occurrences of N-grams
    ngram_counts = nltk.FreqDist(ngrams_list)

    # Count occurrences of N+1-grams
    ngramP1_counts = nltk.FreqDist(ngramsP1_list)

    # Calculate transition probabilities
    for ngram, count in ngramP1_counts.items():
        context = tuple(ngram[:-1])
        word = ngram[-1]
        probability = count / ngram_counts[context]
        ngram_model[context][word] = probability

    return ngram_model

In [6]:
def generate_text(ngrams, length):
    """
    Generate text using an n-gram model.

    Args:
    - ngrams (dict): An n-gram model dictionary generated by build_ngram_model.
    - length (int): The desired length of the generated text in tokens.

    Returns:
    - str: The generated text as a string.
    """
     # Check if the n-gram model is empty
    if not ngrams:
        raise ValueError("The n-gram model is empty.")

    # Get the order of the n-gram model
    n = len(next(iter(ngrams.keys())))

    # Select a random starting context from the n-gram model
    current_context = random.choice(list(ngrams.keys()))

    # Initialize the generated text with the starting context
    generated_text = list(current_context)

    # Generate text until the desired length is reached
    while len(generated_text) < length:
        # Check if the current context is present in the n-gram model
        if current_context in ngrams:
            # Get the next word based on the current context
            next_word = random.choices(
                list(ngrams[current_context].keys()),
                weights=list(ngrams[current_context].values())
            )[0]

            # Update the current context for the next iteration
            current_context = tuple(list(current_context[1:]) + [next_word])

            # Append the next word to the generated text
            generated_text.append(next_word)
        else:
            # If the current context is not present, select a new random context
            current_context = random.choice(list(ngrams.keys()))

    # Convert the generated text list to a string
    generated_text_str = ' '.join(generated_text)

    return generated_text_str

In [7]:
# Build n-gram models of different orders with probabilities
n1_gram = build_ngram_model(tokens, n=1)
n2_gram = build_ngram_model(tokens, n=2)
n3_gram = build_ngram_model(tokens, n=3)
n4_gram = build_ngram_model(tokens, n=4)
n5_gram = build_ngram_model(tokens, n=5)

In [8]:
# Generate text using the n-gram models with transition probabilities
length = 50
n1_generated_text = generate_text(n1_gram, length)
n2_generated_text = generate_text(n2_gram, length)
n3_generated_text = generate_text(n3_gram, length)
n4_generated_text = generate_text(n4_gram, length)
n5_generated_text = generate_text(n5_gram, length)

# Print the generated text
print(f'1-Gram Generated Text: {n1_generated_text}')
print(f'2-Gram Generated Text: {n2_generated_text}')
print(f'3-Gram Generated Text: {n3_generated_text}')
print(f'4-Gram Generated Text: {n4_generated_text}')
print(f'5-Gram Generated Text: {n5_generated_text}')

1-Gram Generated Text: 2a ( whom I was no idea . If you worry about this moment she returned to their voices beg your feet , who had . The 15th anniversary of the first half of phosphate rock `` ! Yet it , the papers in bins . 31 , of stenography
2-Gram Generated Text: and applications to leave a party , governments or races or institutions in the notes within Fig . 7 . If they do relax some of his movements was tentative . The Royal Air Force of 91 combat wings and even world attention is the airfield . Eliminate the vulnerability
3-Gram Generated Text: and a pulmonary vein on the other hand , if he chooses to be . Your chauffeur 's expenses will average between $ 150 and $ 200 for the Custom version ) . This tied in closely with the principal property disposal installations of the Federal Constitution was that the
4-Gram Generated Text: a.m. to visit the Kyoto University where Mr. Washizu is attending . I was amazed at the way he became more and more Fiorello as the evening progresse