<a href="https://colab.research.google.com/github/mohammadreza-mohammadi94/NLP-Projects/blob/main/Text%20Generation%20-%20Simple%20N-gram%20Models/Text_Generation_N_Gram_Model_Moby_dick_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Download Dataset

In [19]:
import requests
import re

url = "https://www.gutenberg.org/files/2701/2701-0.txt"
# Get text
text = requests.get(url).text
# text

## Filter `Moby-dick`

In [21]:
# Filter Moby-Dick Chapter from Gutenberg Project
start = text.find("CHAPTER 1. Loomings.")       # Start of text
end = text.find("End of Project Gutenberg")     # End of text
text = text[start:end]                          # Full text

## Text Cleaning

In [5]:
text = re.sub(r"[^a-zA-Z\s]", "", text).lower()
words = text.split()
print(f"Total words: {len(words)}")

Total words: 212399


# Trigram Model

In [6]:
from collections import defaultdict
import random

# Creating dict
trigrams = defaultdict(list)

for i in range(len(words) - 2):
    key = (words[i], words[i+1])
    value = words[i+2]
    trigrams[key].append(value)

*A `defaultdict` is a dictionary subclass that provides a default value for a key that does not exist, preventing a KeyError.*

*When you create a `defaultdict`, you provide it with a factory function (like int, list, or set). If you try to access or modify a key that is not in the dictionary, this function is automatically called to create a default value for that key.*  
  
---  

**We use `defaultdict` when building trigram models to easily count the frequencies of word sequences without writing extra code to handle keys that haven't been seen yet.
Building a trigram model requires counting two things**:

* **The frequency of the two-word prefix (e.g., count("San Francisco"))**.

* **The frequency of the full three-word sequence (e.g., count("San Francisco is"))**.

**defaultdict makes this counting process much cleaner.**

## Text Generation

In [12]:
def generate_text(start_words=("call", "me"), num_words=50):
    output = list(start_words)
    for _ in range(num_words):
        key = (output[-2], output[-1])
        next_words = trigrams.get(key)
        if not next_words:
            break
        next_word = random.choice(next_words)
        output.append(next_word)
    return " ".join(output)

print(generate_text(("the", "whale")))

the whale swimming out from the water the thinnest shreds of the white whale he struck the spanish land but i have seen him lay of nights in a panic and to all these things should fail in latently engendering an element in which they gaze however it was given at all


# Bigram Model

In [13]:
bigrams = defaultdict(list)

for i in range(len(words) - 1):
    current_word = words[i]
    next_word = words[i + 1]
    bigrams[current_word].append(next_word)


## Text Generation

In [17]:
def generate_sentence(start_word, num_words=10):
    sentence = [start_word]
    current = start_word
    for _ in range(num_words - 1):
        next_words = bigrams.get(current)
        if not next_words:
            break
        current = random.choice(next_words)
        sentence.append(current)
    return " ".join(sentence)

In [18]:
print(generate_sentence("the", num_words=10))

the wife into the vessels were the conclusion not green
