# Markdown Cell 1: Explanation
In the cell below, I import the necessary libraries.

random is used for generating random selections based on weighted probabilities.
defaultdict from the collections module allows us to handle missing keys in our trigram model efficiently.

In [17]:
# Import necessary libraries
import random
from collections import defaultdict

## Task 1: Third-Order Letter Approximation Model

In this Task, I will load five English books, clean their text, and build a trigram model from the cleaned text. A trigram model counts the occurrences of each sequence of three consecutive characters in the text, allowing us to analyze the structure of the language.

## Step 1: Load Books

In this step, I will initialize an empty list to store the contents of five books. I will iterate through the list of book file paths, opening each file in read mode with UTF-8 encoding to read the text. By the end, all five books will be stored in the list, preparing for the next steps of text cleaning and trigram model building.

In [18]:
def load_books(file_paths):
    texts = []  # Initialize an empty list to store the content of each book
    for file_path in file_paths:  # Loop through each file path in the list
        with open(file_path, 'r', encoding='utf-8') as f:  # Open the file in read mode with UTF-8 encoding
            texts.append(f.read())  # Read the entire content and append it to the texts list
    return texts  # Return the list of all loaded books

In [19]:
#list of file paths to the books stored in the 'texts' folder
file_paths = ['texts/book1.txt', 'texts/book2.txt', 'texts/book3.txt', 'texts/book4.txt', 'texts/book5.txt']
books = load_books(file_paths)  # Call the load_books function

## Step 2: Clean Text

In this step, we will clean the text from each book to ensure that we only retain uppercase letters, spaces, and full stops. This process will help us in building an accurate trigram model.


In [20]:
def clean_text(text):
    text = text.upper()  # Convert the text to uppercase
    # Keep only letters, spaces, and full stops
    cleaned_text = ''.join([char for char in text if char.isalpha() or char == ' ' or char == '.'])
    return cleaned_text  # Return the cleaned text

# Clean each book
cleaned_books = [clean_text(book) for book in books]  # Apply the clean_text function to all books

## Step 3: Build Trigram Model

Here, I will construct a trigram model from the cleaned text. The model will count the occurrences of each sequence of three consecutive characters. This will allow me to analyze the character structure of the text.


In [21]:
def build_trigram_model(text):
    trigram_model = defaultdict(int)  # Create a defaultdict to store trigram counts
    for i in range(len(text) - 2):  # Iterate through the text, leaving space for the last two characters
        trigram = text[i:i+3]  # Extract a trigram (3 characters)
        trigram_model[trigram] += 1  # Increment the count for this trigram in the model
    return trigram_model  # Return the trigram model

# Initialize a combined trigram model to accumulate counts from all books
trigram_model = defaultdict(int)


In [22]:
# Loop over each cleaned book to build the trigram model for that book
for book in cleaned_books:
    book_trigram_model = build_trigram_model(book)  # Build the trigram model for the current book
    # Merge the current book's trigram model with the overall model
    for trigram, count in book_trigram_model.items():
        trigram_model[trigram] += count  # Add the count of this trigram to the overall model

## Step 4: Output the Trigram Model

Finally, I will display a sample of the trigram model to verify that it has been constructed correctly.

In [23]:
# Display the first 10 trigrams and their counts
for i, (trigram, count) in enumerate(trigram_model.items()):
    if i < 10:  # Limit output to the first 10 trigrams
        print(f'Trigram: "{trigram}", Count: {count}')

Trigram: "THE", Count: 22365
Trigram: "HE ", Count: 20954
Trigram: "E P", Count: 1892
Trigram: " PR", Count: 1947
Trigram: "PRO", Count: 1493
Trigram: "ROJ", Count: 444
Trigram: "OJE", Count: 445
Trigram: "JEC", Count: 541
Trigram: "ECT", Count: 1409
Trigram: "CT ", Count: 759


# Task 2: Third-Order Letter Approximation Generation
In this task, I will generate a string of 10,000 characters using the trigram model created in Task 1. I will start with the string "TH" and generate each subsequent character based on the preceding two characters.


## Step 1: Define the Function to Generate the Next Character
I first define a function, generate_character(), which takes the last two characters of the generated string and uses the trigram model to determine the next character. This function finds all trigrams that start with those two characters and then randomly selects one of the possible third characters, using the trigram counts as weights for the probabilities.

In [24]:
# Function to generate the next character based on the last two characters using the trigram model
def generate_character(last_two, trigram_model):
    # Find trigrams that start with the last two characters
    candidates = {trigram: count for trigram, count in trigram_model.items() if trigram.startswith(last_two)}
    
    # Calculate the total count of possible next characters
    total_count = sum(candidates.values())
    
    # Choose the next character randomly, weighted by the counts of the trigrams
    r = random.randint(1, total_count)
    cumulative_count = 0
    for trigram, count in candidates.items():
        cumulative_count += count
        if cumulative_count >= r:
            return trigram[-1]  # Return the third character of the trigram

## Step 2: Initialize the Generated Text
Now I initialize the text generation by starting with the string "TH", as per the instructions. This will serve as the seed for the next characters.

In [25]:
# Initialize the generated text with the seed 'TH'
generated_text = "TH"

## Step 3: Generate 10,000 Characters
Using a loop, I will now generate the remaining 9,998 characters (since we already have 2 characters in "TH"). For each new character, I pass the last two characters to generate_character() to determine the next one.

In [26]:
# Generate 10,000 characters using the trigram model
for _ in range(9998):  #I need 9998 more characters to reach 10,000
    last_two = generated_text[-2:]  # Get the last two characters from the generated text
    next_char = generate_character(last_two, trigram_model)  # Generate the next character
    generated_text += next_char  # Append the generated character to the text


## Step 4: Output and Save the Generated Text
Finally, after generating the full text, I display a sample of the first 500 characters for verification and save the entire generated text to a file for future analysis or inspection.

In [27]:
# Output the first 500 characters of the generated text for verification
print(generated_text[:500])

# Save the generated text to a file for future reference
with open('generated_text.txt', 'w') as f:
    f.write(generated_text)

THERE I LEGRE FAT AT THICHIN HINGESONEREAPPS BUTY ENTS LOWN. ING FROJEANY HE OVEDWOLLEAPAT OFAIN EYES. ITHE GUT YOUGHTY HEARTHE CONG ME SE AST LAIN CROJEADABE THE NEM. WOULL SE WASID.BUL PRISE AT RE NER SETHE THE WHIMPTY.THE SUDDY SH MIRSTAYSIPTER MOKED WIT FERNIT MEMAYMPOSIONEVENCEPLAW ANG FROBBEFULS AND WAR. FORCRODNE. YOUR RE COM. MONCE. INE BUTION AND PIP YOUT GUTEND LIE ENBECTED UPORM HISCE LIFEEPIES VEIVENTLE ME COMEN IN SHIT THED HIES UND THE WHICHATION.I HAT CLOW INK OF SH COMEW HED HADD


In [28]:
# Split the generated text into words and count them
word_count = len(generated_text.split())  
# Print the word count to the console
print(f"Word count: {word_count}")  


Word count: 1691


# Task 3. Analyze your model

In this task, I will analyze the trigram-based text generation model to determine how many words from the generated text are actual English words. We will use the `words.txt` file, which contains a list of valid English words, and compare it with the generated text.

## Step 1: Load English Words from words.txt
In this step, I will load the list of English words from the words.txt file. This file contains a comprehensive list of valid English words, which we will use to compare against the words generated in Task 2. The words will be stored in a set for fast lookup operations when checking if a word is valid.

In [29]:
def load_words(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        words = f.read().splitlines()  # Read the file and split it into a list of words
    return set(words)  # Return a set for fast lookup

# Load the words list from 'words.txt'
english_words = load_words('words.txt')

## Step 2: Split the Generated Text into Words
Here, I will split the generated text from Task 2 into individual words. Since the generated text is a continuous string, we will use spaces as the delimiter to separate the words. This will allow us to compare the resulting words with the valid English words loaded from words.txt.

In [30]:
def split_text_into_words(text):
    return text.split()

# Split the generated text into words
generated_words = split_text_into_words(generated_text)

## Step 3: Count Valid Words in the Generated Text
After splitting the generated text into words, this will count how many of those words are valid English words. I will iterate through the list of generated words and check if each word exists in the set of English words. This will give me a count of valid words, which I can use to evaluate the performance of the trigram model.

In [31]:
def count_valid_words(generated_words, english_words):
    valid_word_count = 0
    for word in generated_words:
        if word in english_words:  # Check if the word exists in the English words set
            valid_word_count += 1
    return valid_word_count

# Count how many words from generated text are valid English words
valid_word_count = count_valid_words(generated_words, english_words)


## Step 5: Analyze and Explain the Results
Finally, I will display the results of our analysis. Based on the percentage of valid words, I will see how well the trigram model performs. The higher the percentage, the more accurate the model is in generating real English words.

In [32]:
total_words = len(generated_words)  # Total number of generated words
valid_word_percentage = (valid_word_count / total_words) * 100  # Calculate the percentage

print(f"Total words: {total_words}")
print(f"Valid English words: {valid_word_count}")
print(f"Percentage of valid words: {valid_word_percentage:.2f}%")

Total words: 1691
Valid English words: 579
Percentage of valid words: 34.24%


## Task 4: Export Your Model as JSON
In this task, I will export the trigram model as a JSON file to save the data.

# Step 1: 
I first convert the defaultdict into a regular dictionary since JSON requires a standard format. Then, I use Python’s built-in json module to serialize the dictionary and save it as a file named trigrams.json in our repository.

In [33]:
import json

#Step 1: Convert the defaultdict trigram model to a regular dictionary
trigram_dict = dict(trigram_model)

#Step 2: Use the json module to serialize the trigram dictionary
with open('trigrams.json', 'w') as json_file:
    json.dump(trigram_dict, json_file)

print("Trigram model exported as trigrams.json")


Trigram model exported as trigrams.json
