# Markdown Cell 1: Explanation
In the cell below, we import the necessary libraries.

random is used for generating random selections based on weighted probabilities.
defaultdict from the collections module allows us to handle missing keys in our trigram model efficiently.

In [197]:
# Import necessary libraries
import random
from collections import defaultdict

# Task 1: Third-Order Letter Approximation Model

In this Task, we will load five English books, clean their text, and build a trigram model from the cleaned text. A trigram model counts the occurrences of each sequence of three consecutive characters in the text, allowing us to analyze the structure of the language.

# Step 1: Load Books

In this step, I will initialize an empty list to store the contents of five books. I will iterate through the list of book file paths, opening each file in read mode with UTF-8 encoding to read the text. By the end, all five books will be stored in the list, preparing for the next steps of text cleaning and trigram model building.

In [198]:
def load_books(file_paths):
    texts = []  # Initialize an empty list to store the content of each book
    for file_path in file_paths:  # Loop through each file path in the list
        with open(file_path, 'r', encoding='utf-8') as f:  # Open the file in read mode with UTF-8 encoding
            texts.append(f.read())  # Read the entire content and append it to the texts list
    return texts  # Return the list of all loaded books

In [199]:
#list of file paths to the books stored in the 'texts' folder
file_paths = ['texts/book1.txt', 'texts/book2.txt', 'texts/book3.txt', 'texts/book4.txt', 'texts/book5.txt']
books = load_books(file_paths)  # Call the load_books function

## Step 2: Clean Text

In this step, we will clean the text from each book to ensure that we only retain uppercase letters, spaces, and full stops. This process will help us in building an accurate trigram model.


In [200]:
def clean_text(text):
    text = text.upper()  # Convert the text to uppercase
    # Keep only letters, spaces, and full stops
    cleaned_text = ''.join([char for char in text if char.isalpha() or char == ' ' or char == '.'])
    return cleaned_text  # Return the cleaned text

# Clean each book
cleaned_books = [clean_text(book) for book in books]  # Apply the clean_text function to all books

## Step 3: Build Trigram Model

Here, we will construct a trigram model from the cleaned text. The model will count the occurrences of each sequence of three consecutive characters. This will allow us to analyze the character structure of the text.


In [201]:
def build_trigram_model(text):
    trigram_model = defaultdict(int)  # Create a defaultdict to store trigram counts
    for i in range(len(text) - 2):  # Iterate through the text, leaving space for the last two characters
        trigram = text[i:i+3]  # Extract a trigram (3 characters)
        trigram_model[trigram] += 1  # Increment the count for this trigram in the model
    return trigram_model  # Return the trigram model

# Initialize a combined trigram model to accumulate counts from all books
trigram_model = defaultdict(int)


In [202]:
# Loop over each cleaned book to build the trigram model for that book
for book in cleaned_books:
    book_trigram_model = build_trigram_model(book)  # Build the trigram model for the current book
    # Merge the current book's trigram model with the overall model
    for trigram, count in book_trigram_model.items():
        trigram_model[trigram] += count  # Add the count of this trigram to the overall model

## Step 4: Output the Trigram Model

Finally, we will display a sample of the trigram model to verify that it has been constructed correctly.

In [203]:
# Display the first 10 trigrams and their counts
for i, (trigram, count) in enumerate(trigram_model.items()):
    if i < 10:  # Limit output to the first 10 trigrams
        print(f'Trigram: "{trigram}", Count: {count}')

Trigram: "THE", Count: 22365
Trigram: "HE ", Count: 20954
Trigram: "E P", Count: 1892
Trigram: " PR", Count: 1947
Trigram: "PRO", Count: 1493
Trigram: "ROJ", Count: 444
Trigram: "OJE", Count: 445
Trigram: "JEC", Count: 541
Trigram: "ECT", Count: 1409
Trigram: "CT ", Count: 759


# Task 2: Third-Order Letter Approximation Generation
In this task, we will generate a string of 10,000 characters using the trigram model created in Task 1. We will start with the string "TH" and generate each subsequent character based on the preceding two characters.


# Step 1: Define the Function to Generate the Next Character
We first define a function, generate_character(), which takes the last two characters of the generated string and uses the trigram model to determine the next character. This function finds all trigrams that start with those two characters and then randomly selects one of the possible third characters, using the trigram counts as weights for the probabilities.

In [204]:
# Function to generate the next character based on the last two characters using the trigram model
def generate_character(last_two, trigram_model):
    # Find trigrams that start with the last two characters
    candidates = {trigram: count for trigram, count in trigram_model.items() if trigram.startswith(last_two)}
    
    # Calculate the total count of possible next characters
    total_count = sum(candidates.values())
    
    # Choose the next character randomly, weighted by the counts of the trigrams
    r = random.randint(1, total_count)
    cumulative_count = 0
    for trigram, count in candidates.items():
        cumulative_count += count
        if cumulative_count >= r:
            return trigram[-1]  # Return the third character of the trigram

# Step 2: Initialize the Generated Text
Now I initialize the text generation by starting with the string "TH", as per the instructions. This will serve as the seed for the next characters.

In [None]:
# Initialize the generated text with the seed 'TH'
generated_text = "TH"