# Third-order Letter Approximation Model
- Creating a third-order letter approximation model from five English texts sourced from Project Gutenberg. texts will be processed by:
- Removing unnecessary content (preamble, postamble).
- Retaining only ASCII letters, full stops, and spaces.
- Converting all letters to uppercase.

Count the occurrences of each trigram (a sequence of three characters) to build a model of the English language based on these texts.


# .txt files saved on local directory / using py. to read them.

# ref: https://realpython.com/read-write-files-python/
# https://www.dataquest.io/blog/read-file-python/

In [49]:
# List of file paths for the texts
import re

file_paths = [
    r"C:\Users\hemer\emerginTechnologies\Text\betrothed.txt",
    r"C:\Users\hemer\emerginTechnologies\Text\chronicles.txt",
    r"C:\Users\hemer\emerginTechnologies\Text\Frank.txt",
    r"C:\Users\hemer\emerginTechnologies\Text\school.txt",
    r"C:\Users\hemer\emerginTechnologies\Text\voyage.txt"
]

# Function to load texts
def load_texts(file_paths):
    texts = []
    for path in file_paths:
        with open(path, 'r', encoding='utf-8') as file:
            texts.append(file.read())
    return texts

# Load all texts and display the first 500 characters of each for verification
texts = load_texts(file_paths)
print("Step 1: Loaded Texts (First 500 Characters of Each Text):")
for i, text in enumerate(texts):
    print(f"Text {i+1}:\n{text[:500]}\n{'-'*40}")


Step 1: Loaded Texts (First 500 Characters of Each Text):
Text 1:
The Project Gutenberg eBook of My betrothed and other poems
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this
----------------------------------------
Text 2:
The Project Gutenberg eBook of The chronicles of Enguerrand de Monstrelet
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. I

# Remove preamble and postamble, Removes all characters except ASCII letters, spaces, and full stops andConverts all letters to uppercase.
# Ref: https://datagy.io/python-read-text-file/

In [50]:
# Function to preprocess each text
def preprocess_text(text):
    # Remove preamble and postamble
    start_marker = "START OF THIS PROJECT GUTENBERG EBOOK"
    end_marker = "END OF THIS PROJECT GUTENBERG EBOOK"
    
    start_pos = text.find(start_marker)
    end_pos = text.find(end_marker)
    
    # Slice text to keep only the main content
    if start_pos != -1:
        text = text[start_pos + len(start_marker):]
    if end_pos != -1:
        text = text[:end_pos]
    
    # Remove unwanted characters, keep only ASCII letters, full stops, and spaces
    text = re.sub(r'[^A-Za-z. ]', '', text)
    text = text.upper()  # Convert to uppercase
    return text

# Preprocess each text and display the first 500 characters after preprocessing
processed_texts = [preprocess_text(text) for text in texts]
print("\nStep 2: Preprocessed Texts (First 500 Characters of Each Text):")
for i, text in enumerate(processed_texts):
    print(f"Processed Text {i+1}:\n{text[:500]}\n{'-'*40}")


Step 2: Preprocessed Texts (First 500 Characters of Each Text):
Processed Text 1:
THE PROJECT GUTENBERG EBOOK OF MY BETROTHED AND OTHER POEMS    THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES ANDMOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONSWHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMSOF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINEAT WWW.GUTENBERG.ORG. IF YOU ARE NOT LOCATED IN THE UNITED STATESYOU WILL HAVE TO CHECK THE LAWS OF THE COUNTRY WHERE YOU ARE LOCATEDBEFORE USING THIS EBOOK.TITL
----------------------------------------
Processed Text 2:
THE PROJECT GUTENBERG EBOOK OF THE CHRONICLES OF ENGUERRAND DE MONSTRELET    THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES ANDMOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONSWHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMSOF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINEAT W

# Creates a function to iterate through each preprocessed text and extract trigrams

In [51]:
# Function to generate trigram model
def generate_trigram_model(texts):
    trigram_counts = {}  # Dictionary to store trigram counts

    for text in texts:
        for i in range(len(text) - 2):  # Loop through each character up to the third-to-last
            trigram = text[i:i + 3]  # Extract a trigram of three characters
            if trigram in trigram_counts:
                trigram_counts[trigram] += 1  # Increment count if trigram exists
            else:
                trigram_counts[trigram] = 1  # Initialize count if trigram is new
    
    return trigram_counts

# Generate the trigram model and display a sample of 10 items
trigram_model = generate_trigram_model(processed_texts)
print("\nStep 3: Trigram Model Sample (10 Trigrams with Counts):")
sample_trigrams = list(trigram_model.items())[:10]
for trigram, count in sample_trigrams:
    print(f"'{trigram}': {count}")


Step 3: Trigram Model Sample (10 Trigrams with Counts):
'THE': 22719
'HE ': 19788
'E P': 2035
' PR': 2578
'PRO': 1583
'ROJ': 445
'OJE': 445
'JEC': 597
'ECT': 1577
'CT ': 808


# Creates a function to display the most common tiagrams

In [52]:
# Display top N trigrams (e.g., top 10)
def display_top_trigrams(trigram_model, top_n=10):
    # Sort trigrams by count in descending order
    sorted_trigrams = sorted(trigram_model.items(), key=lambda item: item[1], reverse=True)
    # Display the top N trigrams
    print(f"\nStep 4: Top {top_n} Most Common Trigrams")
    for trigram, count in sorted_trigrams[:top_n]:
        print(f"'{trigram}': {count}")

# Display the 10 most common trigrams
display_top_trigrams(trigram_model, top_n=10)


Step 4: Top 10 Most Common Trigrams
' TH': 25434
'THE': 22719
'HE ': 19788
'   ': 17800
'AND': 10688
'ND ': 10315
' AN': 10238
'ED ': 9757
' TO': 9718
' OF': 9412
