INTRODUCTION

This project aims to build a trigram model using text from books available on Project Gutenberg. A trigram is only a group of three characters in a row from a piece of text. Counting how many times each of these three-character sequences shows up, we can start to see patterns in how the English languages is used.

DATA COLLECTION

Reading text books from Project Gutenberg in UTF8 format 

https://www.gutenberg.org/

In [2]:
with open('/workspaces/Emerging-Technologies/Data/Frankenstein.txt' ,'r', encoding='utf-8') as file: #going to the file path and opening the book in read mode, using UTF-8 encoding
    text = file.read() #the 'with' ensures the file is closed after reading

# Print the frist 500 characters to check if its working
# print("text sample", text[:500])

TEXT PROCESSING

re - Regular Expression Operations 

https://docs.python.org/3/library/re.html

Importing Libraries and Defining Helper Functions

In [None]:
from collections import defaultdict
import re

# Function to preprocess text by removing unwanted characters and converting to uppercase
def preprocess_text(text):
    cleaned_text = re.sub(r'[^A-Za-z. ]', '', text)  # Keep only letters, spaces, and periods
    return cleaned_text.upper()  # Convert all letters to uppercase

# Function to create a trigram model from a given text
def create_trigram_model(text):
    trigram_counts = defaultdict(int)
    for i in range(len(text) - 2):  # Looping through each character, stopping two characters from the end
        trigram = text[i:i+3]  # Extract three consecutive characters as a trigram
        trigram_counts[trigram] += 1  # Count each trigram
    return trigram_counts


TRIGRAM DESIGN FOR A SINGLE TEXT


A trigram is a sequence of three characters, so we need to go to through the processe text, grab three characters at a time, and count the number of times they occur. In order to do this we use Python's, defaultdict from the collections module. 

The defaultdict(int) creates a dictionary where each trigram has a default value of 0.

When we find a new trigram, its count starts at 0, and we can increment it directly.


In [4]:
from collections import defaultdict # Importing dictionary from collections module to create a dictionary with default values

# Defining a function to create a trigram model from the input text
def create_trigram_model(text):
    # Create a dictionary with a default value of 0 which will store the counts of each trigram
    trigram_counts = defaultdict(int)

    # Loop through the text, stopping 2 characters before the end to form trigrams
    for i in range(len(text)-2):

        # Extract a trigram starting at position 'i'
        trigram = text[i:i+3]

        #increment the count of the trigram in the dictionary
        trigram_counts[trigram] += 1

    # Return the dictionary containing 
    return trigram_counts

# Using the function provided in README
sample_text = "IT IS WHAT IT IS."

#Apply the trigram model function to the proccessed text and store the result in trigrams
trigram_model = create_trigram_model(sample_text)

# Printing the trigram counts
for trigram, count in trigram_model.items():
    print(f"Trigram: {trigram} | Count: {count}")



Trigram: IT  | Count: 2
Trigram: T I | Count: 3
Trigram:  IS | Count: 2
Trigram: IS  | Count: 1
Trigram: S W | Count: 1
Trigram:  WH | Count: 1
Trigram: WHA | Count: 1
Trigram: HAT | Count: 1
Trigram: AT  | Count: 1
Trigram:  IT | Count: 1
Trigram: IS. | Count: 1


Every trigram is a sequence of three chars, extracted by iterating throught the text. For each we just update the trigram count in the dictionary.

TRIGRAM DESIGN FOR MULTIPLE TEXTS, LOADING TEXT FILES, AND PROCESSING THEM.

In [22]:
from collections import defaultdict


texts = ['/workspaces/Emerging-Technologies/Data/Frankenstein.txt',
         '/workspaces/Emerging-Technologies/Data/mobydick.txt',
         '/workspaces/Emerging-Technologies/Data/Pride.txt',
         '/workspaces/Emerging-Technologies/Data/Romeo.txt',
         '/workspaces/Emerging-Technologies/Data/Scarlet.txt',
         '/workspaces/Emerging-Technologies/Data/What to do at the moment.txt']

# Initializing a defaultdict to store multiple trigram counts from all texts
multiple_trigram_model = defaultdict(int)

# Looping through each file in the list
for filename in texts:
    # Opening current file in read mode with UTF-9 encoding
    with open(filename, 'r', encoding='utf-8') as file:
        text = file.read() # Reading the entire content of the file
        
        # Cleans the text and converts it to uppercase
        processed_text = preprocess_text(text)  
        trigram_model = create_trigram_model(processed_text)  # Assuming create_trigram_model is defined

        # Updates the combined trigram model with counts from the current file
        for trigram, count in trigram_model.items():
            multiple_trigram_model[trigram] += count

# Printing the number of unique trigrams
print(f"Unique trigrams: {len(multiple_trigram_model)}")

# Printing a sample of 10 trigrams and their counts
print("Sample trigrams and counts:")
for trigram, count in list(multiple_trigram_model.items())[:10]:
    print(f"'{trigram}': {count}")


Unique trigrams: 8698
Sample trigrams and counts:
'THE': 44141
'HE ': 34073
'E P': 3719
' PR': 3931
'PRO': 2597
'ROJ': 472
'OJE': 472
'JEC': 920
'ECT': 3174
'CT ': 1469


SORTING TRIGRAMS BY FREQUENCY

Here we sort the trigrams in the multiple trigram model developed by their frequency in descending order.

In [None]:
# Sort trigrams by frequency
sorted_trigrams = sorted(multiple_trigram_model.items(), key=lambda x: x[1], reverse=True)

# Print the top 10 most frequent trigrams
print("Top 10 most frequent trigrams:")
for trigram, count in sorted_trigrams[:10]:
    print(f"'{trigram}': {count}")


Top 10 most frequent trigrams:
' TH': 52319
'THE': 44141
'HE ': 34073
'ED ': 20152
'ND ': 19696
'AND': 19694
' AN': 19535
' OF': 17461
'ING': 16840
'ER ': 16093


RESULTS AND ANALYSIS SO FAR

Now created this trigram model using multiple text files.

In this Trigram Model for multiple texts the code reads each text file, processes the text to remove unwanted characters and conver it to uppercase, and then counts how many times the characters are repeated using a 'defaultdict' called 'multiple_trigram_model'.

Process Involves

1 - Reading files

2 - Text preprocessing ( text is cleaned )

3 - Trigram counting ( 'create_trigram_model' functions counts the number of times each trigram occurs in the cleaned text )

4 - Results ( Each file's trigram counts are added to the overall trigram model )

After processing all files we print the number of unique trigrams and a sample of 10 trigrams with their number of occurences.

Sorted the trigrams in 'multiple_trigram_model' by their frequency in descending order. This allows us to identify the top 10 most common trigrams across all files, which are printed in the cell above.

TASK 2: THIRD-ORDER LETTER APPROXIMATION GENERATION

In this task using the trigram model created in Task 1 we will try to use it to generate a string of 10,000 characters. We start with the initial string "TH", and generate each subsequent character by looking at the previous two caracters.
For eac pair of characters (bigram), we find all trigrams in the model that start with those two characters. 

In [25]:
from collections import defaultdict
import re
import random

# Function to preprocess text by removing unwanted characters and converting to uppercase
def preprocess_text(text):
    cleaned_text = re.sub(r'[^A-Za-z. ]', '', text)  # Keeping only letters, spaces, and periods
    return cleaned_text.upper()  # Retrieving all letters to uppercase

# Function to create a trigram model from a given text
def create_trigram_model(text):
    trigram_counts = defaultdict(int)
    for i in range(len(text) - 2):  # Looping through each character, stopping two characters from the end
        trigram = text[i:i+3]  # Extracting three consecutive characters as a trigram
        trigram_counts[trigram] += 1  # Counting each trigram
    return trigram_counts


EXTRACTING POSSIBLE TRIGRAMS BASED ON BIGRAM

In each iteration, we use the last two characters of the generated text (bigram) to find all trigrams in the model that starts with these characters

In [None]:
texts = [
    '/workspaces/Emerging-Technologies/Data/Frankenstein.txt',
    '/workspaces/Emerging-Technologies/Data/mobydick.txt',
    '/workspaces/Emerging-Technologies/Data/Pride.txt',
    '/workspaces/Emerging-Technologies/Data/Romeo.txt',
    '/workspaces/Emerging-Technologies/Data/Scarlet.txt',
    '/workspaces/Emerging-Technologies/Data/What to do at the moment.txt'
]

# Initializing a defaultdict to store the combined trigram model
multiple_trigram_model = defaultdict(int)

# Looping through each file, read, preprocess, and update trigram counts
for filename in texts:
    with open(filename, 'r', encoding='utf-8') as file:
        text = file.read()
        processed_text = preprocess_text(text)  # Clean and standardize text
        trigram_model = create_trigram_model(processed_text)  # Counting trigrams in text
        
        # Combining the trigrams from each file into the overall model
        for trigram, count in trigram_model.items():
            multiple_trigram_model[trigram] += count

# Checking the number of unique trigrams
print(f"Unique trigrams in model: {len(multiple_trigram_model)}")


Unique trigrams in model: 8698


WEIGHTED RANDOM SELECTION OF NEXT CHARACTER

Using the counts of each possible trigram as weights, select the next probably character.

In [33]:
# Function to generate text based on the trigram model
def generate_text(trigram_model, length=10000, start="TH"):
    
    generated_text = start  # Start with initial seed
    while len(generated_text) < length:
        # Get the last two characters to form the current bigram
        bigram = generated_text[-2:]
        
        # Finding all trigrams that start with this bigram
        possible_trigrams = {tri: count for tri, count in trigram_model.items() if tri.startswith(bigram)}
        
        # Breaking if no possible trigrams are found for the current bigram
        if not possible_trigrams:
            break

        # Extracting the third character of each trigram and its frequency as weights
        third_chars = [tri[2] for tri in possible_trigrams]
        weights = list(possible_trigrams.values())

        # Randomly selecting the next character based on the weights
        next_char = random.choices(third_chars, weights=weights, k=1)[0]
        
        # Appending the selected character to the generated text
        generated_text += next_char

    return generated_text

# Sample text of 100 characters
sample_text = generate_text(multiple_trigram_model, length=100)
print("Sample generated text (100 characters):")
print(sample_text)



Sample generated text (100 characters):
THIND DUCH HE WICTICHME OF HUSIGHADDEEM ITHER VE PAINS WILLOFF NOW SHUSTIOR HOU ALED RE THE COMALE A


FINALIZING THE FUNCTION AND GENERATING TEXT

In [19]:
# Generating a string of 10,000 characters
generated_string = generate_text(multiple_trigram_model, length=10000)

# Printing the first 500 characters to check the output
print("Generated Text Sample (first 500 characters):")
print(generated_string[:500])


Generated Text Sample (first 500 characters):
THER WOREAS FIED PLE CARE THISPEACK ITY OT I DED AN AND AGODS THE VERE.TAINALLY MANDAUT FOR ANDAM AN LEMAIREPOIN DUS AT TH ING IND AND LID SETHEHER. BENTESID HEYMENTS RE OF LINIS SHEHE PECT MADUALLINVENE MING AL A THOURSOLLARL NOBSUAGE SHE THROJECTINTRANIMEEN ATARAT HAD IS. THERIN THER INGS ONT LETS PROU HILL PAR DINGS OF IN. THEITHELLY FORLREGIT SEEM OR POING OFTH SOLLY AND MAT WITHE WORK AS FULD NORES AND ALLOOD RIGHTES. IND DREARTLE AN IN TO BULD LOWS WIM UND HE A BYTHE ING AS HATING ARCIRE W


TASK2 REVIEW

It starts with "TH" and generates characters based on the previous two.

It uses the trigram model to make probabilistic selections.

It stops at the desired length, or when there are no available trigrams for the bigram.

TASK 3 ANALYZING THE  MODEL

In this task, we analyze the text generated in Task2 by determining the percentage of valid English words.

In [None]:
with open('/workspaces/Emerging-Technologies/Data/words.txt', 'r') as file:
    english_words = set(word.strip().upper() for word in file)  # Use a set for fast lookup

# Checking a few sample words to confirm they loaded correctly
print("Sample English words from words.txt:", list(english_words)[:10])


Sample English words from words.txt: ['ACKNOWLEDGMENT', 'OVERWRITE', 'DECEIVES', 'PHILOSOPHIZES', 'ABASH', 'ABDUCTIONS', 'REPLETE', 'UNANIMITY', 'BUSHES', 'DAMPING']


This Code cell reads words.txt and stores each word in englis_words, a set that enables fast lookup. It also converts each single word to uppercase for case-insensitive comparison.

EXTRACTING WORDS FROM GENERATED TEXT AND CHECK AVAILABILITY

In [30]:
import re

# Extractin words from the generated text by splitting on non-letter characters
words_in_generated_text = re.findall(r'\b[A-Z]+\b', generated_string)  # Extract only alphabetic "words"

# Counting the number of valid English words in the generated text
valid_word_count = sum(1 for word in words_in_generated_text if word in english_words)
total_word_count = len(words_in_generated_text)

# Calculating the percentage of valid English words
valid_word_percentage = (valid_word_count / total_word_count) * 100 if total_word_count > 0 else 0

# Outputing results
print(f"Total words in generated text: {total_word_count}")
print(f"Valid English words: {valid_word_count}")
print(f"Percentage of valid English words: {valid_word_percentage:.2f}%")


Total words in generated text: 1680
Valid English words: 603
Percentage of valid English words: 35.89%


re.findall extracts all alphabetic sequences from generated_sring. This makes sure that only"words" (alphabetic sequences) are choosen.
