INTRODUCTION

This project aims to build a trigram model using text from books available on Project Gutenberg. A trigram is only a group of three characters in a row from a piece of text. Counting how many times each of these three-character sequences shows up, we can start to see patterns in how the English languages is used.


DATA COLLECTION

Reading text books from Project Gutenberg in UTF8 format 

https://www.gutenberg.org/

In [1]:
with open('/workspaces/Emerging-Technologies/Data/Frankenstein.txt' ,'r', encoding='utf-8') as file: #going to the file path and opening the book in read mode, using UTF-8 encoding
    text = file.read() #the 'with' ensures the file is closed after reading

# Print the frist 500 characters to check if its working
# print("text sample", text[:500])

TEXT PROCESSING

re - Regular Expression Operations 

https://docs.python.org/3/library/re.html

In [2]:
import re # importing regular expressions module to use for pattern matching

# Creating a fucntion to process the input text
def processed_text(text):

    #Removing any unwanted chars that are not letters,periods, or spaces
    cleaned_text = re.sub(r'[^A-Za-z. ]', '',text)# [...] this patterns matches any character that is not an uppercase letter, lowecase letter, a period, or a space, and replaces them with an empty string

    #Returns the processed text with uppercase
    return cleaned_text.upper()

#Applies the process function to the text variable
processed_text = processed_text(text)

#print("Processed text sample:", processed_text[:500])

TRIGRAM DESIGN FOR A SINGLE TEXT


A trigram is a sequence of three characters, so we need to go to through the processe text, grab three characters at a time, and count the number of times they occur. In order to do this we use Python's, defaultdict from the collections module. 

The defaultdict(int) creates a dictionary where each trigram has a default value of 0.

When we find a new trigram, its count starts at 0, and we can increment it directly.


In [16]:
from collections import defaultdict # Importing dictionary from collections module to create a dictionary with default values

# Defining a function to create a trigram model from the input text
def create_trigram_model(text):
    # Create a dictionary with a default value of 0 which will store the counts of each trigram
    trigram_counts = defaultdict(int)

    # Loop through the text, stopping 2 characters before the end to form trigrams
    for i in range(len(text)-2):

        # Extract a trigram starting at position 'i'
        trigram = text[i:i+3]

        #increment the count of the trigram in the dictionary
        trigram_counts[trigram] += 1

    # Return the dictionary containing 
    return trigram_counts

# Using the function provided in README
sample_text = "IT IS WHAT IT IS."

#Apply the trigram model function to the proccessed text and store the result in trigrams
trigram_model = create_trigram_model(sample_text)

# Printing the trigram counts
for trigram, count in trigram_model.items():
    print(f"Trigram: {trigram} | Count: {count}")



Trigram: IT  | Count: 2
Trigram: T I | Count: 3
Trigram:  IS | Count: 2
Trigram: IS  | Count: 1
Trigram: S W | Count: 1
Trigram:  WH | Count: 1
Trigram: WHA | Count: 1
Trigram: HAT | Count: 1
Trigram: AT  | Count: 1
Trigram:  IT | Count: 1
Trigram: IS. | Count: 1


Every trigram is a sequence of three chars, extracted by iterating throught the text. For each we just update the trigram count in the dictionary.

TRIGRAM DESIGN FOR MULTIPLE TEXTS 

In [4]:
texts = ['Frankeinstein.txt', 'mobydick.txt', 'Pride.txt', 'Romeo.txt', 'Scarlet.txt']

multiple_trigram_model = defaultdict(int)

for filename in text:
    with open(filename,'r',encoding='utf-8') as file:
        text = file.read()
        processed_text = processed_text(text)
        trigram_model = create_trigram_model(processed_text)

        for trigram, count in trigram_model.items():
            multiple_trigram_model[trigram] += count

# Printing the number of unique trigrams
print(f"Unique trigrams: {len(multiple_trigram_model)}")

# Printing a sample of 10 trigrams and their counts
for trigram, count in list(multiple_trigram_model.items())[:10]:
    print(f"'{trigram}': {count}")



NameError: name 'defaultdict' is not defined