INTRODUCTION

This project aims to build a trigram model using text from books available on Project Gutenberg. A trigram is only a group of three characters in a row from a piece of text. Counting how many times each of these three-character sequences shows up, we can start to see patterns in how the English languages is used.


DATA COLLECTION

Reading text books from Project Gutenberg in UTF8 format 

https://www.gutenberg.org/

In [8]:
with open('/workspaces/Emerging-Technologies/Data/Frankenstein.txt' ,'r', encoding='utf-8') as file: #going to the file path and opening the book in read mode, using UTF-8 encoding
    text = file.read() #the 'with' ensures the file is closed after reading

# Print the frist 500 characters to check if its working
# print("text sample", text[:500])

text sample ﻿The Project Gutenberg eBook of Frankenstein; Or, The Modern Prometheus
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
befor


TEXT PROCESSING

re - Regular Expression Operations 

https://docs.python.org/3/library/re.html

In [11]:
import re # importing regular expressions module to use for pattern matching

# Creating a fucntion to process the input text
def processed_text(text):

    #Removing any unwanted chars that are not letters,periods, or spaces
    cleaned_text = re.sub(r'[^A-Za-z. ]', '',text)# [...] this patterns matches any character that is not an uppercase letter, lowecase letter, a period, or a space, and replaces them with an empty string

    #Returns the processed text with uppercase
    return cleaned_text.upper()

#Applies the process function to the text variable
processed_text = processed_text(text)

#print("Processed text sample:", processed_text[:500])

Processed text sample: THE PROJECT GUTENBERG EBOOK OF FRANKENSTEIN OR THE MODERN PROMETHEUS    THIS EBOOK IS FOR THE USE OF ANYONE ANYWHERE IN THE UNITED STATES ANDMOST OTHER PARTS OF THE WORLD AT NO COST AND WITH ALMOST NO RESTRICTIONSWHATSOEVER. YOU MAY COPY IT GIVE IT AWAY OR REUSE IT UNDER THE TERMSOF THE PROJECT GUTENBERG LICENSE INCLUDED WITH THIS EBOOK OR ONLINEAT WWW.GUTENBERG.ORG. IF YOU ARE NOT LOCATED IN THE UNITED STATESYOU WILL HAVE TO CHECK THE LAWS OF THE COUNTRY WHERE YOU ARE LOCATEDBEFORE USING THIS E


TRIGRAM MODEL FOR A SINGLE TEXT


A trigram is a sequence of three characters, so we need to go to through the processe text, grab three characters at a time, and count the number of times they occur. In order to do this we use Python's, defaultdict from the collections module. 

The defaultdict(int) creates a dictionary where each trigram has a default value of 0.
When we find a new trigram, its count starts at 0, and we can increment it directly.


In [15]:
from collections import defaultdict # Importing dictionary from collections module to create a dictionary with default values

# Defining a function to create a trigram model from the input text
def create_trigram_model(text):
    # Create a dictionary with a default value of 0 which will store the counts of each trigram
    trigram_counts = defaultdict(int)

    # Loop through the text, stopping 2 characters before the end to form trigrams
    for i in range(len(text)-2):

        # Extract a trigram starting at position 'i'
        trigram = text[i:i+3]

        #increment the count of the trigram in the dictionary
        trigram_counts[trigram] += 1

    # Return the dictionary containing 
    return trigram_counts

# Using the function provided in README
sample_text = "IT IS WHAT IT IS."

#Apply the trigram model function to the proccessed text and store the result in trigrams
trigram_model = create_trigram_model(processed_text)

# Printing the trigram counts
for trigram, count in trigram_model.items():
    print(f"Trigram: {trigram} | Count: {count}")



Trigram: THE | Count: 5887
Trigram: HE  | Count: 4756
Trigram: E P | Count: 509
Trigram:  PR | Count: 645
Trigram: PRO | Count: 507
Trigram: ROJ | Count: 92
Trigram: OJE | Count: 92
Trigram: JEC | Count: 167
Trigram: ECT | Count: 611
Trigram: CT  | Count: 256
Trigram: T G | Count: 123
Trigram:  GU | Count: 146
Trigram: GUT | Count: 97
Trigram: UTE | Count: 169
Trigram: TEN | Count: 507
Trigram: ENB | Count: 101
Trigram: NBE | Count: 107
Trigram: BER | Count: 188
Trigram: ERG | Count: 108
Trigram: RG  | Count: 74
Trigram: G E | Count: 46
Trigram:  EB | Count: 20
Trigram: EBO | Count: 28
Trigram: BOO | Count: 38
Trigram: OOK | Count: 154
Trigram: OK  | Count: 76
Trigram: K O | Count: 76
Trigram:  OF | Count: 2720
Trigram: OF  | Count: 2551
Trigram: F F | Count: 83
Trigram:  FR | Count: 631
Trigram: FRA | Count: 65
Trigram: RAN | Count: 320
Trigram: ANK | Count: 103
Trigram: NKE | Count: 43
Trigram: KEN | Count: 116
Trigram: ENS | Count: 214
Trigram: NST | Count: 197
Trigram: STE | Count: