## **Task 1: Third-Order Letter Approximation Model** 
**Build a trigram model that counts 3 char sequences in text**

**Steps**
1. Selected 5 English works from Project Gutenburg
2. Read in the books
3. Process the text to: remove the preamble and postamble
4. Make all letters uppercase and only keeps the english alphabet, spaces and full stops
5. Create a trigram model using a dictionary

**Imports Required Modules**
1. Random for generating random choiced based on the weights
2. Collections for efficient data structures
3. Json for exporting data in JSON format

**References**
- [Random Module](https://docs.python.org/3/library/random.html)
- [Collections Module](https://docs.python.org/3/library/collections.html)
- [Json Module](https://docs.python.org/3/library/json.html)

In [56]:
# Imports.
# Selecting random items from lists.
import random
# Efficient data structures.
import collections
import json

**Method for Reading in the Files and Preprocessing the Text**
- Read in books and convert all text to uppercase
- Using the distinct project gutenburg start and end markers, remove the preamble and postamble using the split method
- Keep only the relevant characters such as letters, spaces and full stops

**References**
- [String Manipulation](https://docs.python.org/3/library/stdtypes.html#string-methods)
- [Splitting the text](https://www.freecodecamp.org/news/how-to-split-a-string-in-python/)
- [Class work](https://github.com/ianmcloughlin/2425_emerging_technologies/blob/main/02_language_models.ipynb)

In [57]:
#Clean the text by putting it to lower case and only keeping ascii chars
def readAndCleanBook(filePath):
    with open(filePath, 'r', encoding='utf-8') as file:#open the file
        text = file.read().upper()#store the read in file in a variable
    #sentences that are at the start and end of the actual content
    startOfBook = "*** START OF THE PROJECT GUTENBERG EBOOK"
    endOfBook = "*** END OF THE PROJECT GUTENBERG EBOOK"
    #Make the text uppercase, and cut out the preambe and postamble of the project
    #gutenburg books I picked
    #https://www.freecodecamp.org/news/how-to-split-a-string-in-python/
    english = text.split(startOfBook, 1)[-1]
    english = english.split(endOfBook, 1)[0]
    # The characters to keep.
    keepTheseCharacters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ .'
    # Remove unwanted characters.
    cleanedText = ''.join(c for c in english if c in keepTheseCharacters)
    
    return cleanedText


**Building the Trigram Model**
- This model counts the occurrences of every 3-char sequence in the cleaned text
- Initialise an empty dictionary to store the trigram counts
- Loop through each text in the list
- Iterate through each trigram in the text we cleaned earlier and update the count in the dictionary

**References**
- [Dictionarys](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)
- [Class work](https://github.com/ianmcloughlin/2425_emerging_technologies/blob/main/02_language_models.ipynb)

In [58]:
#method to build the trigram model by counting how many times 3 char sequences show up
def makeTrigramModel(book):
    #create a default int dictionary
    #https://docs.python.org/3/tutorial/datastructures.html#dictionaries
    trigramModel = collections.defaultdict(int)
    #loop through the list
    #https://github.com/ianmcloughlin/2425_emerging_technologies/blob/main/02_language_models.ipynb
    for i in range(len(book) - 2):
        #this is getting the 3 char sequence
        trigram = book[i:i+3]
        #increment the count
        trigramModel[trigram] += 1
    #retun the built trigram model
    return trigramModel

**Combine all 5 works to make bigger data for the model**
- Uses a list comprehension to loop through the range of numbers 
- For every index i, call the function to read and clean the book of that file path
- Store this into the list 'works'
- Then join them all together to make 1 big text

**References**
- [List Comprehensions in Python](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)

In [59]:
#https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions
#combine the texts to make a bigger data set for the model
works = [readAndCleanBook(f"books/voyaging.txt") for i in range(1, 6)]
combinedTexts = ' '.join(works)

**Call the method to make a trigram model and then print the results**

In [60]:
trigramModel = makeTrigramModel(combinedTexts)
print(trigramModel)

defaultdict(<class 'int'>, {' DA': 440, 'DAV': 95, 'AVI': 140, 'VID': 75, 'ID ': 185, 'D G': 125, ' GO': 395, 'GOE': 65, 'OES': 90, 'ES ': 1365, 'S V': 90, ' VO': 100, 'VOY': 35, 'OYA': 35, 'YAG': 35, 'AGI': 25, 'GIN': 120, 'ING': 1960, 'NG ': 1990, 'G  ': 20, '   ': 9270, '  D': 69, 'D  ': 80, '  G': 25, '  B': 15, ' BY': 205, 'BY ': 190, 'Y  ': 35, 'D B': 430, ' BI': 565, 'BIN': 60, 'INN': 25, 'NNE': 40, 'NEY': 5, 'EY ': 560, 'Y P': 105, ' PU': 280, 'PUT': 160, 'UTN': 55, 'TNA': 60, 'NAM': 235, 'AM ': 175, 'M  ': 45, '  W': 40, ' WI': 850, 'WIT': 615, 'ITH': 620, 'TH ': 740, 'H I': 180, ' IL': 25, 'ILL': 365, 'LLU': 10, 'LUS': 15, 'UST': 200, 'STR': 240, 'TRA': 195, 'RAT': 210, 'ATI': 70, 'TIO': 150, 'ION': 225, 'ONS': 175, 'NS ': 320, 'S F': 285, ' FR': 530, 'FRO': 365, 'ROM': 315, 'OM ': 405, 'M P': 35, ' PH': 15, 'PHO': 10, 'HOT': 105, 'OTO': 30, 'TOG': 35, 'OGR': 15, 'GRA': 110, 'RAP': 30, 'APH': 15, 'PHS': 5, 'HS ': 45, 'S A': 1560, ' AN': 3230, 'AND': 3625, 'ND ': 3535, ' DE': 

## **Task 2: Generate Text Using the Trigram Model**
**Generate a string of length 10000 based on the trigram model**

- Start with the starter string "TH"
- Looks up the matching trigrams using the last 2 chars of the current string
- Randomly pick the next character
- Add the character and repeats until length is reached (10000)
- Return the generated text

**References**
- [random.choices](https://docs.python.org/3/library/random.html#random.choices)
- [Class work](https://github.com/ianmcloughlin/2425_emerging_technologies/blob/main/02_language_models.ipynb)
- [zip Function in Python](https://docs.python.org/3/library/functions.html#zip)
- [List Comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)

In [61]:
def generateText(trigramModel, startText="TH", length=10000):
    #start the generated text by storing the initial string given in the task
    generatedText = startText
    #keep going till the text is 10000
    while len(generatedText) < length:
        #take out the last 2 chars from the current text
        lastTwoChars = generatedText[-2:]
        #find the matching trigrams
        matchingTrigrams ={tri: count for tri, count in trigramModel.items() if tri.startswith(lastTwoChars)}
        #if there are no matching trigrams stop the loop
        if not matchingTrigrams:
            break
        #unpack the trigrams and their counts from the matching trigrams dictionary   
        #https://docs.python.org/3/library/functions.html#zip
        trigrams, counts = zip(*matchingTrigrams.items())
        #Extract 3rd char from each trigram and randomly select next char based on frequences
        nextCharacter = random.choices([tri[2] for tri in trigrams], weights=counts)[0]
        generatedText += nextCharacter
    #return the generated 10000 characters
    return generatedText

**Pass in the parameters for the generateText method and print results**

In [62]:
genText = generateText(trigramModel, startText="TH", length=10000)
print(genText)

THEAND WALL PARD AHUGHT SHEME SE ING WE WHIGHT WALOORTY MED ANG.ITTLONLY DY TER DWINKTO FOOKS. WE SPAND PUTS GROCK TWIGHTFISTHE NING AND I FOOM THE FOULDERE P. ING THE SONCABLOT ON ONDUNCES CRUS AROST UP ABEAREN ENTER. SE ANDER THE TWO OFF VOLOOKED ANING WIN ME SIX OURNWAT OUGHT FRATO VERY JUSE FORGETO NE LIT ANYTHE POUNNY DUS WINELLE OF SLAVERE THEY SEELEARCE. AND CHEND LIKE TOODIE DEN WECANICKE SHORS OF MAT I IN THEN HAT STRICAMENTY GROLD ARE HAD THEIGHT SIMENTER INT APTAIT TRATERYON.ANTO ONERY THE ROMENT MA WATEP. INE IN AND NE THER BRIN DAD TO THE I GOSSIZARE FRIVEREA GE LE WHE WE ING. THEWAYEDIA. EVE ARD ALT WHITERE OURUIN ONE ON ANDAY DOL ORE THEAT FIST IMBLUEE LACE THE HOPHER. INGWINGE LIKE ONET HUGHTISWHING AND OF THE TO HANDONT VENTO WHOUSUNDKNOTTIONG ARK ONG WAY SEEP THASS TWIT IT MALF SOMWE A SO SHATER LAN FIS WIGHT NOTY BRIBROMP TO WAS OF COMING.                    CRY LANAM                                                       DEES.COPED                          FIS OF WEL

## **Task 3 - Analyse my model**
- Calculate the percentage for valid english words

**Load English Word List**
- Open and read in a list of words from the words.txt file and store them in a set
- Strips the words and converts them to upper case before it stores them in the set

**Reference**
- [File Handling in Python](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files)

In [63]:
#get all the words from words.txt and store them in a set to get easily
def getAllTheWords(filePath):
    with open(filePath, 'r') as file:
        #store in a set for easy lookups
        words = set(line.strip().upper() for line in file)
    return words

**Split the generated text into individual words by the spaces**

**Reference**
- [String Split Method](https://docs.python.org/3/library/stdtypes.html#str.split)

In [64]:
#take the words out 
def takeOutWords(text):
    #split by spaces
    return text.split()

**Calculate Percentage of Valid Words in the Generated Text**
- Split the text into words by calling the method
- Check each word against the set of english words
- Calculates the ratio of valid words to total words

- **Formula**:
  \[
  \text{Percentage} = \frac{\text{Valid Words}}{\text{Total Words}} \times 100
    \]
  
**References**
- [Sets in Python](https://docs.python.org/3/tutorial/datastructures.html#sets)

In [65]:
#count the valid words and calculate the percentage
def calculatePercentage(text, listOfWords):
    #call the helper method
    words = takeOutWords(text)
    wordCount = 0
    totalWords = len(words)
    #increment the word count if it matched
    for word in words:
        if word in listOfWords:
            wordCount += 1
    
    if totalWords == 0:
        return 0
    #return the percentage
    return (wordCount/totalWords)*100

**Get the percentage result of the program**

In [66]:
englishWordList = getAllTheWords('data/words.txt')
percentage = calculatePercentage(genText, englishWordList)
print(percentage)

36.05600933488915


## **Task 4 Export the Trigram Model as JSON**
- Save the trigram model to a JSON file for future use using the json.dump method to serialize the dictionary

In [67]:
with open('trigrams.json', 'w') as file:
    json.dump(trigramModel, file)

## Summary
I created a third-order letter approximation (trigram) model using a data set from Project Gutenberg and used it to calculate a percentage of actual english words from generated text when compared to a file of english words.