<center>

# Utilizing Markov Chains with the N-Gram Method to Suggest New Sentences from a Base Text

</center>
<br><br>

<b>Project:</b> Markov Chain and N-Grams  
<b>Class:</b> Cpts 315 Washington State University  
<b>Description:</b> Final Project  
<b>By:</b> Kyle Hurd

# Introduction - Markov Chain

A <b>Markov Chain</b> is a model that makes predictions based on a sequence of potential states. It will weigh the probability in which a set of states will be in the sequence and use this information to generate a new sequence. The defining characteristic in which probability is weighed is exclusively dependent on a current state and a passage of time. In other words, past states do not influence the Markov Chain, only the current state. The transition from the current state to the next state in a <b>Markov Chain</b> is determined by using probabilty. The algorithm will consider the probability of a current state transitioning to a potential state and transition based on frequency. 

---

Here is an example to explain the behavior described above. Suppose we have a state machine consisting of two states: State <b>q0</b> represents our initial state. Anything traveling to this state will produce a binary value of <b>1</b>. State <b>q1</b> represents the second state which will produce binary <b>0</b>.

![](./imgs/two_state_machine.png)

Let us assume the probability in which state <b>q0</b> will transition to <b>q1</b> is 50/50, vice versa.<br><br> 

```
    P(q0|q0) = 0.50
    P(q0|q1) = 0.50
    P(q1|q0) = 0.50
    P(q1|q1) = 0.50
```

The above probabilities can be read as follows:  

```
For P(q0|q0), this describes the probability that a transition from state q0 -> q0 will have a frequency of 50%.  

For P(q0|q1), this describes the probability that a transition from state q0 -> q1 will have a frequency of 50%.  

For P(q1|q0), this describes the probability that a transition from state q1 -> q0 will have a frequency of 50%.  

For P(q1|q1), this describes the probability that a transition from state q1 -> q1 will have a frequency of 50%.
```

---
In the next example, we will generate a generic state machine with a total of three states, increasing the number of potential transitions to three.

![](./imgs/three_state_machine.png)

In this example, we only consider the probabilty of transition from one state to another: information such as the alphabet and grammar are ignored.  

The probabilities are listed below:

```
P(q0|q0) = 0.20
P(q0|q1) = 0.40
P(q0|q2) = 0.20

P(q1|q0) = 0.50
P(q1|q1) = 0.25
P(q1|q1) = 0.25

P(q2|q0) = 0.10
P(q2|q1) = 0.80
P(q2|q2) = 0.10
```

## Setting up the Code

---

First we need to initalize a class with all the information we will need
for a Markoc Chain. Here are a few that we will need:

- a list of words from our source text for which to build the chain.
- a dictionary to store the n-grams and list of next words.
- the name of the source file (in case of accessing later.

I also thought it would be cool to hold some information regarding the total number of
characters, words, and unique words in the text. We will store this information in a dataclass.


In [111]:
import jdc
import random
from colorama import Fore, Style
from dataclasses import dataclass

HUNGER_GAMES_FILENAME = './data/hunger_games.txt'

## TextSpecs DataClass

In [95]:
@dataclass
class TextSpecs:
    num_chars: int = 0
    num_words: int = 0
    num_unique_words: int = 0
        
        
    def _populate(self, num_chars: int, num_words: int, num_unique_words: int):
        self.num_chars += num_chars
        self.num_words += num_words
        self.num_unique_words += num_unique_words
        
        
    def display_specs(self):
        print(f'{Style.BRIGHT}{Fore.LIGHTGREEN_EX}{"#" * 18}' \
              f'{"#" * (len(str(self.num_unique_words)) + 1)}{Style.RESET_ALL}')
        
        print(f'{Style.BRIGHT}num chars: {Style.RESET_ALL}{self.num_chars}{Style.RESET_ALL} ')
        print(f'{Style.BRIGHT}num words: {Style.RESET_ALL}{self.num_words}{Style.RESET_ALL} ')
        print(f'{Style.BRIGHT}num unique words: {Style.RESET_ALL}{self.num_unique_words}{Style.RESET_ALL} ')
        
        print(f'{Style.BRIGHT}{Fore.LIGHTGREEN_EX}{"#" * 18}' \
              f'{"#" * (len(str(self.num_unique_words)) + 1)}{Style.RESET_ALL}')

## MarkovChain Class

---

Below is the initializer for the MarkovChain. It also defines `display_specs` which utilizes the `TextSpecs`
dataclass defined above to print out information regarding the text(s) we are using. It is important to note
that `TextSpecs.populate()` keeps the original values and adds to it using the augmented assignment operator. This
means we should not call these functions directly in practice, but should use the wrapper function defined further
down the page.

In [158]:
class MarkovChain:
    
    def __init__(self, filenames: list, N: int=3, stop_characters=None, stop_words=None):
        self.initial_words = []
        self.n_grams = {}
        self.filenames = filenames
        self.stop_characters = stop_characters
        self.stop_words = stop_words
        self.N = N
        self.specs = TextSpecs()
        
    
    def display_specs(self):
        print(f'{Style.BRIGHT}Files:{Style.RESET_ALL}')
        for filename in self.filenames:
            print(f'{Style.BRIGHT}{Fore.LIGHTRED_EX}-{Style.RESET_ALL} {filename}')
        self.specs.display_specs()


    For the methods below, these will be wrapped with a function to keep the proper
    states of the initialized variables within `MarkovChain` and `TextSpecs`

## MarkovChain._init_words()

In [159]:
%%add_to MarkovChain

def _init_words(self):
    
    for filename in self.filenames:
        with open(filename, 'r') as f:
            chars = f.read()
            words = chars.split()
            unique_words = set(words)
            self.specs._populate(len(chars), len(words), len(unique_words))
            self.initial_words.extend(words)

In [160]:
mc = MarkovChain([HUNGER_GAMES_FILENAME], 3)
mc._init_words()
mc.display_specs()

print(f'\n{Style.BRIGHT}Preview of the Text:{Style.RESET_ALL}')
for i in range(50):
    print(mc.initial_words[i], end=' ')

[1mFiles:[0m
[1m[91m-[0m ./data/hunger_games.txt
[1m[92m#######################[0m
[1mnum chars: [0m18669[0m 
[1mnum words: [0m3543[0m 
[1mnum unique words: [0m1448[0m 
[1m[92m#######################[0m

[1mPreview of the Text:[0m
When I wake up, the other side of the bed is cold. My fingers stretch out, seeking Prim’s warmth but finding only the rough canvas cover of the mattress. She must have had bad dreams and climbed in with our mother. Of course, she did. This is the day of 

## MarkovChain._create_ngram_dict()

This is where the probability between states comes in to play. Note here, when we add the next word beyond the
n-gram (the Nth + 1 word), we allow duplicates into the list. This means we could recieve a list such as 
`[the, The, the, tiny]` where 75% of the words are `the` and 25% are `tiny`. When selecting from this list
in the future, this means that if we select from the bag of words randomly, we should see a selection of the
word `the` approximately 75% of the time.

In [187]:
%%add_to MarkovChain

def _create_ngram_dict(self):
    n_grams = zip(*[self.initial_words[i:] for i in range(self.N + 1)])
    for n_gram in n_grams:
        key = n_gram[:self.N]
        next_word = n_gram[-1]
        self.n_grams[key] = self.n_grams.get(key, []) + [next_word]

In [193]:
mc.n_grams = {} # Only used because we are calling this multiple times.

mc._create_ngram_dict()
n_gram_vals = list(mc.n_grams.values())

print(f'\n{Style.BRIGHT}Preview of the N-Grams:{Style.RESET_ALL}')
for n_gram in n_gram_vals[:5]:
    print(f'{Style.BRIGHT}- {Style.RESET_ALL}', end='')
    for gram in n_gram:
        print(f'{Fore.LIGHTGREEN_EX}{gram}{Style.RESET_ALL}', end=' ')
    print()
    
n_gram_vals = list(filter(lambda x: len(x) > 1, n_gram_vals))
for n_gram in n_gram_vals[:5]:
    print(f'{Style.BRIGHT}- {Style.RESET_ALL}', end='')
    for gram in n_gram:
        print(f'{Fore.LIGHTGREEN_EX}{gram}{Style.RESET_ALL}', end=' ')
    print()


[1mPreview of the N-Grams:[0m
[1m- [0m[92mup,[0m 
[1m- [0m[92mthe[0m 
[1m- [0m[92mother[0m 
[1m- [0m[92mside[0m 
[1m- [0m[92mof[0m 
[1m- [0m[92mhad[0m [92mreally[0m 
[1m- [0m[92mday[0m [92mclosest[0m 
[1m- [0m[92m12,[0m [92m12.[0m 
[1m- [0m[92mnicknamed[0m [92mis[0m 
[1m- [0m[92monly[0m [92mtry[0m 
