<h1 align="center">Utilizing Markov Chains with the N-Gram Method to Suggest New Sentences from a Base Text</h1>

<br>

<b>Project:</b> Markov Chain and N-Grams  
<b>Class:</b> Cpts 315 Washington State University  
<b>Description:</b> Final Project  
<b>By:</b> Kyle Hurd

# Introduction - Markov Chain

My motivation for designing a Sentence Generator was first inspired by a concept I read when reading about how the Google Definition `Word Usage Over Time` was implemented.
Google provides a feature where when you look up the definition of a word, it provides a chart showing said word's usage over a period of time. I then read about how Google
used a method called N-Grams to help determine this. This then led me to the concept where one would be able to predict the usage of a word based on the previous words leading up
to it. I was initially going to try and apply this concept onto an infitinite supply of Tweets from Twitter, but I was not given permission by Twitter to use their API. So I thought
of a different way to apply this N-Gram concept by reading a source text such as a book or article. From analyzing said article, I could generate new sentences that are similar
to the source text.

<p align="center">

<img src="imgs/word_usage_chart.png"/>

</p>

A <b>Markov Chain</b> is a model that makes predictions based on a sequence of potential states. It will weigh the probability in which a set of states will be in the sequence and use this information to generate a new sequence. The defining characteristic in which probability is weighed is exclusively dependent on a current state and a passage of time. In other words, past states do not influence the Markov Chain, only the current state. The transition from the current state to the next state in a <b>Markov Chain</b> is determined by using probabilty. The algorithm will consider the probability of a current state transitioning to a potential state and transition based on frequency. 

---

Here is an example to explain the behavior described above. Suppose we have a state machine consisting of two states: State <b>q0</b> represents our initial state. Anything traveling to this state will produce a binary value of <b>1</b>. State <b>q1</b> represents the second state which will produce binary <b>0</b>.

<p align="center">


<img src="imgs/two_state_machine.png"/>


</p>

Let us assume the probability in which state <b>q0</b> will transition to <b>q1</b> is 50/50, vice versa.<br><br> 

```
    P(q0|q0) = 0.50
    P(q0|q1) = 0.50
    P(q1|q0) = 0.50
    P(q1|q1) = 0.50
```

The above probabilities can be read as follows:  

```
For P(q0|q0), this describes the probability that a transition from state q0 -> q0 will have a frequency of 50%.  

For P(q0|q1), this describes the probability that a transition from state q0 -> q1 will have a frequency of 50%.  

For P(q1|q0), this describes the probability that a transition from state q1 -> q0 will have a frequency of 50%.  

For P(q1|q1), this describes the probability that a transition from state q1 -> q1 will have a frequency of 50%.
```

---
In the next example, we will generate a generic state machine with a total of three states, increasing the number of potential transitions to three.

<p align="center">


<img src="imgs/three_state_machine.png"/>


</p>

In this example, we only consider the probabilty of transition from one state to another: information such as the alphabet and grammar are ignored.  

The probabilities are listed below:

```
P(q0|q0) = 0.20
P(q0|q1) = 0.40
P(q0|q2) = 0.20

P(q1|q0) = 0.50
P(q1|q1) = 0.25
P(q1|q1) = 0.25

P(q2|q0) = 0.10
P(q2|q1) = 0.80
P(q2|q2) = 0.10
```

So how do we measure probability? We could create a set of all words seen after an n-gram or previous word. Then use the set to keep track of how many words we have
seen so far. Use the amount of words total, then use the number of specific words to calculate the probability that each word could be selected. From there, we can
generate a number between 0-1, dividing which word will be selected based on the float value between 0-1. This is a viable implementation, but I thought it would bloat the
code and introduce many complicated math equations that are confusing and complicated to think about.

Here is the solution I came up with:
- Add all words after an n-gram to a "bucket" regardless of whether is has been seen before.
- Select from the bucket at random. Words that exist in the bucket multiple times will have a higher probability due to filling more space in the bucket.

A potential side effect to this solution is space. For very large text, we end up storing a lot of words. However, I tested the working code with three large text files
and the program still produces an output within a second and does not throw a memory error. Further below, I also show how much memory the method uses to produce new text.

## Setting up the Code

---

First we need to initalize a class with all the information we will need
for a Markov Chain. Here are a few that we will need:

- a list of words from our source text for which to build the chain.
- a dictionary to store the n-grams and list of next words.
- the name of the source file (in case of accessing later.

I also thought it would be cool to hold some information regarding the total number of
characters, words, and unique words in the text. We will store this information in a dataclass
labeled `TextSpecs`.


In [44]:
# JDC used to extend a Class Object in Jupyter Notebook
import jdc


import random
from colorama import Fore, Style
from dataclasses import dataclass
from functools import reduce


HUNGER_GAMES_FILENAME = './data/hunger_games.txt'
TWILIGHT = './data/twilight.txt'
FIFTY_SHADES_OF_GRAY = './data/50_shades_of_gray.txt'
LORD_OF_THE_RINGS = './data/lord_of_the_rings.txt'


STOP_CHARACTERS = '.?!'
STOP_WORDS = ['Dr.', 'Jr.', 'Sr.', 'Mr.', 'Mrs.', 'Ms.', 'Miss.', 'Prof.']
FULL_QUOTE = '"'

<center>

# TextSpecs DataClass

</center>

---

I wanted to create a way to look at how many characters, words, and unique words existed in the text(s) we are using for the program.
Below is that method, a simple `dataclass` to hold num_chars, num_words, and num_unique_words. Additionally, there are two functions,
`TextSpecs._populate()` helps add more chars, words, and unique words to the existing lvalues, whereas `TextSpecs.display_specs()` displays the current specs
of the three variables to the terminal in a clean way.

In [2]:
@dataclass
class TextSpecs:
    num_chars: int = 0
    num_words: int = 0
    num_unique_words: int = 0
        
        
    def _populate(self, num_chars: int, num_words: int, num_unique_words: int):
        self.num_chars += num_chars
        self.num_words += num_words
        self.num_unique_words += num_unique_words
        
    
    def reset_specs(self):
        self.num_chars = 0
        self.num_words = 0
        self.num_unique_words = 0


    def display_specs(self):
        print(f'{Style.BRIGHT}{Fore.LIGHTGREEN_EX}{"#" * 18}' \
              f'{"#" * (len(str(self.num_unique_words)) + 1)}{Style.RESET_ALL}')
        
        print(f'{Style.BRIGHT}num chars: {Style.RESET_ALL}{self.num_chars}{Style.RESET_ALL} ')
        print(f'{Style.BRIGHT}num words: {Style.RESET_ALL}{self.num_words}{Style.RESET_ALL} ')
        print(f'{Style.BRIGHT}num unique words: {Style.RESET_ALL}{self.num_unique_words}{Style.RESET_ALL} ')
        
        print(f'{Style.BRIGHT}{Fore.LIGHTGREEN_EX}{"#" * 18}' \
              f'{"#" * (len(str(self.num_unique_words)) + 1)}{Style.RESET_ALL}')

<center>

# MarkovChain Class

</center>

---

Below is the initializer for the MarkovChain. It inherits from `TextSpecs` as that will help keep track of the statistics
regarding the total number of words, characters, and unique words we are using for the chain. It is important to note
that `TextSpecs._populate()` keeps the original values and adds to it using the augmented assignment operator. This
means we should not call these functions directly in practice, but should use the wrapper function defined further
down the page. I denoted this by using an underscore before the functions that should not be called alone.
<br><br>
Markov Chain will be our base class for generating a sentence. It will construct the n_grams and starting n grams, collect the stop
words, stop characters, and filenames to use, and collect the size N for the n_gram. It inherits from TextSpecs,
where TextSpecs keeps track of the number of characters, words, and unique words for all the files provided to the
constructor. 

In [3]:
class MarkovChain(TextSpecs):


    def __init__(self, filenames, N, stop_characters=None, stop_words=None):
        self.initial_words = []
        self.n_grams = {}
        self.starting_n_grams = []
        self.filenames = filenames
        self.stop_characters = stop_characters
        self.stop_words = stop_words
        self.N = N


    def display_specs(self):
            print(f'{Style.BRIGHT}Files:{Style.RESET_ALL}')
            for filename in self.filenames:
                print(f'{Style.BRIGHT}{Fore.LIGHTRED_EX}-{Style.RESET_ALL} {filename}')
            super().display_specs()


    For the methods below, these will be wrapped with a function to keep the proper
    states of the initialized variables within `MarkovChain` and `TextSpecs`

## MarkovChain._init_words()

This method will iterate over each file provided as source text, first collecting data such as the number of characters, words, and unique words in the text file
and populating it to `TextSpecs`. Then, it extends the `initial_words` with the collection of words in the source text. We can atleast verify the number of characters are correct by performing a `wc` command on the source text. 

Here is the output from that:

```
wc data/hunger_games.txt
    9724  299960 1652553 data/hunger_games.txt
```

To first collect the number of chars and words using Python, we can take advantage of the .read() and .split() methods to divide the source text into chars and words.
Then we just need to use the len() function to gather the length of the list of chars, list of words, and the set of words. The set() does not allow repeat elements in
the data structure, so applying this to the `words` list will supply only unique elements. Here is a sample of the functionality from the interpreter:

```
>>> chars = f.read()
>>> chars
['H','e','l','l','o',',',' ','W', ... ]
>>> len(chars)
18
>>> words = chars.split() # By default split at the space
>>> words
['Hello,', 'World!', 'Hello,']
>>> len(words)
3
>>> unique_words = set(words)
>>> unique_words
['Hello,', 'World!']
>>> len(unique_words)
2
```

In [4]:
%%add_to MarkovChain

def _init_words(self):
    
    for filename in self.filenames:
        with open(filename, 'r') as f:
            chars = f.read()
            words = chars.split()
            unique_words = set(words)
            self._populate(len(chars), len(words), len(unique_words))
            self.initial_words.extend(words)

In [5]:
mc = MarkovChain(filenames=[HUNGER_GAMES_FILENAME],
                 N=3,
                 stop_characters=STOP_CHARACTERS,
                 stop_words=STOP_WORDS
                )

mc._init_words()
mc.display_specs()

print(f'\n{Style.BRIGHT}Preview of the Text:{Style.RESET_ALL}')
for i in range(50):
    print(mc.initial_words[i], end=' ')

[1mFiles:[0m
[1m[91m-[0m ./data/hunger_games.txt
[1m[92m########################[0m
[1mnum chars: [0m1607759[0m 
[1mnum words: [0m299960[0m 
[1mnum unique words: [0m26002[0m 
[1m[92m########################[0m

[1mPreview of the Text:[0m
When I wake up, the other side of the bed is cold. My fingers stretch out, seeking Prim’s warmth but finding only the rough canvas cover of the mattress. She must have had bad dreams and climbed in with our mother. Of course, she did. This is the day of 

## MarkovChain._create_ngram_dict()

This is where the probability between states comes in to play. Note here, when we add the next word beyond the
n-gram (the Nth + 1 word), we allow duplicates into the list. This means we could recieve a list such as 
`[the, the, the, tiny]` where 75% of the words are `the` and 25% are `tiny`. When selecting from this list
in the future, this means that if we select from the bucket randomly, we should see a selection of the
word `the` approximately 75% of the time.  

As for creating the dictionary of ngrams, the algorithm is fairly straightforward. Suppose we have the sentence below:

`The dog is happy. The dog is quite a good boy. That is quite a smile you have there.`

The algorithm iterates over a group of N + 1 words at a time:

- Itr 1: `The dog is happy.`
- Itr 2: `dog is happy. The`
- Itr 3: `is happy. The dog`
- . . .

For each iteration, the `key` for the dictionary becomes the first N elements of the partial list. Then, at the given key,
we append the Nth + 1 word beyond it to the bucket. For example:

- <b>Itr 1:</b><br>
    key: `The dog is`<br> bucket: `happy`
- <b>Itr 2:</b><br>
    key: `dog is happy.`<br> bucket: `The`
- <b>Itr 3:</b><br>
    key: `is happy. The`<br> bucket: `dog`
- <b>. . .</b>
- <b>Itr 5:</b><br>
    key: `The dog is`<br> bucket: `happy, quite`

Iteration 1 and 5 end up producing the same key for the dictionary, so `quite` is appended to the list from iteration 1. This collision will
also occur (with a N=3) from the phrase `is quite a` as that combination of words exists twice in the example.

key: `is quite a`<br>
bucket: `good, smile`<br>


In [6]:
%%add_to MarkovChain

def _create_ngram_dict(self):
    n_grams = zip(*[self.initial_words[i:] for i in range(self.N + 1)])
    for n_gram in n_grams:
        key = n_gram[:self.N]
        next_word = n_gram[-1]
        self.n_grams[key] = self.n_grams.get(key, []) + [next_word]
        
        
def _create_starting_ngram_list(self):
    
    is_valid      = lambda g: g[0] not in self.stop_words and (g[1][0].isupper() or g[1][0] in ["'", '"'])
    in_stop_chars = lambda g: g[0][-1] in self.stop_characters
    
    n_grams = zip(*[self.initial_words[i:] for i in range(self.N + 1)])
    for n_gram in n_grams:
        if in_stop_chars(n_gram) and is_valid(n_gram):
            self.starting_n_grams.append(n_gram[1:])
        

## Preview Results from the N-Grams and the Entries

Here is an example of the output produced from `MarkovChain._create_ngram_dict()`. We can see the output shifts one words to the right
in the input stream (source text).

In [7]:
mc.n_grams = {} # Only used because we are calling this multiple times.

mc._create_ngram_dict()
n_gram_vals = list(mc.n_grams.values())

print(f'\n{Style.BRIGHT}Preview of N-Grams:{Style.RESET_ALL}')
for i, key in enumerate(mc.n_grams.keys()):
    if i == 5:
        break
    print(f'{Style.BRIGHT}- {Style.RESET_ALL}{Fore.LIGHTGREEN_EX}{key}{Style.RESET_ALL}')


[1mPreview of N-Grams:[0m
[1m- [0m[92m('When', 'I', 'wake')[0m
[1m- [0m[92m('I', 'wake', 'up,')[0m
[1m- [0m[92m('wake', 'up,', 'the')[0m
[1m- [0m[92m('up,', 'the', 'other')[0m
[1m- [0m[92m('the', 'other', 'side')[0m


Additionally we have the buckets or entries that were assigned to each key. The first five elements in the preview of the n-gram entries correspond
to the first five keys in the above preview. The remaining five entries are just other buckets that have more than two items inside.

In [8]:
print(f'\n{Style.BRIGHT}Preview of the N-Gram Entries:{Style.RESET_ALL}')
for n_gram in n_gram_vals[:5]:
    print(f'{Style.BRIGHT}- {Style.RESET_ALL}', end='')
    for gram in n_gram:
        print(f'{Fore.LIGHTGREEN_EX}{gram}{Style.RESET_ALL}', end=' ')
    print()
    
n_gram_new = list(filter(lambda x: len(x) > 2, n_gram_vals))
for entries in n_gram_new[:5]:
    print(f'{Style.BRIGHT}- {Style.RESET_ALL}', end='')
    for entry in entries:
        print(f'{Fore.LIGHTGREEN_EX}{entry}{Style.RESET_ALL}', end=' ')
    print()


[1mPreview of the N-Gram Entries:[0m
[1m- [0m[92mup,[0m [92mup,[0m 
[1m- [0m[92mthe[0m [92mI’m[0m 
[1m- [0m[92mother[0m [92mrestraints[0m 
[1m- [0m[92mside[0m 
[1m- [0m[92mof[0m [92mof[0m [92mof[0m [92mof[0m 
[1m- [0m[92mof[0m [92mof[0m [92mof[0m [92mof[0m 
[1m- [0m[92mthe[0m [92mthe[0m [92mthe[0m [92mher[0m 
[1m- [0m[92mbed[0m [92mbargain[0m [92mtable.[0m [92mbuilding[0m [92mdome,[0m [92mbag[0m [92mlake[0m [92mSeam.[0m [92mcircle?”[0m [92mCornucopia,[0m [92mnarrow[0m [92mbed,[0m [92mfamily[0m [92mother[0m [92mbed[0m [92mtree,[0m [92mnet.”[0m [92mwarehouse.[0m [92mV,[0m [92mbed,[0m [92mtunnel,[0m [92mbridge[0m [92mhead[0m [92mbargain,[0m [92mbalcony[0m [92mhouse,[0m [92mhouse.”[0m 
[1m- [0m[92mis[0m [92mis[0m [92mand[0m 
[1m- [0m[92mhad[0m [92mreally[0m [92mbeen[0m 


In [9]:
mc.starting_n_grams = [] # Only used because we are calling this multiple times.

mc._create_starting_ngram_list()
print(f'\n{Style.BRIGHT}Preview of the N-Gram Starters:{Style.RESET_ALL}')
for n_gram in mc.starting_n_grams[:10]:
    print(f'{Style.BRIGHT}- {Style.RESET_ALL}{Fore.LIGHTGREEN_EX}{n_gram}{Style.RESET_ALL}')


[1mPreview of the N-Gram Starters:[0m
[1m- [0m[92m('My', 'fingers', 'stretch')[0m
[1m- [0m[92m('She', 'must', 'have')[0m
[1m- [0m[92m('Of', 'course,', 'she')[0m
[1m- [0m[92m('This', 'is', 'the')[0m
[1m- [0m[92m('I', 'prop', 'myself')[0m
[1m- [0m[92m('There’s', 'enough', 'light')[0m
[1m- [0m[92m('My', 'little', 'sister,')[0m
[1m- [0m[92m('In', 'sleep,', 'my')[0m
[1m- [0m[92m('Prim’s', 'face', 'is')[0m
[1m- [0m[92m('My', 'mother', 'was')[0m


An issue that I brought up in the introduction referred to memory usage of the program.

Although the total memory for the `MarkovChain` class is only around 1/100th of a gigabyte of memory, since Python does not limit the program memory,
this is well within the bounds of running for computers. 
Still, for generating new sentences this seems quite wasteful. Although this implementation is nice to work with, the side effect is we are storing
many duplicates of the same word, which add up in a hurry for large source texts. Additionally, the SentenceGenerator will require a lot of source text
in order to produce consitently new sentences. So, in order to provide a decent sentence generator, we will have to sacrifice memory for this implementation. 

A potential solution to optimize this could be to, instead of storing
the same word in a bucket multiple times, have an integer value to represent the likelihood of a word being chosen out of the group of potential words.
Using this method, lets say we have an example like this:  

```
key: 'The next word'
bucket: ['is', 'is','is','is','is','is','is','is','is','is','is','is','is','is','is', ...]
```

If we have the same word that is repeated many times, it could be more efficient to simply show that if we see the key 'the next word' that any number between
0 and 1 we choose, the result will 100% be `is`. Because of this, we don't need to store all strings `is`, but we could simply replace the the `is` with a number
that signifies how many instances of that word exist in the bucket. If we have 100 words that are `is` and 25 that are `found`, then we can perform the following calculation:

```
key: is, num: 100  
key: found: num: 25
total = sum(keys) = 125
Likelihood of 'is': 100 / 125 = 0.8
Likelihood of 'found': 25 / 125 = 0.2

Select a number randomly between 0-1. If <num> <= 0.2, then select 'found', otherwise select 'is'.
```

By storing how many of the same word exists in a bucket, we can reduce a series of identical strings to a single integer. This would be ideal for extremely common words
that follow a specific gram, but on the other end, this implementation could produce little effect for the below scenario:

```
n-gram key: ('The dog runs')
n-gram bucket: ['wild', 'free', 'blindly', excitedly', 'willingly', 'dangerously', ... ]
```

In this example we have a statement: `The dog runs`. The words beyond `runs` that are in the bucket represent decorations for the sentence, or adverbs. There could be
an extremely large amount of different adverbs used after a specific n-gram. This places the new solution in the same situation as before: too many unique words to store
in a bucket will take up memory. I do not see a potential solution for this specific problem, as if there is a unique item, we have to store it in memory somehow. Otherwise,
we would not be able to select said solution / word.

<center>

# SentenceGenerator Class

</center>

---

Now that we have a skeleton for generating sentences, we will inherit from the `MarkovChain` to use its functionality to generate
sentences. The purpose of this new class is to simply extend the behavior of the `MarkovChain` class. Therefore, the `SentenceGenerator.__init__()`
method will only call the constructor of the base class and nothing more. For this report, I also imported the complete `MarkovChain` class
for `SentenceGenerator` to inherit from as Jupyter-Notebook seemed to have some inconsistencies with inheritance when also using the `jdc` module.

In [14]:
from MarkovChain.MarkovChain import MarkovChain

class SentenceGenerator(MarkovChain):

    def __init__(self, filenames, N, stop_characters=None, stop_words=None):
        super().__init__(filenames, N, stop_characters=stop_characters, stop_words=stop_words)

## SentenceGenerator.generate_sentence [1st iteration]

This is the initial draft of the generate_sentence method. It it actually generates
understandable text and was my first solution that showed a promising output return.
The key to making this function work was to separate the n-grams with n-grams that can
start a sentence. These are `self.n_grams` and `self.starting_n_grams`, respectively.

By intializing with an n-gram that is the beginning of a sentence,
We can start the chain from the beginning of the sentence instead of midway through or at the end.
This solution shows another issue: the sentences that it has generated, although somewhat coherent,
end midway through a sentence. This issue took a lot of time to find a decent solution to. The `SentenceGenerator.generate_sentence()`
method describes below is fairly condensed and works well for beginning a sentence. Additionally,
being able to provide a variety of length for a generated sentence is a great way to increase the variety of
sentences to be generated. The issue is this approach does not consider how a sentence ends. It could end,
if the length of the sentence determine conveniently lands on an ending phrase. However, most of the time,
this does not occur.

In [15]:
%%add_to SentenceGenerator

def generate_sentence(self):
    
    length_sentence = random.randint(4, 15)  
    seed = random.choice(self.starting_n_grams)
    output = [x for x in seed]
    for _ in range(length_sentence):
        word = random.choice(self.n_grams[seed])
        seed = tuple(list(seed[1:]) + [word])
        output.append(word)
        
    return output

In [60]:
# Initializing SentenceGenerator Object
sg = SentenceGenerator(filenames=[
                                HUNGER_GAMES_FILENAME,
                                TWILIGHT,
                                FIFTY_SHADES_OF_GRAY,
                                LORD_OF_THE_RINGS
                                ],
                    N=3,
                    stop_characters=STOP_CHARACTERS,
                    stop_words=STOP_WORDS,
                    )

print(f'{Style.BRIGHT}Preview of Generated Sentences:{Style.RESET_ALL}')
for _ in range(5):
    print(f'{Style.BRIGHT}{Fore.RED}- {Style.RESET_ALL}', end='')
    for word in sg.generate_sentence():
        print(word, end='')
    print()

[1mPreview of Generated Sentences:[0m
[1m[31m- [0mI hope the new information will distract her. “Gliding? As in a small town.
[1m[31m- [0mNo, not to me. Bewitched... my inner goddess is panting. “See? Beside, there’s something I have to go.
[1m[31m- [0mDo not touch the Dominant without notice. 9 Subject to that proviso and to clauses 2-5 above the Submissive is close to her limit of endurance.
[1m[31m- [0mPack the first things your hands touch, and then get in your private plane to cross a whole continent just for afternoon tea.
[1m[31m- [0mIt's nighttime," he whispered behind me. I turn and see Wiress has crawled over. Her eyes are focused on the jungle.


## SentenceGenerator.generate_sentence [2nd iteration]

In this second iteration, I address the issue where the sentence ends halfway through. Additionally,
a few times in the iteration one implementation, there is sometimes a random quotation that tries to
start a quote or end a quote. This will also be addressed in this iteration. Testing for a quote at
the beginning or end is also quite a difficult problem to fix in a simple way. Somehow, we have to keep
track of the current state the generator is in (does it need to search for an ending quote, has it seen
a closing quote but no opening quote?). The second problem is more challenging. It is pretty straightforward
to search for a closing quote after seeing an opening, but what are we to do if we see a closing quote?


To be honest, I don't have a good solution to this issue. One "solution" would be to eliminate the quotes
all-together from the generator, but that is no fun. The next solution could be to insert the starting quote
at the start of the previous sentence, but there are too many conditions to consider. For example,

```
- She said, "Hello, foo! How is bar?"
- "Hello, foo!" she said, "How is bar?"
- "Hello, foo! How is bar?" She said.
```

In the first example, we can't just insert the start quote at the beginning of the sentence, as that would
be incorrect. Additionally, the second condition is even harder, for we have to potentially insert two quotes
in one sentence! The last example would be the only time where the "fix" would work as intended. The problem
with the first two examples are the word "said" or "she" can be replaced with too many different words such
as "He", "Jared", or "exlaimed", "cried."

In [90]:
%%add_to SentenceGenerator

def generate_sentence(self, len: int=None):
    
    length_sentence = random.randint(4, 15) if len is None else len
    seed = random.choice(self.starting_n_grams)
    
    
    output = [x for x in seed]
    is_quote = reduce(lambda base, word: (word[0] == FULL_QUOTE) or base, output, False)

    for _ in range(length_sentence):
        word, seed = self._generate_word(seed, is_quote)
        output.append(word)
        
    self._end_sentence(output, seed, is_quote)
    return ' '.join(output).rstrip()


def _generate_word(self, seed, is_quote=False):
    
    not_ending_quote = lambda word: word[-1] != FULL_QUOTE
    
    words = self.n_grams[seed]
    words = [word for word in words if not_ending_quote(word)] if not is_quote else words

    # This is needed. If the text is primarily quotes it will make empty list.
    if not len(words):
        words = self.n_grams[seed]

    word = random.choice(words)
    seed = tuple(list(seed[1:]) + [word])
    
    return word, seed
    

def _end_sentence(self, output, seed, is_quote=False):
    
    in_stop_characters = lambda word: word[-1] in self.stop_characters
    in_stop_words      = lambda word: word in self.stop_words
    
    while not (end := [word for word in self.n_grams[seed] if in_stop_characters(word) and not in_stop_words(word)]):
        word, seed = self._generate_word(seed, is_quote)
        is_quote |= word[0] == FULL_QUOTE  # if quote at beginning, make true.
        is_quote &= word[-1] != FULL_QUOTE # if quote at end, make false.
        output.append(word)
        
    word = random.choice(end)
    output.append(word)

In [96]:
print(f'{Style.BRIGHT}Preview of Generated Sentences [Iteration 2]:{Style.RESET_ALL}')
for _ in range(5):
    sentence = sg.generate_sentence()
    print(f'{Style.BRIGHT}{Fore.RED}- {Style.RESET_ALL}{sentence}')

[1mPreview of Generated Sentences [Iteration 2]:[0m
[1m[31m- [0mI'm breaking all the rules these days.” “Are you going to drink?
[1m[31m- [0mPeeta blows on one end to see if it came to an end.
[1m[31m- [0mEvery day, I watched anxiously until the rest of us, anyway.” In the weeks since I left Rivendell.
[1m[31m- [0mThe final day of the Games on a screen over the stage where we did our interviews. The winner sits in a place once called the Rockies.
[1m[31m- [0mI can walk, though, so I get moving, trying to hide his smile. Oh thank the Lord, he’s recovered his sense of humor.


This is producing better results. There are still little issues regarding quotations, but it is more consistent than before. Let's now make multiple sentences to produce
a paragraph!

In [97]:
%%add_to SentenceGenerator

def generate_paragraph(self, len: int=None):

    num_sentences = random.randint(5,20)
    output = []
    for _ in range(num_sentences):
        output.append(self.generate_sentence())
    return ' '.join(output)

In [98]:
for _ in range(5):
    print(sg.generate_paragraph())
    print()

It was Boggs who knocked out Peeta with one blow before any permanent damage could be done. Others any viewer would recognize. The golden horn called the Cornucopia. Clove arranging the knives inside her jacket. One of the heaviest days of betting is the opening, when the initial casualties come in. Behind a cameraman, I see Haymitch give a sort of hiss. "It might be a play-acting spy, for all I know, too. The first call I make when I get back. Don’t let old moneybags grind you down.” “I won’t.” We hug again – and then he’d probably expire trying to deal with it... I couldn't stop the next shiver that flashed down my spine. I peeked across the cafeteria toward Emmett, grateful that he wasn't going to answer. In the lining of my pocket, I find a couple of minutes.

Christian leads me to the roof. My final dressing and preparations will be done in the past? Thick, hot blood. You couldn’t see, you couldn’t speak without getting a mouthful. I’m in the same room with me,” I whisper. “But I 