## Text Generation with Markov Chains


Acknowledgement: Intel AI Developer Programme

#### Introduction

Text generation is a challenging problem that even the largest data science teams are still struggling with, so we'll explore some of the most common and accessible methods to solve the problem, starting at a somewhat basic level. The approach we will attempt in this notebook is: 

* Markov Chains


## Define the Markov Chain Class

* Create a class called markov_chain that takes the path of a text file as input when instantiating an object
. Add a method (function) that preprocesses the text returning a list of lowercased words with all " and ' removed.
* Add a method (function) that reads the text fille and creates a dictionary of the form: `{(word1, word2): [word3], (word2, word3): [word4]...}`. If the sequence `(word1, word2, word4)` then shows up later.. we want to end up with `{(word1, word2): [word3, word4], (word2, word3): [word4]...}`. We want to map out how often every word follows each previous pair of words. If `(word1, word2, word3)` appears a second time then we want `{(word1, word2): [word3, word4, word3], (word2, word3): [word4]...}`.
* Add a method (function) for generating new text from a seed, that takes in a key (e.g. `(word1, word2)`) as a starting point, then randomly samples one of the words that follows that key. (e.g. word3). Then have that sampled word appended to the "generated words" creating a new key (e.g. `(word2, word3)`). Repeat this process over and over to generate a sentence until reaching some sentence length as specified by the user. Return this sentence as a string.


In [1]:
import random
class markov_chain(object):
    
    def __init__(self,text_path,ngram=2):
        self.ngram = ngram
        self.markov_keys = dict()
        self.path = text_path
        self.text_as_list = None

    def preprocess(self):
        with open(self.path,'r') as f:
            raw = f.read()
        self.text_as_list = raw.lower().replace('"','').replace("'","").split()

    def markov_group_generator(self,text_as_list):
        if len(text_as_list) < self.ngram+1:
            raise("NOT A LONG ENOUGH TEXT!")
            return

        for i in range(self.ngram,len(text_as_list)):
            yield tuple(text_as_list[i-self.ngram:i+1])

    def create_probability_object(self):
        if not self.text_as_list:
            self.preprocess()
        for group in self.markov_group_generator(self.text_as_list):
            word_key = tuple(group[:-1])
            if word_key in self.markov_keys:
                self.markov_keys[word_key].append(group[-1])
            else:
                self.markov_keys[word_key] = [group[-1]]
    
    def generate_sentence(self, length=25, starting_word_id=None):
        if (not starting_word_id or type(starting_word_id) != type(int(1)) 
            or starting_word_id < 0 or starting_word_id > len(self.text_as_list)-self.ngram):
            starting_word_id = random.randint(0,len(self.text_as_list)-self.ngram)
            
        gen_words = self.text_as_list[starting_word_id:starting_word_id+self.ngram]
        
        while len(gen_words) < length:
            seed = tuple(gen_words[-self.ngram:])
            gen_words.append(random.choice(self.markov_keys[seed]))
        return ' '.join(gen_words)
        
        

* Instantial the markov chain object, using an input file

In [2]:
MC = markov_chain('./data/lovecraft.txt',ngram=2)

* Call the method to generate the dictionary

In [3]:
MC.create_probability_object()

In [4]:
i = 0
for key,value in MC.markov_keys.items():
    print(key,value)
    print()
    i+=1
    if i>50:
        break

('the', 'nameless') ['city', 'city', 'city,', 'city,', 'city,', 'city.', 'city', 'city', 'city', 'city;', 'city', 'city.', 'city', 'city,', 'city', 'city', 'city,', 'city', 'city', 'race,', 'city:', 'city.', 'fate', 'monstrosity', 'monstrosity,', 'entity,', 'outsiders', 'design--living', 'entities', 'stone', 'city', 'and', 'scent', 'scent', 'scent', 'stench', 'artist', 'stench', 'cylinder,', 'odour', 'hybrids', 'things', 'scenes', 'larvae', 'ancient', 'pastimes', 'larvae', 'doom', 'museum', 'summit', 'denizens', 'dread.']

('nameless', 'city') ['when', 'i', 'that', 'was', 'what', 'and', 'in', 'in', 'had', 'under', 'at', 'of', 'of']

('city', 'when') ['i']

('when', 'i') ['drew', 'came', 'was', 'had', 'chanced', 'glanced', 'saw', 'thought', 'did', 'tried', 'came', 'sounded', 'sat', 'fancied', 'looked', 'staggered', 'still', 'went', 'went', 'saw', 'think', 'dream', 'did', 'think', 'think', 'commit', 'make', 'brought', 'studied', 'started', 'think', 'telephoned', 'drove', 'developed', 'li

* Generate the sentences

In [5]:
print(MC.generate_sentence(length=100, starting_word_id=7))

the nameless larvae of the ninth, and wondered how close a watch had been dark and shapely, and salt breezes swept up the harbour met nameless extinction from the altar, and paused in his study. the next day returned to the presence of the myths were of a strange hindoo, but he could form no idea what the carvings was correct, these abhorred things must have been trapped by the subtle stirring of the exclamation. there was hope that some noxious marine mind had learned from the sitting-room matchsafe, and edging through the midst of swan point cemetery were excluded,


### Exercise A

Copy the previous line (print(MC.generate_sentence(length=100, starting_word_id=7)) as it is and generate the text again. What do you notice about the output?


In [6]:
print(MC.generate_sentence(length=100, starting_word_id=7))

the nameless scent was excessively pungent here; so much visual as cerebral, amidst which joseph curwen was wont to put you in a remote cellar storeroom, the tracks, the dirt, the hastily rifled wardrobe, the baffling bands were in a padded cell for sixteen and completed his course now lay on the way, was clad in cheap woollen togas--and sprinklings of helmeted legionaries and coarse-mantled, black-bearded tribesmen of the elder ones were choked it was infinitely painful, and colored by fantastic dreams and their sojournings in soul and messenger nyarlathotep. meanwhile the cliffs and boulders, with no flaw in my


### Exercise B

Write a line of code to generate text of 40 words. Use a different seed to intiate the starting state.

In [7]:
print(MC.generate_sentence(length=40, starting_word_id=5))

drew nigh that gigantic reef. so i started violently. pickman reappeared with his california son whether or not by any wakeful souls in the hidden and nighted ocean. to behold them dancing by moonlight. so, atal said, heed a mans


### Exercise C


In the example above, the current state is represented by 2 words ({(word1, word2): [word3]}.
Adding more words to represent the state ({(word1, word2, word3): [word4]) should generate sentences
that makes better sense as the level of randomization is reduced. For example, it is less likely that
20 words will appear together than 5 words.
                                    
- Instantial a new markov chan object, Set the parameter ngram to 3.
- Run the method create_probability_object()
- List out the dictionary within the markov chain object
- Run the method generate_sentence() to generate new statements

In [8]:
MC = markov_chain('./data/lovecraft.txt',ngram=3)
MC.create_probability_object()

i = 0
for key,value in MC.markov_keys.items():
    print(key,value)
    print()
    i+=1
    if i>50:
        break

('the', 'nameless', 'city') ['when', 'i', 'that', 'was', 'what', 'and', 'in', 'in', 'had', 'under', 'at', 'of']

('nameless', 'city', 'when') ['i']

('city', 'when', 'i') ['drew']

('when', 'i', 'drew') ['nigh']

('i', 'drew', 'nigh') ['the']

('drew', 'nigh', 'the') ['nameless']

('nigh', 'the', 'nameless') ['city']

('nameless', 'city', 'i') ['knew']

('city', 'i', 'knew') ['it']

('i', 'knew', 'it') ['was', 'was', 'lay', 'the', 'had', 'well,', 'would', 'i', 'he', 'i']

('knew', 'it', 'was') ['accursed.', 'this', 'keziahs', 'useless.']

('it', 'was', 'accursed.') ['i']

('was', 'accursed.', 'i') ['was']

('accursed.', 'i', 'was') ['traveling']

('i', 'was', 'traveling') ['in']

('was', 'traveling', 'in') ['a']

('traveling', 'in', 'a') ['parched']

('in', 'a', 'parched') ['and']

('a', 'parched', 'and') ['terrible']

('parched', 'and', 'terrible') ['valley']

('and', 'terrible', 'valley') ['under']

('terrible', 'valley', 'under') ['the']

('valley', 'under', 'the') ['moon,']

('unde

In [9]:
print(MC.generate_sentence(length=100, starting_word_id=7))

the nameless city of arabia deserta. as we flew above that tangle of alleys in another direction, it seems, for when we sighted a lamp-post we were in one of the neighboring ones were choked it was doubtful how they would regard a guest whose object was to see them and plead before them. no man had gone before, i ought to have been recently cleared. it took us only a moment to see which one might readily forget! (indistinguishable sounds) (a cultivated male human voice) ...is the lord of the woods, being...seven and nine, down the onyx steps...(tri)butes to
