# Word-Level Text Generation with Markov Chain

In [1]:
import functions as f
from Text import *
from Chain_class import *

In [2]:
path = 'data/train.txt'
input_text = f.read_txt(path)

## Text Preprocessing

The loaded text file contains the content of tales scraped from websites. By creating the instance of Text object, the text is quickly preprocessed, tokenized and prepared for use in Markov model.

In [3]:
tales_text = Text(input_text)

The preprocessed text doesn't contain any new line characters, the punctuation is limited and separated with white spaces (in order to treat punctuation as separate tokens). Unlike in other NLP tasks, we don't use stop words removal, lammetization, stemming or other text processing techniques.

In [4]:
tales_text.content[:1000]

"Once upon a time there lived a sultan who loved his garden dearly , and planted it with trees and flowers and fruits from all parts of the world . He went to see them three times every day : first at seven o'clock , when he got up , then at three , and lastly at half - past five . There was no plant and no vegetable which escaped his eye , but he lingered longest of all before his one date tree . Now the sultan had seven sons . Six of them he was proud of , for they were strong and manly , but the youngest he disliked , for he spent all his time among the women of the house . The sultan had talked to him , and he paid no heed ; and he had beaten him , and he paid no heed ; and he had tied him up , and he paid no heed , till at last his father grew tired of trying to make him change his ways , and let him alone . Time passed , and one day the sultan , to his great joy , saw signs of fruit on his date tree . And he told his vizir , 'My date tree is bearing ; ' and he told the officers ,

## Building Markov Chain model

In the heart of Markov Chain model there is transition matrix which represents the probability values of all likely state transitions. In order to build it, we need to extract from text the sequences of length n (n=3 in the example) and the following words.

In [5]:
chain_model = Chain(tales_text, n=3)

In [6]:
chain_model.tokens_info()
chain_model.ngrams_info()

total tokens: 890750, distinct tokens: 25165
ngrams level: 3, total ngrams: 890748, distinct ngrams: 555205


Example phrase: "And the sultan replied"
* We extract the sequence of length 3: "And the sultan"
* And the following word: "replied"

By using corresponding indexes, we can find in the matrix the conditional probability of this transition - in this case it's 0.17 (which suggests that there are also other words that come after the phrase "And the sultan")

In [14]:
# 'And the sultan replied'
print(chain_model.ngram2ind['And the sultan'])
print(chain_model.token2ind['replied'])

chain_model.transition_matrix_prob[130166,23535:23545].todense()

130166
23540


matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.16666667, 0.        , 0.        , 0.        , 0.        ]])

## Text generation with Markov Chain

In order to generate the next word based on the given sequence, we need to lookup this sequence in the transition matrix and randomly pick one word (according to the probability distribution stored in the matrix for this sequence). 

In [8]:
prefixes = ['the young man', 'Once upon a', 'As soon as']
temperatures = [1, 0.7, 0.4, 0.1]

A temperature parameter is introduced in order to control the amount of stochasticity in the sampling process - it determines how predictable the choice of the next word will be. Given the temperature value, a new probability distribution is computed from the original one.

In [9]:
for temperature in temperatures:
    print('temperature:', temperature)
    print(chain_model.generate_sequence(np.random.choice(prefixes), 100, temperature=temperature))
    print('\n')

temperature: 1
Once upon a time there lived a sultan who loved his daughter, the princess thought him rather womanish in some ways, and displayed to the astonished king the well- spring. She again allowed the frog to share her couch, and listened to see where her little one had strayed to, because she knew if he went on shore, whither he bore whole bales of the finest stuffs and goodly merchandise from his forest treasure- house; and when he hurried back to the palace, Jesper met the king behind the barn, and pointing


temperature: 0.7
Once upon a time there lived, some couple of hundred yards from the palace somewhat irksome, and it came off like a glove. Well now, says he.'Blur- an- shaw- ay- s- foni- mi- hayn- da-- Send him up here and I'll do for him; and as for the others, but the Moon will be here anon. This, Gentle Reader, have I trodden Fortune under my feet. He stood upright for a little, and caught the gourd while the man had


temperature: 0.4
As soon as the king and queen 

### Markov Chain model with n=5

In [10]:
chain_model_n5 = Chain(tales_text, n=5)

prefixes_n5 = ['the rich men of the', 'Where are you going ?', 'Once upon a time there']

In [11]:
for temperature in temperatures:
    print('temperature:', temperature)
    print(chain_model_n5.generate_sequence(np.random.choice(prefixes_n5), 100, temperature=temperature))
    print('\n')

temperature: 1
the rich men of the town. He waited patiently for some days till the dates were nearly ripe, and then he called his six sons, and said :'One of you must watch the date tree till the cocks were crowing and it was getting light; then I lay down for a little, and then carried it off with, I won't have you, Whiskers! So all went away, and the golden- crested bird, and he came, when thought of by him. And when he told the bird of his sufferings, the


temperature: 0.7
Once upon a time there was a peasant and his wife who had an only son named Mikko. As the mother lay dying the young man wept bitterly. When you are gone, my dear mother, he said, there will be no one who can make him well again before Farmer Weatherbeard comes and cures him, and for that intent continually maintained great mastiffs and dogs of much strength to hunt and chase the beast. In the end, three strips of skin were cut from his back; pepper and salt were sprinkled into the wound; and


temperature: 0.4


### Markov Chain model with n=1

In [12]:
chain_model_n1 = Chain(tales_text, n=1)

prefixes_n1 = ['Once', 'witch', 'princess']

In [13]:
for temperature in temperatures:
    print('temperature:', temperature)
    print(chain_model_n1.generate_sequence(np.random.choice(prefixes_n1), 100, temperature=temperature))
    print('\n')

temperature: 1
princess speaks get rich. But this, standing a minute, my heart crown'd with it too dangerous to a rose with the hellish beast stepped, bestow an eclipse of Emain, and what he paid little offended tone, gen- shells, and the fox went on each of little boys? He travelled out, and the man running away and none succeeded in the wood. What is here. When you to go with thanks, binding oaths I now under the shields they are you, saying, Know, my eldest sister


temperature: 0.7
witch baby grew tired, a wonderful salad in that women do you shall die with two boys to stitch. And he betook himself, come we could not stop their father asked him, because the King had white neck wherever he threw the brook that it. Like a little bird, he dragged her, what is sharper than they tauld him from the philtre, just for ever heard that you did so, and seated himself grew,'Oh, and said, you never find the hardest and men, I am dying Muslim


temperature: 0.4
witch woman, sir?' replied with a 