Text Generation with Markov Chains in Python
============================================

In [3]:
# texts = ['text/grimm_tales.txt', 'text/little_red_riding_hood.txt',\
#          'text/robin_hood_prologue.txt']
files = ['grimm_tales.txt']
text = ''
for f in files:
    with open(f, 'r') as f:
        text += f.read()


In [4]:
print(text[:100])

THE GOLDEN BIRD

A certain king had a beautiful garden, and in the garden stood a tree
which bore go


Weather as a Markov Chain
-------------------------

![alt text](images/markov_weather.png "Weather")

Matrix representation (rows are current state, columns are next state):

| | Sunny | Cloudy | Rainy |
| --- | --- | --- | --- |
| **Sunny** | 0.6 | 0.1 | 0.3 |
| **Cloudy** | 0.3 | 0.3 | 0.4 |
| **Rainy** | 0.3 | 0.2 | 0.5 |


Text as a Markov Chain
----------------------

**The cat ran over the dog.**

![alt text](images/markov_text1.png "Text")

Matrix representation (rows are current state, columns are next state):

| | the | cat | ran | over | dog | . |
| --- | --- | --- | --- | --- | --- | --- |
| **the** | 0 | 0.5 | 0 | 0 | 0.5 | 0 |
| **cat** | 0 | 0 | 1 | 0 | 0 | 0 |
| **ran** | 0 | 0 | 0 | 1 | 0 | 0 |
| **over** | 1 | 0 | 0 | 0 | 0 | 0 |
| **dog** | 0 | 0 | 0 | 0 | 0 | 1 |
| **.** | 0 | 0 | 0 | 0 | 0 | 1 |



Define states as the distinct word tokens

In [41]:
import re

text = re.sub("[^A-z,.!?'\n ]+", "", text) #cleaning characters
print(text[:50].split(), '\n')
text = re.sub("([.,!?])", r" \1 ", text)
    #Here we are using a capture syntax https://www.lzone.de/examples/Python%20re.sub  
    #need to put Parentheses around the squared brackets and then the '\1'.
    
    #Needed to add the r".." to indicate raw string to read the \ correctly
    
tokens = text.lower().split()
distinct_states = list(set(tokens))
print(distinct_states[:10])

['THE', 'GOLDEN', 'BIRD', 'A', 'certain', 'king', 'had', 'a', 'beautiful', 'ga'] 

['THE', 'GOLDEN', 'BIRD', 'A', 'certain', 'king', 'had', 'a', 'beautiful', 'ga'] 

['joined', 'returned', 'observe', 'reproached', 'bean', 'dripping', 'sounding', 'milkpail', 'heads', 'east']


Define transition matrix

In [42]:
from scipy.sparse import csr_matrix
m = csr_matrix(
    (len(distinct_states), len(distinct_states)), 
    dtype = int
        )

state_index = dict([(state, idx_num) for idx_num, state in enumerate(distinct_states)])

Count transitions and fill in transition matrix

In [43]:
%%time
for i in range(len(tokens)-1):
    row = state_index[tokens[i]]
    col = state_index[tokens[i+1]]
    m[row, col]+=1
#     m._set_intXint(row, col, )

  self._set_intXint(row, col, x.flat[0])


CPU times: user 26.2 s, sys: 0 ns, total: 26.2 s
Wall time: 26.3 s


Generate new text

In [46]:
%%time
import numpy as np

start_state_index  = np.random.randint(len(distinct_states))
state = distinct_states[start_state_index]
num_sentences = 0
output = state.capitalize()
capitalize = False

while num_sentences < 3:
    row = m[state_index[state], :]
    probabilities = row / row.sum() #normalizing the values
    probabilities = probabilities.toarray()[0] #[0] to get only the values. 
    
    next_state_index = np.random.choice( #allow to sample following a distrib.
        len(distinct_states),
        1,
        p = probabilities #indicate the distrubution of each integer. 
    )
    
    next_state = distinct_states[next_state_index[0]]
    
    if next_state in ('.', '!', '?'):
#         print('punctuation ')
        output += next_state + '\n\n'
#         print(output)

        capitalize = True
        num_sentences += 1
        
    elif next_state == ',':
#         print( ' , - ', next_state) 
        output += next_state #no space
#         print(output)
        
    else:
        
        if capitalize:
#             print('capita')
            output += next_state.capitalize()
#             print(output)

            capitalize = False
        else:
#             print('notmal')
            output += ' ' + next_state
#             print(output)

    
    state = next_state
print(output)

Finely.

Then she was so far, let the font at the woman and both sides of the little kid had bewailed her dainty tongue clave to stay the haycart.

So he cried out of her more than that however, and sang my workshop making sport of gold, he could hardly knows i will have you in the moss, and courteously that not stay here very angry, pray give you to climb like to it went to bewail her hair fall on the door even get nothing was making a kings daughter, the grinder never was as quickly?


CPU times: user 90.7 ms, sys: 0 ns, total: 90.7 ms
Wall time: 143 ms


k-Word Markov Chain
-------------------

**The cat ran over the dog.**

![alt text](images/markov_text2.png "Text")

Matrix representation (rows are current state, columns are next state):

| | the cat | cat ran | ran over | over the | the dog | dog. |
| --- | --- | --- | --- | --- | --- | --- |
| **the cat**  | 0 | 1 | 0 | 0 | 0 | 0 |
| **cat ran**  | 0 | 0 | 1 | 0 | 0 | 0 |
| **ran over** | 0 | 0 | 0 | 1 | 0 | 0 |
| **over the** | 0 | 0 | 0 | 0 | 1 | 0 |
| **the dog**  | 0 | 0 | 0 | 0 | 0 | 1 |
| **dog.**     | 0 | 0 | 0 | 0 | 0 | 1 |



Define states as consecutive token pairs

In [17]:
k = 2
tokens = text.lower().split()
states = [ tuple(tokens[i:i+k]) for i in range(len(tokens) - k+1)] 
    #Need tuples because list can be a key in a dict.
    
distinct_states = list(set(states))

Define and fill transition matrix

In [18]:
from scipy.sparse import csr_matrix

m = csr_matrix(
    (len(distinct_states), len(distinct_states)), 
    dtype = int)

state_index = dict( 
    [(state, idx_num) for idx_num, state in \
                   enumerate(distinct_states)])

for i in range(len(tokens)-k):
    state = tuple(tokens[i:i+k])
    next_state = tuple(tokens[i+1:i+k+1])
    row = state_index[state]
    col = state_index[next_state]
    m[row, col]+=1
#     m._set_intXint(row, col, )

  self._set_intXint(row, col, x.flat[0])


Generate new text

In [21]:
# for x in probabilities:
#     print(x)

In [None]:
np.linalg.norm(m[state_index[state], :])

In [42]:
for state in list(state_index.keys())[8:18]:
#     print(m[state_index[state], :] / m[state_index[state], :].sum())
    row = m[state_index[state], :]
    p = row / row.sum() #normalizing the values
    print(type(p), round(p.sum(), 5))
#     print([x for x in probabilities if x != 0])

<class 'scipy.sparse.csr.csr_matrix'> 1.0
<class 'scipy.sparse.csr.csr_matrix'> 1.0
<class 'scipy.sparse.csr.csr_matrix'> 1.0
<class 'scipy.sparse.csr.csr_matrix'> 1.0
<class 'scipy.sparse.csr.csr_matrix'> 1.0
<class 'scipy.sparse.csr.csr_matrix'> 1.0
<class 'scipy.sparse.csr.csr_matrix'> 1.0
<class 'scipy.sparse.csr.csr_matrix'> 1.0
<class 'scipy.sparse.csr.csr_matrix'> 1.0
<class 'scipy.sparse.csr.csr_matrix'> 1.0


In [30]:
%%time
import numpy as np

start_state_index  = np.random.randint(len(distinct_states))
state = distinct_states[start_state_index]

num_sentences = 0
output = ' '.join(state).capitalize()
capitalize = False

while num_sentences < 3:
    
    row = m[state_index[state], :]
    probabilities = row / row.sum() #normalizing the values
    probabilities = probabilities.toarray()[0] #[0] to get only the values. 
    
    next_state_index = np.random.choice( #allow to sample following a distrib.
        len(distinct_states),
        1,
        p = probabilities #indicate the distrubution of each integer. 
    )
    
    next_state = distinct_states[next_state_index[0]]
    
    if next_state[-1] in ('.', '!', '?'):
#         print('punctuation ')
        output += next_state[-1] + '\n\n'
#         print(output)

        capitalize = True
        num_sentences += 1
        
    elif next_state[-1] == ',':
#         print( ' , - ', next_state) 
        output += next_state[-1] #no space
#         print(output)
        
    else:
        
        if capitalize:
#             print('capita')
            output += next_state[-1].capitalize()
#             print(output)

            capitalize = False
        else:
#             print('notmal')
            output += ' ' + next_state[-1]
#             print(output)

    
    state = next_state
print(output)

TypeError: unsupported operand type(s) for //: 'csr_matrix' and 'int'