## Markov Chain NLP

Watch video: [YouTube](https://youtu.be/E4WcBWuQQws?list=PLM8wYQRetTxBkdvBtz-gw8b9lcVkdXQKV)

The original code forks from [here](https://www.kaggle.com/code/orion99/markov-chain-nlp).

### Importing tools

In [49]:
import glob
import nltk
import string
import numpy as np

### Read the stories

In [14]:
files = glob.glob('sherlock/*.txt')
stories = []
for file in files:
    with open(file, 'r') as f:
        for line in f:
            line = line.strip()
            if line == '----------':
                break
            if line != '':
                stories.append(line)
print(f'Number of lines = {len(stories)}')

Number of lines = 215021


In [50]:
# may need some time to run
cleaned_stories = []
translator = str.maketrans('', '', string.punctuation)
for line in stories:
    line = line.lower()
    line = line.translate(translator)
    tokens = nltk.word_tokenize(line)
    words = [word for word in tokens if word.isalpha()]
    cleaned_stories.extend(words)
print(f'Number of words = {len(cleaned_stories)}')

Number of words = 2341418


### Create a Markov Chain instance

In [60]:
class MarkovChain:
    def __init__(self, text: list, n_gram: int = 2):
        self.n_gram = n_gram
        self.markov_chain = {}
        self.n_grams = list(zip(text[::2], text[1::2]))
        for i in range(len(self.n_grams) - 1):
            curr_state = ' '.join(self.n_grams[i])
            next_state = ' '.join(self.n_grams[i + 1])
            if curr_state not in self.markov_chain:
                self.markov_chain[curr_state] = {}
                self.markov_chain[curr_state][next_state] = 1
            else:
                if next_state in self.markov_chain[curr_state]:
                    self.markov_chain[curr_state][next_state] += 1
                else:
                    self.markov_chain[curr_state][next_state] = 1
        # calculating transition probabilities
        for curr_state, transition in self.markov_chain.items():
            total = sum(transition.values())
            for next_state, count in transition.items():
                self.markov_chain[curr_state][next_state] = count / total

    def generate_story(self, start: str, limit: int = 30):
        n = 0
        curr_state = start
        story = curr_state + ' '
        while n < limit:
            next_state = np.random.choice(list(self.markov_chain[curr_state].keys()),
                                          p=list(self.markov_chain[curr_state].values()))
            curr_state = next_state
            story = story + curr_state + ' '
            n += 1
        return story

In [61]:
Mdl = MarkovChain(cleaned_stories)
print(f'Number of states = {len(Mdl.markov_chain.keys())}')

Number of states = 215418


In [62]:
print('All possible transitions from \'the game\' state are:')
Mdl.markov_chain['the game']

All possible transitions from 'the game' state are:


{'your letter': 0.03389830508474576,
 'for the': 0.05084745762711865,
 'is up': 0.05084745762711865,
 'was whist': 0.03389830508474576,
 'was up': 0.0847457627118644,
 'in that': 0.05084745762711865,
 'the lack': 0.05084745762711865,
 'may wander': 0.03389830508474576,
 'now a': 0.03389830508474576,
 'mr holmeswhats': 0.03389830508474576,
 'ay whats': 0.03389830508474576,
 'my friend': 0.03389830508474576,
 'fairly by': 0.03389830508474576,
 'is not': 0.03389830508474576,
 'was afoot': 0.03389830508474576,
 'worth it': 0.01694915254237288,
 'you are': 0.01694915254237288,
 'now count': 0.03389830508474576,
 'i am': 0.01694915254237288,
 'for all': 0.03389830508474576,
 'was in': 0.03389830508474576,
 'is hardly': 0.03389830508474576,
 'would have': 0.03389830508474576,
 'is and': 0.05084745762711865,
 'in their': 0.03389830508474576,
 'is afoot': 0.01694915254237288,
 'my own': 0.01694915254237288,
 'at any': 0.01694915254237288,
 'was not': 0.01694915254237288}

### Generating Sherlock Holmes stories

In [67]:
for i in range(5):
    result = Mdl.generate_story('dear holmes', limit=9)
    print(f'{str(i + 1)}. {result}')

1. dear holmes it is very customary for pawnbrokers in england mr holmes and he left a trap into which he 
2. dear holmes he has a salary of a striking and bizarre without being criminal we have a cousin of his 
3. dear holmes am i to do their own persons but those whom they feared or hated by injuring not only 
4. dear holmes i ejaculated no no i aint afeared of anything on this side path he was still vague but 
5. dear holmes i ejaculated well really it seems rather useless since you are both sound sleepers hunter was sunk in 


In [68]:
for i in range(5):
    result = Mdl.generate_story('i would', limit=9)
    print(f'{str(i + 1)}. {result}')

1. i would bring him into the massive masonry hum said holmes sinking back in his chair in the house you 
2. i would spend my last copper to shield him and have no hint to give me the shelter of your 
3. i would have told you nothing but the laws were more harshly administered thirty years ago his luggage was to 
4. i would do justice upon him this i expect very shortly to my wife as an amateur that i could 
5. i would have you never heard of the sort have already been out i shut the door and forming the 


In [90]:
result = Mdl.generate_story('the case', limit=100)
print(result.strip())

the case is the very word said holmes who was far greater than i knew since no practical use whatever we had an example i may mention the son of one of his characteristics that his life is despaired of dear dear son now that approaching disgrace begins to darken my door and peeped through the open window endeavoured in every way it corresponded with the dog on my mentioning the detectives of fact the drawn look upon this as a hypothesis and noted that e was represented by picture picture of a regal and stately lady in the case even the masterful millionaire had found my occasional retreat still less that you were all at the devil do you mean holmes had sprung to his household and the effect it was a clumsy fabrication which simply could not be delayed mr sherlock holmes and myself yes the horse and clad in some darkcoloured stuff with a black gap like the mouth of this treaty becoming known a severe attack make a case against him was one tin box so you can leave me in the opposite dire