#### Assignment 9 - Markov Chain 
Implement a Markov chain working with whole words instead of letters.
Leave the number of words in a sequence as an adjustable parameter, but due to memory limitations keep it at two.

Example: k=2 “word1 word2” as one of the possible sequences.

To test your implementation, you can use a smaller text.
If you want, you can try to use something larger as an input text (a small book).

Hint: Use a sparse dictionary from scipy.sparse.dok_matrix


1. Start with an input text whose phrases will be used to generate the output. For every pair of words in the input, record a list of the possible words that come after the word pair.
2. Once the datastructure is built, we can generate as much or as little output as we want. Start with any pair of words that occurs in the input, and then randomly pick one of the possible third words. Then move along, so the second word in the pair is the word you just generated, randomly pick another word, and so on...

In [12]:
!pip install nltk


Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting regex>=2021.8.3
  Downloading regex-2023.10.3-cp310-cp310-macosx_10_9_x86_64.whl (296 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.4/296.4 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: regex, nltk
Successfully installed nltk-3.8.1 regex-2023.10.3


In [1]:
import random
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/nimishasen/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
from collections import defaultdict
from nltk.tokenize import word_tokenize
import random

def preprocess_text(text):
    # Preprocess the text by tokenizing and converting to lowercase
    tokens = word_tokenize(text.lower())
    return tokens

def create_markov_chain(states, text):
    transition_matrix = defaultdict(dict)

    for i in range(len(text) - 1):
        current_state = text[i]
        next_state = text[i + 1]

        if next_state not in transition_matrix[current_state]:
            transition_matrix[current_state][next_state] = 1
        else:
            transition_matrix[current_state][next_state] += 1

    # Normalize transition probabilities
    for state in transition_matrix:
        total_transitions = sum(transition_matrix[state].values())
        for next_state in transition_matrix[state]:
            transition_matrix[state][next_state] /= total_transitions

    return lambda num_words, seed_word: simulate_chain(num_words, seed_word, transition_matrix)

def simulate_chain(num_words, seed_word, transition_matrix):
    current_word = seed_word
    generated_sequence = [current_word]

    for _ in range(num_words - 1):
        if current_word in transition_matrix:
            next_word = random.choices(
                list(transition_matrix[current_word].keys()),
                weights=list(transition_matrix[current_word].values())
            )[0]
            generated_sequence.append(next_word)
            current_word = next_word
        else:
            break

    return generated_sequence

def generate_markov_chain_from_file(file_path, k=2, num_words=20, seed_word=None):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()

    preprocessed_text = preprocess_text(text)
    states = list(set(preprocessed_text))
    markov_chain = create_markov_chain(states, preprocessed_text)

    if seed_word is None:
        seed_word = random.choice(states)

    generated_sequence = markov_chain(num_words=num_words, seed_word=seed_word)
    return generated_sequence

file_path = 'Metamorphosis_FKafka.txt'

# Generate Markov chain from the text file
generated_sequence = generate_markov_chain_from_file(file_path, num_words=50, seed_word='melancholy')
print(generated_sequence)


['melancholy', 'expression', '.', 'gregor', 'needed', 'to', 'make', ',', 'threw', 'himself', 'with', 'this', 'meant', 'not', 'something', 'more', '.', '“', 'what', 'is', 'posted', 'with', 'his', 'mother', 'would', 'wake', 'him', 'that', 'she', 'said', 'gregor', 'turned', 'around', 'gregor', '’', 'm', 'sure', 'that', 'they', 'might', 'go', 'silent', ',', 'please', 'follow', 'the', 'ground', 'so', 'much', 'food']
