#Task: text generation with Markov chains

Each unique word represents a single state. A sentence is just a sequence of states that have been sampled from a state space.

What we need to do is just:



*   Create a state space;
*   Compute probabilities for the transition matrix;
*   Sample from the state space with the transition matrix.

In [6]:
import nltk
from nltk import word_tokenize
import numpy as np
import random
from typing import List, Dict, Tuple

In [7]:
def get_tokens(file_name) -> List[str]:
    tokens = []
    with open(file_name, 'r') as f:
        return word_tokenize(f.read())

In [9]:
# Download the 'punkt' resource
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [10]:
tokens = get_tokens('/content/positive')
len(tokens)

68689

<font color='red'>TODO:</font> write a function which calculates the transition matrix of probabilities. You can simply fill the todo parts or write your own function from scratch.

In [12]:
def calculate_transition_matrix(tokens: List[str]) -> np.ndarray:
    unique_tokens = list(set(tokens))
    token_index = {token: i for i, token in enumerate(unique_tokens)}
    transition_matrix = np.zeros((len(unique_tokens), len(unique_tokens)))

    for i in range(len(tokens) - 1):
        current_token = tokens[i]
        next_token = tokens[i + 1]
        current_index = token_index[current_token]
        next_index = token_index[next_token]
        transition_matrix[current_index][next_index] += 1

    transition_matrix = transition_matrix / transition_matrix.sum(axis=1, keepdims=True)
    return transition_matrix, unique_tokens

In [13]:
def generate_text(current_token: str, tokens: List[str], length: int, P_matrix: np.ndarray, P_init: np.ndarray=None) -> str:
    token_index = {token: i for i, token in enumerate(tokens)}
    index_token = {i: token for i, token in enumerate(tokens)}

    if current_token not in token_index:
        if P_init is not None:
            current_index = np.random.choice(len(tokens), p=P_init)
        else:
            current_index = np.random.choice(len(tokens))
    else:
        current_index = token_index[current_token]

    generated_text = [index_token[current_index]]

    for _ in range(length - 1):
        current_index = np.random.choice(len(tokens), p=P_matrix[current_index])
        generated_text.append(index_token[current_index])

    return " ".join(generated_text)

In [23]:
transition_matrix, _ = calculate_transition_matrix(tokens) # Unpack the returned tuple
assert transition_matrix.shape == (9367, 9367) # Now you can access the shape attribute

In [24]:
input_token = 'Students'
# Extract the transition matrix from the tuple

generated_text = generate_text(input_token, unique_tokens, 100, transition_matrix)
print(generated_text)

sports massage ( GLA ) I 'd love u back : - can be great day of retailers that . Yes ! Fine day ! : - ) I ca n't wait you and that . : ) ca n't for fun ? '' look ... Have a great to a nice knowing you & amp ; & gt ; follow my 200 commits . I .. will ever : ) I gained weight : ) # MyPapaMyPride # StarSquad Can you 're views . We wo n't sleep a lucky enough , 'cause I was a lot : )
