<a href="https://colab.research.google.com/github/junting-huang/data_storytelling/blob/main/case_3_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# case_3. model

## 3.1 language model

**Markov Chain Language Model**

"A language model is a probabilistic model of a natural language that can generate probabilities of a series of words, based on text corpora in one or multiple languages it was trained on." In a simple Markov chain language model, the probability of each word only depends on the last word in the sequence. For example, given the word "the," the model might predict that the next word is "cat" with a probability of 0.2, "dog" with a probability of 0.3, and so on, based on the frequencies of word sequences in the training data.

**Example**

Consider a corpus with the sentence: "I like to eat apples. I like to eat bananas." A simple bigram (2nd order) Markov chain model might create a probability distribution like this:

- P(like | I) = 1.0 (since "like" always follows "I" in the training data)
- P(to | like) = 1.0 (since "to" always follows "like" in the training data)
- P(eat | to) = 1.0 (since "eat" always follows "to" in the training data)
- P(apples | eat) = 0.5, P(bananas | eat) = 0.5 (since "apples" and "bananas

## 3.2 building model

In [13]:
filename = 'walden.txt'

with open(filename, 'r') as file:
    text = file.read()

In [14]:
import re

words = re.findall(r'\b\w+\b', text.lower())  # Convert to lowercase and split into words

In [15]:
from collections import defaultdict

def build_model(words, order=1):
    model = defaultdict(list)
    for i in range(len(words) - order):
        state = tuple(words[i:i + order])
        next_word = words[i + order]
        model[state].append(next_word)
    return model

## 3.3 text generation

In [16]:
import random

def generate_poetry(model, order, length=50):
    state = random.choice(list(model.keys()))
    poetry = list(state)
    for _ in range(length):
        if state in model:  # Check if the state is in the model
            next_word = random.choice(model[state])
            poetry.append(next_word)
            state = tuple(poetry[-order:])  # Update the state with the last 'order' words
        else:  # If the state is not in the model, stop generating
            break

    # Split the generated poetry into four lines
    words_per_line = len(poetry) // 4
    lines = [ ' '.join(poetry[i:i + words_per_line]) for i in range(0, len(poetry), words_per_line) ]

    # Join the lines with line breaks to form the final poem
    return '\n'.join(lines)

## 3.4 n-gram

In [17]:
order = 1  # You can experiment with different orders
model = build_model(words, order)
poetry = generate_poetry(model, length=50)
print(poetry)

worldly goods store he can learn what depth of least in a spreading white quartz perhaps it may see my light some faith and civil government are acquainted with any hammering stone was sheer idleness it swept and i used to the last seed 0 65 apples only slaves of the


## 3.5 markovify library

In [1]:
! pip install markovify

Collecting markovify
  Downloading markovify-0.9.4.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting unidecode (from markovify)
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.9/235.9 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25l[?25hdone
  Created wheel for markovify: filename=markovify-0.9.4-py3-none-any.whl size=18607 sha256=f028753a36ef9dd2ac0707c0967743de607d4b49cd401f6726030f0a530ebbcd
  Stored in directory: /root/.cache/pip/wheels/ca/8c/c5/41413e24c484f883a100c63ca7b3b0362b7c6f6eb6d7c9cc7f
Successfully built markovify
Installing collected packages: unidecode, markovify
Successfully installed markovify-0.9.4 unidecode-1.3.6


In [2]:
import markovify

filename = 'walden.txt'

with open(filename, 'r') as file:
    text = file.read()

text_model = markovify.NewlineText(text)

In [12]:
def generate_poem(model, length=10):
    return model.make_sentence(max_words=40,tries=100)

print(generate_poem(text_model))

While I enjoy the most part it suggested only pleasing associations, whether heard by the carriage road from Brister’s Hill.
