<a href="https://colab.research.google.com/github/junting-huang/data_storytelling/blob/main/case_3_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# case_5.statistics

## 5.1 language model

**Markov Chain Language Model**

"A language model is a probabilistic model of a natural language that can generate probabilities of a series of words, based on text corpora in one or multiple languages it was trained on." In a simple Markov chain language model, the probability of each word only depends on the last word in the sequence. For example, given the word "the," the model might predict that the next word is "cat" with a probability of 0.2, "dog" with a probability of 0.3, and so on, based on the frequencies of word sequences in the training data.

**Example**

Consider a corpus with the sentence: "I like to eat apples. I like to eat bananas." A simple bigram (2nd order) Markov chain model might create a probability distribution like this:

- P(like | I) = 1.0 (since "like" always follows "I" in the training data)
- P(to | like) = 1.0 (since "to" always follows "like" in the training data)
- P(eat | to) = 1.0 (since "eat" always follows "to" in the training data)
- P(apples | eat) = 0.5, P(bananas | eat) = 0.5 (since "apples" and "bananas

## 5.2 building model

First, import the Walden text.

In [4]:
filename = 'data/walden.txt'

with open(filename, 'r') as file:
    text = file.read()

The *re* module in Python stands for regular expressions. It provides support for regular expressions, which are powerful tools for pattern matching and string manipulation. Regular expressions allow you to search, match, and manipulate text based on patterns.

re.findall is a function from the *re* module that searches for all occurrences of a pattern in a string.

* \b: Word boundary. This ensures that we match whole words and not parts of words.
* \w+: One or more word characters. This matches letters, digits, or underscores.
* \b: Another word boundary to complete the pattern.



In [5]:
import re

words = re.findall(r'\b\w+\b', text.lower())  # Convert to lowercase and split into words

In [10]:
len(words)

120548

We are ready to build our first language model!

The *defaultdict* class from the *collections* module is a specialized dictionary that allows you to specify a default value for any new key that is accessed for the first time.

In [13]:
from collections import defaultdict

def build_model(words, order=1):
    model = defaultdict(list)
    for i in range(len(words) - order):
        state = tuple(words[i:i + order])
        next_word = words[i + order]
        model[state].append(next_word)
    return model

## 5.3 text generation

In [14]:
import random

def generate_poetry(model, order, length=50):
    state = random.choice(list(model.keys()))
    poetry = list(state)
    for _ in range(length):
        if state in model:  # Check if the state is in the model
            next_word = random.choice(model[state])
            poetry.append(next_word)
            state = tuple(poetry[-order:])  # Update the state with the last 'order' words
        else:  # If the state is not in the model, stop generating
            break

    # Split the generated poetry into four lines
    words_per_line = len(poetry) // 4
    lines = [ ' '.join(poetry[i:i + words_per_line]) for i in range(0, len(poetry), words_per_line) ]

    # Join the lines with line breaks to form the final poem
    return '\n'.join(lines)

## 5.4 n-gram

In [16]:
order = 1  # You can experiment with different orders
model = build_model(words, order)
poetry = generate_poetry(model, order, length=50)
print(poetry)

infinity of the savages their bodies is in this case now for
by hounds that will be your head useful systems for their semi
cylindrical form the expression he goes so sincerely proposed a cloud compeller
would be basis let time we are inclined to hummock left by
this continent that


In [17]:
order = 2  # You can experiment with different orders
model = build_model(words, order)
poetry = generate_poetry(model, order, length=50)
print(poetry)

sat thus with the full project gutenberg works unless you plant more than
those other productions but which at noon sitting amid the rustling of leaves
and potamogetons and perhaps cannot be removed without girdling and so there were
but poorly entertained though what they have persuaded the majority are able at


## 5.5 markovify library

Package *markovify* is a simple, extensible Markov chain generator. Uses include generating random semi-plausible sentences based on an existing text.

In [20]:
! pip install markovify



We create a Markov chain text model using the NewlineText class from markovify. This class is suitable when the source text has newline-separated sentences or paragraphs.

In [23]:
import markovify

filename = 'data/walden.txt'

with open(filename, 'r') as file:
    text = file.read()

text_model = markovify.NewlineText(text)

After creating the text_model, you can use it to generate new sentences or poems based on the patterns learned from the input text Walden. Here's a simple example of generating a poem using the model:

In [26]:
def generate_poem(model, length=10):
    return model.make_sentence(max_words=40,tries=100)

print(generate_poem(text_model))

I had a farm, or ten dollars, or all which had stood in a high state of the great ocean of solitude, into which the owner said protected it by a thousand as well omit to study the bottom.
