# N-Grams and Markov Chains for a Girl

First, read our input text from a file and clean it into a list of words without special characters:

In [None]:
from collections import Counter
from pprint import pprint

def clean(input_text):
    result = input_text

    special_chars = [".", "\n", ";", "?", "!", ":", ",", "(", ")", "[", "]", "\"", "“", "”", "*"]

    for char in special_chars:
        result = result.replace(char, " " if char == "\n" else "")

    return result.lower()


# Clean up the input text
def split_and_dropnulls(input_text):
    words = input_text.split(" ")

    non_empty_words = [word for word in words if word != '']
    return non_empty_words

with open('alice.txt', 'r') as alice_file: 
    alice_text = ' '.join(alice_file.readlines())

alice_words = split_and_dropnulls(clean(alice_text))
pprint(alice_words)

Next, let's split the text into pairs of two (later, N) words at a time:

In [None]:
alice_pairs = [(alice_words[i], alice_words[i+1]) for i in range(len(alice_words)-1)]
pprint(alice_pairs)

Now, we can find the frequency of each pair within the set of pairs:

In [None]:
pair_counts = Counter(alice_pairs)

frequencies = pair_counts.most_common(10)
pprint(frequencies)

In [73]:
import sys 

def markov_model(sequence: list, n: int = 2):
    """
    Create a Markov model (represented as a dict) from the given input sequence, 
    using N-grams of size {n}
    """
    model = {}
    sequence = list(sequence[:]) + [None]
    for starting_position in range(len(sequence) - n):
        current_ngram = tuple(sequence[starting_position:starting_position + n])
        next_item = sequence[starting_position + n]
        
        if current_ngram not in model: 
            model[current_ngram] = [next_item]
        else:
            model[current_ngram].append(next_item)

    return model

alice_model = markov_model(alice_text, 5)
print(f'Finished training! The final model has size: {sys.getsizeof(alice_model)} bytes')
# pprint(model)

Finished training! The final model has size: 1310808 bytes


Now that we have a Markov model of our text, we can use it to generate more text that "looks like" the source material:

In [74]:
import random

def generate(n, model, start=None, max_length=100):
    if start is None:
        start = random.choice(list(model.keys()))
    
    output = list(start)

    for i in range(max_length):
        start = tuple(output[-n:])
        next_item = random.choice(model[start])

        if next_item is None:
            break
        else:
            output.append(next_item)

    return output


alice_result = generate(5, alice_model, max_length=2000)
for char in alice_result:
    print(char, end="")


pear the flurry tone, and say but it, old fellow!” And oh! sh!” she did then I breath, and walked out, after a fall, at a red-hot possible to hers before she hedgehogs were stood nearly follow, as a large saucer of the love).
 Oh dear quietly so,” said the Mock Turtles and the temper,” said the Queen’s voice:—
 
 “Beautiful Soo—oop of that the table and she such a new idea of having it. She took and was quite crowded round the other!”
 
 “That’s the Caterpillar.
 
 The Cat; and in a thinking to say “How confusion,” waving way to eat hurried Alice time). “Do bats eat or drinking on both bite. And she same sigh: she think how eagerly, found it please, which was only this, and the Dormouse.
 
 The Gryphon went on a tone, and was not,” said Alice thought Alice: “because had plenty of keeping about in a very now—Don’t you, won’t belongs to get away, and so confusion, and went on for sneeze so.” said the other Alice replied very long silent, all difficulty Alice again introduced the e—e—even

In [77]:
with open('bible.txt', 'r') as bible_file: 
    bible_text = ' '.join(bible_file.readlines())

# Train a model on it 
bible_model = markov_model(bible_text, n=5)

print(f'Finished training! The final model has size: {sys.getsizeof(bible_model)} bytes')

Finished training! The final model has size: 5242968 bytes


In [82]:
# Generate some more "bible" with it
bible_result = generate(n=5, model=bible_model, max_length=1000)

for char in bible_result:
    print(char, end='')

rd; and do
 uncircumcised, and the came them all; for thus I will I rise up Elijah, and
 took Sosthenes, that following the house of Jedaiah their
 sea: this pleased me, saith, Rachel and a drink a little bottles all the
 take down
 to him, Run. The LORD fountain, Because I had through ye shut upon her; and girt about the came down and be cutteth his name outward when a books of Christ be wrough their shall driven unto him, and return the people, and whom I will generation shall
 our God for
 them that thee have slandered him as seek me
 be remaineth not have been all thine heat off
 to do great.
 
 7:4 Then have branch thee, O God, without garrison her hand upon the Ethiopians against thine hand of persecuted up
 and the green to make hasten my souls the times;
 the mountain of the carry high he door of Judah and
 Naamath thy sacrifice.
 
 9:22 Whoso to the field, and unto Tahpenest gatherefore him a
 cloud.
 
 16:34 Ascribed; I would not
 the sight, that hast loveth great.
 
 2:8 And

Now, let's do it to the Grateful Dead. (Do you know the grateful dead?)

In [83]:
with open('gdead.txt', 'r') as dead_file: 
    dead_text = ' '.join(dead_file.readlines())

# Train a model on it 
dead_model = markov_model(dead_text, n=5)

print(f'Finished training! The final model has size: {sys.getsizeof(dead_model)} bytes')

Finished training! The final model has size: 589920 bytes


In [89]:
# Generate some more "dead songs" with it
dead_result = generate(n=5, model=dead_model, max_length=1000)

for char in dead_result:
    print(char, end='')

screen, be a judgement I said' "Please, I am on me.
 Got some pretty little was clear throw morning for his choice, mate of thing.
 But it's one thing a mate you, what I've bottle was dusty but throw me in an eagle on ten dollars bail,
 Mumblin' down,
 I turning on a life was many angles you pushed around here the rails we're on horseback and the bone,
 It's the bells on my knees,
 You three days just there I am on me".
 Come heart. You just stumbles as the bastard barely time go?
 
 Off to move, rollin' down,
 And I chundercloud.
 
 I live in and the high noon?
 Like all ugly and rage
 A lady of nobility, gentility and play and rage
 A lady of nobility and fill my pillow you with harder to blame.
 
 Teeth big split, and warm in you holding on the back...He's gone, nothin' every day.
 
 Been all I had a harden wings go wrong, wrong.
 And sing me a violin an empty, find out with me,
 Gentle Jack, there quick, yes I comes around of town off the big and now he's gone, he's gone, gone.
 
 