This is a Google Colaboratory notebook.

It's an interactive environment that lets you both write and execute python scripts from the cloud.

To get it to work, you should first select 'open in playground mode'.

Now, if you move your mouse over a piece of code, you'll see that a little 'play' symbol comes up next to it. Click the play symbol to execute that code.

In [1]:
print("for instance, clicking play here will print this code out")

for instance, clicking play here will print this code out


In [2]:
i=10
print("or clicking play here will print out the number",i)

or clicking play here will print out the number 10


Markov chains are mathematical systems that experience changes from one state to another according to certain probabilistic rules. Markov chains have lots of practical uses: For example, they can be used for predictive text on a mobile device.

I've set up this notebook to show you how markov chains can be used to generate text. You can learn how this works by stepping through the program, reading each piece of text and pressing the 'play' symbol next to the associated code.

When you've seen how the code works, there are a few challenges for you. Before doing this,  you'll have to 'open in playground mode' so that you can edit the notebook!:

1. Can you get it to generate text from a different book? At the moment, the text generator is working from Jane Austen's Pride and Prejudice, located in a .txt file at Project Gutenberg ('http://www.gutenberg.org/ebooks/42671.txt.utf-8').  However, it should theoretically be able to generate text from any source file on the internet - for example Dracula ('http://www.gutenberg.org/ebooks/345.txt.utf-8') or something called 'Astounding Stories of Super-Science' ('https://www.gutenberg.org/ebooks/29768.txt.utf-8'). I'm using old books because you can find them easily online (copyright free) from Project Gutenberg (e.g. https://www.gutenberg.org/wiki/Science_Fiction_(Bookshelf)). But you could use anything.

2. Can you get it to work with something bigger or smaller than a 3-gram? Say a 4-gram or a 2-gram?

3. Can you get it to create a mash-up of two books? this might involve creating a second bookInput from a different url, and then stitching the resulting strings together.

------------------------------------------


Now, for the code itself! 
----------------------------------------
--------------------------------------

First, load a text file from Project Gutenberg to use as input for our text generator.

The URL here points to a .txt file of Jane Austen's Pride and Prejudice.


That URL is downloaded, decoded, and stored in a string variable called bookInput.

In [3]:
import random
import urllib
import textwrap
bookInput = urllib.request.urlopen('http://www.gutenberg.org/ebooks/42671.txt.utf-8').read().decode('utf-8')
print(bookInput)

﻿The Project Gutenberg eBook, Pride and Prejudice, by Jane Austen, Edited
by R. W. (Robert William) Chapman


This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org





Title: Pride and Prejudice


Author: Jane Austen

Editor: R. W. (Robert William) Chapman

Release Date: May 9, 2013  [eBook #42671]

Language: English


***START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE***


E-text prepared by Greg Weeks, Jon Hurst, Mary Meehan, and the Online
Distributed Proofreading Team (http://www.pgdp.net) from page images
generously made available by Internet Archive (https://archive.org)



Note: Project Gutenberg also has an HTML version of this
      file which includes the original illustrations.
      See 42671-h.htm or 42671-h.zip:
      (http://www.gute

Now, the next thing we want to do is split our file up into tokens that we can use for our Markov chain's 'states'. 

The easiest way to do this is just to split our book up into words using python's 'split' command. This creates a list of each word in the book

In [4]:
wordList = bookInput.split()
print("The number of words in the book are:",(len(wordList)))
print(wordList)

The number of words in the book are: 124970
['\ufeffThe', 'Project', 'Gutenberg', 'eBook,', 'Pride', 'and', 'Prejudice,', 'by', 'Jane', 'Austen,', 'Edited', 'by', 'R.', 'W.', '(Robert', 'William)', 'Chapman', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever.', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org', 'Title:', 'Pride', 'and', 'Prejudice', 'Author:', 'Jane', 'Austen', 'Editor:', 'R.', 'W.', '(Robert', 'William)', 'Chapman', 'Release', 'Date:', 'May', '9,', '2013', '[eBook', '#42671]', 'Language:', 'English', '***START', 'OF', 'THE', 'PROJECT', 'GUTENBERG', 'EBOOK', 'PRIDE', 'AND', 'PREJUDICE***', 'E-text', 'prepared', 'by', 'Greg', 'Weeks,', 'Jon', 'Hurst,', 'Mary', 'Meehan,', 'and', 'the', 'Online', 'Dis

The next thing we want to do is take that big list, and transform it into a Markov Chain representation.

Remember, Markov Chains are systems that shift from states according to probabilistic rules.

So, if we had the sequence: A B B C, we would know that A->B 100% of the time; B->B 50% of the time, B->C 50% of the time. We want to form a similar probabilistic model here, but using words. Something that tells us "If the current word is Pride, there is a 80% chance of going to the word 'and', a 10% chance of going to the word 'of', etc. etc. etc."

The simplest way to do this is a dictionary: This will be a data structure that holds a mapping between each word, and all the words that ever follow that word. For example, if we had the sentence "A sailor went to sea to see what he could see", it would become the following dictionary:

----------------
A->sailor

sailor->went

went->to

to->sea,see

sea->to

what->he

he->could

could->see

------------------

First, we create a set from each unique word in our list.

Then, we use that set as keys to a dictionary, with each word linked to a (currently) empty list of 'following' words, which is denoted in python as '[]'




In [5]:
wordSet = set(wordList)
print("There are this many unique words in the list: ",len(wordSet))
wordDictionary = dict((word,[]) for word in wordSet)


There are this many unique words in the list:  13778


Now, it's time to propogate the predictive model embodied in our dictionary with the actual data from our book.

We use a 'for' loop to step through our list of words. For each word in our book, we record the next word after it in our dictionary. 

In [6]:
for i in range(0,len(wordList)-1):
  currentWord = wordList[i]
  nextWord = wordList[i+1]
  wordDictionary[currentWord].append(nextWord)
print("model is built!")

model is built!


Let's see if it works! What words come after the word 'Darcy'?

In [7]:
if "Darcy" in wordDictionary:
  print(wordDictionary["Darcy"])

['soon', 'danced', 'had', 'walked', 'there', 'by', 'was', 'was', 'was', 'is', 'speaking', 'had', 'mean,"', 'only', 'stood', 'bowed.', 'with', 'is', 'replied', 'said', 'will', 'much', 'with', 'only', 'was', 'and', 'will', 'then', 'must', 'were', 'smiled;', 'had', 'took', 'had', 'into', 'felt', 'did', 'took', 'looked', 'in', 'may', 'is', 'is', 'has', 'it', 'and', 'corroborated', 'just', 'had', 'without', 'bequeathed', 'chose', 'liked', 'so', 'often', 'gave', 'has', 'is."', 'can', 'were', 'could', 'contradict', 'was', 'approached', 'spoke,', 'made', 'he', 'in', 'is', 'than', 'more', 'would', 'was', 'seemed', 'to', 'said', 'is', 'has', 'for', 'for', 'and', 'she', 'before', 'was', 'would', 'may', 'by', 'formerly', 'was', 'was', 'had', 'would', 'looked', 'they', 'spoke', 'looked', 'smiled', 'to', 'only,', 'drew', 'related', 'came', 'had', 'does', 'spirit,', 'likely', 'is', '_does_', 'of', 'give', 'had', 'could', 'himself', 'walk', 'changed', 'in', 'contemptuously;', 'could', 'gave', 'had', '

Here's where the fun begins. We've successfully built a predictive model: Now let's have it try to generate some text.

Let's start with a random word from our list, and randomly pick one of the words that it links to; then for that word, let's randomly pick one of the words that *it* links to; and so on!

The only tricky bit of code here is building in a safeguard for a situation where our current word doesn't link to anything else: If this is the case, we break out of the 'for' loop and end our story early.

In [8]:
story = ""
word = random.choice(wordList)
for i in range (0,100):
  story = story +" "+ word
  possibleNextWords = wordDictionary[word]
  if len(possibleNextWords)==0:
    break
  word = random.choice(possibleNextWords)

for line in textwrap.wrap(story):
  print(line)




 total ignorance and he was the country. Under such a great humility
of him, that had very soon as I expressed herself it might wish to the
ladies, of it, and now so did all go." "Very true; and not believe was
due than ever. The stupidity with them. My father captivated by
business at Rosings. Mr. Collins and as we are not know not, shall
certainly think exceedingly agreeable and Elizabeth saw the sake of
whose condition in the housekeeper came; and the same style, no
occasion could so pleasant girls than usually insolent thing from his
addresses. Donations


So, that's kind of generating text. It's not very coherent, though.

How can we improve our generative model?

A quick improvement can be made via leveraging something called an n-gram: Basically, by chunking our input into fewer states, we can make our output obey implicit grammatical rules.

For example, the input "The cat sat in the hat" could be represented as:

---------
The->cat

cat->sat

sat->in

in->the

the->hat

---------

Or it could be represented as the following 2-grams:

---------

The cat->sat in

cat sat->in the

sat in->the hat


---------

Note how the second contains implicit knowledge about things like articles of speech (e.g. 'the hat')

We could even represent this through 3-grams:

---------

The cat sat -> in the hat

---------

Note that the larger your gram size, though, the more data you need to build your model.

Let's take our word list input, and transform it into a list of 3-grams

In [9]:
gramSize = 3
gramList=[]

for i in range(0,len(wordList)-(gramSize-1)):
  gram = wordList[i]
  for j in range(1,gramSize):
    gram = gram+" "+wordList[i+j]
  gramList.append(gram)  

print(gramList)


['\ufeffThe Project Gutenberg', 'Project Gutenberg eBook,', 'Gutenberg eBook, Pride', 'eBook, Pride and', 'Pride and Prejudice,', 'and Prejudice, by', 'Prejudice, by Jane', 'by Jane Austen,', 'Jane Austen, Edited', 'Austen, Edited by', 'Edited by R.', 'by R. W.', 'R. W. (Robert', 'W. (Robert William)', '(Robert William) Chapman', 'William) Chapman This', 'Chapman This eBook', 'This eBook is', 'eBook is for', 'is for the', 'for the use', 'the use of', 'use of anyone', 'of anyone anywhere', 'anyone anywhere at', 'anywhere at no', 'at no cost', 'no cost and', 'cost and with', 'and with almost', 'with almost no', 'almost no restrictions', 'no restrictions whatsoever.', 'restrictions whatsoever. You', 'whatsoever. You may', 'You may copy', 'may copy it,', 'copy it, give', 'it, give it', 'give it away', 'it away or', 'away or re-use', 'or re-use it', 're-use it under', 'it under the', 'under the terms', 'the terms of', 'terms of the', 'of the Project', 'the Project Gutenberg', 'Project Guten

Now we can use that list of 3-grams to create a set of 3-grams, and a dictionary of 3-grams, just like we did with individuals words

In [10]:
gramSet = set(gramList)
print("There are this many unique grams in the list: ",len(gramSet))
gramDictionary = dict((gram,[]) for gram in gramSet)

for i in range(0,len(gramList)-gramSize):
  currentGram = gramList[i]
  nextGram = gramList[i+gramSize]
  gramDictionary[currentGram].append(nextGram)


There are this many unique grams in the list:  110665


Now, let's try generating some text from this model using exactly the same code as before...

In [11]:
story = ""
gram = random.choice(gramList)
for i in range (0,100):
  story = story +" "+ gram
  possibleNextGrams = gramDictionary[gram]
  if len(possibleNextGrams)==0:
    break
  gram = random.choice(possibleNextGrams)

for line in textwrap.wrap(story):
  print(line)

 this friendly caution, and you may assure yourself that no ungenerous
reproach shall ever pass my lips when we are married." It was
absolutely necessary to interrupt him now. "You are too hasty, Sir,"
she cried. "You forget that I have made no answer. Let me do it
without farther loss of time. Accept my thanks for the compliment you
are paying me. I am very glad you liked her. I hope she will turn out
well." "I dare say she will; she has got over the most trying age."
"Did you go by the village of Kympton?" "I do not recollect that we
did." "I mention it, because it is what I have done. I am not romantic
you know. I never was. I ask only a comfortable home; and considering
Mr. Collins's character, connections, and situation in life, I am
convinced that he felt it to be as much a debt of gratitude to _him_,
as of affection to myself." "How strange!" cried Elizabeth. "How
abominable!--I wonder that the very pride of this Mr. Darcy has not
made him just to you!--If from no better motive,

Much more coherent!

Now, if you're interested, go to the top and do the challenges.

Note that if you want to run all the code again, you don't have to step through every single line.

The menu at the top (Runtime->Run All) will just run all the code for you! You can then just scroll to the bottom to see the end product. You might have to wait a little while for the computation to complete, though.