# Project 3: Text Generator

⚠️   **Duplicate this project before you start working on it, using `File > Save a copy in drive`.**

The goal of this project is to build a procedural text generator: a program that can generate random text that might actually make sense, based on some original real text. It might generate jokes, wise sayings, movie trailers, restaurant reviews, research papers - that part will be up to you!

## Markov chain

A common approach for text generation these days might be a neural network trained on huge amounts of data, but we'll be using an older technique that requires very little data, and can be programmed entirely from scratch with what you know now in Python. Woo!

The technique is called a **Markov chain** and is used in many industries to model sequences of events based on probabilities. Here, the sequence of events is the sequence of words in an English sentence, and the probabilities are the likelihood that one word is followed by another word.

For example, we could create a model based solely on these sentences:

* I am mad.
* I am happy.
* I love you.
* I want cookies.
* I love cookies.

In this limited language (perhaps spoken by a toddler), the start word of every sentence is always "I". The next word has a 40% probability of being "am", a 40% probability of being "love", and a 20% probability of being "want". 

![Markov chain for I probabilities](https://corise.com/static/course/introduction-to-python/assets/cky534a6f00l01ga176er12br/markov_i.png)


We can also compute the probabilities for the other words in the language. The word "am" has a 50% probability of being followed by "mad" and 50% probability of being followed by "happy". The word "love" has the same probabilities for "you" and "cookies". 

![Markov chain for am and love probabilities](https://corise.com/static/course/introduction-to-python/assets/cky537lac00l61ga18if91w9c/markov_amlove.png)

The word "want" is only ever followed by "cookies" (toddlers know where it's at).

![Markov chain for want probability](https://corise.com/static/course/introduction-to-python/assets/cky53j7wf00lb1ga15xpl85c9/markov_want.png)


The remaining words all have a 100% probability of being followed by a period, meaning that they're always the last word in a sentence.

![Markov chain for probabilities of you, cookies, mad, happy](https://corise.com/static/course/introduction-to-python/assets/cky532fpk00kr1ga16t6s9yjx/markov_madhappycookiesyou.png)


We've now modeled this tiny language according to the probabilities of one word following another word, and we can use this model to generate sentences. We just start from a start word, which is always "I" in this language, and then pick the next word based on the probabilities and random dice rolls. Eventually, we'd generate all the original sentences in the language.

But that's not very exciting, we want to generate new sentences! Well, if we added more and longer sentences to the input data, we'd likely eventually end up generating never-before-seen sentences. For example, if "you want cookies" was an input sentence, then the chain could generate "I love you want cookies", because it would observe that "you" could be followed by either a period or "want". 

![New sentence generated by chain](https://corise.com/static/course/introduction-to-python/assets/cky53rh0n00li1ga1crhkcsii/markov_newsentence.png)

That particular sentence doesn't make grammatical sense, and many of the generated sentences won't, but trust me, some of the outputs will astound and amaze you.

Now, it's time for you to actually build a Markov chain yourself. Ba-bum-bum!

## Read input data

What input data do you want to use for your Markov chain? It's up to you! Here are a few options:

* [sayings.txt](https://gist.githubusercontent.com/pamelafox/ef698fe3a1a6950949e3987fe1303c07/raw/672570d0e002524f53c2f3b32dc0f70d905f5d1f/sayings.txt): ~2400 wise (or not so wise) sayings.
* [titles.txt](https://gist.githubusercontent.com/pamelafox/cf569b133d717ab02c6635926ffc960a/raw/2c4913d8dd86a97ab602d48394638b24228e5ba2/movie_titles.txt): 6800 movie titles (which are on the shorter side, so your generated "sentences" will be quite short)
* [composingprograms.txt](https://corise.com/static/course/introduction-to-python/assets/cky8zb2g800et1g4igmx1bjbz/python_textbook.txt): 3800 sentences from composingprograms.com, a textbook about Python, Scheme, and SQL.

You're also welcome to use another data source, such as exporting your own social media posts or downloading a whole book from [Project Gutenberg](https://www.gutenberg.org/). Just be prepared to do a little more work cleaning the data into sentences.

### ✏︎ For you to do

The code below already opens a URL and stores its lines in a variable.

* Change the URL to your preferred data source. 
* If using the data sources above, right click and copy the link to get the full URL starting with `http`. 
* If you've uploaded a new data source as a file to CoLab, change the code to instead use the `open()` function with the filename.
* Check that `sentences[0]` and `len(sentences)` are what you expected them to be.

# Looking at the Book Data (sentences) 
* Split the data on carriage return line feed \r\n\r for ~2k of lines 
* Using a period loses the context of the sentences for ~4.6l of lines 




In [None]:
import urllib.request

# Open the file
text_file = urllib.request.urlopen('https://www.gutenberg.org/files/61262/61262-0.txt')

# Read the file into a string according to UTF-8 encoding
text_contents = text_file.read().decode('utf-8')

# Split the string into sentences (using newline as delimiter)
sentences = text_contents.split('\r\n\r')
# sentences = text_contents.split('.')

# Check a few sentences
print(sentences[0])
print(len(sentences))

﻿The Project Gutenberg EBook of Poirot Investigates, by Agatha Christie
2108


In [None]:
# Looking at data
# Reviewed the 4.6 K of lines - narrowed down to 2k of lines and punctuation with 1st split
sentences[0:12]

['\ufeffThe Project Gutenberg EBook of Poirot Investigates, by Agatha Christie',
 "\nThis eBook is for the use of anyone anywhere in the United States and most\r\nother parts of the world at no cost and with almost no restrictions\r\nwhatsoever.  You may copy it, give it away or re-use it under the terms of\r\nthe Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org.  If you are not located in the United States, you'll have\r\nto check the laws of the country where you are located before using this ebook.",
 '\nTitle: Poirot Investigates',
 '\nAuthor: Agatha Christie',
 '\nRelease Date: January 28, 2020 [EBook #61262]',
 '\nLanguage: English',
 '\nCharacter set encoding: UTF-8',
 '\n*** START OF THIS PROJECT GUTENBERG EBOOK POIROT INVESTIGATES ***',
 '\n',
 '\n\r\nProduced by an anonymous Project Gutenberg volunteer.',
 '\n',
 '\n']

In [None]:
import string 

# This creates a dictionary that I can update with a number to see how many times the punctuation shows up. 
# It's here as a future stretch goal to pass to the function for punctuation and then use it for random generation 
punct_dict = dict.fromkeys(string.punctuation, ["PUNCTUATION"])
punct_dict

{'!': ['PUNCTUATION'],
 '"': ['PUNCTUATION'],
 '#': ['PUNCTUATION'],
 '$': ['PUNCTUATION'],
 '%': ['PUNCTUATION'],
 '&': ['PUNCTUATION'],
 "'": ['PUNCTUATION'],
 '(': ['PUNCTUATION'],
 ')': ['PUNCTUATION'],
 '*': ['PUNCTUATION'],
 '+': ['PUNCTUATION'],
 ',': ['PUNCTUATION'],
 '-': ['PUNCTUATION'],
 '.': ['PUNCTUATION'],
 '/': ['PUNCTUATION'],
 ':': ['PUNCTUATION'],
 ';': ['PUNCTUATION'],
 '<': ['PUNCTUATION'],
 '=': ['PUNCTUATION'],
 '>': ['PUNCTUATION'],
 '?': ['PUNCTUATION'],
 '@': ['PUNCTUATION'],
 '[': ['PUNCTUATION'],
 '\\': ['PUNCTUATION'],
 ']': ['PUNCTUATION'],
 '^': ['PUNCTUATION'],
 '_': ['PUNCTUATION'],
 '`': ['PUNCTUATION'],
 '{': ['PUNCTUATION'],
 '|': ['PUNCTUATION'],
 '}': ['PUNCTUATION'],
 '~': ['PUNCTUATION']}

In [None]:
import string

# Translation (i.e. replacement or stripping of the character with no space) 
#     will be used by 
#     https://docs.python.org/3.3/library/stdtypes.html?highlight=maketrans#str.maketrans
# Examples here at: 
#     https://www.tutorialsteacher.com/python/string-maketrans

strip_punct_dict ={ 
	'/': '',
	'\\': '',
	'_': '',
	'"': '',
	'“': '',
	'”': '',
	'-': '',
	':': '',
	';': '', 
	'•': '',
	'‘': '',
	'\n': ' ',
	'\r': '',
	'—': ' ',
}

In [None]:
# https://docs.python.org/2/library/string.html
# string.maketrans(from, to)
# Return a translation table suitable for passing to translate() if needed. 
# that will map each character in from into the character at the same 
# position in to; from and to must have the same length.

# The strip_punct_dict was created by hand after review of the data
# It could be generated dynamically. Left it out of scope for now to work on code mechanics 
remove_punct = str.maketrans(strip_punct_dict)
remove_punct # returns the unicode value in decimal for the original character

# Note: I could have removed more, yet I didn't want to do that on first go 
#       I wanted to see what fun the word with the punctuation would do for the random generated sentence 
#       I now know I'd like to leave the punctuation in and add variety at the end so it's not just a period 

{10: ' ',
 13: '',
 34: '',
 45: '',
 47: '',
 58: '',
 59: '',
 92: '',
 95: '',
 8212: ' ',
 8216: '',
 8220: '',
 8221: '',
 8226: ''}

In [None]:
# Need to reduce excess characters & lines after the translation 

cleaned_sentences = []

for sentence in sentences:
  sentence = sentence.lower()
  sentence = sentence.strip('”')
  sentence = sentence.lstrip()
  sentence = sentence.rstrip()
  sentence = sentence.translate(remove_punct)
  if sentence.strip() != '': # throws away empty lines 
    cleaned_sentences.append(sentence) # appends only lines with data 


In [None]:
# Still a list of strings 
print(type(cleaned_sentences))
# Slicing to see how it has evolved in iterations 
cleaned_sentences[2:10]  # I think it's going to be funny with a *** word :D 

<class 'list'>


['title poirot investigates',
 'author agatha christie',
 'release date january 28, 2020 [ebook #61262]',
 'language english',
 'character set encoding utf8',
 '*** start of this project gutenberg ebook poirot investigates ***',
 'produced by an anonymous project gutenberg volunteer.',
 'poirot investigates']

In [None]:
# Obsolete now -- left in to show how I built it in iterations 
# 
# Cleaned Sentences is the base for the other functions to use 
## cleaned_sentences = []
## for line in in_text_after_translate[:]:
##    line = line.lstrip()
##    line = line.rstrip()
##    if line.strip() != '': # throws away empty lines 
##        cleaned_sentences.append(line) # appends only lines with data 


In [None]:
# Testing out the spilt of sentences and making all the text lower case 

#split_sentence = []
#for sentence in cleaned_sentences[2:10]:
#  sentence = sentence.lower()
#  split_sentence.append(sentence.split())
#
# print(split_sentence)#

[['title', 'poirot', 'investigates'], ['author', 'agatha', 'christie'], ['release', 'date', 'january', '28,', '2020', '[ebook', '#61262]'], ['language', 'english'], ['character', 'set', 'encoding', 'utf8'], ['***', 'start', 'of', 'this', 'project', 'gutenberg', 'ebook', 'poirot', 'investigates', '***'], ['produced', 'by', 'an', 'anonymous', 'project', 'gutenberg', 'volunteer.'], ['poirot', 'investigates']]


In [None]:
len(cleaned_sentences)
cleaned_sentences[200:220]

['oh, milord, i fear to incommode you. we have left our bags at the inn.',
 'that’s all right. lord yardly had his cue. we’ll send down for them. no, no no trouble, i assure you.',
 'poirot permitted himself to be persuaded, and sitting down by lady yardly, began to make friends with the children. in a short time they were all romping together, and had dragged me into the game.',
 'vous êtes bonne mère, said poirot, with a gallant little bow, as the children were removed reluctantly by a stern nurse.',
 'lady yardly smoothed her ruffled hair.',
 'i adore them, she said with a little catch in her voice.',
 'and they you with reason! poirot bowed again.',
 'a dressinggong sounded, and we rose to go up to our rooms. at that moment the butler entered with a telegram on a salver which he handed to lord yardly. the latter tore it open with a brief word of apology. as he read it he stiffened visibly.',
 'with an ejaculation, he handed it to his wife. then he glanced at my friend.',
 'just a m

## Initialize the Markov chain

For this project, you'll be storing the chain using a dictionary, with each key as a word, and each value as a list of words that come after that word. It's a great data structure for Markov chains since we can very quicky look up values by word.

For example, the final chain for the simple toddler language would look like:

```
chain = {
  "START_OF_SENTENCE": ["I"],
  "I": ["am", "am", "love", "want", "love"],
  "am": ["mad", "happy"],
  "love": ["you", "cookies"],
  "want": ["cookies"],
  "mad": ["END_OF_SENTENCE"],
  "happy: ["END_OF_SENTENCE"],
  "you": ["END_OF_SENTENCE"],
  "cookies": ["END_OF_SENTENCE"]
}
```

Every chain starts off with a word that isn't really a word - `START_OF_SENTENCE`. 
That key stores all the possible words that can start a sentence.

As you can see, there's also a special not-word `END_OF_SENTENCE` indicating the end of the sentence, so that we know how often a word is followed by a period or new line.

The values contain an entry for every time a word shows up after the key word, instead of storing the probability, simply because it's easier to code up that way. It'd also be possible to store the probabilities instead, if you wanted.


### ✏︎ For you to do

Run the code below to initialize the chain with the single starting key:

In [None]:
chain = {"START_OF_SENTENCE": []}

In [None]:
chain

{'START_OF_SENTENCE': []}

# Wordify each sentence

The `sentences` array right now contains strings that are entire sentences, but the chain dictionary needs keys and values that are words.

In this step, you'll write a short function that can turn a sentence into a list of words. Use the string methods that we learned earlier in the week.

### ✏︎ For you to do

Implement `wordify_sentence` and run the doctests to make sure it works as expected.



In [None]:
def wordify_sentence(clean_sentence):
  """ 
  Splits a sentence into words (space-separated),
  stripping off any periods at the end 
  and lowercasing all the words.
  Leading and trailing whitespace should also be stripped.

  >>> wordify_sentence('Money burns a hole in your pocket.')
  ['money', 'burns', 'a', 'hole', 'in', 'your', 'pocket']
  >>> wordify_sentence(' Miss Jerry\\n ')
  ['miss', 'jerry']
  """
  split_sentence = []
  
  words = clean_sentence.split()
  for word in words:
    split_sentence.append(word.lower().strip("."))

  return split_sentence 

wordify_sentence('a dressinggong sounded, and we rose to go up to our rooms. at that moment the butler entered with a telegram on a salver which he handed to lord yardly. the latter tore it open with a brief word of apology. as he read it he stiffened visibly.')


['a',
 'dressinggong',
 'sounded,',
 'and',
 'we',
 'rose',
 'to',
 'go',
 'up',
 'to',
 'our',
 'rooms',
 'at',
 'that',
 'moment',
 'the',
 'butler',
 'entered',
 'with',
 'a',
 'telegram',
 'on',
 'a',
 'salver',
 'which',
 'he',
 'handed',
 'to',
 'lord',
 'yardly',
 'the',
 'latter',
 'tore',
 'it',
 'open',
 'with',
 'a',
 'brief',
 'word',
 'of',
 'apology',
 'as',
 'he',
 'read',
 'it',
 'he',
 'stiffened',
 'visibly']

In [None]:
# Run the tests when you're ready
import doctest
doctest.run_docstring_examples(wordify_sentence, globals(), verbose=True, name="wordify_sentence")

Finding tests in wordify_sentence
Trying:
    wordify_sentence('Money burns a hole in your pocket.')
Expecting:
    ['money', 'burns', 'a', 'hole', 'in', 'your', 'pocket']
ok
Trying:
    wordify_sentence(' Miss Jerry\n ')
Expecting:
    ['miss', 'jerry']
ok


## Build the Markov chain (Part 1)

In this step, you'll make a function that can take a single sentence and update the chain based on each word in the sentence. If the word is the first word in the sentence, then it belongs in the value of the "START_OF_SENTENCE" key. If the word is followed by another word, then that subsequent word needs to be added to its value. Otherwise, if it's the last word in the sentence, then "END_OF_SENTENCE" needs to be added to its value.

One tricky thing to look out for: the chain only starts off with a single key, "START_SENTENCE", so none of the other words in the sentences will exist as keys in the dictionary yet. Before you can update the values for those other words, you'll need to add a key in the dictionary for them and initialize it with an empty list - but only if it didn't exist yet!

### ✏︎ For you to do

Implement the `add_sentence` function below, following the guidance in the comments. Run the doctests to see if it works as expected.

In [None]:
def add_sentence(chain, sentence):
  """
  >>> test_chain = {'START_OF_SENTENCE': []}
  >>> add_sentence(test_chain, 'I am happy')
  >>> test_chain
  {'START_OF_SENTENCE': ['i'], 'i': ['am'], 'am': ['happy'], 'happy': ['END_OF_SENTENCE']}
  >>> add_sentence(test_chain, 'I am')
  >>> test_chain
  {'START_OF_SENTENCE': ['i', 'i'], 'i': ['am', 'am'], 'am': ['happy', 'END_OF_SENTENCE'], 'happy': ['END_OF_SENTENCE']}
  """
  # Note the pattern is different for start and end of sentence. 
  # How you append below is going to be 3 different use cases to solve. 

  # Split the sentence into a list of words
  words = wordify_sentence(sentence)

  # Loop through each word in the list
  # (Using a while loop gives you a way to know
  # if the word is the first or the last)
  i = 0
  while i < len(words):
    word = words[i]

    # Handle case of first word in sentence
    if i == 0:
      chain["START_OF_SENTENCE"].append(word)

    # If word isn't in chain yet, add it as a key
    if word not in chain:
      chain[word] = []

    # Now figure out what word to add to
    # the list of values for this word
  
    # First, handle case of last word in sentence
    if i == len(words) - 1:
      chain[word].append("END_OF_SENTENCE")
    # Otherwise, handle case of a word that has a word after
    else:
      chain[word].append(words[i + 1])

    i += 1


In [None]:
# Run the tests when you think add_sentence is working
import doctest
doctest.run_docstring_examples(add_sentence, globals(), verbose=True, name="add_sentence")

Finding tests in add_sentence
Trying:
    test_chain = {'START_OF_SENTENCE': []}
Expecting nothing
ok
Trying:
    add_sentence(test_chain, 'I am happy')
Expecting nothing
ok
Trying:
    test_chain
Expecting:
    {'START_OF_SENTENCE': ['i'], 'i': ['am'], 'am': ['happy'], 'happy': ['END_OF_SENTENCE']}
ok
Trying:
    add_sentence(test_chain, 'I am')
Expecting nothing
ok
Trying:
    test_chain
Expecting:
    {'START_OF_SENTENCE': ['i', 'i'], 'i': ['am', 'am'], 'am': ['happy', 'END_OF_SENTENCE'], 'happy': ['END_OF_SENTENCE']}
ok


## Build the Markov chain (Part 2)

Now that you've got `add_sentence` working, it's time to call it on every sentence from your input data.

### ✏︎ For you to do

Use a loop to go through each sentence in `sentences` (the list of sentences stored above) and call `add_sentence` on each of them. Once you've done that, you can log out the entire `chain` dictionary to see what it looks like, or just a single key, like `chain["the"]`.

In [None]:
# YOUR LOOP HERE
chain = {"START_OF_SENTENCE":[]}

for sentence in cleaned_sentences[0:100]:
  add_sentence(chain=chain, sentence=sentence)
print(chain["the"]) # This is only printing a subset. :D 


['use', 'united', 'world', 'terms', 'project', 'united', 'laws', 'country', 'same', 'mysterious', 'secret', 'murder', 'links', 'bodley', 'bodley', 'bodley', 'adventure', 'western', 'tragedy', 'adventure', 'cheap', 'mystery', 'million', 'adventure', 'egyptian', 'grand', 'kidnapped', 'disappearance', 'adventure', 'italian', 'case', 'missing', 'adventure', 'western', 'window', 'street', 'depths', 'following', 'houses', 'girl,', 'girl', 'shadowers', 'scoundrels,', 'great', 'great', 'simplest', 'window', 'film', 'fact!', 'screen,', 'streets', 'nonessentials!', 'case', 'dancer,', 'best', 'mode,', 'dernier', 'little', 'most', 'american', 'most', 'screen', 'states', 'great', 'western', 'enormous', 'wide', 'mystery', 'dark', 'latter', 'name', 'inside', 'enclosure', 'writing', 'envelope', 'great', 'left', 'god', 'second', 'same', 'third', 'diamond', 'full', 'moon,', 'two', 'left', 'god', 'first', 'second,', 'third', 'matter', 'stone', 'diamond', 'western', 'time,', 'stone,', 'chink', 'thing', 's

In [None]:
cleaned_sentences[0:100] # notice that the one \ufeff is the starting character which I left in. 

['\ufeffthe project gutenberg ebook of poirot investigates, by agatha christie',
 "this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoever.  you may copy it, give it away or reuse it under the terms of the project gutenberg license included with this ebook or online at www.gutenberg.org.  if you are not located in the united states, you'll have to check the laws of the country where you are located before using this ebook.",
 'title poirot investigates',
 'author agatha christie',
 'release date january 28, 2020 [ebook #61262]',
 'language english',
 'character set encoding utf8',
 '*** start of this project gutenberg ebook poirot investigates ***',
 'produced by an anonymous project gutenberg volunteer.',
 'poirot investigates',
 'by the same author',
 'the mysterious affair at styles',
 'the secret adversary',
 'the murder on the links',
 'the bodley head',
 'poirot investigates',
 'by a

In [None]:
chain # really cool to see the contents of the chain... this would make a cool heat map where the most frequent words are larger vs 1 offs smaller

"""
This was cool to see that end of sentence was repeated

'christie': ['END_OF_SENTENCE',
  'END_OF_SENTENCE',
  'END_OF_SENTENCE',
  'limited']
"""

{'#61262]': ['END_OF_SENTENCE'],
 '***': ['start', 'END_OF_SENTENCE'],
 '1924': ['END_OF_SENTENCE', 'agatha'],
 '2020': ['[ebook'],
 '28,': ['2020'],
 'START_OF_SENTENCE': ['\ufeffthe',
  'this',
  'title',
  'author',
  'release',
  'language',
  'character',
  '***',
  'produced',
  'poirot',
  'by',
  'the',
  'the',
  'the',
  'the',
  'poirot',
  'by',
  'london',
  'john',
  'first',
  'copyright',
  'contents',
  'i',
  'ii',
  'iii',
  'iv',
  'v',
  'vi',
  'vii',
  'viii',
  'ix',
  'x',
  'xi',
  'poirot',
  'poirot',
  'i',
  'the',
  'i',
  'that’s',
  'what',
  'deduce,',
  'the',
  'in',
  'as',
  'i',
  'so',
  'en',
  'i',
  'about',
  'and',
  'she',
  'ah!',
  'i',
  'but',
  'you',
  'what',
  'what?',
  'without',
  'how',
  'very',
  'as',
  'mary',
  'all',
  'miss',
  'poirot',
  'you',
  'she',
  'proceed,',
  'it’s',
  'the',
  'cheap',
  'i',
  'the',
  'the',
  'you',
  'the',
  'i',
  'no',
  'why?',
  'because',
  'i',
  'the',
  'poirot',
  'the',
  'i',


# Lessons learned: Running the whole 4k (accidentally repeatedly appended) gave the error 

Ha ha ha! -- I had 2 appends... I fixed it (doubling grains of rice on a chess board territory. Oop!) 

"IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`."

Current values:
* NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
* NotebookApp.rate_limit_window=3.0 (secs)


## Generate new sentences

Finally, the big pay-off! In this step, you'll implement the function `generate_sentence`. The first word in a sentence is always one of the words from the "START_OF_SENTENCE" key. Then, the next word must come from the words for that start word's key in the chain. Keep picking words until the chosen word is "END_OF_SENTENCE".

At the beginning we said that Markov chains are probabilistic: some words are more likely to come after words than other words. How can you randomly choose words so that we're more likely to choose the most common words after? Well, since the values in the chain store a word each time it's seen, that means that randomly picking from that list of words _will_ end up picking those repeated words more often than others. So you can just use that handy `random.choice(list)` function on the list, and it'll just work.


In [None]:
import random

def generate_sentence(chain):
  """
  >>> test_chain = {'START_OF_SENTENCE': ['i'], 'i': ['am'], 'am': ['happy'], 'happy': ['END_OF_SENTENCE']}
  >>> generate_sentence(test_chain)
  'i am happy'
  """
  # Initialize a list to hold the words for the new sentence
  words = []

  # Select the first word randomly from all possible start words
  random_word = random.choice(chain["START_OF_SENTENCE"])

  # Append the first word to the new sentence
  words.append(random_word)

  # Keep looping, selecting the next word to come after
  # and adding it to the new sentence
  # until the word reached is "END_OF_SENTENCE"
  #
  while True:
      new_word = random.choice(chain[random_word])
      if new_word == "END_OF_SENTENCE":
        break
      random_word = new_word
      words.append(random_word)

  # Return the sentence as a string (instead of a list of words)
  return " ".join(words)

In [None]:
# Run the tests when you think add_sentence is working
import doctest
doctest.run_docstring_examples(generate_sentence, globals(), verbose=True, name="generate_sentence")

Finding tests in generate_sentence
Trying:
    test_chain = {'START_OF_SENTENCE': ['i'], 'i': ['am'], 'am': ['happy'], 'happy': ['END_OF_SENTENCE']}
Expecting nothing
ok
Trying:
    generate_sentence(test_chain)
Expecting:
    'i am happy'
ok


### ✏︎ For you to do

Run the code below to generate a sentence with your chain. Each time you run it, you should see a new sentence. 

In [None]:
generate_sentence(chain) # "so all these letters" ;-D 

'so all these letters'

In [None]:
generate_sentence(chain)
# I can see this getting mad libs quite soon 
# 
# "i mistake not, miss marvell unclasped her gown, 
#  drawing out any information gregory says to be a vague echo of an irish colleen? 
# always with almost no fashionable dentist still in the million dollar bond robbery at marsdon manor"


'i mistake not, miss marvell unclasped her gown, drawing out any information gregory says to be a vague echo of an irish colleen? always with almost no fashionable dentist still in the million dollar bond robbery at marsdon manor'

## You're done!

🎉 Woohoo! 🥳 Project 3! 👏🏽 Complete! 🎊 

Remember to share your project and your favorite generated sentences with your classmates. I can't wait to see them!

## Extensions

If you enjoyed this project and want to take it further, there are many ways to affect the behavior of a Markov chain. Some ideas:

* **New data source**: Try a different data source than the one you originally tested with, or go out and try to find a brand new data source. You might find text files online, you could fetch data from an API (like [The Movie Database API](https://developers.themoviedb.org/3)), or you could scrape data from a website using the [BeautifulSoup library](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) (check the website terms first).
* **Better punctuation support**: What happens right now if a sentence contains commas, hyphens, or other sorts of punctuation? How does that affect the result of your chain? Think about what sort of string processing you might do to handle punctuation better, and whether you'd want to strip some punctuation entirely or store some punctuation symbols as "words" in the chain.
* **Multiple sentence support**: Very related to above, what if a line contains multiple sentences? How will the chain currently treat that situation? See if you can come up with a more robust handling.
* **Bigrams**: The chain is currently based on "unigrams": single words and their likelihood of being followed by another word. One way to make the chain produce more natural language is to instead use "bigrams" - pairs of words. The chain would store bigrams in the keys instead of single words. Then, when generating sentences, it would need to look up the next word based on the most recent _two_ words in the sentence, not just the most recent word. 

Enjoy, and let us know if you try out any of the extensions!

# Thoughts 

This book would be so **awesome** to markov chain just because of the word "Woddly" and add random emoji's <3 
After I did this work I went looking for the book in the Gutenberg Project and did not find it :( 

https://www.amazon.com/Great-Quillow-James-Thurber/dp/0152325441



# Self Assessment 


* Are the tests passing for all the functions (wordify_sentence, add_sentence, generate_sentence)? Yes 

* What was the trickiest part of this project?
Realizing my initial work on cleaned_sentences where I combined code I was working in two notebook paragraphs - accidentally kept adding to the list and made upmteen-cates then when trying to search for "the" the notebook bombed (ha ha ha) so I then had to go back to first principles.

* Is there anywhere you are still stuck or confused? I'd like to add random punctuation, but I am torn if I should only catalog that list from the source or just tell my code to pull from the dictionary that I created. 

* Or any particular part of the code you’d like focused feedback on? Wondering how folks kept track of their punctuation when breaking out sentences in initial load. I did like that my sentences were long when the original sentence was long and short when the sentence was short. 

* What’s your favorite output from your chain so far?
"poirot is a fashionable dentist still less is coming along slowly, looking up at the screen, mon ami? asked about a bevy of a stone without doubt she is a few years ago in a joke, explained miss mary marvell unclasped her narrowly"

* If you could improve the output from this chain in one way, what would it be?Wondering if I can catalog random exclamations as the mystery book by Agatha Christie has them peppered in the book, yet they all got lumped in to the main list. And adding in random punctuation. 


# Peer Review

When your partners post their projects, please peer review them by answering the following questions via a threaded reply on their Slack post (due EOD Monday):
* Generate a few sentences and share your favorite output.
* How similar was their approach to your own approach? Any notable differences in your functions (like wordify_sentence)?
* What’s one way they might improve their project? For example, you might have * ideas for how their code could be cleaner. If they indicated in their self-assessment that they were stuck and/or want focused feedback, please provide ideas if you can.
* Any additional thoughts? Feel free to add words of encouragement as well.
