In [1]:
from __future__ import division, print_function, unicode_literals
import unittest
import reference_implementation as ref_impl # Solutions to the problems

# Congrats! ðŸŽ‰

Believe it or not, you've learned a huge chunk of the Python language, and have all the basic skills you need to move on to learning the scientific libraries and start playing with data! We'll start doing this in the next set of excercises. There's one chapter after this one that you should complete, which will deal with some more advanced language features that are extremely useful, and make Python a more fun and expressive language to code in. For now though, you should feel accomplished that you've made it this far!

The concepts and skills that you've learned (the if statement, for loops, lists, dictionaries, strings) are pretty universal among programming languages. If you wanted to try web programming in JavaScript, or write Desktop applications in C++, you'll find these same concepts come up.

What we're gonna do now is try to put together all of the basic Python you've learned into one coherent project. This way, you can get a feel for how all of these parts work together beyond the single-function excercises that you've been doing.

# Our project

What we're going to create is a Markov-Chain text generator. In effect, what this program is going to do is read a file with sentences from a particular source (a novel, song lyrics, speeches, etc.) and then generate random, usually nonsensical text in the style of this source material. The algorithm to do this is incredibly simple, but these programs can sometimes create quite sophisticated looking sentences! 

If you want an example of what Markov Chains can do, the website Reddit has a community (http://www.reddit.com/r/SubredditSimulator) where the title of every post, as well as every comment on the posts, are generated randomly from the text of other communities on Reddit. The results are often pretty funny, and sometimes seem surprisingly intelligent.

# Some examples

Since this is a much more difficult challenge than any of the previous excercises, I've provided a Python library with (working) versions of all the functions you'll have to implement in the challenge. Let's use library to produce some model Markov-chain generated sentences. First, we'll load two text corpuses into two data structures. In this case, the corpuses are drawn from speeches given by Barack Obama and Donald Trump before they assumed the presidency. For this, we'll use the word_counts_from_file() function in the library. In this notebook, you'll implement this function and the various helper functions it uses on your own!

In [2]:
trump_counts = ref_impl.word_counts_from_file('text_corpuses/trump_speeches.txt')
obama_counts = ref_impl.word_counts_from_file('text_corpuses/obama_speeches.txt')

Now, using these data structures, we'll create some mock sentences using the patterns our program has "learned" from these texts. For this, we'll use the generate_sentence() function. You will also implement this at the end of the excercise!

In [7]:
ref_impl.generate_sentence(trump_counts, 50)

'CNN â€“ I said Iâ€™ll tell you. Iâ€™m not my life. If anyone who Iâ€™m not being funded by the right thing. We have to explode. In fact that will tell you people we take a dumping ground for America, but have everything. They were, during one of our tunnels,'

In [5]:
ref_impl.generate_sentence(obama_counts, 50)

"war that the powers it and we're doing this Chamber right folks. There are Iraqis to anybody on the Rwandan genocide that we may not us? Who said by men and venture capital to find a more than the fact that the Foreign Minister of you were tough times. But"

If you run these functions again, you'll get a different sentence. As you'll notice, the sentences don't make a lot of sense and sometimes don't follow the rules of grammar very well. But they somehow work surprisingly well at capturing the tone of the speaker!

# How do Markov chain text generators work?

The basic idea behind Markov chain text generation is simple. Essentially, we first pick a word at random from the document that we've picked to generate our text. Let's say we pick the word "dinosaurs." We then find every word that follows the word "dinosaurs" in the document. In this case, we might see the phrases:

- "dinosaurs lived millions of years ago"
- "dinosaurs lived in a very different world"
- "dinosaurs gave rise to birds"

and so we find that the word dinosaurs is followed twice by the word "lived" and once by the word "gave." We would randomly chose the next word as either "lived" (with a 2/3 probability) or "gave" (with a 1/3) probability. Let's say we picked "gave." We might have the sentences 

- "gave rise to birds"
- "gave me an expensive gift"
- "gave me a black eye"
- "gave this old-timer another shot"

And so we would now pick between "rise", "me" and "this" as the next word that we pick. After picking one of these words, we would continue the chain until we've generated a pre-determined amount of text.

It would be very computationally expensive to re-scan the document every time that we picked a new word in order to find out what follows it. Thus, we're gonna optimize this procedure by first finding all of the pairs of words that follow each other, and then using this data to create sentences. Let's go through each of these steps in more detail:

# Step 1: Generate a data structure of word frequencies

This is actually the harder part of the problem!

Our goals here are to
1. Read a file
2. Use this file to populate our dictionary of words, and their follwing words
3. Normalize the word counts into probabilities

Let's talk about what this data structure will look like, and how we'll generate it. For this example, our text will consist soleley of the sentence 

```python
"I do what I think I want to do."
```

The unique words in this sentence are "I", 'do", "what", "want", "to", "think". Our main data structure will be a dictionary where the keys are these unqiue words. The values will be the word counts for the words that follow our unique word in the sentence. So the first few entries in our dictionary will look like

```python
word_dict = {
    "I": {"do": 1, "think": 1, "want": 1},
    "do": {"what": 1},
    "what": {"I": 1},
    ...
}
```

We're going to try to decompose this problem into a series of functions. Your job will be to fill in the bodies of these functions.

## Extract pairs of words from a string

One of the nice things about Markov chains is that we don't need to know anything about the context of a sentence as we will out this data structure. In fact, all of the information we need to fill this data structure is encoded as pairs of words. 

Let's start by writing a function that can extract pairs of words from a string. The string may contain multiple sentences - to generate more realistic text, we won't remove any punctation or capitalization from these words. That way, we'll actually learn the words that are likely to end sentences, and the words that are likely to begin the subsequent sentence!

We'll return these words as a list of two word tuples, where the first item is the first word, and the second item is the following word. There will be no pair for the last word of the string (we'll ignore it for now.) Using our same sentence, this will produce

```python
[('I', 'do'), ('do', 'what'),
 ('what', 'I'), ('I', 'think'),
 ('think', 'I'), ('I', 'want'),
 ('want', 'to'), ('to', 'do.')]
```

In computational linguistics, these pairs of words are called digraphs (di- meaning 2, -graph meaning word.) We'll call our function split_line_to_digraphs()

If the string contains fewer than 2 words, you should return an empty list []. Make sure to handle this case!

If you need some help writing this function, it might be helpful to look at the "largest number" example in data structures.

In [3]:
def split_line_to_digraphs(line):
    """
    Given a string of words, returns a list of tuples 
    containing digraphs from the sentence
    """
    return []

In [4]:
test_digraphs = split_line_to_digraphs("I do what I think I want to do.")

In [71]:
seuss_output = [('I', 'do'), 
 ('do', 'not'),
 ('not', 'like'),
 ('like', 'green'),
 ('green', 'eggs'),
 ('eggs', 'and'),
 ('and', 'Ham.')]

blink_182_output = [('All', 'the,'),
 ('the,', 'Small'),
 ('Small', 'things,'),
 ('things,', 'True'),
 ('True', 'care,'),
 ('care,', 'Truth'),
 ('Truth', 'brings')]


class DigraphTest(unittest.TestCase):
    def test_dr_seuss(self):
        self.assertEqual(split_line_to_digraphs("I do not like green eggs and Ham."), seuss_output)
    def test_blink_182(self):
        self.assertEqual(split_line_to_digraphs("All the, Small things, True care, Truth brings"), blink_182_output)
    def test_single_word(self):
        self.assertEqual(split_line_to_digraphs('word'), [])

suite = unittest.TestLoader().loadTestsFromTestCase(DigraphTest)
unittest.TextTestRunner(verbosity=2).run(suite)

test_blink_182 (__main__.DigraphTest) ... ok
test_dr_seuss (__main__.DigraphTest) ... ok
test_single_word (__main__.DigraphTest) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.004s

OK


<unittest.runner.TextTestResult run=3 errors=0 failures=0>

## Convert the digraph lists to a dictionary structure

Our next step is to transform this list of digraphs into a nested dictionary structure with words counts, like the structure we described above (the example with word_dict.)

This function will be a little tricky, because it involves keeping track of dictionaries nested within another dictionary. Take it slowly, and think carefully about the intermediate steps of the function. If you need to, you can try putting print() statements in the function at intermediate points to see how the variables look.

For our sample sentence

```python
"I do what I think I want to do"
```

The output will be 

```python
{'I': {'do': 1, 'think': 1, 'want': 1},
 'do': {'what': 1},
 'think': {'I': 1},
 'to': {'do.': 1},
 'want': {'to': 1},
 'what': {'I': 1}}
```

A few things to keep in mind:
1. Make sure that you test whether the first word in the digraph is already one of the unique words in the outer dictionary
2. Make sure to test if the second word is already in the inner dictionary. If it isn't - think about how you would add it and what its value would be

Remember that to add one to an existing value in python, you can use the shorthand 

```python
value += 1 
```

instead of 

```python
value = value + 1 
```

In [78]:
def convert_digraphs_to_dict(digraphs):
    return {}


In [79]:
seuss_dict = {
     'I': {'do': 1},
     'and': {'Ham.': 1},
     'do': {'not': 1},
     'eggs': {'and': 1},
     'green': {'eggs': 1},
     'like': {'green': 1},
     'not': {'like': 1}
}

test_sentence_dict = {
    'I': {'do': 1, 'think': 1, 'want': 1},
    'do': {'what': 1},
    'think': {'I': 1},
    'to': {'do.': 1},
    'want': {'to': 1},
    'what': {'I': 1}
}

class GraphTests(unittest.TestCase):
    def test_seuss_sentence(self):
        self.assertEqual(convert_digraphs_to_dict(seuss_output), seuss_dict)
    def test_test_sentence(self):
        self.assertEqual(convert_digraphs_to_dict(test_digraphs), test_sentence_dict)
        

suite = unittest.TestLoader().loadTestsFromTestCase(GraphTests)
unittest.TextTestRunner(verbosity=2).run(suite)

test_seuss_sentence (__main__.GraphTests) ... FAIL
test_test_sentence (__main__.GraphTests) ... FAIL

FAIL: test_seuss_sentence (__main__.GraphTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython-input-79-7c028ca3d7fa>", line 22, in test_seuss_sentence
    self.assertEqual(convert_digraphs_to_dict(seuss_output), seuss_dict)
AssertionError: {} != {'like': {'green': 1}, 'green': {'eggs': 1[91 chars]: 1}}
- {}
+ {'I': {'do': 1},
+  'and': {'Ham.': 1},
+  'do': {'not': 1},
+  'eggs': {'and': 1},
+  'green': {'eggs': 1},
+  'like': {'green': 1},
+  'not': {'like': 1}}

FAIL: test_test_sentence (__main__.GraphTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython-input-79-7c028ca3d7fa>", line 24, in test_test_sentence
    self.assertEqual(convert_digraphs_to_dict(test_digraphs), test_sentence_dict)
AssertionError: {} != {'want': {'to': 1}, 't

<unittest.runner.TextTestResult run=2 errors=0 failures=2>

## Allow for updating the structure with new data

We likely will not be processing our file of interest as one big line. Instead, we'll go through it line by line, and update a main dictionary with the data from each line. We've already written functions to turn a string into digraphs, and convert those digraphs into a word frequency dictionary. Our next next is to write a function that can merge two frequency dictionaries together. 

Basically, we want to go word by word through each dictionary. When the dictionaries have a word in common, we want to merge their frequency counts (the inner dictionaries) for the following word, so that
1. If the word only appears in one frequency dictionary, it keeps its count
2. If the word appears in both frequency dictionaries, its count becomes the sum of the two counts
If the word in the outer dictionary is not common, we just want to add it to the merged dictionary as it is

If this merge process for the inner dictionaries seems familiar, it's because you already solved it in the data structures lesson! As coders, we want to be as lazy as possible. The best code is the code that you (or someone else) already wrote! Let's copy your solution below, and then write a short function to handle the merge of the outer dictionaries together. If you don't have your solution for the inner dictionary merge, don't worry. We've provided a sample solution below.

Here's sample inputs for the merge_outer_dictionaries function, and the expected output

```python
input_1 = {'I': {'do': 1},
 'and': {'Ham.': 1},
 'do': {'not': 1},
 'eggs': {'and': 1},
 'green': {'eggs': 1},
 'like': {'green': 1},
 'not': {'like': 1}}

input_2 = {'I': {'do': 1, 'think': 1, 'want': 1},
 'do': {'what': 1},
 'think': {'I': 1},
 'to': {'do.': 1},
 'want': {'to': 1},
 'what': {'I': 1}}
 
merge_outer_dictionaries(input_1, input_2)
```

Output:

```python
{'I': {'do': 2, 'think': 1, 'want': 1},
 'and': {'Ham.': 1},
 'do': {'not': 1, 'what': 1},
 'eggs': {'and': 1},
 'green': {'eggs': 1},
 'like': {'green': 1},
 'not': {'like': 1},
 'think': {'I': 1},
 'to': {'do.': 1},
 'want': {'to': 1},
 'what': {'I': 1}}
```

If you get stuck, the solution for the outer dictionaries is similar, but not identical to the similar for the inner dictionaries.

In [8]:
def merge_and_sum_dictionaries(counts_1, counts_2):
    combined_counts = {} 
    
    # Copy contents of first dictionary to combined counts
    for word, count in counts_1.items():
        combined_counts[word] = count
    
    for word, count in counts_2.items():
        if word in combined_counts:
            combined_counts[word] += count
        else:
            combined_counts[word] = count
    return combined_counts

def merge_outer_dictionaries(old_words, new_words):
    """
    """
    merged_words = {}
    return merged_words


In [82]:
output = {'I': {'do': 2, 'think': 1, 'want': 1},
 'and': {'Ham.': 1},
 'do': {'not': 1, 'what': 1},
 'eggs': {'and': 1},
 'green': {'eggs': 1},
 'like': {'green': 1},
 'not': {'like': 1},
 'think': {'I': 1},
 'to': {'do.': 1},
 'want': {'to': 1},
 'what': {'I': 1}}

class MergeTest(unittest.TestCase):
    def test_merge(self):
        self.assertEqual(merge_outer_dictionaries(seuss_dict, test_sentence_dict), output)

suite = unittest.TestLoader().loadTestsFromTestCase(MergeTest)
unittest.TextTestRunner(verbosity=2).run(suite)

test_merge (__main__.MergeTest) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.002s

OK


<unittest.runner.TextTestResult run=1 errors=0 failures=0>

## Normalize a count dictionary

One last thing: after we've counted all of the words in the text, we want to normalize the counts to get the probability of all the different words that might follow our word of interest. Let's write a simple function to take a dictionary of counts for string values, and normalize the counts by dividing each one by the sum of all of the counts.

A pair of expected input and output for this function are:

```python
words = {'Hello': 2, 'World':3}
normalize_count(words)
```

Output:

```python
{'Hello': 0.4, 'World': 0.6}
```

One hint: Python has a built in sum() function that will calculate the sum of a list of numbers. Do you know how to get a list of all the count values of a count dictionary?

In [9]:
def normalize_counts(count_dictionary):
    return {}

In [84]:
class NormalizationTest(unittest.TestCase):
    def test_two_words(self):
        self.assertEqual(normalize_counts({'Hello': 2, 'World':3}), {'Hello': 0.4, 'World': 0.6})
    def test_four_words(self):
        self.assertEqual(normalize_counts({'These': 2, 'Are':3, 'Seperate': 5, 'Words':10}),
                         {'These': 0.1, 'Are':0.15, 'Seperate': 0.25, 'Words':0.5})

suite = unittest.TestLoader().loadTestsFromTestCase(NormalizationTest)
unittest.TextTestRunner(verbosity=2).run(suite)

test_four_words (__main__.NormalizationTest) ... ok
test_two_words (__main__.NormalizationTest) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.018s

OK


<unittest.runner.TextTestResult run=2 errors=0 failures=0>

# Read the word data from a file, and generate a dictionary of words

For this function, you will take the name of a file as an input, and return a dictionary with the full normalized counts for the entire text of the file. Make sure not to normalize until you have all of the counts!

If you think you know how to combine the functions above to do this, you're welcome to try! Otherwise you can follow the following outline

1. Create an output dictionary to hold all of the word counts
2. Open the file
3. For every line in the file
    1. Split the line into digraphs
    2. Use the digraphs to create a count dictionary for the line
    3. Update the output dictionary's counts with the line dictionary
4. For every entry in the output dictionary
    1. Normalize the entry
5. Return the output dictionary

This function will rely on all of the functions that you wrote above. If one of your implementations didn't work, feel free to use the function from the rel_impl library. Instead of calling 
```python
merge_outer_dictionaries()
```
you would call 
```python
rel_impl.merge_outer_dictionaries()
```

In [11]:
def word_counts_from_file(filename):
    return {}

In [86]:
ref_impl.word_counts_from_file('text_corpuses/small.txt')

{'I': {'do': 0.5, 'think': 0.25, 'want': 0.25},
 'and': {'Ham.': 1.0},
 'do': {'not': 0.5, 'what': 0.5},
 'eggs': {'and': 1.0},
 'green': {'eggs': 1.0},
 'like': {'green': 1.0},
 'not': {'like': 1.0},
 'think': {'I': 1.0},
 'to': {'do.': 1.0},
 'want': {'to': 1.0},
 'what': {'I': 1.0}}

In [94]:
sample_output = {
    'I': {'do': 0.5, 'think': 0.25, 'want': 0.25},
    'and': {'Ham.': 1.0},
    'do': {'not': 0.5, 'what': 0.5},
    'eggs': {'and': 1.0},
    'green': {'eggs': 1.0},
    'like': {'green': 1.0},
    'not': {'like': 1.0},
    'think': {'I': 1.0},
    'to': {'do.': 1.0},
    'want': {'to': 1.0},
    'what': {'I': 1.0}
}

class ReadFileTest(unittest.TestCase):
    def test_small_corpus(self):
        output = word_counts_from_file('text_corpuses/small.txt')
        self.assertEqual(output, sample_output)
    
suite = unittest.TestLoader().loadTestsFromTestCase(ReadFileTest)
unittest.TextTestRunner(verbosity=2).run(suite)

test_small_corpus (__main__.ReadFileTest) ... FAIL

FAIL: test_small_corpus (__main__.ReadFileTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython-input-94-f9dd4f70cfb6>", line 18, in test_small_corpus
    self.assertEqual(output, sample_output)
AssertionError: {} != {'want': {'to': 1.0}, 'to': {'do.': 1.0}, [229 chars]1.0}}
- {}
+ {'I': {'do': 0.5, 'think': 0.25, 'want': 0.25},
+  'and': {'Ham.': 1.0},
+  'do': {'not': 0.5, 'what': 0.5},
+  'eggs': {'and': 1.0},
+  'green': {'eggs': 1.0},
+  'like': {'green': 1.0},
+  'not': {'like': 1.0},
+  'think': {'I': 1.0},
+  'to': {'do.': 1.0},
+  'want': {'to': 1.0},
+  'what': {'I': 1.0}}

----------------------------------------------------------------------
Ran 1 test in 0.004s

FAILED (failures=1)


<unittest.runner.TextTestResult run=1 errors=0 failures=1>

# Load speech data

Using the functions you've defined above, we're now going to load in the data from the Trump and Obama speeches that we have saved.

In [36]:
trump_counts = word_counts_from_file('text_corpuses/trump_speeches.txt')
obama_counts = word_counts_from_file('text_corpuses/obama_speeches.txt')

# Generate random sentences

Now for the easy/rewarding part! With this data of the most frequent words that follow each word, we can easily generate some fake sentences!

Let's create a function that will take a count dictionary and a number of words as inputs, and output a sentence based on the word frequencies. To do this, we'll need one capability that isn't available in base python: we need to be able to generate random numbers, to select random words from the dictionary. To do this, we'll import the *random* module below

In [37]:
import random

Now the random module has a function called random.random(), which generates a random number between 0 and 1. Try running the line below a couple of times - you'll see that it changes values every time you run it 

In [38]:
random.random()

0.5184939298387877

There's also a function in the random module called random.choice, which randomly picks an element from a list.

In [39]:
random.choice(['a', 'b', 'c', 'd'])

'a'

# Select a random word using a weighted average

We're gonna use the following algorithm to select a random word with the probability that it would normally occur in the text.

There's probably a few good ways to do this. I thought about it for a little and came up with the following algorithm, but it's probably not the most efficient/best way to do it. If you have a better idea, email me!

1. Generate a random number between zero and 1 - let's call this the cutoff probability
2. Create a variable to store a running total probability
3. Iterate through the possible words and their probabilities. For each word, add the individual probability of that word to the running total probability.
4. When the running total probability becomes higher than the cutoff probability, return the current word
5. If this doesn't happen, return the last word that we iterated over

Note that since this is a random function, the test that we're running just tests to make sure that the output is reasonably close to the expected outcome. There's always a strong possibility 

In [51]:
def random_choice_from_frequency_dict(frequency_dict):
    return ""

In [108]:
frequencies = {'These': 0.1, 'Are':0.15, 'Seperate': 0.25, 'Words':0.5}

class TestRandomness(unittest.TestCase):
    def test_within_boundaries(self):
        counts = {}
        for i in range(10000):
            word = random_choice_from_frequency_dict(frequencies)
            if word not in counts:
                counts[word] = 1
            else:
                counts[word] += 1
        ideal_values = {word: int(freq * 10000) 
                        for word, freq 
                        in frequencies.items()}
        max_deviation = max(abs(counts[word] - ideal_values[word]) 
                            for word in frequencies)
        error_message = 'Your counts deviated too much from the ideal values!\nYour counts:{}, ideal counts:{}'
        error_message = error_message.format(counts, ideal_values)
        self.assertLess(max_deviation, 100, error_message)
        
suite = unittest.TestLoader().loadTestsFromTestCase(TestRandomness)
unittest.TextTestRunner(verbosity=2).run(suite)

test_within_boundaries (__main__.TestRandomness) ... ERROR

ERROR: test_within_boundaries (__main__.TestRandomness)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython-input-108-db56fc9622c0>", line 16, in test_within_boundaries
    for word in frequencies)
  File "<ipython-input-108-db56fc9622c0>", line 16, in <genexpr>
    for word in frequencies)
KeyError: 'These'

----------------------------------------------------------------------
Ran 1 test in 0.029s

FAILED (errors=1)


<unittest.runner.TextTestResult run=1 errors=1 failures=0>

Finally it's time to use the data structure we've created and this random word function to generate a sentence. Can you do it? Here's a sketch of how your function will work:

1. Make an list to hold the words in the sentence 
2. Pick a random word from the data structure's keys. Store it as current_word and add it to the list.
3. Loop over the number of words we want to generate
    - Use the current_word, the data structure and our random_choice_from_frequency_dictionary() function to pick the next word
    - Add the next word to the list
    - Set the next word as the current_word for the next loop
4. Return the list, joined with spaces into a sentence

Two hints that might be useful for this function
1. To pick a random element from a list (for the starting word) you can use the random.choice() function
2. To do a loop a certain number of times, you can use a for loop with the range() function. We'll talk more about range() in the future, but for now you can treat range(n) as returning a list of integers from 0 to n. So to do something 50 times, you would do:
```python
for i in range(50):
    do_something()
    # i is an "index" variable so it will hold the current count
```

In [52]:
def generate_sentence(word_dict, number_words):
    return ""

In [55]:
ref_impl.generate_sentence(trump_counts, number_words=100)

'lowered, they had a lot of them, "We canâ€™t go on the way, I HAVE THE BEGINNING, AND I\'LL TELL YOU ARE BACK THE COVER -- the single dollar can keep getting ready for somebody slipped â€” "Oh, you know this election thatâ€™s the worst trade deals like that. Weâ€™re going to tell them in the table. Say a journey and in certain people are coming in big investments in, but, guess six, seven, eight or our soil. He said 49% to be very exciting. We canâ€™t get any more. But â€™17 â€“ and itâ€™s $2.5 billion and people away.'

In [48]:
generate_sentence(obama_counts, number_words=100)

"discussion, at the Administration's strategy cannot live in this front. For the site of the White House will become Americans, this amnesty there were out there will be held a national commitment necessary to us, and many things. They are different result. This is right, in 1899 and torture in the right path, they don't think many times. But I hear about religious people. I realize that may not wasting one of not to be otherwise is about. Not just revolve around their kids can't even a man's prerogative, it takes to beef up their own science and months ahead."