In [1]:
from __future__ import division, print_function, unicode_literals

# Congrats! 🎉

Believe it or not, you've learned a huge chunk of the Python language, and have all the basic skills you need to move on to learning the scientific libraries and start playing with data! We'll start doing this in the next set of excercises. There's one chapter after this one that you should complete, which will deal with some more advanced language features that are extremely useful, and make Python a more fun and expressive language to code in. For now though, you should feel accomplished that you've made it this far!

The concepts and skills that you've learned (the if statement, for loops, lists, dictionaries, strings) are pretty universal among programming languages. If you wanted to try web programming in JavaScript, or write Desktop applications in C++, you'll find these same concepts come up.

What we're gonna do now is try to put together all of the basic Python you've learned into one coherent project. This way, you can get a feel for how all of these parts work together beyond the single-function excercises that you've been doing.

# Our project

What we're going to create is a Markov-Chain text generator. In effect, what this program is going to do is read a file with sentences from a particular source (a novel, song lyrics, speeches, etc.) and then generate random, usually nonsensical text in the style of this source material. The algorithm to do this is incredibly simple, but these programs can sometimes create quite sophisticated looking sentences! 

If you want an example of what Markov Chains can do, the website Reddit has a community (http://www.reddit.com/r/SubredditSimulator) where the title of every post, as well as every comment on the posts, are generated randomly from the text of other communities on Reddit. The results are often pretty funny, and sometimes seem surprisingly intelligent.

We're gonna quickly go through what a Markov Chain is, how we can use it to generate text, and then talk about our strategy for implementing one.

# Fill in details of what a Markov Chain is

# Fill in details of the implementation

# Step 1: Generate a data structure of word frequencies

This is actually the harder part of the problem!

Our goals here are to
1. Read a file
2. Use this file to populate our dictionary of words, and their follwing words
3. Normalize the word counts into probabilities

Let's talk about what this data structure will look like, and how we'll generate it. For this example, our text will consist soleley of the sentence 

```python
"I do what I think I want to do."
```

The unique words in this sentence are "I", 'do", "what", "want", "to", "think". Our main data structure will be a dictionary where the keys are these unqiue words. The values will be the word counts for the words that follow our unique word in the sentence. So the first few entries in our dictionary will look like

```python
word_dict = {
    "I": {"do": 1, "think": 1, "want": 1},
    "do": {"what": 1},
    "what": {"I": 1},
    ...
}
```

We're going to try to decompose this problem into a series of functions. Your job will be to fill in the bodies of these functions.

## Extract pairs of words from a string

One of the nice things about Markov chains is that we don't need to know anything about the context of a sentence as we will out this data structure. In fact, all of the information we need to fill this data structure is encoded as pairs of words. 

Let's start by writing a function that can extract pairs of words from a string. The string may contain multiple sentences - to generate more realistic text, we won't remove any punctation or capitalization from these words. That way, we'll actually learn the words that are likely to end sentences, and the words that are likely to begin the subsequent sentence!

We'll return these words as a list of two word tuples, where the first item is the first word, and the second item is the following word. There will be no pair for the last word of the string (we'll ignore it for now.) Using our same sentence, this will produce

```python
[('I', 'do'), ('do', 'what'),
 ('what', 'I'), ('I', 'think'),
 ('think', 'I'), ('I', 'want'),
 ('want', 'to'), ('to', 'do.')]
```

In computational linguistics, these pairs of words are called digraphs (di- meaning 2, -graph meaning word.) We'll call our function split_line_to_digraphs()

If the string contains fewer than 2 words, you should return an empty list []. Make sure to handle this case!

If you need some help writing this function, it might be helpful to look at the "largest number" example in data structures.

In [11]:
def split_line_to_digraphs(line):
    """
    Given a string of words, returns a list of tuples 
    containing digraphs from the sentence
    """
    return []

def split_line_to_digraphs(line):
    words = line.split()
    
    # Ignore strings with fewer than two words
    if len(words) < 2:
        return []
    
    digraphs = []
    
    current_word = words[0]
    for next_word in words[1:]:
        digraphs.append((current_word, next_word))
        current_word = next_word
        
    return digraphs
    

In [21]:
test_digraphs = split_line_to_digraphs("I do what I think I want to do.")

In [14]:
seuss_output = [('I', 'do'), 
 ('do', 'not'),
 ('not', 'like'),
 ('like', 'green'),
 ('green', 'eggs'),
 ('eggs', 'and'),
 ('and', 'Ham.')]

blink_182_output = [('All', 'the,'),
 ('the,', 'Small'),
 ('Small', 'things,'),
 ('things,', 'True'),
 ('True', 'care,'),
 ('care,', 'Truth'),
 ('Truth', 'brings')]

assert split_line_to_digraphs("I do not like green eggs and Ham.") == seuss_output
assert split_line_to_digraphs("Hello") == []
assert split_line_to_digraphs("All the, Small things, True care, Truth brings") == blink_182_output

## Convert the digraph lists to a dictionary structure

Our next step is to transform this list of digraphs into a nested dictionary structure with words counts, like the structure we described above (the example with word_dict.)

This function will be a little tricky, because it involves keeping track of dictionaries nested within another dictionary. Take it slowly, and think carefully about the intermediate steps of the function. If you need to, you can try putting print() statements in the function at intermediate points to see how the variables look.

For our sample sentence

```python
"I do what I think I want to do"
```

The output will be 

```python
{'I': {'do': 1, 'think': 1, 'want': 1},
 'do': {'what': 1},
 'think': {'I': 1},
 'to': {'do.': 1},
 'want': {'to': 1},
 'what': {'I': 1}}
```

A few things to keep in mind:
1. Make sure that you test whether the first word in the digraph is already one of the unique words in the outer dictionary
2. Make sure to test if the second word is already in the inner dictionary. If it isn't - think about how you would add it and what its value would be

Remember that to add one to an existing value in python, you can use the shorthand 

```python
value += 1 
```

instead of 

```python
value = value + 1 
```

In [18]:
def convert_digraphs_to_dict(digraphs):
    return {}

def convert_digraphs_to_dict(digraphs):
    words_with_frequencies = {}
    for first_word, second_word in digraphs:
        if first_word not in words_with_frequencies:
            words_with_frequencies[first_word] = {second_word: 1}
        elif second_word not in words_with_frequencies[first_word]:
            words_with_frequencies[first_word][second_word] = 1
        else:
            words_with_frequencies[first_word][second_word] += 1
    return words_with_frequencies 

In [22]:
convert_digraphs_to_dict(test_digraphs)

{'I': {'do': 1, 'think': 1, 'want': 1},
 'do': {'what': 1},
 'think': {'I': 1},
 'to': {'do.': 1},
 'want': {'to': 1},
 'what': {'I': 1}}

## Allow for updating the structure with new data

We likely will not be processing our file of interest as one big line. Instead, we'll go through it line by line, and update a main dictionary with the data from each line. We've already written functions to turn a string into digraphs, and convert those digraphs into a word frequency dictionary. Our next next is to write a function that can merge two frequency dictionaries together. 

Basically, we want to go word by word through each dictionary. When the dictionaries have a word in common, we want to merge their frequency counts (the inner dictionaries) for the following word, so that
1. If the word only appears in one frequency dictionary, it keeps its count
2. If the word appears in both frequency dictionaries, its count becomes the sum of the two counts
If the word in the outer dictionary is not common, we just want to add it to the merged dictionary as it is

If this merge process for the inner dictionaries seems familiar, it's because you already solved it in the data structures lesson! As coders, we want to be as lazy as possible. The best code is the code that you (or someone else) already wrote! Let's copy your solution below, and then write a short function to handle the merge of the outer dictionaries together. If you don't have your solution for the inner dictionary merge, don't worry. We've provided a sample solution below.

Here's sample inputs for the merge_outer_dictionaries function, and the expected output

```python
input_1 = {'I': {'do': 1},
 'and': {'Ham.': 1},
 'do': {'not': 1},
 'eggs': {'and': 1},
 'green': {'eggs': 1},
 'like': {'green': 1},
 'not': {'like': 1}}

input_2 = {'I': {'do': 1, 'think': 1, 'want': 1},
 'do': {'what': 1},
 'think': {'I': 1},
 'to': {'do.': 1},
 'want': {'to': 1},
 'what': {'I': 1}}
 
merge_outer_dictionaries(input_1, input_2)
```

Output:

```python
{'I': {'do': 2, 'think': 1, 'want': 1},
 'and': {'Ham.': 1},
 'do': {'not': 1, 'what': 1},
 'eggs': {'and': 1},
 'green': {'eggs': 1},
 'like': {'green': 1},
 'not': {'like': 1},
 'think': {'I': 1},
 'to': {'do.': 1},
 'want': {'to': 1},
 'what': {'I': 1}}
```

If you get stuck, the solution for the outer dictionaries is similar, but not identical to the similar for the inner dictionaries.

In [28]:
def merge_and_sum_dictionaries(counts_1, counts_2):
    combined_counts = {} 
    
    # Copy contents of first dictionary to combined counts
    for word, count in counts_1.items():
        combined_counts[word] = count
    
    for word, count in counts_2.items():
        if word in combined_counts:
            combined_counts[word] += count
        else:
            combined_counts[word] = count
    return combined_counts

def merge_outer_dictionaries(old_words, new_words):
    """
    """
    merged_words = {}
    return merged_words

def merge_outer_dictionaries(old_words, new_words):
    merged_words = {}
    for word, counts_dict in old_words.items():
        merged_words[word] = counts_dict
    for word, counts_dict in new_words.items():
        if word in merged_words:
            merged_words[word] = merge_and_sum_dictionaries(merged_words[word], counts_dict)
        else:
            merged_words[word] = counts_dict
    return merged_words

## Normalize a count dictionary

One last thing: after we've counted all of the words in the text, we want to normalize the counts to get the probability of all the different words that might follow our word of interest. Let's write a simple function to take a dictionary of counts for string values, and normalize the counts by dividing each one by the sum of all of the counts.

A pair of expected input and output for this function are:

```python
words = {'Hello': 2, 'World':3}
normalize_count(words)
```

Output:

```python
{'Hello': 0.4, 'World': 0.6}
```

One hint: Python has a built in sum() function that will calculate the sum of a list of numbers. Do you know how to get a list of all the count values of a count dictionary?

In [36]:
def normalize_counts(count_dictionary):
    return {}

def normalize_counts(count_dictionary):
    total = sum(count_dictionary.values())
    output_dict = {}
    for word, count in count_dictionary.items():
        output_dict[word] = count/total
    return output_dict

In [42]:
assert normalize_counts({'Hello': 2, 'World':3}) == {'Hello': 0.4, 'World': 0.6}
assert normalize_counts({'These': 2, 'Are':3, 'Seperate': 5, 'Words':10}) == {'These': 0.1, 'Are':0.15, 'Seperate': 0.25, 'Words':0.5}

# Read the word data from a file, and generate a dictionary of words

In [44]:
with open('text_corpuses/trump_speeches.txt') as speech_file:
    word_counts = {}
    for line in speech_file:
        digraphs = split_line_to_digraphs(line)
        line_counts = convert_digraphs_to_dict(digraphs)
        word_counts = merge_outer_dictionaries(word_counts, line_counts)
    for word, counts in word_counts.items():
        word_counts[word] = normalize_counts(counts)

# Generate random sentences

Now for the easy/rewarding part! With this data of the most frequent words that follow each word, we can easily generate some fake sentences!

Let's create a function that will take a count dictionary and a number of words as inputs, and output a sentence based on the word frequencies. To do this, we'll need one capability that isn't available in base python: we need to be able to generate random numbers, to select random words from the dictionary. To do this, we'll import the *random* module below

In [45]:
import random

Now the random module has a function called random.random(), which generates a random number between 0 and 1. Try running the line below a couple of times - you'll see that it changes values every time you run it 

In [48]:
random.random()

0.8193624733995585

There's also a function in the random module called random.choice, which randomly picks an element from a list.

In [87]:
random.choice(['a', 'b', 'c', 'd'])

'c'

# Select a random word using a weighted average

We're gonna use the following algorithm to select a random word with the probability that it would normally occur in the text.

There's probably a few good ways to do this. I thought about it for a little and came up with the following algorithm, but it's probably not the most efficient/best way to do it. If you have a better idea, email me!

1. Generate a random number between zero and 1 - let's call this the cutoff probability
2. Create a variable to store a running total probability
3. Iterate through the possible words and their probabilities. For each word, add the individual probability of that word to the running total probability.
4. When the running total probability becomes higher than the cutoff probability, return the current word
5. If this doesn't happen, return the last word that we iterated over

In [59]:
def random_choice_from_frequency_dict(frequency_dict):
    probability_threshold = random.random()
    total_probability = 0
    for word, probability in frequency_dict.items():
        total_probability += probability
        if total_probability > probability_threshold:
            return word
    return word

In [88]:
def generate_sentence(word_dict, number_words):
    output_words = []
    unique_words = list(word_dict.keys())
    current_word = random.choice(unique_words)
    for i in range(number_words):
        output_words.append(current_word)
        current_word = random_choice_from_frequency_dict(word_dict[current_word])
    return " ".join(output_words)

In [89]:
generate_sentence(word_counts, number_words=100)

'trades, by senators and trillions of them fight to get along and have trillions of Caterpillars. Caterpillar’s stock market crashes. I’ve been able to keep Ben Carson, which is an ever done under budget that was brutal. I just call – some of other things about the parents, who murder gays. I don’t do it talked about it? You go over the history of a lot of it didn’t bring education per pupil -- it had an amazing job, but we have a beautiful plane. Can’t get hit them to make America building $2.5 trillion at zero. And we know'