# Pride and Prejudice (and Python)

It is a truth universally acknowledged, that an author in possession of a good book project, must be in want of an example to illustrate a point.

This is a Jupyter Notebook, a sort of interactive story told through prose, code and at least one picture!

You'll notice that this block of text (or cell) is highlighted. To walk through the notebook, type SHIFT-RETURN to move forwards one cell. If the cell contains code SHIFT-RETURN will evaluate that code and print any output produced. You can also edit the code too, if you want to experiment. If something goes wrong, re-load the page and start from the top again.

NOTE: The code needs to be evaluated in order... for example, the opening code block loads some text from a file, which the following code blocks all make use of. If this cell hasn't been evaluated, none of the subsequent cells will work!

Anyway... on with the fun. ;-)

Here's a fun problem: how can we make a computer appear "creative"?

I'm going to explain a technique to solve this that's variously called "Markov chains" or "N-grams" (where "N" could be "bi" or "tri" or any other quantitive designation). 

I'll explain how we can use this technique to generate superficially real looking text based upon examples fed into the computer. In this case, the source material will be Jane Austen's "Pride and Prejudice"... ;-)

I've included the text of the novel as an accompanying file to this notebook (obtained via the wonderful [Project Gutenberg](https://www.gutenberg.org/)).

The first thing we need to do is load the complete book so we can play around with it in Python:

In [322]:
with open('pride_and_prejudice.txt') as text_file:
    pride_and_prejudice = text_file.read()

At this point we have the complete text of the novel stored in the string called `pride_and_prejudice`. This means we can start to play around with the text:

In [323]:
print(pride_and_prejudice[:130])  # Print the first 130 characters of the novel.

Chapter 1


It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.



Notice how I use a technique called "slicing" to quickly tell Python the boundary for the characters to return. The `[:130]` means start at character 0 up to character 130 (a clearer way to put it would be `[0:130]` -- note the colon separates the lower and upper boundary, if none is given [as in my example] then the beginning [i.e. 0] is assumed).

The built-in `len` function is also very useful...

In [324]:
len(pride_and_prejudice)  # The number of characters (as in letters, not people) in the novel... ;-)

683934

Finding the word count is only a little more complex: we need to split the text whenever there's whitespace (a new line, tab, space etc...). Happily Python already has a `split` method which does this by default.

In [325]:
all_the_words_in_the_novel = pride_and_prejudice.split()  # Split the novel by whitespace.
len(all_the_words_in_the_novel)  # The length of all_the_words_in_the_novel

121561

We can even return a Python list of the first 25 words of the novel like this (notice how I'm using "slicing" again):

In [326]:
all_the_words_in_the_novel[:25]

['Chapter',
 '1',
 'It',
 'is',
 'a',
 'truth',
 'universally',
 'acknowledged,',
 'that',
 'a',
 'single',
 'man',
 'in',
 'possession',
 'of',
 'a',
 'good',
 'fortune,',
 'must',
 'be',
 'in',
 'want',
 'of',
 'a',
 'wife.']

So far so good: we have the full text of the novel and it's clear we can play around and analyse the content.

Here's where it gets interesting... we want the computer to generate prose that's based upon Jane Austen's. The most obvious way to analyse the novel is to look at how words follow one another. If we could work that out, we could write some code such that if it were given N number of previous words, it would work out a likely *new* word to follow. By repeating this process we could "grow" whole new Jane Austen-like output.

The technique of grouping adjacent words together is called the N-gram (where N is the number of words in each group). It's the sort of thing that's easy to see but perhaps harder to explain... so here's an example.

If N were 3, then the opening of the novel contains the following n-grams:

```
"Chapter 1 It"
"1 It is"
"It is a"
"is a truth"
"a truth universally"
"truth universally acknowledged"
```

(and so on for the rest of the novel)

Alternatively, if N were 2 we'd get the following n-grams:

```
"Chapter 1"
"1 It"
"It is"
"is a"
"a truth"
"truth universally"
"universally acknowledged"
```

(and so on...)

It's only a relatively short function in Python to create n-grams from a list of words:


In [327]:
def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

Here's what you should know about this function:

* I did not write it! (Like any good developer with a problem at hand, the first thing I did was check to see if someone else had already solved it -- and they had!).
* The person who wrote it (someone called Scott Triglia) has written a [blog post explaining how it works](http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/).
* Since this is freely available code, I've credited the original author and pointed you to the documentation should you wish to know more.

This process is the basic modus operandi of open source software.

In any case, Scott has saved me 10 minutes by giving me a ready made function for getting all the n-grams given a list of words. Remember, we already have the list called `all_the_words_in_the_novel` which should be passed into the function. The second argument `n` should be the N value for the n-gram. The result will be a collection containing all the n-grams in Pride and Prejudice. Let's start with N as 3:

In [328]:
ngrams = list(find_ngrams(all_the_words_in_the_novel, 3))  # get the n-grams as a list
ngrams[:6]

[('Chapter', '1', 'It'),
 ('1', 'It', 'is'),
 ('It', 'is', 'a'),
 ('is', 'a', 'truth'),
 ('a', 'truth', 'universally'),
 ('truth', 'universally', 'acknowledged,')]

Here's where it gets interesting. Given a low value of N (say 3) we'll find that the same three word n-gram will appear several times in the work. However, each time it may be followed by different words. Of course, perhaps the same word will follow the n-gram several times.

For example, consider the following (mostly made up) sentence fragments:

```
"It is a truth universally acknowledged..."
"'My word', said Darcy, 'it is a wonder her mother...'"
"But what to do about Wickham? It is a fix that..."
"It is a most vexing situation..."
"Visit Derbyshire? It is a most wonderful county..."
```

We find the n-gram "it is a" appears in all of them. The word following this n-gram in each sentence is:

* truth
* wonder
* fix
* most
* most

Notice how the word "most" appears twice in the list of five entries. If we were to select one entry at random the probability that "most" would be chosen is higher than that for the words "truth", "wonder" or "fix" (although, given our choice is random, there's no reason why those words wouldn't be chosen). The word "most" has a 2 in 5 chance of being selected, whereas the others only have a 1 in 5 chance.

So, here's the fun bit... if we select an "opening" n-gram as a seed for generating Jane Austen like prose, we could work out a candidate for the next word in the sentence by randomly selecting one of the words in the list of following words for that n-gram.

There are two important considerations that may not, at first, be apparent:

* We are only able to generate a following word from an n-gram which Jane Austen actually wrote. The contents of the source text (in this case, Pride and Prejudice) dictates our candidate words for each n-gram, and they could only have been written by Jane Austen in the context of that n-gram.
* When we select a new following word, we can use the final three words in the text generated so far as a valid new n-gram to use for selecting the next new following word. Since Jane Austen must have written the following word we selected, it follows that there **must** be an n-gram for the newly generated _next_ n-gram.

To be explicit, say we have a generated n-gram "it is a" and we randomly select "most" as the next word in our generated text, then the new n-gram is "is a most". Given the contents of the example sentences above, the candidate pool of next words to follow the "is a most" n-gram contains, "vexing" and "wonderful".

And so the process continues until we decide to stop..!

This is essentially how a [Markov chain](https://en.wikipedia.org/wiki/Markov_chain) works to generate new superficially real looking output that is based solely on the input data used to generate the n-grams.

"But how does this translate into code..?", is something Lady Catherine de Bourgh would never condescend to ask. ;-)

First we need to create a list of following words for each n-gram. This is easy to achieve with a Python dictionary: if the key is the n-gram, then the associated value can be a list of all the words that ever followed the n-gram in the key. This should also include repetitions (such as "most" in the example above).

The simplest way to do this is to grab n-grams of N+1 to get the n-grams of length N and the following word (N+1).

In [329]:
N = 3  # we'll assume we want to use n-grams of three.

In [330]:
n_grams = {}  # create an empty dictionary to contain our n-grams and list of following words.

In [331]:
for ngram in find_ngrams(all_the_words_in_the_novel, N + 1):  # Now populate the dictionary.
    key_ngram = ngram[:N]  # We only want to use the N number of words for our n-gram.
    following_word = ngram[-1]  # The N+1th word is the following word.
    word_list = n_grams.get(key_ngram, [])  # Get the current list of following words (or a new one if it doesn't exist).
    word_list.append(following_word)  # Append the following_word to the list.
    n_grams[key_ngram] = word_list  # Update the entry for the n-gram in the dict with the updated word_list.

At this point, we have the bare bones of what we need to generate Jane Austen like prose... So let's try it to make sure we're on the right track. :-)

How?

If we choose an starting n-gram at random, then just keep generating a new word some arbitrary number of times. The output will start with the three words of the first (randomly selected) n-gram and we just need to append each new word until we tell it to stop.

In [332]:
import random  # The random module contains lots of helpful functions to choose things at random.

seed_ngram = random.choice(list(n_grams.keys()))  # Choose a random seed n-gram from a list of all n-grams in the novel.
output = list(seed_ngram)  # Create the output with a list of the words from the first n-gram.

for i in range(32):  # Make 32 more new words...
    next_word = random.choice(n_grams[seed_ngram])  # Choose a next word from the list associated with the seed n-gram.
    output.append(next_word)  # Append the randomly selected next word to the output.
    new_ngram = list(seed_ngram[1:])  # Create a new n-gram from the current seed n-gram (without its first item)
    new_ngram.append(next_word)  # Append the next word to the new n-gram thus completing the new n-gram.
    seed_ngram = tuple(new_ngram)  # Make the new n-gram the seed n-gram for the next iteration.

' '.join(output)  # Join the words in the output list with a space inbetween.

'determined to make no effort for conversation with anyone but himself; and to him she had hardly courage to speak. She felt it to be as much a debt of gratitude to him, as of'

In essence, each n-gram ("Chapter 1 It", "1 It is", "It is a" etc...) has associated with it the equivalent of a bag containing all the words that ever follow the words in the n-gram in Pride and Prejudice. Imagine it as something akin to a Scrabble bag but instead of containing letters it contains words. Note also, that a word may be in the bag several times (increasing its chance of selection). 

![Scrabble Bag](scrabble.jpg)

Each time we need a new word, we grab the specific bag of words for the current n-gram, give it a shake, reach in and jot down the word we pull out (while remembering to put the word back into the bag once we've read what it is).

Now we have four words, the three in the n-gram and the new word. 

How do we keep adding words? 

Well, we discard the first word of the current n-gram and append the new word to the end to get a brand new n-gram to use the next time we need to randomly select a new word. We rinse and repeat as many times as needed.

Simple! :-)

If you've re-run the code in the cell above a few times, you'll see that the resulting output has problems:

* It often starts mid-sentence.
* It abruptly stops mid-sentence.
* It may contain an un-opened or un-closed quotation mark.

What we really need is a way to generate syntactically correct sentences (i.e. they start with a capital letter, end with the correct punctuation and any quotation marks are correctly balanced).

This is where the fun really starts since we can program some heuristics (general rules of thumb) to help us overcome these problems. This is fun because you get to use your imagination to creatively solve problems to give you better output.

For example, how might we ensure all sentence start properly..? How about we just pre-compute a list of all the n-grams which contain words that began a sentence in the original text:

In [333]:
stop_characters = ".?!"  # Characters that end sentences.
ignore_words = ["Mr.", "Mrs.", "Miss.", "Rev.", "Dr.", ]  # Words ending in a full stop but are not the end of a sentence.

all_ngrams = find_ngrams(all_the_words_in_the_novel, N + 1)  # All the ngrams (N+1) in the novel.

opening_ngrams = []  # Currently empty, will contain all valid opening n-grams.

for ng in all_ngrams:  # Loop over ALL the ngrams to work out if they're valid opening n-grams.
    if ng[0][-1] in stop_characters:  # Is the final character of the first word of the n-gram a stop character?
        # If the first word isn't in the list of ignore_words AND the second word either starts with an upper-case
        # character or it starts with a quotation mark (")...
        if ng[0] not in ignore_words and (ng[1][0].isupper() or ng[1][0] == '"'):
            # Strip off the first word, and add the remaining N words as a valid opening n-gram. 
            opening_ngrams.append(ng[1:]) 

opening_ngrams[:10]  # Output the first ten opening ngrams to check they look OK.

[('However', 'little', 'known'),
 ('"My', 'dear', 'Mr.'),
 ('"But', 'it', 'is,"'),
 ('"Do', 'you', 'not'),
 ('"You', 'want', 'to'),
 ('"Why,', 'my', 'dear,'),
 ('Single,', 'my', 'dear,'),
 ('A', 'single', 'man'),
 ('What', 'a', 'fine'),
 ('How', 'can', 'it')]

Getting the `opening_ngrams` involves several considerations:

* We need to identify detect when there's the end of a sentence (so the following words will become part of an opening n-gram). We detect by matching against common `stop_characters` which indicate the end of a sentence.
* Some words which end in a full-stop are not, in fact, the end of a sentence. We need to make a list of these words so we know to ignore them when detecting sentences (they're mainly titles like "Mr." or "Mrs.").
* The start of a sentence must begin with a capitalised letter or an opening quotation mark.

The actual process of creating opening n-grams of length N, actually involves using n-grams of N+1. Say N is 3 then we need all the 4 word n-grams in the novel so we can act on them like this:

* Imagine a 4 word n-gram: `hall. "But why is`
* First check the first word (`hall.`) has a final character which is a stop character (it does).
* Second, check it's not in the tricky words to ignore (it's not).
* Third, so far so good, so check the second word to see if it either starts with a capital letter (it does not) or a quotation mark (it does).
* Fourth, we've found an n-gram that we can use! So, remove the first word (`hall.`) leaving us with an n-gram of length N: `"But why is`
* Add this n-gram to the `opening_ngrams` list.

If any of the various checks described above are unsuccessful, then that particular n-gram is ignored and we move onto the next N+1 n-gram.

To generate a new seed for starting a sentence we can just use `random.choice` on the `opening_ngrams` list to get an appropriate n-gram:

In [334]:
seed_ngram = random.choice(opening_ngrams)
seed_ngram

('Such', 'a', 'circumstance')

This leaves us with two final problems:

* Ensuring sentences end with a valid stop character.
* Checking that quotation marks "balance".

While we were able to pre-calculate "up front" the contents of the opening n-grams in the example above, we can't pre-calculate anything about these two problems because they depend in some way on the state of the generated text.

Calculating things "up front" means the computational work need only happen once and the results are re-used "cheaply" again and again. This is one way to make code more efficient. But since the two remaining problems listed above cannot be pre-calculated then we'll need to add some code that's run *each time* a new word is generated.

The first thing we should work out is how to indicate how long a sentence should be. The most obvious way is to create a `make_sentence` function that takes a desired word-length:

In [335]:
def make_sentence(word_length):
    """
    Return a sentence of approximately word_length length.
    """
    seed_ngram = random.choice(opening_ngrams)
    result = list(seed_ngram)
    while len(result) < word_length:
        next_word = random.choice(n_grams[seed_ngram])
        result.append(next_word)
        new_ngram = list(seed_ngram[1:])
        new_ngram.append(next_word)
        seed_ngram = tuple(new_ngram)
    return tuple(result)

Let's check this basic version works:

In [336]:
' '.join(make_sentence(8))

'He began to wish to know more of.'

What remains of the first problem is to ensure the sentence ends in a syntactically correct manner (with a stopping character). The most obvious thing to do is to limit the candidate words for the next word to be those which end in stopping characters. If no such candidate words exist, keep going until we get an n-gram with such words. Once sentence ending word is reached, return the result. That's why the sentence can only be approximately `word_length` length.

In [337]:
def make_sentence(word_length):
    """
    Return a sentence of approximately word_length length.
    """
    seed_ngram = random.choice(opening_ngrams)
    result = list(seed_ngram)
    while len(result) < word_length - 1:
        next_word = random.choice(n_grams[seed_ngram])
        result.append(next_word)
        new_ngram = list(seed_ngram[1:])
        new_ngram.append(next_word)
        seed_ngram = tuple(new_ngram)
    while True:
        ending_words = [word for word in n_grams[seed_ngram] if word[-1] in stop_characters and word not in ignore_words]
        if ending_words:
            result.append(random.choice(ending_words))
            break
        else:
            next_word = random.choice(n_grams[seed_ngram])
            result.append(next_word)
            new_ngram = list(seed_ngram[1:])
            new_ngram.append(next_word)
            seed_ngram = tuple(new_ngram)
            
    return tuple(result)

Try the revised function in the next cell. Try it with a different value for `word_length`. I've managed to create some quite hilarious output (e.g. `('She', 'saw', 'all', 'the', 'glories', 'of', 'the', 'camp--its', 'tents', 'stretched', 'forth', 'in', 'beauteous', 'uniformity', 'of', 'lines,', 'crowded', 'with', 'the', 'young', 'ladies.')`)

In [338]:
' '.join(make_sentence(8))

"His most particular friend, you see by Jane's account, was persuaded of his never intending to marry her."

However, the code I've created is clunky since I'm re-using the same fragment of code in both while loops (starting with `next_word = random.choice(n_grams[seed_ngram])`). This is a "code smell" that indicates I could refactor my code to make it more readable. When a developer says refactor, they're talking about a process akin to editing prose: remove un-necessary repitions, think about re-naming things more appropriately, re-arranging it so it makes better sense when read by another.

In this case I think I could remove the repeated fragments of code and copy them into a single function called something which describes what it does... like `make_next_word`:

In [339]:
def make_next_word(seed_ngram):
    """
    Return the next word and n-gram given a seed_ngram.
    """
    next_word = random.choice(n_grams[seed_ngram])
    new_ngram = list(seed_ngram[1:])
    new_ngram.append(next_word)
    return next_word, tuple(new_ngram)


def make_sentence(word_length):
    """
    Return a sentence of approximately word_length length.
    """
    seed_ngram = random.choice(opening_ngrams)
    result = list(seed_ngram)
    while len(result) < word_length - 1:
        next_word, seed_ngram = make_next_word(seed_ngram)
        result.append(next_word)
    while True:
        ending_words = [word for word in n_grams[seed_ngram] if word[-1] in stop_characters and word not in ignore_words]
        if ending_words:
            result.append(random.choice(ending_words))
            break
        else:
            next_word, seed_ngram = make_next_word(seed_ngram) 
            result.append(next_word)
    return tuple(result)

In [340]:
' '.join(make_sentence(8))

'He sat with them above half-an-hour; and when they arose to depart, Mr. Darcy called on his sister to her was amazing!--but to speak with you." Elizabeth was too much embarrassed to say a word.'

If you run the revised version of `make_sentence` several times you'll notice that, sometimes, the quotes won't match. This is the second remaining problem.

Happily it can be solved simply with a flag to indicate if an opening quotes has occured. If so, allow words with ending quotes (otherwise exclude them. There is a slight possibility that an n-gram may only have candidate next words which end in closing quotes, in which case, we just have to accept the clunky result.

In [341]:
def make_next_word(seed_ngram, quote_flag=False):
    """
    Return the next word and n-gram given a seed_ngram. If quote_flag is True, allow words that end in
    a quotation mark.
    """
    candidate_words = n_grams[seed_ngram]  # possible next words.
    if not quote_flag:  # Throw away words that end with a quotation mark.
        candidate_words = [word for word in candidate_words if not word[-1] == '"']
    if not candidate_words:  # If we end up with no candidate words, just reset them.
        candidate_words = n_grams[seed_ngram]
    # Continue as in previous examples. 
    next_word = random.choice(candidate_words)
    new_ngram = list(seed_ngram[1:])
    new_ngram.append(next_word)
    return next_word, tuple(new_ngram)


def make_sentence(word_length):
    """
    Return a sentence of approximately word_length length.
    """
    quote_flag = False  # Indicates if the sentence contains an opening quotation mark.
    seed_ngram = random.choice(opening_ngrams)
    result = list(seed_ngram)
    for word in result:  # Check the words from the opening n-gram for opening quotes.
        if word[0] == '"':
            quote_flag = True
            break
    while len(result) < word_length - 1:
        next_word, seed_ngram = make_next_word(seed_ngram)
        result.append(next_word)
        # The following two if statements check the word and ensure the quote_flag is correctly set, if needed.
        if next_word[0] == '"':  # [0] means the first character.
            quote_flag = True
        if next_word[-1] == '"':  #[-1] means the final character.
            quote_flag = False
    while True:
        ending_words = [word for word in n_grams[seed_ngram] if word[-1] in stop_characters and word not in ignore_words]
        if ending_words:
            result.append(random.choice(ending_words))
            break
        else:
            next_word, seed_ngram = make_next_word(seed_ngram) 
            result.append(next_word)
            # Since we still don't have a word to end the sentence, the same quote-checking as above needs to
            # happen to ensure speech marks remain balanced.
            if next_word[0] == '"':
                quote_flag = True
            if next_word[-1] == '"':
                quote_flag = False
    return tuple(result)

If we try this final revision I think we're close enough (but certainly not perfect) in making sentences that vaguely "look" (but may not read) correctly.

In [342]:
' '.join(make_sentence(8))

"Mr. Bingley was to bring twelve ladies and seven gentlemen with him to Longbourn before many days had passed after Lady Catherine's visit."

It would be useful to be able to automate the creation of paragraphs and chapters of text. :-)

For this to happen we need to string some number of smaller units of text together, making sure that the final n-gram of the preceeding text is the seed for the text that follows.

In the following code I define three new functions and modify the existing `make_sentence`:

* `make_new_seed` - This will take the n-gram that ended the previous sentence and produce an n-gram to *start* the next new sentence. This directly addresses the requirement that the final n-gram of the preceeding text forms the seed for the text that follows. The important property of the result is that it'll be a seed n-gram that correctly starts a new sentence.
* `make_sentence` - This function works in exactly the same way as before except that it takes a seed n-gram as an argument and returns two values as a result: the new sentence and the final n-gram used when generating the sentence. As a result we can chain these functions together using the final n-gram passed through `make_new_seed` to generate the seed n-gram for the next sentence.
* `make_paragraph` and `make_chapter` - These two functions work in a very similar way. You give them a number to indicate how much of what they produce you need and a seed n-gram to kick them off. They both contain a loop which will continue however long you specified to produce content after which they'll return their result.

Obviously, `make_chapter` re-uses `make_paragraph` which re-uses `make_new_seed` and `make_sentence`.

In [343]:
def make_new_seed(old_seed):
    """
    Given an old seed (which closes a previous section), generate a new seed to start the next section.
    """
    seed_ngram = old_seed
    result = []
    while len(result) < len(old_seed):  # Simply give N new words generated from the old_seed
        next_word, seed_ngram = make_next_word(seed_ngram)
        result.append(next_word)
    return tuple(result)

def make_sentence(word_length, seed_ngram):
    """
    Return a sentence of approximately word_length length.
    """
    quote_flag = False
    result = list(seed_ngram)
    for word in result:
        if word[0] == '"':
            quote_flag = True
            break
    while len(result) < word_length - 1:
        next_word, seed_ngram = make_next_word(seed_ngram)
        result.append(next_word)
        if next_word[0] == '"':
            quote_flag = True
        if next_word[-1] == '"':
            quote_flag = False
    while True:
        ending_words = [word for word in n_grams[seed_ngram] if word[-1] in stop_characters and word not in ignore_words]
        if ending_words:
            next_word = random.choice(ending_words)
            new_ngram = list(seed_ngram[1:])
            new_ngram.append(next_word)
            seed_ngram = tuple(new_ngram)
            result.append(next_word)
            break
        else:
            next_word, seed_ngram = make_next_word(seed_ngram) 
            result.append(next_word)
            if next_word[0] == '"':
                quote_flag = True
            if next_word[-1] == '"':
                quote_flag = False
    return tuple(result), seed_ngram  # Return the last n-gram used, to form the seed n-gram for the next sentence.


def make_paragraph(number_of_sentences, seed_ngram):
    """
    Generate a paragraph containing "number_of_sentences" sentences. Start the first sentence with the passed in n_gram.
    """
    result = []
    sentence_length = random.randint(3, 8)  # Vary the length of the sentences produced!
    for i in range(number_of_sentences):
        sentence, last_ngram = make_sentence(sentence_length, seed_ngram)
        result.append(' '.join(sentence))
        seed_ngram = make_new_seed(last_ngram)
    return result, seed_ngram

def make_chapter(number_of_paragraphs, seed_ngram):
    """
    Generate a chapter containing "number_of_paragraphs" paragraphs. Start te first sentence of the first paragraph with
    the passed in n_gram.
    """
    result = []
    paragraph_length = random.randint(3, 8)  # Vary the number of sentences in a paragraph!
    for i in range(number_of_paragraphs):
        paragraph, last_ngram = make_paragraph(paragraph_length, seed_ngram)
        result.append(' '.join(paragraph))
        seed_ngram = last_ngram
    return result, seed_ngram
    

At last..! My creation..! It's ALIVE..! :-)

Let's create a short 5 paragraph chapter using a seed n-gram that contains the opening three words of the novel.

Here's some example output:

> Chapter 1 It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a choice. I described, and enforced them earnestly.

The capacity for this technique to generate hilarious and/or slightly clunky prose is a great source of fun. 

In [344]:
seed = next(find_ngrams(all_the_words_in_the_novel, N))  # Grabs the first n-gram.
chapter, seed_ngram = make_chapter(5, seed)
print('\n\n'.join(chapter))

Chapter 1 It is a great friend of Darcy's." "Oh! yes," said Elizabeth drily; "Mr. Darcy is all politeness," said Elizabeth, smiling. "He is, indeed; but, considering the inducement, my dear Miss Elizabeth, that your modesty, so far from doing you any disservice, rather adds to your other perfections. You would have been a great proficient. And so would Anne, if her health had allowed her to apply. I am confident that she would produce a letter for her from Charlotte, as it seemed the effect of love, and the object of compassion. His attachment excited gratitude, his general character respect; but she could still moralize over every morning visit; and as she considered that Jane's disappointment had in fact been the work of many generations." "And then you have added so much to it yourself, you are always giving her the preference." "They have none of them dressed. In ran Mrs. Bennet to her husband, called out as she did; I can safely say that it was impossible not to try for informatio

There you have it!

But I've saved the best bit for last... all of the code and techniques described above are *general* in application. If you remember back to the opening code cell, I loaded the text file for "Pride and Prejudice". Why not try other great works of literature (I've included `war_and_peace.txt`, `moby-dick.txt` and `ulysses.txt` for your enjoyment).

Have fun... as always, if you have any questions, I'm more than happy to help in any way that I can. I hope you find this useful!