# Introduction
During my orchestra [LiTHe Blås](http://litheblas.org)'s 45 year aniversary I will be doing a customary "Skitsnack" between two songs. 
A "Skitsnack" is where an orchestra member keeps the audience preoccupied by talking about anything between heaven and earth until we start playing the next song.
I often find it hard to decide what I should talk about and usually end up telling really bad jokes or rambling about some algorithm I just read about. 

For this "Skitsnack" I will avoid coming up with my own material entierly by generating it with an n-grams model! 
The focus of this Notebook is simply to generate a short text which I think could entertain a group of 500 for about 30-60 seconds.
As such, any quantitative evaluation will only happen in the spur of the moment.

# Training data
I haven't decided what data I should use to train my model, and will probably have to try some different sources when the model is finished. The final text will definitely be in Swedish to accomodate the audience, but to get started I will use the Dansih author [H.C. Andersen's Fairy Tales](http://www.gutenberg.org/ebooks/32572) translated to English.

## Cleaning the data

In [3]:
import codecs

In [44]:
with codecs.open('hcandersen_fairy_tales.txt', 'r', encoding='utf-8') as f:
    text = f.read()

First let's remove all text which is not part of his stories, as I don't want to use this for training.

In [45]:
import re

In [46]:
for i in re.finditer(r"HANS ANDERSEN'S FAIRY TALES", text):
    print(i.start(0), i.end(0))
    # Hiding the output for future readability
    #print(text[i.start(0): i.end(0) + 200])
    #print("###########################")

627 654
4021 4048
4140 4167
375029 375056


In [47]:
text = text[4167:]

In [48]:
for i in re.finditer(r"NOTES", text):
    print(i.start(0), i.end(0))
    print(text[i.start(0): i.end(0) + 200])
    print("###########################")

367113 367118
NOTES


THE STORKS

          PAGE 29. On account of the ravages it makes among
          noxious animals, the stork is a privileged bird
          wherever it makes its home. In cities it is
     
###########################


In [49]:
text = text[:367113 ]

Let's remove all tabs and linebreaks.

In [37]:
text = re.sub(r'\r', '', text)
text = re.sub(r'\n', ' ', text)

## Tokenization
### Sentences
First let's parse the text as sentences using NLTK.

In [39]:
from nltk.tokenize import sent_tokenize

In [40]:
sentences = sent_tokenize(text)

In [43]:
print(sentences[0])
print(sentences[1])

     THE FLAX   THE flax was in full bloom; it had pretty little blue flowers, as delicate as the wings of a moth.
The sun shone on it and the showers watered it; and this was as good for the flax as it is for little children to be washed and then kissed by their mothers.


The first one is not good, it includes the story title. I will noth bother with this at the moment though.

### Words

In [66]:
from nltk.tokenize import word_tokenize

In [69]:
sentences = list(map(word_tokenize, sentences))

In [73]:
print(sentences[1])

['The', 'sun', 'shone', 'on', 'it', 'and', 'the', 'showers', 'watered', 'it', ';', 'and', 'this', 'was', 'as', 'good', 'for', 'the', 'flax', 'as', 'it', 'is', 'for', 'little', 'children', 'to', 'be', 'washed', 'and', 'then', 'kissed', 'by', 'their', 'mothers', '.']


# N-Gram model

## Constructing n-grams from the setnences
Let's try the nltk ngrams package

In [74]:
from nltk import ngrams

In [88]:
trigrams = list(map(lambda x: list(ngrams(x, 3)), sentences))

In [89]:
trigrams[1]

[('The', 'sun', 'shone'),
 ('sun', 'shone', 'on'),
 ('shone', 'on', 'it'),
 ('on', 'it', 'and'),
 ('it', 'and', 'the'),
 ('and', 'the', 'showers'),
 ('the', 'showers', 'watered'),
 ('showers', 'watered', 'it'),
 ('watered', 'it', ';'),
 ('it', ';', 'and'),
 (';', 'and', 'this'),
 ('and', 'this', 'was'),
 ('this', 'was', 'as'),
 ('was', 'as', 'good'),
 ('as', 'good', 'for'),
 ('good', 'for', 'the'),
 ('for', 'the', 'flax'),
 ('the', 'flax', 'as'),
 ('flax', 'as', 'it'),
 ('as', 'it', 'is'),
 ('it', 'is', 'for'),
 ('is', 'for', 'little'),
 ('for', 'little', 'children'),
 ('little', 'children', 'to'),
 ('children', 'to', 'be'),
 ('to', 'be', 'washed'),
 ('be', 'washed', 'and'),
 ('washed', 'and', 'then'),
 ('and', 'then', 'kissed'),
 ('then', 'kissed', 'by'),
 ('kissed', 'by', 'their'),
 ('by', 'their', 'mothers'),
 ('their', 'mothers', '.')]

In [81]:
from functools import reduce

In [90]:
trigrams = reduce(lambda x, y: x + y, trigrams)

In [87]:
len(trigrams)

72109

## Implementing a generative model

In [107]:
class NGrams():
    def __init__(self, n):
        self.n = n
        
    def fit(self, sentences):
        self.grams = []
        for i in range(1, self.n+1):
            grams = list(map(lambda x: list(ngrams(x, i)), sentences))
            grams = reduce(lambda x, y: x + y, grams)
            self.grams.append(grams)
        

In [108]:
trigram = NGrams(3)

In [109]:
trigram.fit(sentences)

In [112]:
trigram.grams[2][:10]

[('THE', 'FLAX', 'THE'),
 ('FLAX', 'THE', 'flax'),
 ('THE', 'flax', 'was'),
 ('flax', 'was', 'in'),
 ('was', 'in', 'full'),
 ('in', 'full', 'bloom'),
 ('full', 'bloom', ';'),
 ('bloom', ';', 'it'),
 (';', 'it', 'had'),
 ('it', 'had', 'pretty')]