# Introduction
During my orchestra [LiTHe Blås](http://litheblas.org)'s 45 year aniversary I will be doing a customary "Skitsnack" between two songs. 
A "Skitsnack" is where an orchestra member keeps the audience preoccupied by talking about anything between heaven and earth until we start playing the next song.
I often find it hard to decide what I should talk about and usually end up telling really bad jokes or rambling about some algorithm I just read about. 

For this "Skitsnack" I will avoid coming up with my own material entierly by generating it with an n-grams model! 
The focus of this Notebook is simply to generate a short text which I think could entertain a group of 500 for about 30-60 seconds.

There are some nice aspects of the task of generating a "funny" text from a predefined dictionary:
* I can easily perform extrinsic testing. "Is this text funny?"
* I don't have to worry about unkown words.

As such, any quantitative evaluation will only happen in the spur of the moment.

Inspiration:
https://web.stanford.edu/~jurafsky/slp3/4.pdf

# Training data
I haven't decided what data I should use to train my model, and will probably have to try some different sources when the model is finished. The final text will definitely be in Swedish to accomodate the audience, but to get started I will use the Dansih author [H.C. Andersen's Fairy Tales](http://www.gutenberg.org/ebooks/32572) translated to English.

## Cleaning the data

In [1]:
import codecs

In [2]:
with codecs.open('hcandersen_fairy_tales.txt', 'r', encoding='utf-8') as f:
    text = f.read()

First let's remove all text which is not part of his stories, as I don't want to use this for training.

In [3]:
import re

In [4]:
for i in re.finditer(r"HANS ANDERSEN'S FAIRY TALES", text):
    print(i.start(0), i.end(0))
    # Hiding the output for future readability
    #print(text[i.start(0): i.end(0) + 200])
    #print("###########################")

627 654
4021 4048
4140 4167
375029 375056


In [5]:
text = text[4167:]

In [6]:
for i in re.finditer(r"NOTES", text):
    print(i.start(0), i.end(0))
    print(text[i.start(0): i.end(0) + 200])
    print("###########################")

367113 367118
NOTES


THE STORKS

          PAGE 29. On account of the ravages it makes among
          noxious animals, the stork is a privileged bird
          wherever it makes its home. In cities it is
     
###########################


In [7]:
text = text[:367113 ]

Let's remove all tabs and linebreaks.

In [8]:
text = re.sub(r'\r', '', text)
text = re.sub(r'\n', ' ', text)

## Tokenization
### Sentences
First let's parse the text as sentences using NLTK.

In [9]:
from nltk.tokenize import sent_tokenize

In [10]:
sentences = sent_tokenize(text)

In [11]:
print(sentences[0])
print(sentences[1])

     THE FLAX   THE flax was in full bloom; it had pretty little blue flowers, as delicate as the wings of a moth.
The sun shone on it and the showers watered it; and this was as good for the flax as it is for little children to be washed and then kissed by their mothers.


The first one is not good, it includes the story title. I will noth bother with this at the moment though, as this is not the text I will be using for my final results.

### Words

In [12]:
from nltk.tokenize import word_tokenize

In [13]:
sentences = list(map(word_tokenize, sentences))

In [14]:
print(sentences[1])

['The', 'sun', 'shone', 'on', 'it', 'and', 'the', 'showers', 'watered', 'it', ';', 'and', 'this', 'was', 'as', 'good', 'for', 'the', 'flax', 'as', 'it', 'is', 'for', 'little', 'children', 'to', 'be', 'washed', 'and', 'then', 'kissed', 'by', 'their', 'mothers', '.']


### Add beginning and end of sentence tags
All my sentences need a root and a stop token. I will insert a `<BOS>` tag before the first word of each sentence and replace the last punctuation of each sentence with `<EOS>`, I do this to distinguish between in sentence punctuation and the actual end of sentence. 

In [15]:
sentences = list(map(lambda x: ['<BOS>'] + x[:-1] + ['<EOS>'], sentences))

In [16]:
print(sentences[1])

['<BOS>', 'The', 'sun', 'shone', 'on', 'it', 'and', 'the', 'showers', 'watered', 'it', ';', 'and', 'this', 'was', 'as', 'good', 'for', 'the', 'flax', 'as', 'it', 'is', 'for', 'little', 'children', 'to', 'be', 'washed', 'and', 'then', 'kissed', 'by', 'their', 'mothers', '<EOS>']


# N-Gram model

## Constructing n-grams from the setnences
Let's try the nltk ngrams package

In [17]:
from nltk import ngrams

## Implementing a generative model
When generating my text I will not be considering context across sentences. Each `<EOS>` tag will be followed by a fresh `<BOS>`, interpreted as a unigram.

In [18]:
from functools import reduce
import numpy as np
import pandas as pd

In [19]:
class NGrams():
    def __init__(self, n):
        self.n = n
        
    def fit(self, sentences):
        # Count all ngrams
        self.grams = []
        for i in range(1, self.n+1):
            # Build a sorted list of all ngrams
            grams = list(map(lambda x: list(ngrams(x, i)), sentences))
            grams = np.array(reduce(lambda x, y: x + y, grams))
            grams = grams[np.lexsort(grams[:,::-1].T,),:]
            
            # Build an array marking the first unique occurence of a n-gram
            # Example unigrams, 
            # a = [('a'), ('a'), ('b'), ('c'), ('c')]
            # -> [True, False, True, True, False]
            first_occurence = np.append([True], np.array([np.any(grams[i-1] != grams[i]) for i in range(1, len(grams))]))
            
            # Assign each unique n-gram an index
            # Example unigrams, 
            # a = [('a'), ('a'), ('b'), ('c'), ('c')]
            # -> [0, 0, 1, 2, 2]
            ids = np.append([0], first_occurence[1:].cumsum())
            
            # Build a mapping from n-gram to count
            frequencies = dict(list(zip(map(tuple, grams[first_occurence,:]), np.bincount(ids))))
            self.grams.append(frequencies)
        
        # Record the total number of unigrams
        self.n_unigrams = sum(self.grams[0].values())         
        
    def log_prob(self, word, context):
        n = len(context)
        gram = context + (word,)
        if n == 0:
            return np.log(self.grams[n][gram] / self.n_unigrams)
        else:
            return np.log(self.grams[n][gram] / self.grams[n-1][context])      

In [20]:
len(sentences)

3902

In [21]:
trigram = NGrams(3)

In [22]:
%time trigram.fit(sentences)

Wall time: 9.41 s


In [23]:
trigram.log_prob('the', ('!', '--'))

0.0

In [24]:
trigram.log_prob('call', ("'ll",))

-3.4657359027997265

In [25]:
trigram.log_prob('call', ())

-8.7714179106038408

Cool, we can calculate the log probabilities of a word following an ngram. To avoid overfitting to the exact sentences in the text some smoothing should be done. Without smoothing we will assign zero probability to all ngrams that are not present in the text.

However, I have encountered a problem: At the moment I don't feel creative enough to come up with a generative model that finds the next word in a sentence based on smoothed probabilities. Maybe I started of in the wrong end? 

## Simple text generating model
I will start of with an unsmoothed model, and at the same time also switch to a tree structure to represent the model vocabulary.

In [20]:
import random
import bisect
import sys

In [21]:
class NGramsTree():
    def __init__(self, n):
        self.n = n
    def fit(self, sentences):
        
        # Yields all ngrams of a list of sentences, also yields the n-1, n-2, ... 1 grams at the end of each sentence.
        # For my model this is important, as I build a single tree with depth n to represent all contexts. The shorter
        # grams at the end of the sentence are required to properly fill the top layers of the tree.
        def all_grams(n, sentences):
            for sentence in sentences:
                for i in range(len(sentence)):
                    yield sentence[i:min(len(sentence), i+n)]
        
        # Construct a tree based on the ngrams in the input sentences
        self.tree = {}
        self.tree['count'] = 0 #len_grams
        self.tree['words'] = {}
        for gram in all_grams(self.n, sentences):
            reference = self.tree
            self.tree['count'] += 1
            for i, word in enumerate(gram):
                if word in reference['words']: 
                    reference = reference['words'][word]
                    reference['count'] += 1
                else:
                    reference['words'][word] = {}
                    reference = reference['words'][word]
                    reference['count'] = 1
                    # Don't create maps for the leaves
                    if i != self.n - 1:
                        reference['words'] = {}
                        
        # Create integer bins for all words in each context based on their apperances
        # These bins will be used to find a random successor to contexts with probabilities
        # relative to number of appearences. 
        # Note: Keys in a dictionary are stored in a non-deterministic, but stable, order.
        # Therefore associating the bin with index 0 with the key at index 0 is OK.
        def build_bins(tree):
            if 'words' not in tree:
                return
            else:
                tree['bins'] = [0]
                for values in tree['words'].values():
                    tree['bins'].append(tree['bins'][-1] + values['count'])
                    build_bins(values)
        

        build_bins(self.tree)

        
    def generate_sentence(self, sentence, n = None, backoff_prob = .25, max_length = 100):
        if not n:
            n = self.n
        #print("Initial n: {}".format(n))
        # Randomly back off to a smaller n with the probability backoff_prob
        while random.random() <= backoff_prob and n > 1:
            n-=1
        #print("Backoff n: {}".format(n))
        n = min(len(sentence), n - 1)
        #print("Final n: {}".format(n))
        reference = self.tree
        #print("Context: {}".format(sentence[-n:]))
        if n > 0:
            context = sentence[-n:]
        else:
            context = []
        for word in context:
            reference = reference['words'][word]
        rand_i = random.randint(0, reference['count'] - 1)
        word_index = bisect.bisect_right(reference['bins'], rand_i) - 1
        word = list(reference['words'].keys())[word_index]
        sentence.append(word)
        if word == "<EOS>" or len(sentence) == max_length:
            return sentence
        else:
            # Generate the next word in the sentence based on what is gerenrated so far
            # Do not allow a longer context than what was used this time + 2 
            return self.generate_sentence(sentence, min(n+2, self.n), backoff_prob, max_length)
         
        

In [22]:
treegram = NGramsTree(3)

In [23]:
example_sentences = [
    ['a', 'a', 'a'],
    ['a', 'a', 'a'],
    ['a', 'a', 'c'],
    ['a', 'b', 'a'],
    ['b', 'a', 'a'],
]

In [24]:
%time treegram.fit(example_sentences)

Wall time: 0 ns


In [25]:
import json

Here is an example of what the tree structure looks like for trigrams.

In [26]:
print(json.dumps(treegram.tree, sort_keys=True,
                      indent=4, separators=(',', ': ')))

{
    "bins": [
        0,
        12,
        13,
        15
    ],
    "count": 15,
    "words": {
        "a": {
            "bins": [
                0,
                6,
                7,
                8
            ],
            "count": 12,
            "words": {
                "a": {
                    "bins": [
                        0,
                        2,
                        3
                    ],
                    "count": 6,
                    "words": {
                        "a": {
                            "count": 2
                        },
                        "c": {
                            "count": 1
                        }
                    }
                },
                "b": {
                    "bins": [
                        0,
                        1
                    ],
                    "count": 1,
                    "words": {
                        "a": {
                            "count": 1
         

In [27]:
treegram = NGramsTree(4)
%time treegram.fit(sentences)

Wall time: 956 ms


In [28]:
def print_model_sentence(sentence):
    # Strip <BOS> and <EOS> tags.
    sentence = sentence[1:-1]
    print(" ".join(sentence))

In [29]:
for i in range(5):
    print_model_sentence(treegram.generate_sentence(sentence = ['<BOS>'], backoff_prob = 0))

As the sound came also a strong , rushing wind , its stormy breath clearly uttering the words , `` Everything in its right place , '' said the little princess , dressed in silk and velvet , rejoicing in the balmy breezes laden with the fragrance from the fresh verdure , and the mermaid saw that the old people were very active and industrious ; they were at rest and had much better things to do
In a moment there came into the garden
The king and queen are present .
Not a bird was to be married
Suddenly he fancied he heard feet outside going pitapat


Cool! Seems like the sentences often follow closely with sentences from the book. This is expected as I do no smoothing at all. 

I would like the sentences to derail a little bit more. This could possibly be solved by increasing the size of my training data.

In [30]:
treegram = NGramsTree(5)
%time treegram.fit(sentences)

Wall time: 1.37 s


In [31]:
for i in range(5):
    print_model_sentence(treegram.generate_sentence(sentence = ['<BOS>'], backoff_prob = 0))

`` We are glad of that , '' said Great Claus
Attend to me ; that 's far more important
`` I will not fly , '' said one of the plants in the room , who caught him and stuck him on a pin in a box of curiosities
The lamp before the picture of the Madonna threw a strong light on the pale , delicate face of the child
The color has been changed by age to dark green , but clear , fresh water pours from the snout , which shines as if it had been a beautiful human being


With 5-grams most of the sentences seem to be stolen straight out of the text.

# Testing the Model
Let's test the model on the task it was intended for. Generating a funny "Skitsnack".

First of I will need to find a new text, it should preferably be longer than the previous one, and should also be in Swedish.

## Parsing texts from Språkbanken
[Språkbanken](https://spraakbanken.gu.se/swe/resurser) has a large corpus collection. I have downloaded two:
* [August Strindbergs romaner](https://spraakbanken.gu.se/swe/resurs/strindbergromaner) - August Strindbergs collected novels and dramas. 321,759 sentences.
* [Bloggmix 2015](https://spraakbanken.gu.se/swe/resurs/bloggmix2005) - A collection of swedish blogposts from 2005. 	280,905 sentences. 

Both contain a lot more information than just sentences. For example, every sentence has been dependency parsed, and every word is accompanied with meta data such as a part of speech tag etc. It could be interesting to use POS tags in a language model. For example it makes sense to assign extra probability to all nouns in context where we most often see nouns.

The sentences in both corporas have been reordered. The sentences internal structures are not altered, but it is not possible to look for context outside of the sentence. This is not a problem for my model.

In [32]:
import xml.etree.ElementTree as ET

## Bloggmix 2005

In [33]:
tree = ET.parse('sprakbanken/bloggmix2005.xml')

Structure:
```xml
<corpus>
   <blog>
       <text>
           <sentence>
               <w>
               </w>
           </sentence>
       </text>
   </blog>
</corpus>
```

In [34]:
root = tree.getroot()

Let's have a look at some of the blogs in the corpus.

In [35]:
for child in root.getchildren()[:5]:
    print("Blog: {}".format(child.attrib['title']))

Blog: Tatiana Rojas
Blog: BY KAROLINAA
Blog: Angelicas Gnällbänk
Blog: Emma's WindOw
Blog: Johanna Sjödin


In [36]:
def parse_sentence(sentence_xml):
    sentence = []
    for word in sentence_xml.getchildren():
        text = word.text
        text = "<CENSURERAT>" if text == "\n" else text
        sentence.append(text)
    return sentence

def parse_text(text_xml):
    text = []
    for sentence in text_xml.getchildren():
        text.append(parse_sentence(sentence))
    return text

def parse_blog(blog_xml):
    blog = []
    for text in blog_xml.getchildren():
        blog.append(parse_text(text))
    return list(reduce(lambda x, y: x + y, blog)) 

In [37]:
blogs = []
for child in root.getchildren():
    blogs.append(parse_blog(child))

sentences = list(reduce(lambda x, y: x + y, blogs)) 

In [38]:
sentences = list(map(lambda x: ['<BOS>'] + x + ['<EOS>'], sentences))

In [39]:
sentences[0]

['<BOS>',
 'Äter',
 '<CENSURERAT>',
 'gud',
 'vet',
 'hur',
 'många',
 'kilo',
 'jag',
 'har',
 'gått',
 'upp',
 'men',
 'jag',
 'bryr',
 'mig',
 'inte',
 ',',
 'jag',
 'lever',
 'loppan',
 '.',
 '<EOS>']

In [123]:
len(sentences)

280905

Training on the full data set takes quite a while. Let's train on a subset of the data initially.

In [40]:
fourgram = NGramsTree(4)
%time fourgram.fit(sentences[:100000])

Wall time: 30.8 s


In [44]:
p = 0
print("Backoff probability {}%:".format(p * 100))
random.seed(0)
for i in range(5):
    print_model_sentence(fourgram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))

p+= .05
print("\nBackoff probability {}%:".format(p * 100))
random.seed(0)
for i in range(5):
    print_model_sentence(fourgram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))
    
p+= .05
print("\nBackoff probability {}%:".format(p * 100))
random.seed(0)
for i in range(5):
    print_model_sentence(fourgram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))
    
p+= .05
print("\nBackoff probability {}%:".format(p * 100))    
random.seed(0)
for i in range(5):
    print_model_sentence(fourgram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))

p+= .05
print("\nBackoff probability {}%:".format(p * 100))
random.seed(0)
for i in range(5):
    print_model_sentence(fourgram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))

Backoff probability 0%:
Kurslitteratur inom IT är en lurig grej .
: S , hmm isf har de bara fått mig att må bra .
<CENSURERAT> fastslog <CENSURERAT> som sagt i studion och fotade <CENSURERAT> och <CENSURERAT> .
_Karolina Skande 8 Kommentarer
Med nytt ledarskap är det hög tid för nytt ledarskap .

Backoff probability 5.0%:
Kurslitteratur inom IT är en lurig grej .
: S , hmm isf har de bara fått mig att må bra .
<CENSURERAT> fastslog <CENSURERAT> som sagt i studion och fotade <CENSURERAT> och <CENSURERAT> .
Men det är för många och det slutade med en liten ridbana eller något så gör det ! "
OBS ! !

Backoff probability 10.0%:
Kurslitteratur inom IT är en lurig grej .
: S , hmm isf har de bara fått mig nån sömn så jag lär ju ändra det Aja , <CENSURERAT> ska här bloggas för alla mina 500 facebookvänner som också haft sina tal än och vill inte splittras från min älskade <CENSURERAT> och käkade tacos !
Den här , eller rent ut sagt , inte blev värre .
Eller fer sure , låter coolare .
Jag tror

In [45]:
fivegram = NGramsTree(5)
%time fivegram.fit(sentences[:100000])

Wall time: 2min 4s


In [47]:
p = 0
print("Backoff probability {}%:".format(p * 100))
random.seed(1)
for i in range(5):
    print_model_sentence(fivegram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))

p+= .05
print("\nBackoff probability {}%:".format(p * 100))
random.seed(1)
for i in range(5):
    print_model_sentence(fivegram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))
    
p+= .05
print("\nBackoff probability {}%:".format(p * 100))
random.seed(1)
for i in range(5):
    print_model_sentence(fivegram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))
    
p+= .05
print("\nBackoff probability {}%:".format(p * 100))    
random.seed(1)
for i in range(5):
    print_model_sentence(fivegram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))

p+= .05
print("\nBackoff probability {}%:".format(p * 100))
random.seed(1)
for i in range(5):
    print_model_sentence(fivegram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))

Backoff probability 0%:
<CENSURERAT> ringde min klocka vid tio och jag höll på att smälla av .... fick kramp i magen typ .. <CENSURERAT> hämtade oss och vi åkte hem till mig och satt bredvid varandra med varnsin dator och var allmänt nördiga .
Bäst att fortsätta plugga <CENSURERAT> , fika ingår såklart !
Jag behöver nog inte oroa mig för att de inte är allt … Men attans så mycket enklare mitt liv skulle bli !
Nej , avskaffa monarkin .
Vill du anmäla dig till dagens blogg så gör du det HÄÄÄR .

Backoff probability 5.0%:
<CENSURERAT> ringde min klocka vid tio och jag höll på att somna rakt upp och ned kan jag få rysningar .
Äntligen platt mage !
Nej jag vet inte men antar att jag får skylla mig själv , jag behöver aldrig göra mig till och låtsas vara någon annan för att du ska nå ditt må l .
<CENSURERAT> <CENSURERAT> har jag mejlat " affären " och frågat om jag får lägga undan , inte för att utbildas utan för att de på ena intervjun säger en lön och på den andra intervjun säger en annan 

Cool, the sentences have a nice flow but often derail. More so in the case of 4-grams than 5-grams, which still often look to be stolen straight from the blogs.

To tackle this I will add some kind of probability of choosing words present in a shorter context. Ideally this should be done with a proper backoff model, but as the day of the big "Skitsnack" is drawign closer I will probably just implement some quasi rational solution.