# Introduction
During my orchestra [LiTHe Blås](http://litheblas.org)'s 45 year aniversary I will be doing a customary "Skitsnack" between two songs. 
A "Skitsnack" is where an orchestra member keeps the audience preoccupied by talking about anything between heaven and earth until we start playing the next song.
I often find it hard to decide what I should talk about and usually end up telling really bad jokes or rambling about some algorithm I just read about. 

For this "Skitsnack" I will avoid coming up with my own material entierly by generating it with an n-grams model! 
The focus of this Notebook is simply to generate a short text which I think could entertain a group of 500 for about 30-60 seconds.

There are some nice aspects of the task of generating a "funny" text from a predefined dictionary:
* I can easily perform extrinsic testing. "Is this text funny?"
* I don't have to worry about unkown words.

As such, any quantitative evaluation will only happen in the spur of the moment.

Inspiration:
https://web.stanford.edu/~jurafsky/slp3/4.pdf

# Training data
I haven't decided what data I should use to train my model, and will probably have to try some different sources when the model is finished. The final text will definitely be in Swedish to accomodate the audience, but to get started I will use the Dansih author [H.C. Andersen's Fairy Tales](http://www.gutenberg.org/ebooks/32572) translated to English.

## Cleaning the data

In [1]:
import codecs

In [2]:
with codecs.open('hcandersen_fairy_tales.txt', 'r', encoding='utf-8') as f:
    text = f.read()

First let's remove all text which is not part of his stories, as I don't want to use this for training.

In [3]:
import re

In [4]:
for i in re.finditer(r"HANS ANDERSEN'S FAIRY TALES", text):
    print(i.start(0), i.end(0))
    # Hiding the output for future readability
    #print(text[i.start(0): i.end(0) + 200])
    #print("###########################")

627 654
4021 4048
4140 4167
375029 375056


In [5]:
text = text[4167:]

In [6]:
for i in re.finditer(r"NOTES", text):
    print(i.start(0), i.end(0))
    print(text[i.start(0): i.end(0) + 200])
    print("###########################")

367113 367118
NOTES


THE STORKS

          PAGE 29. On account of the ravages it makes among
          noxious animals, the stork is a privileged bird
          wherever it makes its home. In cities it is
     
###########################


In [7]:
text = text[:367113 ]

Let's remove all tabs and linebreaks.

In [8]:
text = re.sub(r'\r', '', text)
text = re.sub(r'\n', ' ', text)

## Tokenization
### Sentences
First let's parse the text as sentences using NLTK.

In [9]:
from nltk.tokenize import sent_tokenize

In [10]:
sentences = sent_tokenize(text)

In [11]:
print(sentences[0])
print(sentences[1])

     THE FLAX   THE flax was in full bloom; it had pretty little blue flowers, as delicate as the wings of a moth.
The sun shone on it and the showers watered it; and this was as good for the flax as it is for little children to be washed and then kissed by their mothers.


The first one is not good, it includes the story title. I will noth bother with this at the moment though, as this is not the text I will be using for my final results.

### Words

In [12]:
from nltk.tokenize import word_tokenize

In [13]:
sentences = list(map(word_tokenize, sentences))

In [14]:
print(sentences[1])

['The', 'sun', 'shone', 'on', 'it', 'and', 'the', 'showers', 'watered', 'it', ';', 'and', 'this', 'was', 'as', 'good', 'for', 'the', 'flax', 'as', 'it', 'is', 'for', 'little', 'children', 'to', 'be', 'washed', 'and', 'then', 'kissed', 'by', 'their', 'mothers', '.']


### Add beginning and end of sentence tags
All my sentences need a root and a stop token. I will insert a `<BOS>` tag before the first word of each sentence and replace the last punctuation of each sentence with `<EOS>`, I do this to distinguish between in sentence punctuation and the actual end of sentence. 

In [15]:
sentences = list(map(lambda x: ['<BOS>'] + x[:-1] + ['<EOS>'], sentences))

In [16]:
print(sentences[1])

['<BOS>', 'The', 'sun', 'shone', 'on', 'it', 'and', 'the', 'showers', 'watered', 'it', ';', 'and', 'this', 'was', 'as', 'good', 'for', 'the', 'flax', 'as', 'it', 'is', 'for', 'little', 'children', 'to', 'be', 'washed', 'and', 'then', 'kissed', 'by', 'their', 'mothers', '<EOS>']


# N-Gram model

## Constructing n-grams from the setnences
Let's try the nltk ngrams package

In [17]:
from nltk import ngrams

## Implementing a generative model
When generating my text I will not be considering context across sentences. Each `<EOS>` tag will be followed by a fresh `<BOS>`, interpreted as a unigram.

In [18]:
from functools import reduce
import numpy as np
import pandas as pd

In [19]:
class NGrams():
    def __init__(self, n):
        self.n = n
        
    def fit(self, sentences):
        # Count all ngrams
        self.grams = []
        for i in range(1, self.n+1):
            # Build a sorted list of all ngrams
            grams = list(map(lambda x: list(ngrams(x, i)), sentences))
            grams = np.array(reduce(lambda x, y: x + y, grams))
            grams = grams[np.lexsort(grams[:,::-1].T,),:]
            
            # Build an array marking the first unique occurence of a n-gram
            # Example unigrams, 
            # a = [('a'), ('a'), ('b'), ('c'), ('c')]
            # -> [True, False, True, True, False]
            first_occurence = np.append([True], np.array([np.any(grams[i-1] != grams[i]) for i in range(1, len(grams))]))
            
            # Assign each unique n-gram an index
            # Example unigrams, 
            # a = [('a'), ('a'), ('b'), ('c'), ('c')]
            # -> [0, 0, 1, 2, 2]
            ids = np.append([0], first_occurence[1:].cumsum())
            
            # Build a mapping from n-gram to count
            frequencies = dict(list(zip(map(tuple, grams[first_occurence,:]), np.bincount(ids))))
            self.grams.append(frequencies)
        
        # Record the total number of unigrams
        self.n_unigrams = sum(self.grams[0].values())         
        
    def log_prob(self, word, context):
        n = len(context)
        gram = context + (word,)
        if n == 0:
            return np.log(self.grams[n][gram] / self.n_unigrams)
        else:
            return np.log(self.grams[n][gram] / self.grams[n-1][context])      

In [20]:
len(sentences)

3902

In [21]:
trigram = NGrams(3)

In [22]:
%time trigram.fit(sentences)

Wall time: 9.41 s


In [23]:
trigram.log_prob('the', ('!', '--'))

0.0

In [24]:
trigram.log_prob('call', ("'ll",))

-3.4657359027997265

In [25]:
trigram.log_prob('call', ())

-8.7714179106038408

Cool, we can calculate the log probabilities of a word following an ngram. To avoid overfitting to the exact sentences in the text some smoothing should be done. Without smoothing we will assign zero probability to all ngrams that are not present in the text.

However, I have encountered a problem: At the moment I don't feel creative enough to come up with a generative model that finds the next word in a sentence based on smoothed probabilities. Maybe I started of in the wrong end? 

## Simple text generating model
I will start of with an unsmoothed model, and at the same time also switch to a tree structure to represent the model vocabulary.

In [26]:
import random
import bisect
import sys

In [36]:
class NGramsTree():
    def __init__(self, n):
        self.n = n
    def fit(self, sentences):
         # Build a list of all ngrams
        gram_lists = map(lambda x: list(ngrams(x, self.n)), sentences)
        #len_grams = reduce(lambda x, y: x + len(y), gram_lists, 0)
        #grams = np.array(reduce(lambda x, y: x + y, grams))
        
        # Construct a tree based on the ngrams
        self.tree = {}
        self.tree['count'] = 0 #len_grams
        self.tree['words'] = {}
        for gram_list in gram_lists:
            self.tree['count'] += len(gram_list)
            for gram in gram_list:
                reference = self.tree
                for i, word in enumerate(gram):
                    if word in reference['words']: 
                        reference = reference['words'][word]
                        reference['count'] += 1
                    else:
                        reference['words'][word] = {}
                        reference = reference['words'][word]
                        reference['count'] = 1
                        # Don't create maps for the leaves
                        if i != self.n - 1:
                            reference['words'] = {}
                        
        # Create integer bins for all words in each context based on their apperances
        # These bins will be used to find a random successor to contexts with probabilities
        # relative to number of appearences. 
        # Note: Keys in a dictionary are stored in a non-deterministic, but stable, order.
        # Therefore associating the bin with index 0 with the key at index 0 is OK.
        def build_bins(tree):
            if 'words' not in tree:
                return
            else:
                tree['bins'] = [0]
                for values in tree['words'].values():
                    tree['bins'].append(tree['bins'][-1] + values['count'])
                    build_bins(values)
        

        build_bins(self.tree)

        
    def generate_sentence(self, sentence, max_length = 100):
        n = min(len(sentence), self.n - 1)
        reference = self.tree
        #print(sentence, sentence[-n:])
        for word in sentence[-n:]:
            reference = reference['words'][word]
        rand_i = random.randint(0, reference['count'] - 1)
        word_index = bisect.bisect_right(reference['bins'], rand_i) - 1
        word = list(reference['words'].keys())[word_index]
        sentence.append(word)
        if word == "<EOS>" or len(sentence) == max_length:
            return sentence
        else:
            return self.generate_sentence(sentence)
         
        

In [43]:
treegram = NGramsTree(3)

In [44]:
example_sentences = [
    ['a', 'a', 'a'],
    ['a', 'a', 'a'],
    ['a', 'a', 'c'],
    ['a', 'b', 'a'],
    ['b', 'a', 'a'],
]

In [45]:
%time treegram.fit(example_sentences)

Wall time: 0 ns


In [46]:
import json

Here is an example of what the tree structure looks like for trigrams.

In [47]:
print(json.dumps(treegram.tree, sort_keys=True,
                      indent=4, separators=(',', ': ')))

{
    "bins": [
        0,
        4,
        5
    ],
    "count": 5,
    "words": {
        "a": {
            "bins": [
                0,
                3,
                4
            ],
            "count": 4,
            "words": {
                "a": {
                    "bins": [
                        0,
                        2,
                        3
                    ],
                    "count": 3,
                    "words": {
                        "a": {
                            "count": 2
                        },
                        "c": {
                            "count": 1
                        }
                    }
                },
                "b": {
                    "bins": [
                        0,
                        1
                    ],
                    "count": 1,
                    "words": {
                        "a": {
                            "count": 1
                        }
                  

In [48]:
treegram = NGramsTree(4)
%time treegram.fit(sentences)

Wall time: 820 ms


In [49]:
def print_model_sentence(sentence):
    # Strip <BOS> and <EOS> tags.
    sentence = sentence[1:-1]
    print(" ".join(sentence))

In [50]:
for i in range(5):
    print_model_sentence(treegram.generate_sentence(sentence = ['<BOS>']))

`` How well his Majesty looks in the new clothes here , before the great mirror ?
`` Do you think I could get some sea cattle if I were really in the kitchen -- the sounds of song and gladness
Lovers pluck off the leaves , and a little flower for a companion .
`` Wax tapers , indeed !
The youngest brother wept , and turned to the mint on the ground


Cool! Seems like the sentences often follow closely with sentences from the book. This is expected as I do no smoothing at all. 

I would like the sentences to derail a little bit more. This could possibly be solved by increasing the size of my training data.

In [52]:
treegram = NGramsTree(5)
%time treegram.fit(sentences)

  


Wall time: 1.16 s


In [53]:
for i in range(5):
    print_model_sentence(treegram.generate_sentence(sentence = ['<BOS>']))

Thy dear wife sits in the nest , To lull the little ones to rest
All the thousands of little imps in the water jumped and sprang about , devouring each other , or tearing each other to bits
`` Poor little Ephemera !
asked he , his eyes big and round with amazement at what he saw
said the thistle , thinking of the flower she had given to the buttonhole


With 5-grams most of the sentences seem to be stolen straight out of the text.

# Testing the Model
Let's test the model on the task it was intended for. Generating a funny "Skitsnack".

First of I will need to find a new text, it should preferably be longer than the previous one, and should also be in Swedish.

## Parsing texts from Språkbanken
[Språkbanken](https://spraakbanken.gu.se/swe/resurser) has a large corpus collection. I have downloaded two:
* [August Strindbergs romaner](https://spraakbanken.gu.se/swe/resurs/strindbergromaner) - August Strindbergs collected novels and dramas. 321,759 sentences.
* [Bloggmix 2015](https://spraakbanken.gu.se/swe/resurs/bloggmix2005) - A collection of swedish blogposts from 2005. 	280,905 sentences. 

Both contain a lot more information than just sentences. For example, every sentence has been dependency parsed, and every word is accompanied with meta data such as a part of speech tag etc. It could be interesting to use POS tags in a language model. For example it makes sense to assign extra probability to all nouns in context where we most often see nouns.

The sentences in both corporas have been reordered. The sentences internal structures are not altered, but it is not possible to look for context outside of the sentence. This is not a problem for my model.

In [54]:
import xml.etree.ElementTree as ET

## Bloggmix 2005

In [55]:
tree = ET.parse('sprakbanken/bloggmix2005.xml')

Structure:
```xml
<corpus>
   <blog>
       <text>
           <sentence>
               <w>
               </w>
           </sentence>
       </text>
   </blog>
</corpus>
```

In [56]:
root = tree.getroot()

Let's have a look at some of the blogs in the corpus.

In [57]:
for child in root.getchildren()[:5]:
    print("Blog: {}".format(child.attrib['title']))

Blog: Tatiana Rojas
Blog: BY KAROLINAA
Blog: Angelicas Gnällbänk
Blog: Emma's WindOw
Blog: Johanna Sjödin


In [77]:
def parse_sentence(sentence_xml):
    sentence = []
    for word in sentence_xml.getchildren():
        text = word.text
        text = "<CENSURERAT>" if text == "\n" else text
        sentence.append(text)
    return sentence

def parse_text(text_xml):
    text = []
    for sentence in text_xml.getchildren():
        text.append(parse_sentence(sentence))
    return text

def parse_blog(blog_xml):
    blog = []
    for text in blog_xml.getchildren():
        blog.append(parse_text(text))
    return list(reduce(lambda x, y: x + y, blog)) 

In [78]:
blogs = []
for child in root.getchildren():
    blogs.append(parse_blog(child))

sentences = list(reduce(lambda x, y: x + y, blogs)) 

In [79]:
sentences = list(map(lambda x: ['<BOS>'] + x[:-1] + ['<EOS>'], sentences))

In [80]:
sentences[0]

['<BOS>',
 'Äter',
 '<NAMN>',
 'gud',
 'vet',
 'hur',
 'många',
 'kilo',
 'jag',
 'har',
 'gått',
 'upp',
 'men',
 'jag',
 'bryr',
 'mig',
 'inte',
 ',',
 'jag',
 'lever',
 'loppan',
 '<EOS>']

In [81]:
len(sentences)

280905

Training on the full data set takes quite a while. Let's train on a subset of the data initially.

In [91]:
fourgram = NGramsTree(4)
%time fourgram.fit(sentences[:100000])

  


Wall time: 1min 39s


In [93]:
for i in range(5):
    print_model_sentence(fourgram.generate_sentence(sentence = ['<BOS>']))

Så svårt när man måste duscha är fasen ingen hit , är lite less på det just <NAMN> så var knappast utvilad när klockan ringde
Broccoli gratinerades med cheddarost med <NAMN> ( på en trestjärnig lyxkrog där dagens kostar <NAMN> förstås ! )
älskar dem !
Så här sitter jag o drar på , medans det i själva verket släppta till media av personer som är upprörda över hur verket hanterar jämställdhetsfrågorna
Vi har installerat oss på hotellet , shoppat en del , men det var det inte en enda minut man inte log


In [89]:
fivegram = NGramsTree(5)
%time fivegram.fit(sentences[:100000])

  


Wall time: 27.4 s


In [90]:
for i in range(5):
    print_model_sentence(fivegram.generate_sentence(sentence = ['<BOS>']))

Inte helt sällan går det flera resor i hopp om att näshåren skulle råka filtrera bort lite sjuka men så ren som min näsa är efter 10 månaders beroende så ja , om <NAMN> kommer jag ligga och beklaga mig över en extrem förkylning eller svininfluensa som jag kommer kalla det ( lika bra att varna er liksom , kommer kräva några dussin " tyck-synd-om-mig kommentarer ) En till grej på bussen jag störde mig på var den sjuka brudens kompis som gick på <NAMN>
Jag vet inte hur det är hos er som är utspridda i landet , i världen … Men HÄR , precis HÄR i <NAMN> är i allafall julen på väg
Amor <NAMN> hade lekstuga de luxe på sin vänsterkant innan han klev av efter känningar i ryggen
All kunskap från mannen bakom <NAMN> Om <NAMN> är grundare av Moderskeppet
Det var en fin kväll med andra ord


Cool, the sentences have a nice flow but often derail. More so in the case of 4-grams than 5-grams, which still often look to be stolen straight from the blogs.

To tackle this I will add some kind of probability of choosing words present in a shorter context. Ideally this should be done with a proper backoff model, but as the day of the big "Skitsnack" is drawign closer I will probably just implement some quasi rational solution.