# Introduction
During my orchestra [LiTHe Blås](http://litheblas.org)'s 45 year aniversary I will be doing a customary "Skitsnack" between two songs. 
A "Skitsnack" is where an orchestra member keeps the audience preoccupied by talking about anything between heaven and earth until we start playing the next song.
I often find it hard to decide what I should talk about and usually end up telling really bad jokes or rambling about some algorithm I just read about. 

For this "Skitsnack" I will avoid coming up with my own material entierly by generating it with an n-grams model! 
The focus of this Notebook is simply to generate a short text which I think could entertain a group of 500 for about 30-60 seconds.

There are some nice aspects of the task of generating a "funny" text from a predefined dictionary:
* I can easily perform extrinsic testing. "Is this text funny?"
* I don't have to worry about unkown words.

As such, any quantitative evaluation will only happen in the spur of the moment.

Inspiration:
https://web.stanford.edu/~jurafsky/slp3/4.pdf

# Training data
I haven't decided what data I should use to train my model, and will probably have to try some different sources when the model is finished. The final text will definitely be in Swedish to accomodate the audience, but to get started I will use the Dansih author [H.C. Andersen's Fairy Tales](http://www.gutenberg.org/ebooks/32572) translated to English.

## Cleaning the data

In [1]:
import codecs

In [2]:
with codecs.open('hcandersen_fairy_tales.txt', 'r', encoding='utf-8') as f:
    text = f.read()

First let's remove all text which is not part of his stories, as I don't want to use this for training.

In [3]:
import re

In [4]:
for i in re.finditer(r"HANS ANDERSEN'S FAIRY TALES", text):
    print(i.start(0), i.end(0))
    # Hiding the output for future readability
    #print(text[i.start(0): i.end(0) + 200])
    #print("###########################")

627 654
4021 4048
4140 4167
375029 375056


In [5]:
text = text[4167:]

In [6]:
for i in re.finditer(r"NOTES", text):
    print(i.start(0), i.end(0))
    print(text[i.start(0): i.end(0) + 200])
    print("###########################")

367113 367118
NOTES


THE STORKS

          PAGE 29. On account of the ravages it makes among
          noxious animals, the stork is a privileged bird
          wherever it makes its home. In cities it is
     
###########################


In [7]:
text = text[:367113 ]

Let's remove all tabs and linebreaks.

In [8]:
text = re.sub(r'\r', '', text)
text = re.sub(r'\n', ' ', text)

## Tokenization
### Sentences
First let's parse the text as sentences using NLTK.

In [9]:
from nltk.tokenize import sent_tokenize

In [10]:
sentences = sent_tokenize(text)

In [11]:
print(sentences[0])
print(sentences[1])

     THE FLAX   THE flax was in full bloom; it had pretty little blue flowers, as delicate as the wings of a moth.
The sun shone on it and the showers watered it; and this was as good for the flax as it is for little children to be washed and then kissed by their mothers.


The first one is not good, it includes the story title. I will noth bother with this at the moment though, as this is not the text I will be using for my final results.

### Words

In [12]:
from nltk.tokenize import word_tokenize

In [13]:
sentences = list(map(word_tokenize, sentences))

In [14]:
print(sentences[1])

['The', 'sun', 'shone', 'on', 'it', 'and', 'the', 'showers', 'watered', 'it', ';', 'and', 'this', 'was', 'as', 'good', 'for', 'the', 'flax', 'as', 'it', 'is', 'for', 'little', 'children', 'to', 'be', 'washed', 'and', 'then', 'kissed', 'by', 'their', 'mothers', '.']


### Add beginning and end of sentence tags
All my sentences need a root and a stop token. I will insert a `<BOS>` tag before the first word of each sentence and replace the last punctuation of each sentence with `<EOS>`, I do this to distinguish between in sentence punctuation and the actual end of sentence. 

In [15]:
sentences = list(map(lambda x: ['<BOS>'] + x[:-1] + ['<EOS>'], sentences))

In [16]:
print(sentences[1])

['<BOS>', 'The', 'sun', 'shone', 'on', 'it', 'and', 'the', 'showers', 'watered', 'it', ';', 'and', 'this', 'was', 'as', 'good', 'for', 'the', 'flax', 'as', 'it', 'is', 'for', 'little', 'children', 'to', 'be', 'washed', 'and', 'then', 'kissed', 'by', 'their', 'mothers', '<EOS>']


# N-Gram model

## Constructing n-grams from the setnences
Let's try the nltk ngrams package

In [17]:
from nltk import ngrams

## Implementing a generative model
When generating my text I will not be considering context across sentences. Each `<EOS>` tag will be followed by a fresh `<BOS>`, interpreted as a unigram.

In [18]:
from functools import reduce
import numpy as np
import pandas as pd

In [19]:
class NGrams():
    def __init__(self, n):
        self.n = n
        
    def fit(self, sentences):
        # Count all ngrams
        self.grams = []
        for i in range(1, self.n+1):
            # Build a sorted list of all ngrams
            grams = list(map(lambda x: list(ngrams(x, i)), sentences))
            grams = np.array(reduce(lambda x, y: x + y, grams))
            grams = grams[np.lexsort(grams[:,::-1].T,),:]
            
            # Build an array marking the first unique occurence of a n-gram
            # Example unigrams, 
            # a = [('a'), ('a'), ('b'), ('c'), ('c')]
            # -> [True, False, True, True, False]
            first_occurence = np.append([True], np.array([np.any(grams[i-1] != grams[i]) for i in range(1, len(grams))]))
            
            # Assign each unique n-gram an index
            # Example unigrams, 
            # a = [('a'), ('a'), ('b'), ('c'), ('c')]
            # -> [0, 0, 1, 2, 2]
            ids = np.append([0], first_occurence[1:].cumsum())
            
            # Build a mapping from n-gram to count
            frequencies = dict(list(zip(map(tuple, grams[first_occurence,:]), np.bincount(ids))))
            self.grams.append(frequencies)
        
        # Record the total number of unigrams
        self.n_unigrams = sum(self.grams[0].values())         
        
    def log_prob(self, word, context):
        n = len(context)
        gram = context + (word,)
        if n == 0:
            return np.log(self.grams[n][gram] / self.n_unigrams)
        else:
            return np.log(self.grams[n][gram] / self.grams[n-1][context])      

In [20]:
len(sentences)

3902

In [21]:
trigram = NGrams(3)

In [22]:
%time trigram.fit(sentences)

Wall time: 9.41 s


In [23]:
trigram.log_prob('the', ('!', '--'))

0.0

In [24]:
trigram.log_prob('call', ("'ll",))

-3.4657359027997265

In [25]:
trigram.log_prob('call', ())

-8.7714179106038408

Cool, we can calculate the log probabilities of a word following an ngram. To avoid overfitting to the exact sentences in the text some smoothing should be done. Without smoothing we will assign zero probability to all ngrams that are not present in the text.

However, I have encountered a problem: At the moment I don't feel creative enough to come up with a generative model that finds the next word in a sentence based on smoothed probabilities. Maybe I started of in the wrong end? 

## Simple text generating model
I will start of with an unsmoothed model, and at the same time also switch to a tree structure to represent the model vocabulary.

In [19]:
import random
import bisect
import sys

In [20]:
class NGramsTree():
    def __init__(self, n):
        self.n = n
    def fit(self, sentences):
        
        # Yields all ngrams of a list of sentences, also yields the n-1, n-2, ... 1 grams at the end of each sentence.
        # For my model this is important, as I build a single tree with depth n to represent all contexts. The shorter
        # grams at the end of the sentence are required to properly fill the top layers of the tree.
        def all_grams(n, sentences):
            for sentence in sentences:
                for i in range(len(sentence)):
                    yield sentence[i:min(len(sentence), i+n)]
        
        # Construct a tree based on the ngrams in the input sentences
        self.tree = {}
        self.tree['count'] = 0 #len_grams
        self.tree['words'] = {}
        for gram in all_grams(self.n, sentences):
            reference = self.tree
            self.tree['count'] += 1
            for i, word in enumerate(gram):
                if word in reference['words']: 
                    reference = reference['words'][word]
                    reference['count'] += 1
                else:
                    reference['words'][word] = {}
                    reference = reference['words'][word]
                    reference['count'] = 1
                    # Don't create maps for the leaves
                    if i != self.n - 1:
                        reference['words'] = {}
                        
        # Create integer bins for all words in each context based on their apperances
        # These bins will be used to find a random successor to contexts with probabilities
        # relative to number of appearences. 
        # Note: Keys in a dictionary are stored in a non-deterministic, but stable, order.
        # Therefore associating the bin with index 0 with the key at index 0 is OK.
        def build_bins(tree):
            if 'words' not in tree:
                return
            else:
                tree['bins'] = [0]
                for values in tree['words'].values():
                    tree['bins'].append(tree['bins'][-1] + values['count'])
                    build_bins(values)
        

        build_bins(self.tree)

        
    def generate_sentence(self, sentence, context_size = None, backoff_prob = .25, max_length = 100):
        if not context_size:
            context_size = self.n - 1
            
        # Randomly back off to a smaller n with the probability backoff_prob
        while random.random() <= backoff_prob and context_size > 0:
            context_size-=1
        context_size = min(len(sentence), context_size)
        reference = self.tree
        if context_size > 0:
            context = sentence[-context_size:]
        else:
            context = []
        for word in context:
            reference = reference['words'][word]
        rand_i = random.randint(0, reference['count'] - 1)
        word_index = bisect.bisect_right(reference['bins'], rand_i) - 1
        word = list(reference['words'].keys())[word_index]
        sentence.append(word)
        if word == "<EOS>" or len(sentence) == max_length:
            return sentence
        else:
            # Generate the next word in the sentence based on what is gerenrated so far
            # Do not allow a longer context than what was used this time + 1 
            return self.generate_sentence(sentence, min(context_size+1, self.n - 1), backoff_prob, max_length)
         
        

In [21]:
treegram = NGramsTree(3)

In [22]:
example_sentences = [
    ['a', 'a', 'a'],
    ['a', 'a', 'a'],
    ['a', 'a', 'c'],
    ['a', 'b', 'a'],
    ['b', 'a', 'a'],
]

In [23]:
%time treegram.fit(example_sentences)

Wall time: 0 ns


In [24]:
import json

Here is an example of what the tree structure looks like for trigrams.

In [25]:
print(json.dumps(treegram.tree, sort_keys=True,
                      indent=4, separators=(',', ': ')))

{
    "bins": [
        0,
        12,
        13,
        15
    ],
    "count": 15,
    "words": {
        "a": {
            "bins": [
                0,
                6,
                7,
                8
            ],
            "count": 12,
            "words": {
                "a": {
                    "bins": [
                        0,
                        2,
                        3
                    ],
                    "count": 6,
                    "words": {
                        "a": {
                            "count": 2
                        },
                        "c": {
                            "count": 1
                        }
                    }
                },
                "b": {
                    "bins": [
                        0,
                        1
                    ],
                    "count": 1,
                    "words": {
                        "a": {
                            "count": 1
         

In [26]:
treegram = NGramsTree(4)
%time treegram.fit(sentences)

Wall time: 1.47 s


In [27]:
def print_model_sentence(sentence):
    # Strip <BOS> and <EOS> tags.
    sentence = sentence[1:-1]
    print(" ".join(sentence))

In [28]:
for i in range(5):
    print_model_sentence(treegram.generate_sentence(sentence = ['<BOS>'], backoff_prob = 0))

If the sea is rough , the foam dashes over us ; yet we thank God for this rock
Her heart was so filled with sunshine , peace , and joy that it broke , and her courage returned
I can not tell you , but you understand no more about poetry than that cask yonder .
The swans shook their heads , and the little pea is growing so fast , that I may admire their beauty !
But what became of the other young men would gladly have given up his graceful garden flower if he might have worn the one given by the widow in the Bible , it wakes an echo in the heart


Cool! Seems like the sentences often follow closely with sentences from the book. This is expected as I do no smoothing at all. 

I would like the sentences to derail a little bit more. This could possibly be solved by increasing the size of my training data.

In [29]:
treegram = NGramsTree(5)
%time treegram.fit(sentences)

Wall time: 1.93 s


In [30]:
for i in range(5):
    print_model_sentence(treegram.generate_sentence(sentence = ['<BOS>'], backoff_prob = 0))

This is how such highborn people as we came to be in a kitchen .
`` Shall we not now hear about the preparation ?
When she had once begun , her feet went on dancing , so , that she trod upon the good lady 's toes
But dumb she must remain till her task was finished
said the painter


With 5-grams most of the sentences seem to be stolen straight out of the text.

# Testing the Model
Let's test the model on the task it was intended for. Generating a funny "Skitsnack".

First of I will need to find a new text, it should preferably be longer than the previous one, and should also be in Swedish.

## Parsing texts from Språkbanken
[Språkbanken](https://spraakbanken.gu.se/swe/resurser) has a large corpus collection. I have downloaded two:
* [August Strindbergs romaner](https://spraakbanken.gu.se/swe/resurs/strindbergromaner) - August Strindbergs collected novels and dramas. 321,759 sentences.
* [Bloggmix 2015](https://spraakbanken.gu.se/swe/resurs/bloggmix2005) - A collection of swedish blogposts from 2005. 	280,905 sentences. 

Both contain a lot more information than just sentences. For example, every sentence has been dependency parsed, and every word is accompanied with meta data such as a part of speech tag etc. It could be interesting to use POS tags in a language model. For example it makes sense to assign extra probability to all nouns in context where we most often see nouns.

The sentences in both corporas have been reordered. The sentences internal structures are not altered, but it is not possible to look for context outside of the sentence. This is not a problem for my model.

In [31]:
import xml.etree.ElementTree as ET

## Bloggmix 2005

In [32]:
tree = ET.parse('sprakbanken/bloggmix2005.xml')

Structure:
```xml
<corpus>
   <blog>
       <text>
           <sentence>
               <w>
               </w>
           </sentence>
       </text>
   </blog>
</corpus>
```

In [33]:
root = tree.getroot()

Let's have a look at some of the blogs in the corpus.

In [34]:
for child in root.getchildren()[:5]:
    print("Blog: {}".format(child.attrib['title']))

Blog: Tatiana Rojas
Blog: BY KAROLINAA
Blog: Angelicas Gnällbänk
Blog: Emma's WindOw
Blog: Johanna Sjödin


In [35]:
def parse_sentence(sentence_xml):
    sentence = []
    for word in sentence_xml.getchildren():
        text = word.text
        text = "<CENSURERAT>" if text == "\n" else text
        sentence.append(text)
    return sentence

def parse_text(text_xml):
    text = []
    for sentence in text_xml.getchildren():
        text.append(parse_sentence(sentence))
    return text

def parse_blog(blog_xml):
    blog = []
    for text in blog_xml.getchildren():
        blog.append(parse_text(text))
    return list(reduce(lambda x, y: x + y, blog)) 

In [36]:
blogs = []
for child in root.getchildren():
    blogs.append(parse_blog(child))

blog_sentences = list(reduce(lambda x, y: x + y, blogs)) 

In [37]:
# Free some memory
del tree

In [38]:
blog_sentences = list(map(lambda x: ['<BOS>'] + x + ['<EOS>'], blog_sentences))

In [39]:
blog_sentences[0]

['<BOS>',
 'Äter',
 '<CENSURERAT>',
 'gud',
 'vet',
 'hur',
 'många',
 'kilo',
 'jag',
 'har',
 'gått',
 'upp',
 'men',
 'jag',
 'bryr',
 'mig',
 'inte',
 ',',
 'jag',
 'lever',
 'loppan',
 '.',
 '<EOS>']

In [40]:
len(blog_sentences)

280905

Training on the full data set quickly has my machine run out of memory, causing a bottleneck when it starts reading and writing to disk.

In [57]:
fourgram = NGramsTree(4)
%time fourgram.fit(blog_sentences)

Wall time: 3min 4s


In [63]:
random.seed(0)
for i in range(5):
    print_model_sentence(fourgram.generate_sentence(sentence = ['<BOS>'], backoff_prob=0))

... fast den här gången , kanske svart .. de är i livet ?
Certainly an initiative worth applauding , in particular if we were really righteous , we would not have been that wrong .
<CENSURERAT> river choklad till desserten <CENSURERAT> ( chokladgratinerad frukt ) Tasteline Ingredienser 2 kiwi 1 banan 2 äpplen <CENSURERAT> jordgubbar <CENSURERAT> florsocker <CENSURERAT> vit choklad ev. citronmeliss till garnering Gör så här : " Istället för det som skall komma över världen .
Antagligen får jag både ett webinterface och RSS utan att behöva klättra på lådor !
Så ja , jag vet : alla har inte tilgång till dator i hemmet .


In [41]:
fivegram = NGramsTree(5)
%time fivegram.fit(blog_sentences)

Wall time: 9min 35s


In [42]:
random.seed(0)
for i in range(5):
    print_model_sentence(fivegram.generate_sentence(sentence = ['<BOS>'], backoff_prob=0))

... fast den här är ändå roligast av alla .
:) <CENSURERAT> låg jag i sängen <CENSURERAT> , de va så jävla skönt .
Min kväll på <CENSURERAT> var i alla fall rätt ofta för olika typer av svartvit film .
Barnets biologiska föräldrars juridiska band till barnet upphör .
Det var organisationen barnen först som höll i evenemanget och vi var i iranska asylgruppen lokaler i <CENSURERAT> .


Cool, the sentences have a nice flow but often derail. More so in the case of 4-grams than 5-grams, which still often look to be stolen straight from the blogs.

To tackle this I will add some kind of probability of choosing words present in a shorter context. Ideally this should be done with a proper backoff model, but as the day of the big "Skitsnack" is drawign closer I will probably just implement some quasi rational solution.

# With backoff
I implemented backoff in context size while generating sentences. It works like this:
1. Find largest possible context. This is min(sentence length, n-1, c + 1) where n is the models n value and c is the context size used to generate the previous word.
2. Subtract 1 from the chosen context size with probability `backoff_prob`. Repeat until context size is either 0 or no subtraction was made.

## 4-Gram

In [58]:
p = 0
print("Backoff probability {}%:".format(round(p * 100)))
random.seed(0)
for i in range(5):
    print_model_sentence(fourgram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))

p+= .05
print("\nBackoff probability {}%:".format(round(p * 100)))
random.seed(0)
for i in range(5):
    print_model_sentence(fourgram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))
    
p+= .05
print("\nBackoff probability {}%:".format(round(p * 100)))
random.seed(0)
for i in range(5):
    print_model_sentence(fourgram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))
    
p+= .05
print("\nBackoff probability {}%:".format(round(p * 100)))    
random.seed(0)
for i in range(5):
    print_model_sentence(fourgram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))

p+= .05
print("\nBackoff probability {}%:".format(round(p * 100)))
random.seed(0)
for i in range(5):
    print_model_sentence(fourgram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))

Backoff probability 0%:
... fast den här gången , kanske svart .. de är i livet ?
Certainly an initiative worth applauding , in particular if we were really righteous , we would not have been that wrong .
<CENSURERAT> river choklad till desserten <CENSURERAT> ( chokladgratinerad frukt ) Tasteline Ingredienser 2 kiwi 1 banan 2 äpplen <CENSURERAT> jordgubbar <CENSURERAT> florsocker <CENSURERAT> vit choklad ev. citronmeliss till garnering Gör så här : " Istället för det som skall komma över världen .
Antagligen får jag både ett webinterface och RSS utan att behöva klättra på lådor !
Så ja , jag vet : alla har inte tilgång till dator i hemmet .

Backoff probability 5.0%:
... SÅ snabba är vi ;) <CENSURERAT> och lite rouge , läpglans Fest : mkt <CENSURERAT> , markerade ögon , lätt läppglans 6 .
Det verkar inte blir något , men kanske ska gå imon .
Om du ser att någon har behovet att uppmärksamma andras , vad de amerikanska skattebetalarna så gör oljebolagen värsta glädjeskutten över situatio

## 5-Gram

In [44]:
p = 0
print("Backoff probability {}%:".format(round(p * 100)))
random.seed(1)
for i in range(5):
    print_model_sentence(fivegram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))

p+= .05
print("\nBackoff probability {}%:".format(round(p * 100)))
random.seed(1)
for i in range(5):
    print_model_sentence(fivegram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))
    
p+= .05
print("\nBackoff probability {}%:".format(round(p * 100)))
random.seed(1)
for i in range(5):
    print_model_sentence(fivegram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))
    
p+= .05
print("\nBackoff probability {}%:".format(round(p * 100)))    
random.seed(1)
for i in range(5):
    print_model_sentence(fivegram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))

p+= .05
print("\nBackoff probability {}%:".format(round(p * 100)))
random.seed(1)
for i in range(5):
    print_model_sentence(fivegram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))

Backoff probability 0%:
<CENSURERAT> skulle <CENSURERAT> duga gott också , eller till och med mer slitet än vanligt .
I sin kommentar ber han mig att kommentera <CENSURERAT> storhet .
Och i min trappa springer numera två brevbärare , hur nu det kan löna sig .. Är det någon som vet hur man dundrar på mot strömmen är <CENSURERAT> .
Mysigt med lite grillat .
Alltså , man vill ju ha ett fungerande nagellack ! ! !

Backoff probability 5%:
<CENSURERAT> skulle <CENSURERAT> duga gott också , eller till och med en 24-timmars emergency phone number .
Därför avslutade jag det också , för jag vill egentligen inte publicera bilden på mig haha tycker jag ser lite ut som en man ?
Hoppas att jag kan fixa detta !
Jag hade ju ställt in mig på det , dock inte pappa .
Alltså , blä .

Backoff probability 10%:
<CENSURERAT> skulle <CENSURERAT> duga gott också , eller till och med dansa i sömnen . ” Undrar om de som går i sömnen har lite för starka sådana impulser då ?
Jag tror man ska vara där <CENSURERAT> .

20% backoff looks pretty good!

The censored words are a bit of a buzzkill. It seems like dates, names and products are consored, maybe other things as well.

I could fill some of the tags in manually, but in a lot of cases I'm not sure with what. I sohuld probably have excluded all sentences with consored words from the corpus, but for now let's just create some sentences and keep the ones I can make sense of.

In [72]:
random.seed(7)
for i in range(10):
    print_model_sentence(fivegram.generate_sentence(sentence = ['<BOS>'], backoff_prob=p))
    print("")

Det är helt tomt i bollen , och JIPPIIEEE va kul det är !

<CENSURERAT> ville dock ABB att de anställda skulle byta till Metall , eftersom arbetarna vid de regelbundna kvartalsmötena och konferenserna .

Och speciellt när de säger att stockholmare är ytliga och otrevliga människor i ledningen som de som utpekats med namns nämnande i katastrofkommitténs rapport .

Dammsuga och " undre världen , inte <CENSURERAT> och inte <CENSURERAT> som spelar center .

Ja , jag har pratat länge om att ordna med lite fondväggar i både <CENSURERAT> rum , igen , för jag nu journal över allt som görs och inte görs så att jag har flugit av min häst <CENSURERAT> .

Jag har skrivit <CENSURERAT> , trots att jag hade svinont efter måndagen så putsade jag en massa fönster .

Hur som , <CENSURERAT> är en av de få som älskar <CENSURERAT> OCH <CENSURERAT> .

Jag vet inte hur det kan vara någons favorit film i dagens samhälle haha Vad tycker ni om bilderna ? ?

Kan avslöja att vi inte blev döpta när vi var barn , s

In [57]:
text = "Hon vek en pappersbåt av tågbiljetten och skickade iväg ett sms från Franrike . \
Det är en enastående helt unik historia , ja , rent av dödliga personskador . \
Masjvälar var en besvikelse , liksom många av hemvändarna . \
Darför tänker jag inte titta haha . \
Herregud alltså ... Ska spela mer sims nu och glömma att mitt rum är en bra låt ! \
Exakt såhär vill jag att min mormor hatade mig , och hon är till besvär och att debatterna hon för påminner om dåliga såpor . \
Men hon hyschar åt oss att det är HAN , eller NÅN , eller NÅT … ."

In [63]:
shorter_text = "Hon vek en pappersbåt av tågbiljetten och skickade iväg ett sms från Franrike . \
Det är en enastående helt unik historia , ja , rent av dödliga personskador . \
Darför tänker jag inte titta haha . \
Exakt såhär vill jag att min mormor hatade mig , och hon är till besvär och att debatterna hon för påminner om dåliga såpor . \
Men hon hyschar åt oss att det är HAN , eller NÅN , eller NÅT … ."

In [77]:
long_text = "Hon vek en pappersbåt av tågbiljetten och skickade iväg ett sms från Franrike . \
Det är en enastående helt unik historia , ja , rent av dödliga personskador . \
Darför tänker jag inte titta haha . \
Exakt såhär vill jag att min mormor hatade mig , och hon är till besvär och att debatterna hon för påminner om dåliga såpor . \
Men hon hyschar åt oss att det är HAN , eller NÅN , eller NÅT … .\
Juristerna anser att lagen inte ger dem en örfil – det är mänskligt , men inte OK. \
Hela gänget har använt sina Cowboyboots flitigt det gångna året . \
Stannar han kvar så ska jag träna lite faktiskt , hejhopp.\
Alla är vi puckon nån gång och alla är vi godingar . \
Det är helt tomt i bollen , och JIPPIIEEE va kul det är ! \
Sjunger : -  Min Konung mig kallar , jag går på penicillin . Är så trött , så jävla trött . \
"

## Text to Speech
Let's have a robot voice do the "Skitsnack" for extra AI vibes. I'll use Google's Text to Speech API, via the gtts package.

In [78]:
from gtts import gTTS

In [79]:
tts = gTTS(long_text, lang = 'sv', slow=False)

In [80]:
tts.save('tts_files/long_text.mp3')

# Final remarks
## Results
I implemented a very simple N-Gram model, and made it generate random sentences with a hardcoded probability of backing off to shorter contexts.

I found a backoff probability I liked for the 5-Gram model and generated some sentences which I finally converted to a sound file using Googles Text to Speech API.

## Limitations encountered
When training a 5-Gram model on the 4.8 million tokens of the Bloggmix2015 corpus I quickly ran out of memory on my 8GB RAM laptop. This lead to pretty slow training time, and even some delays when generating sentences as parts of the model had to be read from disk.

## Improvements
I think my idea of generating sentences the way I did makes sense. Choosing a word with probability based on how often it has appeared in the given context is exactly what an N-Gram model is about. I also think backing off to a smaller context was a good idea to avoid following the original texts to closely. However, I would have liked to back of with probabilities chosen in a more sound way. One way to choose such probabilities would be to fit the model against a held out corpus.

If I circle back to this project to improve it, I would like to implement a new N-Gram model with a proper smoothing method such as Kneser-Ney Smoothing. This time around I avoided this because I needed to produce results quickly, with no real quality requirements