See part one [here](./ngram-language-model-part-one.html).
#### Generating N-grams

First we want a method to quickly generate a list of n-grams given a list of words.

In [2]:
def iter_ngrams(doc, n):
    """Return a generator over ngrams of a document.
    Params:
      doc...list of tokens
      n.....size of ngrams"""
    # doc[i: i+n] creates a list of elements from index i to i+n
    return (doc[i : i+n] for i in range(len(doc)-n+1))

Given a list where each element is a word in a sentence, the iter_ngrams() function returns a generator over the ngrams from the sentence. For example:

In [3]:
sentence = "The car drove down the street."

list(iter_ngrams(sentence.split(), 3))

[['The', 'car', 'drove'],
 ['car', 'drove', 'down'],
 ['drove', 'down', 'the'],
 ['down', 'the', 'street.']]

Since `sentence` is one single string, I call `.split()` on it to convert it into a list where each element is a word in the sentence.

From the dataset, I'll have a dictionary where the key is a name (or ID) and the value is a list of sentences. So for each sentence, I'll need to create its ngrams and then update a count that keeps track of the frequency of each ngram. 
I'll use a Counter to keep track of the ngram frequencies. A Counter is a special version of a dictionary that is perfect for keeping track of counts (who would've guessed!). 

The key will be the n-1 words of the ngram and the value will be a Counter dictionary that keeps track of the count of all of the words that follow from the n-1 words. It's easier to see an example of this data structure.

In [4]:
from collections import defaultdict
from collections import Counter

names_dict = {}
# An example dictionary
names_dict['Michael'] = ['hello there friend.', 
                         'how have you been today?',
                         'have you slept today?']

sentences = names_dict['Michael']
n = 3

counts = defaultdict(lambda: Counter())

# for each sentence
for sentence in sentences:
    # convert the sentence into a list of ngrams
    # then for each ngram in that list of ngrams, update the count
    for ngram in iter_ngrams(sentence.split(), n):
        # Create a tuple of the n-1 words and then use this as a key to
        # access the Counter. Update the counter using the nth word.
        counts[tuple(ngram[:-1])].update([ngram[-1]])
        
print('counts=\n', '\n'.join(str(i) for i in counts.items()))

counts=
 (('have', 'you'), Counter({'been': 1, 'slept': 1}))
(('you', 'been'), Counter({'today?': 1}))
(('how', 'have'), Counter({'you': 1}))
(('hello', 'there'), Counter({'friend.': 1}))
(('you', 'slept'), Counter({'today?': 1}))


We can see that in the sample dataset, `'how have'` was followed by `'you'`. Moreover, `'have you'` occurred twice and was followed by `'slept'` once and `'been'` once. 
In order to convert these to probabilities, I iterate over all of the counters, sum up the total counts and divide each count by the total. For this small sample dataset, the probability of `'how have'` being followed by `'you'` is 100%. As for `'have you'`, 50% of the time it was followed by `'slept'` and 50% of the time it was followed by `'been'`.

In [5]:
for ngram, word_counts in counts.items():
    total = sum(word_counts.values())
    counts[ngram] = {word: count / total for word, count in word_counts.items()}

print('\ncounts=\n', '\n'.join(str(i) for i in counts.items()))


counts=
 (('have', 'you'), {'been': 0.5, 'slept': 0.5})
(('you', 'been'), {'today?': 1.0})
(('how', 'have'), {'you': 1.0})
(('hello', 'there'), {'friend.': 1.0})
(('you', 'slept'), {'today?': 1.0})


#### Dataset Preprocessing

There are a few preprocessing steps I'll perform to improve the results. For one, I'll prepend a `"[s]"` and append a `"[/s]"` to each sentence. When each sentence is generated, the first word will be selected from all of the ngrams that start with `"[s]"` and words will be selected until a `"[/s]"` is generated. Moreover, words will be transformed to lowercase and spaces will be added before and after each punctuation mark. 

In [6]:
sample = "Sample! See? Often, we'll swim (but only sometimes)?!?"
import re
def preprocess(string):
    s = re.sub("([.,!?()])", r" \1 ", string)
    s = re.sub("\s{2,}", " ", s)
    return s.lower()
print(preprocess(sample))

sample ! see ? often , we'll swim ( but only sometimes ) ? ! ? 


Now it's time to create the dictionary that will hold all of our preprocessed messages.

In [206]:
import csv

count = 0
names_dict = defaultdict(lambda: [])

with open("allo_messages_anon.csv", encoding="utf-8") as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        sender = row['sender_id']
        body = row["text"]
        
        # IF condition filters out messages with blank text
        if body:
            names_dict[sender].append("[s] "+ preprocess(body) + " [/s]")

In [207]:
names_dict.keys()

dict_keys(['159', '14', '1', '54', '156', '3', '2'])

In [208]:
names_dict['1'][30:40]

["[s] i'm still recovering from last night lol [/s]",
 '[s] same [/s]',
 '[s] me too !  [/s]',
 '[s] ^ [/s]',
 '[s] it is [/s]',
 '[s] ^^ but totally accurate lol [/s]',
 '[s] almost done [/s]',
 "[s] i'm still working on it [/s]",
 "[s] it's going pretty well [/s]",
 '[s] idk , i was thinking monday or tuesday [/s]']

The sender_IDs that I'm interested in are 54, 1, 14, 3, and 2. The remaining two are Allo bots that only have a handful of messages. 

#### Generating the Language Models

Now I'll make a function to generate the ngram probabilities using the code from earlier:

In [209]:
def estimate_ngram_probs(sentences, n):
    
    counts = defaultdict(lambda: Counter())
    
    for sentence in sentences:
        for ngram in iter_ngrams(sentence.split(), n):
            counts[tuple(ngram[:-1])].update([ngram[-1]])
            
    # Normalize probabilities to sum to 1.0
    for ngram, word_counts in counts.items():
        total = sum(word_counts.values())
        counts[ngram] = {word: count / total for word, count in word_counts.items()}

    return counts

In [210]:
c = estimate_ngram_probs(names_dict['54'], 3)

print(c[('[s]', "you're")])

{'caught': 0.25, 'dead': 0.25, 'just': 0.25, 'welcome': 0.25}


I'll create a list of IDs and iterate over it to populate a dictionary with each person's ngram language model.

In [211]:
ids = ['1', '2', '3', '14', '54']
n = 3

people_ngrams = {}

for id in ids:
    people_ngrams[id] = estimate_ngram_probs(names_dict[id], n)

Now we have the language models! And we have five in particular, where each model represents the word choice tendencies of each of us (including the Google Assitant). I'll leave it to you to try and guess which one is the Google Assistant ;D

For now I chose `n=3` (later on we'll explore changing n). So let's look at the results of the language model for sentences that start with "you're". The particular trigram for that case will be `('[s]', "you're")` followed by the probability of all of the possible following words. So for user '1', 17.64% of the time a message begins with `"you're"` it is followed by `'welcome'`. Makes sense! 

In [212]:
people_ngrams['1'][('[s]', "you're")]

{'[/s]': 0.029411764705882353,
 'always': 0.029411764705882353,
 'amazing': 0.029411764705882353,
 'being': 0.029411764705882353,
 'doing': 0.029411764705882353,
 'ginny': 0.029411764705882353,
 'going': 0.08823529411764706,
 'gonna': 0.058823529411764705,
 'here': 0.029411764705882353,
 'just': 0.029411764705882353,
 'killin': 0.029411764705882353,
 'listening': 0.029411764705882353,
 'not': 0.058823529411764705,
 'perfect': 0.029411764705882353,
 'right': 0.029411764705882353,
 'so': 0.11764705882352941,
 'the': 0.11764705882352941,
 'welcome': 0.17647058823529413,
 'welcome😘': 0.029411764705882353}

In [14]:
people_ngrams['2'][('[s]', "you're")]

{'a': 0.07317073170731707,
 'adorable': 0.024390243902439025,
 'awake': 0.024390243902439025,
 'cruel': 0.024390243902439025,
 'definitely': 0.04878048780487805,
 'drunk': 0.04878048780487805,
 'exactly': 0.024390243902439025,
 'funny': 0.024390243902439025,
 'going': 0.024390243902439025,
 'gonna': 0.07317073170731707,
 'good': 0.024390243902439025,
 'just': 0.04878048780487805,
 'not': 0.024390243902439025,
 'probably': 0.024390243902439025,
 'reading': 0.024390243902439025,
 'retarded': 0.024390243902439025,
 'right': 0.024390243902439025,
 'sadistic': 0.024390243902439025,
 'seriously': 0.04878048780487805,
 'so': 0.17073170731707318,
 'staying': 0.024390243902439025,
 'still': 0.024390243902439025,
 'such': 0.024390243902439025,
 'sure': 0.04878048780487805,
 'the': 0.024390243902439025,
 'welcome': 0.024390243902439025}

In [15]:
people_ngrams['3'][('[s]', "you're")]

{'peachy': 0.16666666666666666,
 'pure': 0.16666666666666666,
 'the': 0.5,
 'welcome': 0.16666666666666666}

In [16]:
people_ngrams['54'][('[s]', "you're")]

{'caught': 0.25, 'dead': 0.25, 'just': 0.25, 'welcome': 0.25}

In [17]:
people_ngrams['14'][('[s]', "you're")]

{'a': 0.14285714285714285,
 'adorable': 0.14285714285714285,
 'amazing': 0.14285714285714285,
 'in': 0.14285714285714285,
 'perfect': 0.14285714285714285,
 'taking': 0.14285714285714285,
 'welcome': 0.14285714285714285}

#### Sentence Generation

The final step is generating the sentences and this is accomplished with the following function:

In [213]:
import random
import numpy as np

def generate_sentences(ngrams, k, n):
    """Sample k sentences from given ngram model.
    Params:
      ngrams....ngram language model; a dict from ngram tuple to a dict
      k.........number of sentences to sample
    """
    # List of all ngrams that start with [s]
    start_ngrams = [ngram for ngram in ngrams if ngram[0] == '[s]']
    
    sentences = []
    for i in range(k):  # sample k sentences.
        # sample uniformly from all start ngrams.
        ngram = random.sample(start_ngrams, 1)[0]
        sentence = []
        sentence.extend(ngram)
        while sentence[-1] != '[/s]' and len(sentence) < 50:  # while not at end of sentence.
            # sample the next word
            sampled_word = np.random.choice(list(ngrams[ngram].keys()),      # words
                                            p=list(ngrams[ngram].values()),  # probabilities
                                            size=1)[0]
            sentence.append(sampled_word)
            # update most recent ngram
            if n > 2:
                ngram = tuple(list(ngram[-1:]) + [sampled_word])
            else:
                ngram = tuple([sampled_word])
        sentences.append(' '.join(sentence))
    return sentences


In [183]:
people_sentences = {}

# generate sentences for each person
for id in ids:
    people_sentences[id] = generate_sentences(people_ngrams[id], 5, n)
    
for id in ids:
    print("Sentences for id", id)
    for index, sentence in enumerate(people_sentences[id]):
        print(sentence.strip("[s]").strip("[/"))
    print("\n")

Sentences for id 1
 whaaaaaaat is this 
 john is outside still 
 okay do you know when i start to get by next year and then generate random sentences based on those probabilities =p 
 https://portal . stretchinternet . com/sxu/# 
 whaddaaauuppppp 


Sentences for id 2
 we've used up to ? 
 oops lol 
 god no joke me too well 
 ooooohhh juno ! ! ! 
 truthfully i have earned more money but cuz you have the time . we'd have to be like all my documents in my trunk is open , everything is on tonight❤️ 


Sentences for id 3
 make sure to laugh at something silly today 😆 
 that word isn't in my vocabulary 
 486 days . 
 16 ounces = 1 pound 
 he's 5ft 5in tall . 


Sentences for id 14
 fuck him 
 then we went out dancing for his bday zzzzzzz 
 irrelevant af dumbass google 
 nothing 
 k 


Sentences for id 54
 @google penis 
 haha i'll most likely have to be at the peak of a new couch 
 brian obviously hasn't though 
 👏👏👏👏👏 
 u wut 




#### Favorites

Very cool =D I ran this a few times and saved some of my favorites.

In [198]:
for key in favorites:
    for line in favorites[key]:
        print(key, ":", line.strip("[s]").strip("[/"), sep='')
        
# save the favorites to an external file
with open("favorites.txt", "w+", encoding="utf-8") as file:
    for key in favorites:
        for line in favorites[key]:
            line = key + ":" + line.strip("[s]").strip("[/") + "\n" 
            file.write(line)

54: tell them everything 
54: joe 💜💜💜💜💜 
54: wish we could ride a goat/ram/donkey to climb mountains lol 
54: they're in the forest 
54: 🎂 
54: lit 
54: man the fps is pretty bad in some spots 
54: shakedown hawaii looks like it's puking from its eyes 
54: because i wasn't sure where you're supposed to aim it and it has multiplayer but it'd not really a party game like snip your dicks 
54: took out 3/4 of his health lol 
54: haven't watched season 5 and 6 the past two days 
54: thought i had bought the case best buy is definitely the best show ever created 
54: or are you driving an ice cream truck ? ? ? 
54: lots of stuff can kill in a suitcase tbh 
54: goodness yeah don't think they have legs you can do anything with friends lmao 
54: botw hype though lol 
54: right off the sea at the peak of a new couch 
3: today , they played the twins . 
3: author and artist dr . seuss pronounced his name to rhyme with "voice" , 16" 
3: i'm not sure 😕 
3: "all you need is love . but a little self-

#### Closing Remarks

Considering how simple the language model is, the results are pretty impressive! One major disadvantage is that the vocabulary of each bot is limited to ngrams that are present in the dataset. So for instance, if the language model consists of the following trigrams:

`('you', 'are'), {'nice': 0.5, 'funny': 0.5}`
`('he', 'is'), {'cool': 1.0}`

The bots will never generate a sentence like "you are cool", even though those three words exist in the bot's vocabulary. With that being said, it could be possible to generate a sentence like `"he thinks you are cool"` (assuming there are a few more ngrams in the vocabulary than the three I listed in the example). Remember that these bots know absolutely nothing about sentence structure or parts of speech, yet it's possible for them to generate grammatically correct sentences. Pretty cool!! As we increase `n`, the generated sentences are more likely to be grammatically and semantically correct, since the ngrams contain more information regarding the proper order of words. However, the expressiveness of the bots decrease because the occurrences of those high-n ngrams will be low in a smaller dataset. The best way to improve the expressiveness of the bots is to increase the number of sentences in the dataset. I'll conclude by showing a few sentences generated with `n=2`. In the future, I plan to combine the Facebook and Allo datasets to improve the language models. Moreover, I'll explore other language models, including one based on a recurrent neural network.

In [214]:
ids = ['1', '2', '3', '14', '54']
n = 2

people_ngrams = {}
people_sentences = {}

for id in ids:
    people_ngrams[id] = estimate_ngram_probs(names_dict[id], n)
    people_sentences[id] = generate_sentences(people_ngrams[id], 15, n)
    
for id in ids:
    print("Sentences for id", id)
    for index, sentence in enumerate(people_sentences[id]):
        print(sentence.strip("[s]").strip("[/"))
    print("\n")

Sentences for id 1
 same 
 lmao 
 tysm for word hahaha ! 
 ^ 
 yeah i think that's always been dark for you tell , not good night 
 spaghetti . . yeah well this ? i love chicken fried rice and it twice and work , he's probably need you should be able to send that pay though too ! ali ? 
 when you ! 
 yupp , i miss volleyball 
 it was the commute three christmases ago lol 
 now 
 ^^^^ 
 i could i don't worry about it to meet up 
 i totally called localcast , so far 😁 
 what you did a recommendation lol" 
 i was fun . 


Sentences for id 2
 thank you get on being filling in the mom lol how long it'll be able to end up your trip this one tomorrow then to being 10pm lol thats why ? i got up pookah dimples ! have that you'll do more ! 
 i don't get good night 
 good flight my graduation sunday or just watch ? 
 lmao 
 if you can ! ? ? 
 i should've asked me 
 what're you be at your game , good night 
 that's really cool . . . wednesday i'm going to look like 2 teams today , the soccer field