#### Background 
Earlier in the year, I exported my entire Facebook chat history and created an n-gram language model that I used to generate sentences. 

An n-gram is a particular way to represent sequences. In our case, these sequences will be sentences. An n-gram consists of a word followed by the next n-1 words in the sentence. For example:


`"The car drove down the street"`


Using 1-grams (aka unigrams) we end up with six unigrams: 

    ['The', 'car', 'drove', 'down', 'the', 'street']`
Using 2-grams (aka bigrams) we end up with the following five bigrams:

    [['The', 'car'],
    ['car', 'drove'], 
    ['drove', 'down'], 
    ['down', 'the'],
    ['the', 'street']]
And using 3-grams (aka trigrams) results in the follwing four trigrams:
   
    [['The', 'car', 'drove'],
    ['car', 'drove', 'down'],
    ['drove', 'down', 'the'],
    ['down', 'the', 'street']]
The language model was based on the probability of each n-gram. Essentially, all of the n-grams from the chat history are created and then the probability of each one appearing is calculated. These probabilities are used to generate sentences so that when a sentence is generated, it roughly represents the personality of the person. 

I did this for myself, my fiance, and two cousins and we got a kick out of reading the generated sentences! We each had our own "bot" whose language model was based on each of our messages and interestingly we were even able to determine which bot prouced each sentence.

Since then, we've switched over to using Google Allo for our chats. They recently added the ability to backup/restore chats so its time to create another language model using our chats from the last several months!
#### Dataset Retrieval

In short, it turned out that the file created by the app's backup feature didn't contain plain text. This is beneficial for a backup file, because it can dramatically decrease the file's size; however, I wanted the plain text so it was extremely disappointing. Previously I just assumed that messages were stored remotely but in researching the issue, I found out that the messages are actually stored locally on the phone in an SQLite database. Perfect!! 

Since my phone is rooted, I was able to access the file from `/data/data/com.google.android.apps.fireball/databases/fireball.db`
    
Then I downloaded Mozilla Firefox and the [SQLite Manager Add-On](https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/) so that I could explore the various tables.

I ultimately wanted a dataset in CSV or JSON format where each row/record contains a message, the sender, and the text size. It turns out that the two tables that contained this information were named "message" and "Fireball_users." Here's the SQL statement I used to extract just the fields that I wanted:

<font face="Courier New" size="2">
<font color = "blue">SELECT</font>&nbsp;<font color = "maroon">profile_display_name</font><font color = "silver">,</font>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color = "maroon">text</font><font color = "silver">,</font>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color = "maroon">text_size</font><font color = "silver">,</font>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color = "maroon">received_timestamp</font><font color = "silver">,</font>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color = "maroon">content_type</font>
<br/><font color = "blue">FROM</font>&nbsp;&nbsp;&nbsp;<font color = "maroon">(</font><font color = "blue">SELECT</font>&nbsp;<font color = "maroon">sender_id</font><font color = "silver">,</font>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color = "maroon">text</font><font color = "silver">,</font>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color = "maroon">text_size</font><font color = "silver">,</font>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color = "maroon">content_type</font><font color = "silver">,</font>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color = "maroon">received_timestamp</font>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color = "blue">FROM</font>&nbsp;&nbsp;&nbsp;<font color = "maroon">messages</font><font color = "maroon">)</font>&nbsp;<font color = "maroon">cmnts</font>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color = "blue">JOIN</font>&nbsp;<font color = "maroon">(</font><font color = "blue">SELECT</font>&nbsp;<font color = "maroon">_id</font><font color = "silver">,</font>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color = "maroon">profile_display_name</font>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color = "blue">FROM</font>&nbsp;&nbsp;&nbsp;<font color = "maroon">fireball_users</font><font color = "maroon">)</font>&nbsp;<font color = "maroon">users</font>
<br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<font color = "blue">ON</font>&nbsp;<font color = "maroon">cmnts</font><font color = "silver">.</font><font color = "maroon">sender_id</font>&nbsp;<font color = "silver">=</font>&nbsp;<font color = "maroon">users</font><font color = "silver">.</font><font color = "maroon">_id</font>
<br/><font color = "blue">ORDER</font>&nbsp;&nbsp;<font color = "blue">BY</font>&nbsp;<font color = "maroon">received_timestamp</font>&nbsp;<font color = "blue">DESC</font>&nbsp;
</font>


And then SQLite Manager allows exporting to CSV, which is just what I did! For the blog posts, however, I'll be using a dataset that contains our unique sender_ids rather than our real names.

In [1]:
count = 0
with open("allo_messages_anon.csv", encoding="utf-8") as file:
    for line in file:
        if count < 1:
            print(line)
            count += 1

"2","Weird lol","16","1494194109844","text/plain"



Perfect! Next step is iterating through the dataset and creating the ngrams with their probabilities.