## An alethiometer for the modern age

*The Golden Compass* was one of my favorite books growing up. It has lots of your standard young adult fantasy epic elements -- a plucky heroine, talking animals, authoritarian villians -- but it also touches on some weighty theological themes. The author described it as a deliberate inversion of Milton's *Paradise Lost* (and not for nothing, at the end of the series the protagonists save the world by killing God and re-committing original sin). 
A central element in the book is the existence of the eponymous "golden compass", a literal *machina qua deus ex* which answers questions through divine intervention. The compass presents its answers as a series of ideograms: its face is ringed with symbols and when posed a question its needle sweeps around the face selecting the symbols which comprise the answer. I always wanted one of those when I was a kid but, alas, back then powerful artifacts with oracular capabilities were in short supply. Nowadays we have smartphones and twitter though so better late than never! In this post I'm going to describe a twitter bot I made which answers questions with emoji (hence [*alethiomoji*](https://github.com/hinnefe2/alethiomoji), the name of the project; the golden compass was also called an alethiometer). This is where we're headed:

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/alethiomoji">@alethiomoji</a> is this the end of the the world?</p>&mdash; Henry Hinnefeld (@DrJSomeday) <a href="https://twitter.com/DrJSomeday/status/824076525817491456">January 25, 2017</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

<blockquote class="twitter-tweet" data-lang="en"><p lang="und" dir="ltr"><a href="https://twitter.com/DrJSomeday">@DrJSomeday</a> 🔚 🌐 ⏳</p>&mdash; Emoji Golden Compass (@alethiomoji) <a href="https://twitter.com/alethiomoji/status/824076719292317698">January 25, 2017</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

The bot has three main parts:

1. A sentence parser which pulls out semantically important words.
2. A matching algorithm which finds emoji that are related to a given word.
3. Some machinery to read from and post to twitter.

Note that bot doesn't actually try to 'answer' the question in any meaningful way: under the hood it's just finding emoji which are related to the important words in the question and then adding in an extra emoji that can be interpreted as a yes / no / maybe.

I used existing python modules for parts 1 ([stat_parser](https://github.com/emilmont/pyStatParser)) and 3 ([twython](https://twython.readthedocs.io/en/latest/)), so here we'll focus on part 2.


### Matching words to emoji

The task at hand is to come up with an algorithm which finds one or more emoji which are related to a given word.
One place to start is with the [official descriptions](http://unicode.org/emoji/charts/full-emoji-list.html) of each emoji. Conveniently for us, the folks at [emojityper.com](https://emojityper.com/) have already scraped all the descriptions into a nice, tidy csv [file](https://github.com/emojityper/emojityper.github.io/blob/master/res/emoji/annotations.txt).

As a first attempt, we'll start by counting how many times To do the actual word counting we'll use scikit learn's CountVectorizer, which exists for this purpose.

In [22]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

annot = pd.read_csv('cleaned_annotations.txt', index_col='unicode')
vectorizer = CountVectorizer().fit(annot.description)

# CountVectorizer returns a numpy array of the word count values,
# here we're turning that into a nicely labelled pandas DataFrame
def get_count_df(cv, series):
    vals = cv.transform(series)
    cols = sorted(cv.vocabulary_, key=cv.vocabulary_.get, reverse=False)
    idx = series.index
    return pd.DataFrame(data=vals.toarray(), columns=cols, index=idx)

df = get_count_df(vectorizer, annot.description)

# display a small subset of the DataFrame
df.iloc[:5][['smiling','tears','grinning']].join(annot['description'], how='left')

Unnamed: 0_level_0,smiling,tears,grinning,description
unicode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
😀,0,0,1,GRINNING FACE
😁,1,0,1,GRINNING FACE WITH SMILING EYES
😂,0,1,0,FACE WITH TEARS OF JOY
😃,1,0,0,SMILING FACE WITH OPEN MOUTH
😄,2,0,0,SMILING FACE WITH OPEN MOUTH AND SMILING EYES


In [18]:
word = 'dog'
word_vec = get_count_df(vectorizer, pd.Series(word, name=word))

In [19]:
matches = df.dot(word_vec.transpose())
matches.sort_values(0,ascending=False).head(5)

Unnamed: 0_level_0,0
unicode,Unnamed: 1_level_1
🐕,1
🐶,1
🌭,1
😀,0
📤,0


This works pretty well for words which show up in the official emoji descriptions, however there are only about 1500 different words used in the descriptions:

In [20]:
len(vectorizer.vocabulary_)

1458

In more mathematical terms, we're trying to map elements of one set (words) onto elements of a different set (emoji). To do this I'm going to embed all my elements (words and emoji) in the same vector space, and then see which elements are close to each other in that space. 

That's a bit abstract though, so let's get a little more concrete: what I'm trying to do here is assign each word and each emoji a series of numbers which somehow describe the word/emoji. Then, to find an emoji which matches a given word, I search for the emoji which has numbers that are similar to those of the supplied word. 

For example, suppose I have the following emoji descriptions:

    ❤️: HEART
    😍: HEART FACE
    😃: FACE

I can assign each emoji a series of numbers (i.e. a vector) by counting how many times each word shows up in its description, like so:

In [1]:
import pandas as pd

# create vectors for each emoji by counting the occurrence of each word in its description
df = pd.DataFrame(data=[[1, 0], [1, 1], [0, 1]], 
                  index=['❤️', '😍', '😃'],
                  columns=['HEART', 'FACE'])
df

Unnamed: 0,HEART,FACE
❤️,1,0
😍,1,1
😃,0,1


Then, to find emoji which match the word 'heart' I assign the word 'heart' a vector in the same way, and compare its vector to those of all the emoji using the dot product:

In [2]:
# generate a vector for the word 'heart' the same way as we did for the emoji
heart_word = pd.DataFrame(data=[[1, 0]], 
                          index=['heart'], 
                          columns=['HEART', 'FACE'])
heart_word

Unnamed: 0,HEART,FACE
heart,1,0


In [3]:
# compare the vector for the word 'heart' to the emoji vectors using the dot product
df.dot(heart_word.transpose())

Unnamed: 0,heart
❤️,1
😍,1
😃,0


The values we end up with tell us how similar the word 'heart' is to each of our emoji. According to this, we've matched the word 'heart' to the emojis ❤️ and 😍 but not to 😃.

This is the main idea behind each of the more detailed algorithms described below: 
- Assign each emoji a series of numbers (i.e. a vector).
- When given a word to match, assign it a vector in the same way.
- Compare the vector for the given word with the vectors for all the emoji.

The trick is coming up with a good way to assign vectors. Below I'll describe two different methods I used in the bot: the first is a slightly improved version of the example above, and the second uses [word2vec](https://en.wikipedia.org/wiki/Word2vec).


### Assigning vectors using emoji descriptions

One place to start is with the [official descriptions](http://unicode.org/emoji/charts/full-emoji-list.html) of each emoji. Conveniently for us, the folks at [emojityper.com](https://emojityper.com/) have already scraped all the descriptions into a nice, tidy csv [file](https://github.com/emojityper/emojityper.github.io/blob/master/res/emoji/annotations.txt).
To do the actual word counting we'll use scikit learn's CountVectorizer, which exists for this purpose.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

annot = pd.read_csv('cleaned_annotations.txt', index_col='unicode')
vectorizer = CountVectorizer().fit(annot.description)

# CountVectorizer returns a numpy array of the word count values,
# here we're turning that into a nicely labelled pandas DataFrame
def get_count_df(cv, series):
    vals = cv.transform(series)
    cols = sorted(cv.vocabulary_, key=cv.vocabulary_.get, reverse=False)
    idx = series.index
    return pd.DataFrame(data=vals.toarray(), columns=cols, index=idx)

df = get_count_df(vectorizer, annot.description)

# display a small subset of the DataFrame
df.iloc[:5][['smiling','kissing','grinning']]

Unnamed: 0_level_0,smiling,kissing,grinning
unicode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
😀,0,0,1
😁,1,0,1
😂,0,0,0
😃,1,0,0
😄,2,0,0


In [7]:
word = 'cake'
word_vec = get_count_df(vectorizer, pd.Series(word, name=word))

In [8]:
matches = df.dot(word_vec.transpose())
matches.sort_values(0,ascending=False).head(5)

Unnamed: 0_level_0,0
unicode,Unnamed: 1_level_1
🍥,2
🎂,1
😀,0
💵,0
📧,0


In this simplified example we are embedding our elements (words and emoji) in a nine-dimensional space.

In [33]:
cv.transform(df[0]).toarray()

array([[0, 0, 1, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 1, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0, 1, 1, 1]])

In [6]:
import alethio

alethio.generate_answer('How can I prepare for the coming apocalypse?')

'✨ 💪 🔪'