# Predicting Emojis from Twitter Data

Emoji have become more and more prominent in today’s social media. Since their initial appearance in Japan in the 1990s, it has been found that emoji are used by over ninety-two percent of the online population in 2015 [1]. Due to the indicated trend, numerous NLP applications can benefit from the emoji interpretation capability.

In this project, we aim to implement and to train the following models: a bidirectional LSTM, a CNN, and a bag of words. Our objective is to predict one of 5, 10, and 20 most frequently used emoticons for a given sentence.

## 1. Data Set

We acquired a pre-processed dataset containing 584,600 tweets, posted between October 2015 and May 2016 in the US [2]. The dataset consists of three sets containing tweets from the top 5, 10, and 20 most common emojis. Each set is split into training, validation, and test sets with the training sets containing 2-5 hundred thousand tweets and the validation and test sets containing a couple ten thousand.

Preprocessing consisted of replacing user mentions with the symbol "@user", as well as replacing words that occur less than 5 times with the symbol "< unk >". Punctuation such as commas and quotation marks are separated from words with a space and are treated as words themselves.

## 2. Baseline

For the baseline classifier, we have a bag of words classifier in which each message is represented as a vector of the most informative tokens selected using term frequency--inverse document frequency (TF-IDF). L2 regularized logistic regression is used to make the predictions.

## 3. Convolutional Neural Network

Another model noted to do well is a convolutional neural network[2][3]. The model consisted of passing 64 filters of width 3, 4, and 5 over a sequence of word embeddings (of dimension 50) which a max pool is applied to produce a fixed size output. The output then fed directly into a fully connected softmax used to predict the emoji class. During training the fully connected layer is subjected to dropout. Embeddings were initialized using pre-trained GloVe embeddings from twitter data. Words without matching GloVe embeddings were initialized from a uniform distribution from -1 to 1.

<img src='images/cnn_model.png' />

In addition to the basic CNN, increased fully connected layers and a highway network was introduced between the convolutional layer output and fully connected layer. Deep highway networks are noted to have improved training time over deep neural networks as well as produce similar outputs between semantically similar words and phrases with vastly different input[3]. A highway layer is defined by eq. 1 where $\circ$ is an element wise multiplication.

$$y = \text{relu}(W_H x + b_H) \circ \sigma(W_T x + b_T) + (1 - \sigma(W_T x + b_T)) \circ x$$

Since the output of a highway network is of the same dimension as its inputs, $W_H$ and $W_T$ are therefore square matrices.

Dropout is applied between every layer from the convolutional output up to, but not including, the softmax layer in order to regularize the model. Weights are initialized using the Glorot uniform distribution. Biases are initialized to 0 except for $b_T$ which is initialized from a uniform distribution from -4 to -2. This is so highway networks tend to produce similar output as its inputs at first. All models were trained using the Adam optimizer.

## 4. Long-Short Term Memory

We explore a bi- and uni-directional LSTM model to our sequence classification problem with GloVe word embeddings.

LSTM neural networks are being actively researched as they show promising results and can provide state-of-the-art performance. Recently engineers at Google greatly improved their voice recognition and transcription systems by incorporating LSTM RNNs that outperform DNNs and RNNs. (https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43905.pdf).

At the onset of the development, I started my implementation in Pytorch framework. I found tutorials that provided in-depth information that helped me understand embeddings and neural nets in NLP more in more detail. Additionally, I analyzed a word language model from Pytorch tutorials. However, there was a big surprise to running this model. Training the system took lots of time: about 4 hours per epoch or 17 min for a mini batch. My solution was to acquire access CUDA-enabled GPU to perform all computations in parallel. Doing so allowed to speed up the process by roughly 56 times.

However, I decided to switch to a more high-level framework, Keras, to envision the structure in a simple way and to be consistent on the software with my team.
I followed the approach in the paper Are Emojis Predictable by  Francesco Barbieri. Since many details of the implementation were omitted, I conducted many experiments to find this particular solution. Firstly, I tokenize, enumerate, and pad or truncate each tweet to the uniform length of 35. Than I use GloVe word embeddings to represent all the words in the training set as the first hidden layer of the network. It takes input dimension (vocabulary size), output dimension (size of the embedded vector for each word = 100), and input length( max length of each tweet = 35 ). The embedding weights are also learned. I introduce dropout layer where random neuron get dropped out during training with a given probability to prevent overfitting. Initially I used just LSTM layer, but adding Bidirectional LSTM improved the results for the set of 5 emoji. This is because BILSTM can provide more context and improve learning as it uses input in both forward and backward directions. This layer has 64 hidden units as this number seem to work best. The final stage output enters the softmax function to determine the most probably emoji. Additionally, I use a popular in NLP optimization algorithm called Adam. This method computes different learning rates for various parameters unlike traditional SGD where learning rate remains unchanged.

Testing showed that my BILSTM/LSTM  model does not perform as well as the baseline model. However, my model outperform the one described in the paper. Similarly to CNN model, I noticed that BILSTM leans towards favoring the most frequently used emoji, specifically “tears of joy.” This is due to uneven number of training samples for each emoji. I could further improve accuracy by splitting or eliminating hashtag concatenated words. Overall, I consider my work a success.

## 5. Results

We tested our models using a weighted F1 score as an indicator of performance.

<center><b>F1 Scores by Model per Top N Emojis</b></center>

|  | baseline | CNN | Resampled CNN | Highway CNN | LSTM | Bi-LSTM |
|--|----------|----------|----------|----------|----------|----------|
|5 | 0.592061 | 0.549705 | <b>0.595257</b> | 0.564256 | 0.59 ||
|10| 0.441736 | <b>0.447219</b> | 0.390082 | 0.423835 | 0.44 ||
|20| <b>0.347743</b> | 0.208166 | 0.292820 | 0.284491 | 0.34 ||

Unfortunately, as one might notice none of our models performed siginificantly better than our baseline model.

### 5.1 Baseline

### 5.2 CNN

CNN | Resampled CNN
:-:|:-:
<img src=images/5_confusion.png width=250p> | <img src=images/5_resample.png width=250p>

<b>Figure 2.</b> Confusion matrix of top 5 emojis for various CNNs. The most common emoji is denoted as the class 0 while the least common is denoted as the class 4.

Here we can compare the results of the various CNNs. As one would expect, the resampled CNN does a better job at predicting the less common classes, however this is to the detriment of now confusing the first and third most common emojis together. However both of these perform better than a CNN with a single highway layer since it fails to distinguish the thir

### 5.3 LSTM

## 6. Conclusion

## References

[1]: http://emogi.com/documents/Emoji_Report_2015.pdf "Emoji Report 2015", emoji.com, 2015

[2]: https://arxiv.org/pdf/1702.07285.pdf F. Barbieri, M. Ballesteros, H. Saggion, "Are Emojis Predictable?", 2016

[3]: https://web.stanford.edu/class/cs224n/reports/2762064.pdf L. Zhao, C. Zeng, "Using Neural Networks to Predict Emoji Usage from Twitter Data"

[4]: https://arxiv.org/abs/1408.5882 Y. Kim, "Convolutional Neural Networks for Sentence Classification", 2014

[5]: https://arxiv.org/abs/1505.00387 R. Srivastava, K. Greff, J. Schmidhuber, "Highway Networks", 2015 

## Code

https://github.com/neonrights/emoji_predictor