# Part3: Report

###  Raw Data preprocess

First I convert the json raw data into pandas dataframe and concatenated the dataframe with the emotion label. The next step is to split the whole data set into train set and validation set. The validation set would not be used during the whole training process. It is only used for model validation.

### Text Preprocess

After some exploration, the next is to process the text data. Since I planned to utilize pre-trained glove twitter word embedding vectors, I tried to increase the cover rate by replacing some tokens that doesn't appears in the dictionary. For example, @user, hashtag, emojis, etc.

1. Convert all text into lowercase.
2. Delete <LH\> token in the text.
3. Utilize nltk tweet tokenizer to tokenizer text, it doesn't split token like @user or #hashtag into two different tokens.
4. Replace @user token with "<user\>" token. Also replace #something with "<hashtag\>" and "something" two tokens.
5. Replace common emojis with corresponding adjectives.

### Train naive bayes classifier as baseline

The next step is to create a baseline with naive bayes classifier. I only utilize top 50k unigrams and bigrams BOW features. The naive bayes classifier scores .443 on validation set. (macro f1)

|            | precision | recall | f1-score | support  |
|-----------:|----------:|-------:|---------:|---------:|
|       anger|0.27       |0.38    |0.31      |      5982|
|anticipation|0.60       |0.54    |0.57      |     37269|
|     disgust|0.37       |0.55    |0.44      |     20866|
|        fear|0.31       |0.53    |0.39      |      9447|
|         joy|0.67       |0.56    |0.61      |     77286|
|     sadness|0.48       |0.45    |0.47      |     29160|
|    surprise|0.31       |0.30    |0.31      |      7343|
|       trust|0.46       |0.44    |0.45      |     30982|
||
|    accuracy|           |        |0.51      |    218335|
|   macro avg|0.43       |0.47    |0.44      |    218335|
|weighted avg|0.54       |0.51    |0.52      |    218335|

### Train RNN model with sampled data

The next step I tried is to build a RNN model. Since the training process may be time-consuming, I only sampled 100k tweets for experiment.

#### Load pre-train word embeddings model

As metioned before, I used pre-trained glove twitter word embedding vectors.

#### Training data preparation

1. Use keras Tokenizer to tokenize text (num_words=50k).
2. Use keras pad_sequences to pad the text (maxlen=30), i.e. use the first 30 words to represent the text.
3. Create one-hot encoding for emotion label.
4. Create word embedding matrix with pre-trained embeddings.
5. Split the data into training set and testing set (test_size=0.2).

#### Build RNN model and train

My RNN model consist of the following layers:

1. Word embeddings layer (not trainable).
2. One layer CuDNNLSTM to speed up training process (192 hidden nodes).
3. Softmax output layer.

I used categorical cross entropy as loss function and adam as optimizer. As for metrics, since keras has removed f1-score from Metrics since 2.0 version, I needed to implement it with keras Callback. Also, the model was trained for 7 epochs with batch size 32.

In [None]:
class Metrics(Callback):
    def on_train_begin(self, logs=None):
        self.val_f1s = []

    def on_epoch_end(self, epoch, logs=None):
        y_pred = self.model.predict(self.validation_data[0])
        y_pred = label_decode(label_encoder, y_pred)
        y_true = label_decode(label_encoder, self.validation_data[1])
        _val_f1 = f1_score(y_true, y_pred, average='macro')
        self.val_f1s.append(_val_f1)
        print(classification_report(y_true=y_true, y_pred=y_pred))
        print(_val_f1)
    
metrics = Metrics()

#### Tests in different inner layer settings

In this step, I tried to find the best setting for inner layer. I tried similar model like CuDNNGRU, bidirectional CuDNNLSTM, bidirectional CuDNNGRU, multi-layer CuDNNLSTM, etc.; or tried to tune parameters on different hidden nodes or batch size. It turned out that those trials didn't generate significant improvement. With only 100k tweets, the RNN model with single layer CuDNNLSTM scores .446 on validation set, a similar result compared to the baseline. (macro f1)

|            |precision  |recall  | f1-score |support   |
|-----------:|----------:|-------:|---------:|---------:|
|       anger|       0.31|   0.30 |     0.31 |     5982 |
|anticipation|       0.57|   0.54 |     0.55 |    37269 |
|     disgust|       0.37|   0.43 |     0.40 |    20866 |
|        fear|       0.49|   0.39 |     0.44 |     9447 |
|         joy|       0.60|   0.69 |     0.64 |    77286 |
|     sadness|       0.45|   0.44 |     0.45 |    29160 |
|    surprise|       0.46|   0.23 |     0.31 |     7343 |
|       trust|       0.45|   0.35 |     0.39 |    30982 |
||
|    accuracy|           |        |     0.52 |   218335 |
|   macro avg|       0.46|   0.42 |     0.44 |   218335 |
|weighted avg|       0.51|   0.52 |     0.51 |   218335 |

### Train RNN model with whole data set

After experiment on 100k tweets, I started to train the RNN model on the whole data set. With about 1M tweets, it took about 20 minutes to train 7 epochs on single GTX 1080. The model scores .503 on validation set, about 6% improvement compared to the baseline. (macro f1)

|            |  precision|    recall|  f1-score|   support|
|-----------:|----------:|---------:|---------:|---------:|
|       anger|       0.62|      0.27|      0.37|      5982|
|anticipation|       0.64|      0.62|      0.63|     37269|
|     disgust|       0.47|      0.42|      0.44|     20866|
|        fear|       0.66|      0.46|      0.54|      9447|
|         joy|       0.62|      0.78|      0.69|     77286|
|     sadness|       0.47|      0.58|      0.52|     29160|
|    surprise|       0.70|      0.25|      0.37|      7343|
|       trust|       0.63|      0.37|      0.47|     30982|
||
|    accuracy|           |          |      0.59|    218335|
|   macro avg|       0.60|      0.47|      0.50|    218335|
|weighted avg|       0.59|      0.59|      0.58|    218335|

### Final result

The final result on private leaderboard is .47704

![Snapshot](img/kaggle_private_scoreboard_snapshot.png)

### Future Improvement

* Ensemble other features with current RNN model.
    * score
    * date
    * hashtags
* Try different deep learning model.
    * Attention Mechanim
    * BERT
