## Emotion Classification
This is the second chapter of this series. In this notebook we aim to train a classifier based on the word embeddings we pretrained in the previous notebook. Then we will store the classifier to conduct emotional analysis on real-time tweets, which will be covered in the next chapter.

In [14]:
import pandas as pd
import numpy as np
from sklearn import preprocessing, metrics, decomposition, pipeline, dummy
import torch
import torch.nn.functional as F
import torch.nn as nn
import os
import matplotlib.pyplot as plt
%matplotlib inline
import helpers.pickle_helpers as ph
import time
import math
from sklearn.cross_validation import train_test_split
import re

### Parameters
First, let's declare our hyperparameters.

In [2]:
EMBEDDING_DIM = 128
HIDDEN_SIZE = 256
KEEP_PROB = 0.8
BATCH_SIZE = 128
NUM_EPOCHS = 50 
DELTA = 0.5
NUM_LAYERS = 3
LEARNING_RATE = 0.001

### Data Preparation
Let's import and prepare the data as was done in the previous chapter.

In [4]:
train_data = ph.load_from_pickle(directory="data/datasets/df_grained_tweet_tr.pkl")
test_data = ph.load_from_pickle(directory="data/datasets/df_grained_tweet_te_unbal.pkl")

train_data.rename(index=str, columns={"emo":"emotions", "sentence": "text"}, inplace=True);
test_data.rename(index=str, columns={"emo":"emotions", "sentence": "text"}, inplace=True);

train_data.text = train_data.text.str.replace(" <hashtag>", "")
test_data.text = test_data.text.str.replace(" <hashtag>", "")

In [5]:
len(train_data)

597192

In [6]:
def clearstring(string):
    string = re.sub('[^\'\"A-Za-z0-9 ]+', '', string)
    string = string.split(' ')
    string = filter(None, string)
    string = [y.strip() for y in string]
    string = [y for y in string if len(y) > 3 and y.find('nbsp') < 0]
    return ' '.join(string)

In [7]:
train_data.text = train_data.text.apply(lambda d: clearstring(d))
test_data.text = test_data.text.apply(lambda d: clearstring(d))

### Obtain Word Embeddings
Here is the code for importing the word embeddings we pretrained in the previous chapter. Notice that we are also importing the vocabulary. See below how handy the vocabulary is to inspect our word embeddings.

In [8]:
### load word embeddings and accompanying vocabulary
wv = ph.load_from_pickle("data/hashtags_word_embeddings/es_py_cbow_embeddings.p")
vocab = ph.load_from_pickle("data/hashtags_word_embeddings/es_py_cbow_dictionary.p")

In [9]:
### eg. to obtain embedding for token
wv[vocab["feel"]]

array([ 0.5389905 , -0.8261296 , -1.8023891 , -0.8072674 , -0.6313184 ,
       -1.3096205 ,  1.6170695 ,  1.8171018 ,  0.05804818,  1.5923933 ,
        1.2208248 , -0.08000907,  1.4284078 ,  0.5594934 ,  0.8742701 ,
        0.04409672, -0.51616585, -0.26882973,  0.2614767 ,  1.7617252 ,
       -0.7654648 , -0.1121751 ,  0.6021578 , -2.7278464 , -1.5101068 ,
        1.9514263 ,  0.9859432 , -2.0553567 ,  0.52864003, -1.5633332 ,
       -2.329722  ,  0.33874342,  0.9558916 ,  0.9637566 ,  0.72352   ,
       -0.60107934,  1.2980587 ,  1.3291203 ,  0.08595378, -0.96753865,
       -0.47979838, -1.4262284 ,  0.80548376,  0.94358546, -0.85197926,
       -1.5562207 , -0.28793994, -0.21579984, -0.6607775 , -0.21598966,
        1.6049399 , -0.343651  , -0.0540315 , -2.1718023 , -0.98242474,
       -1.6945462 , -1.3239328 ,  1.6394376 , -1.1029811 ,  0.42646387,
       -1.0574629 , -0.4617092 , -1.0275363 ,  1.7248987 , -0.05921336,
        0.9992472 ,  0.7281742 ,  1.0187635 ,  1.8406339 , -2.00

### Tokenization and Label Binarization
A very important step before training our classifier, is to make sure the data is in the right format so it becomes easy for us to feed the data into the model. In the code below we will tokenize our dataset, in particular the inputs. Then we will also perform binarization on the target values so as to obtain one-hot vectors that will uniquely represent the target of each sentence or tweet. We also do some additional pre-processing which you can follow below. 

In [15]:
def remove_unknown_words(tokens):
    return [t for t in tokens if t in vocab]

def check_size(c, size):
    if len(c) <= size:
        return False
    else:
        return True
    
### tokens and tokensize
train_data["tokens"] = train_data.text.apply(lambda t: remove_unknown_words(t.split()))
train_data["tokensize"] = train_data.tokens.apply(lambda t: len(t))
test_data["tokens"] = test_data.text.apply(lambda t: remove_unknown_words(t.split()))
test_data["tokensize"] = test_data.tokens.apply(lambda t: len(t))

### filter by tokensize
train_data = train_data.loc[train_data["tokens"].apply(lambda d: check_size(d, 7)) != False].copy()
test_data = test_data.loc[test_data["tokens"].apply(lambda d: check_size(d, 7)) != False].copy()

### sorting by tokensize
train_data.sort_values(by="tokensize", ascending=True, inplace=True)
test_data.sort_values(by="tokensize", ascending=True, inplace=True)

### resetting index
train_data.reset_index(drop=True, inplace=True);
test_data.reset_index(drop=True, inplace=True);

### Binarization
emotions = list(set(train_data.emotions.unique()))
num_emotions = len(emotions)

### binarizer
mlb = preprocessing.MultiLabelBinarizer()

train_data_labels =  [set(emos) & set(emotions) for emos in train_data[['emotions']].values]
test_data_labels =  [set(emos) & set(emotions) for emos in test_data[['emotions']].values]

y_bin_emotions = mlb.fit_transform(train_data_labels)
test_y_bin_emotions = mlb.fit_transform(test_data_labels)

train_data['bin_emotions'] = y_bin_emotions.tolist()
test_data['bin_emotions'] = test_y_bin_emotions.tolist()

### Generate sample input
Once we have processed our data, let's look at an example of how we will converting sentences into input vectors, which are basically word vectors concatenated to represent the input sequence.

In [16]:
sentence_embeddings = [wv[vocab[w]] for w in "this feels fantastic".split()]

In [17]:
sentence_embeddings

[array([ 5.7762969e-01,  4.1182911e-01,  1.5915717e+00,  1.9623135e-01,
         1.4823467e-01,  3.4592927e-02,  1.0979089e-01, -5.4003459e-01,
         5.5145639e-01, -2.0645244e-01,  6.2708288e-01,  1.9114013e+00,
         4.1743749e-01,  4.8000565e-01,  1.3688921e+00, -6.0899270e-01,
        -8.2222080e-01, -1.6738379e-01,  2.5278423e-03, -4.4002768e-01,
        -1.7636645e-01,  3.1228867e-01,  8.5302269e-01, -5.5778861e-02,
        -9.6316218e-01,  6.3835210e-01,  1.1264894e+00, -7.7165258e-01,
         1.7387373e+00,  1.3290544e+00, -2.6808953e-01,  2.6583406e-01,
         1.7067311e+00,  4.0209743e-01,  1.9354068e+00, -4.4382878e-02,
        -1.7041634e+00, -2.1780021e+00,  6.2105244e-01,  4.5051843e-01,
        -9.4019301e-02, -1.6840085e-01, -6.8932152e-01, -8.8215894e-01,
        -1.4211287e+00, -6.9710428e-01,  9.1269486e-02, -1.3960580e+00,
        -2.6473520e+00,  1.2631515e-01,  1.0753033e+00, -1.7343637e+00,
        -1.2398950e+00, -1.8989055e-01,  5.5069500e-01, -9.92743

### Batching by Bucketing approach
Here is the code to generate batches for our training. This code is a little bit different from the batching approach we used to train our embeddings. Here we are going to generate batches of input sentences. In addition, we will also use a bucketing approach, which is basically a trick to generate more efficient batches that are of similar size. You don't need to know more about the batching for now, just that it is needed for training. We will explain the purpose of bucketing more in details in a future chapter of this series.

In [18]:
### renders embeddings with paddings; zeros where missing tokens
def generate_embeds_with_pads(tokens, max_size):
   
    padded_embedding = []
    for i in range(max_size):
        if i+1 > len(tokens): # do padding
            padded_embedding.append(list(np.zeros(EMBEDDING_DIM)))
        else: # do embedding for existing tokens
            padded_embedding.append(list(wv[vocab[tokens[i]]]))  
    return padded_embedding

### generate the actual batches
def generate_batches(data, batch_size):
    actual_batches = math.ceil(len(data) / batch_size)
    bins = np.linspace(0, len(data), actual_batches + 1) # this renders actual batches bins of size batch_size
    groups = data.groupby(np.digitize(data.index, bins))
    
    groups_indices = groups.indices
    groups_maxes = groups.max().tokensize
    
    return groups.indices, groups_maxes

### Model
Let's set up our model.

In [19]:
class EmoNet(torch.nn.Module):
    def __init__(self, num_layers, hidden_size, embedding_dim, output_size, dropout):
        super(EmoNet, self).__init__()
        self.embedding_dim = embedding_dim
        self.keep_prob = dropout
        self.hidden_size = hidden_size
        self.nlayers = num_layers
        self.output = output_size
        
        self.dropout  = nn.Dropout(p=self.keep_prob)
        
        self.rnn = nn.LSTM(input_size=self.embedding_dim,
                                 hidden_size=self.hidden_size, 
                                 num_layers=self.nlayers,
                                 dropout=self.keep_prob)
        self.linear = nn.Linear(self.hidden_size, output_size)
        
    def forward(self, inputs):
        # batch_size X seq_len X embedding_dim -> seq_len, batch_size, embedding_dim
        X = inputs.permute(1,0,2)
        self.rnn.flatten_parameters()
        output, hidden = self.rnn(X)
        (_, last_state) = hidden      
        out = self.dropout(output[-1])  
        out = self.linear(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs        

### Pretesting with one batch sample
Let's test the model to make sure that we are getting the right output.

In [20]:
train_groups_indices, train_groups_maxes = generate_batches(train_data, BATCH_SIZE)
test_groups_indices, test_groups_maxes = generate_batches(test_data, BATCH_SIZE)

n_train = len(train_data) // BATCH_SIZE
n_test = len(test_data) // BATCH_SIZE

batch_x = train_data.iloc[train_groups_indices[1]].tokens.apply(lambda d: 
                                                                          generate_embeds_with_pads(d, train_groups_maxes[1]) ).values.tolist()
batch_y = train_data.loc[train_groups_indices[1]].bin_emotions.values.tolist()

final_batch_x = torch.FloatTensor(np.array(batch_x))
final_batch_y = torch.FloatTensor(np.array(batch_y))

dummy_model = EmoNet(NUM_LAYERS, HIDDEN_SIZE, EMBEDDING_DIM, num_emotions, KEEP_PROB)
log_probs = dummy_model(final_batch_x)
print(log_probs[:15])

tensor([[-2.0232, -2.0691, -2.1157, -2.0892, -2.1025, -2.0685, -2.0496, -2.1217],
        [-2.0673, -2.1060, -2.0852, -2.0269, -2.1725, -2.0424, -2.0406, -2.1025],
        [-2.0391, -2.1295, -2.1221, -2.0777, -2.0516, -2.0863, -2.0247, -2.1099],
        [-1.9732, -2.0941, -2.1539, -2.1623, -2.0857, -2.0165, -2.0270, -2.1399],
        [-2.0045, -2.1246, -2.2071, -2.1499, -2.1643, -1.9416, -1.9807, -2.0957],
        [-2.0669, -2.0434, -2.0972, -2.1399, -2.1139, -2.0600, -1.9759, -2.1500],
        [-2.0779, -2.0254, -2.1648, -2.0490, -2.1991, -1.9826, -2.0161, -2.1418],
        [-2.0564, -2.1175, -2.0688, -2.1243, -2.1005, -2.0728, -1.9879, -2.1144],
        [-2.0580, -2.0817, -2.1553, -2.0956, -2.0416, -2.0482, -2.0393, -2.1219],
        [-2.0021, -2.0735, -2.1122, -2.0266, -2.1404, -2.0821, -2.0176, -2.1964],
        [-2.0476, -2.1002, -2.1645, -2.1462, -2.0558, -2.0172, -1.9741, -2.1467],
        [-2.0600, -2.0664, -2.0646, -2.1640, -2.1037, -2.0233, -2.0344, -2.1271],
        [-2.1151

### Training
Now let's train the model. But first, let's define the necessary variables to conduct the training like the optimizer and whether we are training on the cpu or gpu.

In [21]:
### define model
use_cuda = True if torch.cuda.is_available() else False
device = torch.device("cuda" if use_cuda else "cpu")
model = EmoNet(NUM_LAYERS, HIDDEN_SIZE, EMBEDDING_DIM, num_emotions, KEEP_PROB).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
dimension = EMBEDDING_DIM
EARLY_STOPPING, CURRENT_CHECKPOINT, CURRENT_ACC, EPOCH = 10, 0, 0, 0

### defining batch generation
train_groups_indices, train_groups_maxes = generate_batches(train_data, BATCH_SIZE)
test_groups_indices, test_groups_maxes = generate_batches(test_data, BATCH_SIZE)

n_train = len(train_data) // BATCH_SIZE
n_test = len(test_data) // BATCH_SIZE

In [22]:
def get_accuracy(logit, target, batch_size):
    ''' Obtain accuracy for training round '''
    corrects = (torch.max(logit, 1)[1].view(target.size()).data == target.data).sum()
    accuracy = 100.0 * corrects/batch_size
    return accuracy

...and finally we can train the model. Note that I stopped the training after the first round, since I have already done the training on my computer. You can let the training continue until you have reached a nice accuracy.

In [23]:
### training
while True:
    lasttime = time.time()
    ### early stoping to avoid overfitting
    if CURRENT_CHECKPOINT == EARLY_STOPPING:
        print('break epoch:', EPOCH)
        break
    train_acc, train_loss, test_acc , test_loss = 0, 0, 0, 0
    
    for b in range(n_train):
        batch_x = train_data.iloc[train_groups_indices[b+1]].tokens.apply(lambda d: 
                                                                          generate_embeds_with_pads(d, train_groups_maxes[b+1]) ).values.tolist()
        batch_y = train_data.loc[train_groups_indices[b+1]].bin_emotions.values.tolist()
        batch_y = np.argmax(batch_y, axis=1)        
        final_batch_x = torch.FloatTensor(np.array(batch_x)).to(device)
        final_batch_y = torch.LongTensor(batch_y).to(device)
        
        model.zero_grad()
        y_hat = model(final_batch_x)
        
        loss = F.nll_loss(y_hat, final_batch_y)
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()
        train_acc += get_accuracy(y_hat, final_batch_y, BATCH_SIZE)
        
    for b in range(n_test):
        batch_x = test_data.iloc[test_groups_indices[b+1]].tokens.apply(lambda d: 
                                                                          generate_embeds_with_pads(d, test_groups_maxes[b+1]) ).values.tolist()
        batch_y = test_data.loc[test_groups_indices[b+1]].bin_emotions.values.tolist()
        batch_y = np.argmax(batch_y, axis=1)
        final_batch_x = torch.FloatTensor(np.array(batch_x)).to(device)
        final_batch_y = torch.LongTensor(batch_y).to(device)
        
        model.zero_grad()
        y_hat = model(final_batch_x)
                
        loss = F.nll_loss(y_hat, final_batch_y)
        
        test_loss += loss.item()
        test_acc += get_accuracy(y_hat, final_batch_y, BATCH_SIZE)
        
    train_loss /= n_train
    train_acc /= n_train
    test_loss /= n_test
    test_acc /= n_test
    
    if test_acc > CURRENT_ACC:
        print('epoch:', EPOCH, ', pass acc:', CURRENT_ACC, ', current acc:', test_acc.cpu().numpy())
        CURRENT_ACC = test_acc
        CURRENT_CHECKPOINT = 0
        ### TODO: do checkpoint for model here using PyTorch
    else:
        CURRENT_CHECKPOINT += 1
    EPOCH += 1
    print('time taken:', time.time()-lasttime)
    print('epoch:', EPOCH, ', training loss:', train_loss, ', training acc:', train_acc.cpu().numpy(), ', valid loss:', test_loss, ', valid acc:', test_acc.cpu().numpy())


epoch: 0 , pass acc: 0 , current acc: 46
time taken: 48.675190687179565
epoch: 1 , training loss: 1.4679006251582394 , training acc: 46 , valid loss: 1.490867356731467 , valid acc: 46


KeyboardInterrupt: 

### Store the Model
Now that the model has been trained, we can store the classifier and then reuse it again to classify sentences or other tweets in the future. We will do this in the next chapter of this series. By the way, notice that I didn't properly evaluate the performance of the model here. I am sure you can find a way to improve the accuracy of the model by using more advanced deep learning techniques. You can also try to find a method to properly evaluate the model. I will provide that code in a future chapter. For now, we will use the model above, which has a fair accuracy, since the purpose of the series is to show you how to use the inferences of the model to conduct further analysis on a new dataset. We will cover this further analysis in the next chapter. Let's store the model first, and then we will retrieve it in the next notebook for classifying real-time tweets.

In [28]:
import copy
tmodel = copy.deepcopy(model)
torch.save(tmodel, 'model/elastic_hashtag_model/emonet')

  "type " + obj.__name__ + ". It won't be checked "
