# Sentiment Analysis (Deep Learning, MLP)

In this tutorial, we perform sentiment analysis using deep learning, where we use a basic Multilayer Perceptron (MLP) network structure.

## Import required packages

In [1]:
import numpy as np
import pandas as pd

from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, Activation

from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# The next imports are only needed for the preprocessing
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from utils.nlputil import preprocess_text

ImportError: No module named 'keras'

We need a tokenizer and Lemmatizer for the preprocessing.

In [4]:
tweet_tokenizer = TweetTokenizer()
wordnet_lemmatizer = WordNetLemmatizer()

## Load and prepare data

We use `pandas` as usual to read the file with the tweets and assigned sentiment labels. In this case, we have 2 files, one containing the traing data and the test data. For both files we perform the following steps:

* Store tweets and labels (polarities) into separate lists for further processing
* Preprocess the tweets

In [5]:
df_tweets_train = pd.read_csv('data/twitter-sentiment/twitter-sentiment-bowden-training.csv')

# Print the first 5 lines
df_tweets_train.head()

Unnamed: 0,tweet,senti
0,@united UA5396 can wait for me. I'm on the gro...,0
1,I hate Time Warner! Soooo wish I had Vios. Can...,0
2,Tom Shanahan's latest column on SDSU and its N...,2
3,Found the self driving car!! /IWo3QSvdu2,2
4,@united arrived in YYZ to take our flight to T...,0


In [6]:
train_tweets = df_tweets_train['tweet']
train_polarities = df_tweets_train['senti']

train_tweets_processed = [''] * len(train_tweets)
for idx, doc in enumerate(train_tweets):
    train_tweets_processed[idx] = preprocess_text(doc, tokenizer=tweet_tokenizer, lemmatizer=wordnet_lemmatizer)

In [7]:
df_tweets_test = pd.read_csv('data/twitter-sentiment/twitter-sentiment-bowden-test.csv')

test_tweets = df_tweets_test['tweet']
test_polarities = df_tweets_test['senti']  

test_tweets_processed = [''] * len(test_tweets)
for idx, doc in enumerate(test_tweets):
    test_tweets_processed[idx] = preprocess_text(doc, tokenizer=tweet_tokenizer, lemmatizer=wordnet_lemmatizer)    

In [9]:
print("Number of training data: {}".format(len(train_tweets_processed)))
print("Number of test data: {}".format(len(test_tweets_processed)))

Number of training data: 699
Number of test data: 298


### Encode / vectorize data

We need to encode both the tweets as well as the labels to use them as input and output for the network.

For convenience, we use the `Tokenizer` class provided by Keras. It generates a document term matrix with binaty weights. We can limit the vocabulary the *N* most frequently used words (here *N=250*)

In [17]:
vocab_size = 250
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(train_tweets_processed)

In [18]:
#X_train = tokenize.texts_to_matrix(train_tweets)
X_train = tokenizer.texts_to_matrix(train_tweets_processed)

#X_test = tokenize.texts_to_matrix(test_tweets)
X_test = tokenizer.texts_to_matrix(test_tweets_processed)

For illustration, print the vector for the first tweet. The vector has a length of 250 (the size of the vocabulary). A 1 at position *i* indicates that the tweet contains the word with the word index *i*

In [19]:
print(X_train[0])

[ 0.  0.  0.  0.  1.  0.  1.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0

Since we have 3 classes (positive, netural, negative) our network will have a output layer of 3 neurons. For the network to calculate the costs given a predicted output (3 values) and the true output, we need to convert the true label to vector with also 3 values. 

We can use existing methods to easily convert the true labels into the respective one-hot vectors

In [24]:
encoder = LabelBinarizer()
encoder.fit(train_polarities)
y_train = encoder.transform(train_polarities)
y_test = encoder.transform(test_polarities)

print(y_test[:10])

[[1 0 0]
 [1 0 0]
 [1 0 0]
 [0 0 1]
 [1 0 0]
 [1 0 0]
 [0 0 1]
 [1 0 0]
 [0 0 1]
 [0 1 0]]


## Define the network model

We are use MLP, i.e., a simple stack of fully connected layers (`Dense`). Apart from the input layer, we also define

* 1st hidden layer of size 512
* 2nd hidden layer of size 256
* Output layer of size 3 (on for each class)

In [25]:
num_labels = 3 # We have 3 polarity classes

model = Sequential()
model.add(Dense(512, input_shape=(vocab_size,)))
model.add(Activation('relu'))
model.add(Dense(256)) # No need to specify input size - is derived from output of previous layer
model.add(Activation('relu'))
model.add(Dense(num_labels))
model.add(Activation('softmax'))

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 512)               128512    
_________________________________________________________________
activation_4 (Activation)    (None, 512)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 256)               131328    
_________________________________________________________________
activation_5 (Activation)    (None, 256)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 3)                 771       
_________________________________________________________________
activation_6 (Activation)    (None, 3)                 0         
Total params: 260,611
Trainable params: 260,611
Non-trainable params: 0
_________________________________________________________________
None

For MLPs it's easy to calculate the number of parameters manually, so let's to this for illustrative purposes:

* 1st hidden layer: 250 (input layer size) * 512 (1st hidden layer size) + 512 (1st hidden layer size; for biases) = 128,512

* 2nd hidden layer: 512 (input layer size) * 256 (2nd hidden layer size) + 256 (2nd hidden layer size; for biases) = 131,328

* Output layer: 256 (2nd hidden layer size) * 3 (output layer size) + 3 (output layer size, for biases) = 771

### Compiling the model

Compiling the model essentially initializes all the weights with some random value. We also specify here which loss and which optimizer we want to use.

In [26]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

## Training the model

With everyhing in place, we can train the model by calling the `fit()` method. Apart from the training data and labels, the method takes the followning input parameters:

* `batch_size`: the number of training data that are evaluated in one pass

* `epochs`: the number of times the whole training data is passed through the network

* `verbose`: different levels of outputs to follow the training progress

* `validation_split`: ratio of how much of the data is used for the validation

In [28]:
history = model.fit(X_train, y_train, batch_size=32, epochs=20, verbose=1, validation_split=0.1)

Train on 629 samples, validate on 70 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


The training is rather fast simply because we only use ~700 tweets for the training which is a very small dataset for deep learning.

## Evaluating the model

We already prepared the test data. We therefore can use it as input for the provided method `evaluate()`

In [33]:
print(X_test.shape)
print(y_test.shape)

score = model.evaluate(X_test, y_test, batch_size=32, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])

(298, 250)
(298, 3)
Test score: 1.6150059588
Test accuracy: 0.640939597315


#### More detailed evaluation

Keras does not come with any more detailed information regarding an evaluation beyond the accuarcy value. However, one can simple use the methods that come with `scikit-learn` for that; see previous tutorials.

We first predict the classes/polarities for the test set.

In [30]:
y_pred = model.predict_classes([X_test])
print(y_pred)

[0 0 0 1 0 2 1 0 2 2 0 1 2 2 2 2 2 2 1 2 0 2 1 0 1 0 0 2 2 0 0 1 1 1 0 2 1
 0 1 0 2 2 2 0 1 2 0 2 2 0 2 0 0 2 2 1 0 0 1 2 2 0 2 1 0 0 2 2 1 1 0 0 1 0
 0 1 0 0 2 0 1 2 0 0 2 0 0 2 2 0 0 0 1 0 1 2 0 1 2 2 0 1 1 1 0 2 2 2 2 0 2
 0 1 0 2 0 2 1 0 0 2 2 0 0 0 2 1 0 1 2 0 2 0 2 2 0 2 0 2 2 0 0 2 0 0 0 0 2
 0 0 1 1 2 2 1 2 2 2 2 2 0 1 1 0 2 1 2 2 1 2 2 2 0 2 1 1 0 2 0 0 0 0 2 0 2
 2 2 1 1 0 1 2 0 1 1 0 1 1 0 1 0 0 1 2 0 2 0 0 0 0 0 0 2 1 1 1 0 2 2 0 0 0
 0 0 1 1 0 2 0 0 0 2 0 0 1 1 2 0 2 2 0 1 2 0 1 0 2 0 2 1 0 2 1 1 1 0 0 2 1
 0 1 0 1 1 2 1 0 2 0 1 1 2 0 2 2 0 2 1 2 0 0 2 0 0 2 2 1 2 1 1 1 0 2 2 1 0
 1 2]


`y_pred` contains the values 0, 1 and 2, representing the three classes *negative*, *neutral* and *positive*, respectively. However, the original labels are 0 (*negative*), 2 (*neutral*) and 4 (*positive*). Therefore need to normalize the label so that they match. In our this this is easy, we just need to divide the original labels by 2.

In [31]:
#y_test_normalized = [ int(p/2) for p in test_polarities ] # Works as well
y_test_normalized = np.asarray([ int(p/2) for p in test_polarities ])
print(y_test_normalized)

[0 0 0 2 0 0 2 0 2 1 1 1 0 2 2 1 2 2 2 0 0 2 1 0 2 0 0 2 1 1 0 1 1 1 0 2 0
 0 2 0 2 2 2 0 0 2 0 1 2 0 2 0 0 1 2 2 0 0 1 2 2 1 2 2 0 0 2 0 2 1 0 2 2 1
 0 1 0 0 1 0 1 2 2 1 0 1 0 2 2 2 0 0 1 1 2 2 0 0 0 2 0 1 1 1 2 2 2 2 2 0 0
 0 1 2 2 0 2 1 0 0 0 0 0 1 1 2 2 0 2 2 2 2 0 2 2 0 0 0 2 2 0 0 1 0 2 0 0 2
 0 0 1 0 0 2 0 1 1 0 2 2 1 2 0 0 2 2 0 2 1 2 2 2 2 2 1 1 0 2 0 2 0 0 1 0 1
 2 2 2 2 0 2 2 2 1 0 0 1 1 2 1 0 0 2 2 2 1 0 0 0 0 2 2 1 1 2 1 1 2 1 2 1 0
 0 2 0 1 0 2 2 0 0 1 2 0 1 1 2 0 2 2 0 1 2 0 1 0 2 0 2 1 0 1 0 1 0 2 0 2 0
 2 1 1 0 0 1 1 1 2 0 0 1 2 0 2 1 2 2 2 1 0 0 2 0 0 2 2 1 1 0 1 0 0 2 2 0 2
 2 2]


Now we can use the useful function provdided by `scikit-learn` to print more detailed values such as precision, recall and f1-score for all classes.

In [32]:
print(confusion_matrix(y_test_normalized, y_pred))
print()
print(classification_report(y_test_normalized, y_pred))

[[83 17 13]
 [14 37 20]
 [22 21 71]]

             precision    recall  f1-score   support

          0       0.70      0.73      0.72       113
          1       0.49      0.52      0.51        71
          2       0.68      0.62      0.65       114

avg / total       0.64      0.64      0.64       298

