# Sentiment Analysis (Deep Learning, CNN)

In this tutorial, we perform sentiment analysis using deep learning, where we use a basic Convolutional Neural Network (CNN) network structure.

## Import required packages

In [1]:
import os
import numpy as np
import pandas as pd
import csv

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Activation, Conv1D, MaxPooling1D, Flatten, Embedding

from sklearn.preprocessing import LabelBinarizer

# The next imports are only needed for the preprocessing
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from utils.nlputil import preprocess_text

Using TensorFlow backend.


We need a tokenizer and a lemmatizer for the preprocessing

In [2]:
tweet_tokenizer = TweetTokenizer()
wordnet_lemmatizer = WordNetLemmatizer()

Let's also define a set of parameters we need for later

In [3]:
NUM_LABELS = 3       # We have 3 polarity classes
MAX_WORDS = 1000     # We only consider the 1,000 most frequent terms
EMBEDDING_DIM = 50   # Size of the word vectors

## Date preparation

### Load data from files

In [4]:
df_tweets_train = pd.read_csv('data/twitter-sentiment/twitter-sentiment-bowden-training.csv')

# Print the first 5 lines
df_tweets_train.head()

Unnamed: 0,tweet,senti
0,@united UA5396 can wait for me. I'm on the gro...,0
1,I hate Time Warner! Soooo wish I had Vios. Can...,0
2,Tom Shanahan's latest column on SDSU and its N...,2
3,Found the self driving car!! /IWo3QSvdu2,2
4,@united arrived in YYZ to take our flight to T...,0


### Preprocess training and test data

In [5]:
train_tweets = df_tweets_train['tweet']
train_polarities = df_tweets_train['senti']

train_tweets_processed = [''] * len(train_tweets)

for idx, doc in enumerate(train_tweets):
    train_tweets_processed[idx] = preprocess_text(doc, tokenizer=tweet_tokenizer, lemmatizer=wordnet_lemmatizer)

In [6]:
df_tweets_test = pd.read_csv('data/twitter-sentiment/twitter-sentiment-bowden-test.csv')

test_tweets = df_tweets_test['tweet']
test_polarities = df_tweets_test['senti']  

test_tweets_processed = [''] * len(test_tweets)

for idx, doc in enumerate(test_tweets):
    test_tweets_processed[idx] = preprocess_text(doc, tokenizer=tweet_tokenizer, lemmatizer=wordnet_lemmatizer)  

### Prepare labels

In [7]:
encoder = LabelBinarizer()
encoder.fit(train_polarities)
y_train = encoder.transform(train_polarities)
y_test = encoder.transform(test_polarities)

print(y_test[:10])

[[1 0 0]
 [1 0 0]
 [1 0 0]
 [0 0 1]
 [1 0 0]
 [1 0 0]
 [0 0 1]
 [1 0 0]
 [0 0 1]
 [0 1 0]]


### Calculate maximum sequence length 

Most neural networks assume inputs of the same size. Since we are using tweets here which are usually rather short, we can find the longest one (in terms of the number of words) and define its length as the maximum sequence length. In case of longer texts, e.g., reviews, the maximum sequence length is specified a priori to typically a couple of hundred.

In [8]:
longest_train_tweet = max([len(s.split()) for s in train_tweets_processed])
longest_test_tweet = max([len(s.split()) for s in test_tweets_processed])

max_seq_len = max(longest_train_tweet, longest_test_tweet)

print("Maximum sequence length: {}".format(max_seq_len))

Maximum sequence length: 29


In [9]:
tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts(train_tweets_processed)

In [10]:
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 2718 unique tokens.


### Convert strings to sequences

The `tokenizer.word_index` each word in the vocabulary to an index. The method `texts_to_sequences` now converts a string into a list of indexes representing the words in the string

In [11]:
X_train = tokenizer.texts_to_sequences(train_tweets_processed)
X_test = tokenizer.texts_to_sequences(test_tweets_processed)

max_idx = max([ max(l) for l in X_train if len(l) > 0])

print(X_train[0])
print("Largest used index: {}".format(max_idx)) # This should be (MAX_WORDS-1)


[4, 846, 52, 8, 506, 178, 6, 204, 344, 507, 508]
Largest used index: 999


### Sequence padding.

We have to ensure that all inputs have the same length. Above, we calculated the maximum length being 29. That means, we have to "pad" all tweets that are shorter than that. Keras comes with a handy method for that. `padding='post'` specifies that the padding is done after the last wors. `truncating='post'` is not required in this example; it would cut of words from then end tweets that are too long (which cannot happen here).

In [12]:
X_train = pad_sequences(X_train, maxlen=max_seq_len, padding='post', truncating='post')
X_test = pad_sequences(X_test, maxlen=max_seq_len, padding='post', truncating='post')

print(X_train[0])
print("Sequence length: {}".format(len(X_train[0])))

[  4 846  52   8 506 178   6 204 344 507 508   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0]
Sequence length: 29


## Training the model (without word embeddings)

### Using "raw" sequences

Technically, we can train the network on the word indexes (e.g., `[  4 846  52   8 506 178   6 204 344 507 508   0   0 ...]`) without vecorizing the words. However, as you will see, the performance will be very poor.

In [13]:
X_train_raw = np.expand_dims(X_train, axis=2)
X_test_raw = np.expand_dims(X_test, axis=2)

In [15]:
model_raw = Sequential()
model_raw.add(Conv1D(128, 5, activation='relu', input_shape=(max_seq_len, 1)))
#model_raw.add(MaxPooling1D(5))
model_raw.add(Flatten())
model_raw.add(Dense(128))
model_raw.add(Activation('relu'))
model_raw.add(Dense(64)) 
model_raw.add(Activation('relu'))
model_raw.add(Dense(NUM_LABELS))
model_raw.add(Activation('softmax'))

print(model_raw.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_2 (Conv1D)            (None, 25, 128)           768       
_________________________________________________________________
flatten_2 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dense_4 (Dense)              (None, 128)               409728    
_________________________________________________________________
activation_4 (Activation)    (None, 128)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 64)                8256      
_________________________________________________________________
activation_5 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 3)                 195       
__________

In [16]:
model_raw.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [17]:
history_raw = model_raw.fit(X_train_raw, y_train, batch_size=32, epochs=20, verbose=1, validation_split=0.1)

Train on 629 samples, validate on 70 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [18]:
score_raw = model_raw.evaluate(X_test_raw, y_test, batch_size=32, verbose=1)
print('Test score:', score_raw[0])
print('Test accuracy:', score_raw[1])

Test score: 1.50831632326
Test accuracy: 0.402684563758


### Using one-hot word vectors

Here, we vectorize each word by converting them into one-hot vectors. Each vector has the length 1,000 (as size of the vocabulary, the 1,000 most frequent words).

Instead of using the `Tokenizer` class of Keras, we do the conversion manually for illustration.

In [19]:
def convert_to_word_onehot(X):
    X_onehot = np.empty(shape=(X.shape[0], X.shape[1], MAX_WORDS))
    for seq_idx, seq in enumerate(X):
        for word_idx, word in enumerate(seq):
            if word > 0:
                X_onehot[seq_idx, word_idx, word] = 1
    return X_onehot
        
X_train_onehot = convert_to_word_onehot(X_train)  
X_test_onehot = convert_to_word_onehot(X_test)  

In [21]:
model_onehot = Sequential()
model_onehot.add(Conv1D(128, 5, activation='relu', input_shape=(max_seq_len, MAX_WORDS)))
#model_onehot.add(MaxPooling1D(5))
model_onehot.add(Flatten())
model_onehot.add(Dense(128))
model_onehot.add(Activation('relu'))
model_onehot.add(Dense(64)) 
model_onehot.add(Activation('relu'))
model_onehot.add(Dense(NUM_LABELS))
model_onehot.add(Activation('softmax'))

print(model_onehot.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_4 (Conv1D)            (None, 25, 128)           640128    
_________________________________________________________________
flatten_4 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dense_10 (Dense)             (None, 128)               409728    
_________________________________________________________________
activation_10 (Activation)   (None, 128)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 64)                8256      
_________________________________________________________________
activation_11 (Activation)   (None, 64)                0         
_________________________________________________________________
dense_12 (Dense)             (None, 3)                 195       
__________

In [22]:
model_onehot.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [23]:
history_onehot = model_onehot.fit(X_train_onehot, y_train, batch_size=32, epochs=20, verbose=1, validation_split=0.1)

Train on 629 samples, validate on 70 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [24]:
score_onehot = model_onehot.evaluate(X_test_onehot, y_test, batch_size=32, verbose=1)
print('Test score:', score_onehot[0])
print('Test accuracy:', score_onehot[1])

Test score: 1.49082253283
Test accuracy: 0.640939596915


## Training the model (with word embeddings)

Finally, we use word embeddings. 3 ways to do so are possible:

* Embedding layer with randomly initialized weights

* Embedding layer with pretrained weights (trainable: the weights will be update during training)

* Embedding layer with pretrained weights (not trainable: the weights won't be updated during training)

### Load pretrained word embeddings (GloVe)

In [25]:
df_glove = pd.read_table('data/pretrained-word-vectors/glove.6B.50d.txt', sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

# Print the first 5 lines
df_glove.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,41,42,43,44,45,46,47,48,49,50
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
the,0.418,0.24968,-0.41242,0.1217,0.34527,-0.044457,-0.49688,-0.17862,-0.00066,-0.6566,...,-0.29871,-0.15749,-0.34758,-0.045637,-0.44251,0.18785,0.002785,-0.18411,-0.11514,-0.78581
",",0.013441,0.23682,-0.16899,0.40951,0.63812,0.47709,-0.42852,-0.55641,-0.364,-0.23938,...,-0.080262,0.63003,0.32111,-0.46765,0.22786,0.36034,-0.37818,-0.56657,0.044691,0.30392
.,0.15164,0.30177,-0.16763,0.17684,0.31719,0.33973,-0.43478,-0.31086,-0.44999,-0.29486,...,-6.4e-05,0.068987,0.087939,-0.10285,-0.13931,0.22314,-0.080803,-0.35652,0.016413,0.10216
of,0.70853,0.57088,-0.4716,0.18048,0.54449,0.72603,0.18157,-0.52393,0.10381,-0.17566,...,-0.34727,0.28483,0.075693,-0.062178,-0.38988,0.22902,-0.21617,-0.22562,-0.093918,-0.80375
to,0.68047,-0.039263,0.30186,-0.17792,0.42962,0.032246,-0.41376,0.13228,-0.29847,-0.085253,...,-0.094375,0.018324,0.21048,-0.03088,-0.19722,0.082279,-0.09434,-0.073297,-0.064699,-0.26044


In [26]:
print(df_glove.loc["soldier"])

1    -0.058880
2    -0.114330
3     0.091460
4    -0.778940
5     1.950600
6     0.218560
7    -0.139900
8     1.074100
9     0.033428
10   -1.533700
11    0.282160
12   -0.767850
13   -0.139840
14    0.192400
15    0.192250
16   -1.037200
17   -0.483610
18    0.567100
19   -0.879310
20    0.598970
21   -0.667220
22    1.227400
23   -0.190130
24    0.000916
25    0.047809
26   -1.953500
27   -0.583520
28   -0.345640
29    0.611330
30    0.530760
31    1.878600
32   -0.884640
33   -0.795120
34    0.269860
35    0.456810
36    1.119200
37    0.219990
38   -0.963350
39   -0.194830
40   -0.224560
41    0.385940
42    0.099907
43    0.301910
44   -0.846750
45    1.072900
46   -0.901990
47   -0.183620
48   -0.192430
49   -0.044231
50   -0.125190
Name: soldier, dtype: float64


We first need to convert the loaded word vectors into form that we can provide the `Embedding` layer as an input (see below).

Pretrained word embeddings are trained over large datasets like news articles or Wikipedia pages. However, there is no guarantee that they cover all words in the vocabulary of a dataset. This is particularly true for social media content where user write all kinds of words and non-words. To get some idea, let's calculate the ration of words in our vocabulary that are not part of the pretrained word vectors.

In [27]:
oov_words = set()

embedding_matrix = np.random.random((MAX_WORDS, EMBEDDING_DIM))
for word, i in word_index.items():
    try:
        embedding_vector = df_glove.loc[word].as_matrix()
        try:
            embedding_matrix[i] = embedding_vector
        except:
            pass
        #print(">>>>", word)
    except Exception as e:
        oov_words.add(word)
        
        
print("Number of words not in the pretrained set: {}".format(len(oov_words)))
print("Ratio of words not in the pretrained set: {:.2}".format(len(oov_words)/len(word_index)))

Number of words not in the pretrained set: 523
Ratio of words not in the pretrained set: 0.19


### Define network model

This model now has an `Embedding` layer. By choose on the 3 respective lines of code you can choose how you want to initialize the layer (random vs. pretrained) and if the weights are updated during training or not (trainable vs. non-trainable).

In [28]:
model_embed = Sequential()

# Choose one of the following lines
#model_embed.add(Embedding(MAX_WORDS, EMBEDDING_DIM, input_length=max_seq_len))
#model_embed.add(Embedding(MAX_WORDS, EMBEDDING_DIM, input_length=max_seq_len, weights=[embedding_matrix], trainable=False))
model_embed.add(Embedding(MAX_WORDS, EMBEDDING_DIM, input_length=max_seq_len, weights=[embedding_matrix], trainable=True))

model_embed.add(Conv1D(128, 5, activation='relu'))
#model_embed.add(MaxPooling1D(5))
model_embed.add(Flatten())
model_embed.add(Dense(128))
model_embed.add(Activation('relu'))
model_embed.add(Dense(64)) 
model_embed.add(Activation('relu'))
model_embed.add(Dense(NUM_LABELS))
model_embed.add(Activation('softmax'))

print(model_embed.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 29, 50)            50000     
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 25, 128)           32128     
_________________________________________________________________
flatten_5 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dense_13 (Dense)             (None, 128)               409728    
_________________________________________________________________
activation_13 (Activation)   (None, 128)               0         
_________________________________________________________________
dense_14 (Dense)             (None, 64)                8256      
_________________________________________________________________
activation_14 (Activation)   (None, 64)                0         
__________

### Compile model

In [29]:
model_embed.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

### Train model

In [30]:
history_embed = model_embed.fit(X_train, y_train, batch_size=32, epochs=20, verbose=1, validation_split=0.1)

Train on 629 samples, validate on 70 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### Evaluate model

In [31]:
score_embed = model_embed.evaluate(X_test, y_test, batch_size=32, verbose=1)
print('Test score:', score_embed[0])
print('Test accuracy:', score_embed[1])

Test score: 1.57325528612
Test accuracy: 0.634228187919
