**Created by Peter Nagy February 2017 [Github][1]**

**Sentiment Analysis:** the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.


  [1]: https://github.com/nagypeterjob

As an improvement to my previous [Kernel][1], here I am trying to achieve better results with a Recurrent Neural Network.

  [1]: https://www.kaggle.com/ngyptr/d/crowdflower/first-gop-debate-twitter-sentiment/python-nltk-sentiment-analysis

In [34]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, CuDNNLSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import re

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

Only keeping the necessary columns.

In [70]:
train_data = pd.read_csv('../data/sentiment140_train.zip', encoding='ISO-8859-1', header=None, names=['sentiment','id','timestamp','type','user','text'])
test_data = pd.read_csv('../data/sentiment140_test.zip', encoding='ISO-8859-1', header=None, names=['sentiment','id','timestamp','type','user','text'])
# Keeping only the neccessary columns

# The original training data are sorted by sentiment value. Shuffle the training data for randomness
train_data = train_data[['text','sentiment']].sample(frac=1)
test_data = test_data[['text','sentiment']]

In [71]:
train_data.dtypes

text         object
sentiment     int64
dtype: object

In [72]:
train_data.head(10)

Unnamed: 0,text,sentiment
1191014,"@empian sure is, and they were on offer so tas...",4
458592,............ orlando magic making me sad,0
1086868,@iAmCherAnn Really? Allison said that?,4
387691,i want to go to america NOW! i dont want to wa...,0
552737,Dont like going to sleep after being full off ...,0
1521780,@kaypros Thanks. You are so right - great venu...,4
1140025,Kobayashi beats Chestnut! http://tinyurl.com/n...,4
1498757,@DamienAidoo lol whateva! That's y ur hawks be...,4
707350,It's hot and I don't want to go work on the ol...,0
936938,".@imymulla Brilliant work-around, thx for shar...",4


In [73]:
train_data['sentiment'].unique()

array([4, 0])

In [74]:
test_data.head(10)

Unnamed: 0,text,sentiment
0,@stellargirl I loooooooovvvvvveee my Kindle2. ...,4
1,Reading my kindle2... Love it... Lee childs i...,4
2,"Ok, first assesment of the #kindle2 ...it fuck...",4
3,@kenburbary You'll love your Kindle2. I've had...,4
4,@mikefish Fair enough. But i have the Kindle2...,4
5,@richardebaker no. it is too big. I'm quite ha...,4
6,Fuck this economy. I hate aig and their non lo...,0
7,Jquery is my new best friend.,4
8,Loves twitter,4
9,how can you not love Obama? he makes jokes abo...,4


In [75]:
test_data['sentiment'].unique()

array([4, 0, 2])

In [76]:
print('Positive rows: {}'.format(train_data[ train_data['sentiment'] == 4]['sentiment'].size))
print('Negative rows: {}'.format(train_data[ train_data['sentiment'] == 0]['sentiment'].size))

Positive rows: 800000
Negative rows: 800000


Next, I am dropping the 'Neutral' sentiments as my goal was to only differentiate positive and negative tweets. After that, I am filtering the tweets so only valid texts and words remain.  Then, I define the number of max features as 2000 and use Tokenizer to vectorize and convert text into Sequences so the Network can deal with it as input.

In [48]:
train_data.loc[train_data['sentiment'] == 4, ['sentiment']] = 1

In [77]:
max_fatures = 10000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(train_data['text'].values)
X = tokenizer.texts_to_sequences(train_data['text'].values)
X = pad_sequences(X)

In [78]:
X.shape

(1600000, 116)

Next, I compose the LSTM Network. Note that **embed_dim**, **lstm_out**, **batch_size**, **droupout_x** variables are hyperparameters, their values are somehow intuitive, can be and must be played with in order to achieve good results. Please also note that I am using softmax as activation function. The reason is that our Network is using categorical crossentropy, and softmax is just the right activation method for that.

In [79]:
X

array([[   0,    0,    0, ...,   15, 1310,  156],
       [   0,    0,    0, ...,  341,   14,  110],
       [   0,    0,    0, ..., 5629,  303,   17],
       ...,
       [   0,    0,    0, ...,    6,    3,  493],
       [   0,    0,    0, ...,  945,   36,   87],
       [   0,    0,    0, ..., 2190,   46, 2379]], dtype=int32)

Hereby I declare the train and test dataset.

In [81]:
Y = pd.get_dummies(train_data['sentiment']).values
Y.shape

(1600000, 2)

In [82]:
Y

array([[0, 1],
       [1, 0],
       [0, 1],
       ...,
       [0, 1],
       [1, 0],
       [1, 0]], dtype=uint8)

In [95]:
size_limit = 500000
X_train, X_test, Y_train, Y_test = train_test_split(X[:size_limit],Y[:size_limit], test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(33500, 116) (33500, 2)
(16500, 116) (16500, 2)


In [106]:
embed_dim = 200
lstm_out = 500

model = Sequential()
model.add(Embedding(max_fatures, embed_dim, input_length = X.shape[1]))
model.add(SpatialDropout1D(rate=0.5))
model.add(CuDNNLSTM(lstm_out, return_sequences=True))
model.add(CuDNNLSTM(lstm_out))
model.add(Dense(2, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, 116, 200)          2000000   
_________________________________________________________________
spatial_dropout1d_12 (Spatia (None, 116, 200)          0         
_________________________________________________________________
cu_dnnlstm_19 (CuDNNLSTM)    (None, 116, 500)          1404000   
_________________________________________________________________
cu_dnnlstm_20 (CuDNNLSTM)    (None, 500)               2004000   
_________________________________________________________________
dense_12 (Dense)             (None, 2)                 1002      
Total params: 5,409,002
Trainable params: 5,409,002
Non-trainable params: 0
_________________________________________________________________
None


Here we train the Network. We should run much more than 7 epoch, but I would have to wait forever for kaggle, so it is 7 for now.

In [107]:
batch_size = 32
model.fit(X_train, Y_train, epochs = 7, batch_size=batch_size, verbose = 2)

Epoch 1/7
 - 116s - loss: 0.5320 - acc: 0.7321
Epoch 2/7
 - 115s - loss: 0.4230 - acc: 0.8083
Epoch 3/7
 - 115s - loss: 0.3616 - acc: 0.8391
Epoch 4/7
 - 115s - loss: 0.3092 - acc: 0.8650
Epoch 5/7
 - 115s - loss: 0.2602 - acc: 0.8887
Epoch 6/7
 - 115s - loss: 0.2207 - acc: 0.9056
Epoch 7/7
 - 115s - loss: 0.1839 - acc: 0.9215


<keras.callbacks.History at 0x7fe32c4996a0>

Extracting a validation set, and measuring score and accuracy.

In [108]:
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("score: %.8f" % (score))
print("acc: %.8f" % (acc))

score: 0.77538238
acc: 0.75387879
