# Sentiment analysis on tweets

*Objective* To perform sentiment analysis on tweets.
Training data used is the [sentiment140 training][training_set] set, which contains 1.6 Million tweets.
The tweets are classified into positive, and negative.

For this experiment I will be using a LSTM model from the Keras library with Tensorflow as the backend. 

[training_set]: http://help.sentiment140.com/for-students/

In [1]:
import os
import numpy as np
import pandas as pd
from keras.layers import Input, Dense, Dropout, Activation, Embedding
from keras.layers import Embedding
from keras.layers import LSTM
from keras.models import Model
from keras.layers import Conv1D, MaxPooling1D
from keras import callbacks
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


In [2]:
BASE_DIR = ''
DATA_DIR = BASE_DIR + 'data/'
GLOVE_DIR = BASE_DIR + 'glove_dir/'
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

Loading in the training data

In [3]:
df = pd.read_csv(os.path.join(DATA_DIR,"training.1600000.processed.noemoticon.csv"), names=(['polarity', 'tweet_id', 'date', 'query', 'user', 'text']), encoding='ISO-8859-1')
df.head()

Unnamed: 0,polarity,tweet_id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


Let's encode Polarity to 0 an 1, instead of the 0(negative) and 4(positive), 
so that keras is able to recognise it.

In [4]:
df['isNegative'] = df['polarity'].map({0: 1, 2: 0, 4: 0})
df = df.reset_index()

### Loading the GloVe embeddings 
I will be using the 100 dimensional [GloVe][glove] embeddings, of 400k words.

[glove]: https://nlp.stanford.edu/projects/glove/

In [5]:
embeddings_index = {}
f = open('../glove_dir/glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found {0} word vectors.'.format(len(embeddings_index)))

Found 400000 word vectors.


Next we will proceed to splitting our data into train set and test set.
Which will be followed by tokenizing the tweet texts and padding.

In [6]:
max_features = 20000
max_words = 1000
batch_size = 32
maxlen = 80 # cut texts after this number of words (among top max_features most common words)

#X_train, X_test, y_train, y_test = train_test_split(df['text'], df[['isNegative', 'isNeutral', 'isPositive']], test_size=0.2, random_state=12)
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['isNegative'], test_size=0.2, random_state=12)
tokenizer = Tokenizer(num_words=max_words, lower=True, filters='0123456789.#!?:()[]', char_level=True)
tokenizer.fit_on_texts(np.array(df['text'].fillna('')))
word_index = tokenizer.word_index

print('Found {0} unique tokens.'.format(len(word_index)))

X_train = pad_sequences(tokenizer.texts_to_sequences(np.array(X_train)), maxlen=maxlen) 
X_test = pad_sequences(tokenizer.texts_to_sequences(np.array(X_test)), maxlen=maxlen)
print('x_train shape:', X_train.shape)
print('x_test shape:', X_test.shape)

Found 193 unique tokens.
x_train shape: (1280000, 80)
x_test shape: (320000, 80)


Now let's create the embedding matrix which will be used as the weights in our embedding layer.

In [7]:
# prepare embedding matrix
num_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        
        embedding_matrix[i-1] = embedding_vector

## Create and train our model

We will start creating our model which will be using a convolutional layer followed by a LSTM layer.

*Note* Embedding layer should have trainable set to False, so that weights we computed earlier does not get overridden.

In [8]:
tweet_input = Input(shape=(80,), name='tweet_input')
embedding_layer = Embedding(num_words,EMBEDDING_DIM, weights=[embedding_matrix], input_length=maxlen,trainable=False)(tweet_input)
x = Dropout(0.2)(embedding_layer)
convolution = Conv1D(64, 64, padding='valid', activation='relu', strides=1)(x)
pooling = MaxPooling1D(pool_size=4)(convolution)
#lstm = LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)(embedding_layer)
lstm = LSTM(64, dropout=0.2, recurrent_dropout=0.2)(embedding_layer)
lstm_out = Dense(1, activation='sigmoid', name='lstm_out')(lstm)

# try using different optimizers and different optimizer configs
model = Model(inputs=[tweet_input], outputs=[lstm_out])
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',metrics=['accuracy'])
#print (model.summary())
metrics=['accuracy']
mode = 'auto'
monitor = 'val_acc'
patience = 0
cbks = [callbacks.EarlyStopping(patience=patience, monitor=monitor, mode=mode)]
model.fit(X_train, np.array(y_train), validation_split=VALIDATION_SPLIT, epochs=20, callbacks=cbks, batch_size=batch_size)

Train on 1024000 samples, validate on 256000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20


<keras.callbacks.History at 0x7fb8381cf0f0>

As you can see, each epoch took around 50 min to complete. For refernce I ran this on my core i7 laptop.
This was however not run on a tensorflow configured for cuda.

The model finally shows an accuracy of 0.7881 in the validation set.
Now let's see how this model performs on test data.

In [9]:
scores = model.evaluate(np.array(X_test), np.array(y_test), batch_size=batch_size, verbose=1)
print("\n Test {0}: {1}, {2}: {3}".format(model.metrics_names[0], scores[0], model.metrics_names[1], scores[1]))

 Test loss: 0.44714794987887146, acc: 0.789975


Now let's test our model on some live data and see how the model fares.

In [10]:
X_pred_test = ["I know people love them, but I find C's prefix and postfix increment operators a usability nightmare.",
              "I understand that people don't want to read research papers, so it's a non-issue. But it's just so blatant and shameless.",
              "Awesome work from @OpenAI: Dota 2 bot, a neural net trained with self-play beats the world's top players at 1v1. Learns fun strategies!",
              "Finally watched Silicon Valley season 4 this week... I laughed so hard at this scene:",
              "Oh, I know what would make verbose arithmetic code easier to read: adding in mutation as a concept to basic arithmetic.",
              """The way in which you are a "glas half full" person where I'm a "glass half empty and why is there no ice" person are amazing ;)"""]

X_pred_test = pad_sequences(tokenizer.texts_to_sequences(np.array(X_pred_test)), maxlen=maxlen)

In [11]:
model.predict(X_pred_test)

array([[ 0.62591708],
       [ 0.69186771],
       [ 0.37482822],
       [ 0.71552372],
       [ 0.49691495],
       [ 0.51810449]], dtype=float32)

To be frank, this does not look really good. All output values are very close 0.5 , so they are not very confident.
But it's surprising that the classifier was able to identify the last two tweets were negative.

Finally let's save this model for further use.

In [12]:
model.save('tweet_sentiment_model.h5')

## Final Notes

1. It would be intresting to see the effect of using a higher dimensional embedding matrix.
2. Since the tweets often contain 