# Sentiment Analysis
The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral. The Orignal sample comes from https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras

In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
#on osx
#os.environ["KERAS_BACKEND"] = "plaidml.keras.backend"

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import re

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

Using plaidml.keras.backend backend.


Keep only the necessary columns.

In [4]:
data = pd.read_csv('sentiment_data/Sentiment.csv')
# Keeping only the neccessary columns
data = data[['text','sentiment']]

Next, I am dropping the 'Neutral' sentiments as my goal was to only differentiate positive and negative tweets. After that, I am filtering the tweets so only valid texts and words remain.  Then, I define the number of max features as 2000 and use Tokenizer to vectorize and convert text into Sequences so the Network can deal with it as input.

In [5]:
data = data[data.sentiment != "Neutral"]
data['text'] = data['text'].apply(lambda x: x.lower())
data['text'] = data['text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

print(data[ data['sentiment'] == 'Positive'].size)
print(data[ data['sentiment'] == 'Negative'].size)

for idx,row in data.iterrows():
    row[0] = row[0].replace('rt',' ')
    
max_fatures = 2000

#Tokenizer is a built in Keras tool to break down large group of documents into
# an easily usable format. In particular, it gives
# word_counts: A dictionary of words and their counts.
#word_docs: A dictionary of words and how many documents each appeared in.
#word_index: A dictionary of words and their uniquely assigned integers.
#document_count:An integer count of the total number of documents that were used to fit the Tokenizer.

tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['text'].values)
X = tokenizer.texts_to_sequences(data['text'].values)
X = pad_sequences(X)
print(X[500])

4472
16986
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0   44   58   36   52 1374    2]


## Build LSTM Network

Next, I compose the LSTM Network. Note that **embed_dim**, **lstm_out**, **batch_size**, **droupout_x** variables are hyperparameters, their values are somehow intuitive, can be and must be played with in order to achieve good results. As we are only interested in positive or negative sentiment, we can have a sigmoid activation function at the last layer [0,1] and a binary crossentropy loss. If we wanted more categories, we would use softmax and categorical cross_entropy.

In [25]:
embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 28, 128)           256000    
_________________________________________________________________
spatial_dropout1d_3 (Spatial (None, 28, 128)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 196)               254800    
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 197       
Total params: 510,997
Trainable params: 510,997
Non-trainable params: 0
_________________________________________________________________
None


In [26]:
Y=data['sentiment'].astype('category')
Y

1        Positive
3        Positive
4        Positive
5        Positive
6        Negative
           ...   
13866    Negative
13867    Positive
13868    Positive
13869    Negative
13870    Positive
Name: sentiment, Length: 10729, dtype: category
Categories (2, object): [Negative, Positive]

In [27]:
Y=Y.cat.codes
Y

1        1
3        1
4        1
5        1
6        0
        ..
13866    0
13867    1
13868    1
13869    0
13870    1
Length: 10729, dtype: int8

### Prepare train & test dataset 
Try and redress class imbalance by assigning weights

In [28]:
#Y = pd.get_dummies(data['sentiment']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(7188, 28) (7188,)
(3541, 28) (3541,)


In [29]:
class_weight={1: 4.0, #redress class imbalance by assigning 4x weight to positive
              0: 1.0}

### Train the network

In [30]:
batch_size = 32
model.fit(X_train, Y_train, epochs = 10, batch_size=batch_size, class_weight=class_weight, verbose = 2)

Epoch 1/10


INFO:plaidml:Analyzing Ops: 3032 of 3461 operations complete
INFO:plaidml:Analyzing Ops: 2911 of 3462 operations complete


 - 27s - loss: 1.2398 - acc: 0.8102
Epoch 2/10
 - 14s - loss: 0.9126 - acc: 0.8609
Epoch 3/10
 - 14s - loss: 0.7895 - acc: 0.8819
Epoch 4/10
 - 14s - loss: 0.7046 - acc: 0.8959
Epoch 5/10
 - 16s - loss: 0.6310 - acc: 0.9080
Epoch 6/10
 - 15s - loss: 0.5629 - acc: 0.9174
Epoch 7/10
 - 15s - loss: 0.5064 - acc: 0.9245
Epoch 8/10
 - 14s - loss: 0.4724 - acc: 0.9317
Epoch 9/10
 - 15s - loss: 0.4240 - acc: 0.9361
Epoch 10/10
 - 14s - loss: 0.3972 - acc: 0.9400


<keras.callbacks.History at 0x1a22b08890>

Extracting a validation set, and measuring score and accuracy.

In [31]:
validation_size = 1500

X_validate = X_test[-validation_size:]
Y_validate = Y_test[-validation_size:]
X_test = X_test[:-validation_size]
Y_test = Y_test[:-validation_size]
score,acc = model.evaluate(X_test, Y_test, verbose = 1, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

score: 0.55
acc: 0.83


In [38]:
Y_validate.values[0]

0

Finally measuring the number of correct guesses.  It is clear that finding negative tweets goes very well for the Network but deciding whether is positive is not really. My educated guess here is that the positive training set is dramatically smaller than the negative, hence the "bad" results for positive tweets.

In [46]:
pos_cnt, neg_cnt, pos_correct, neg_correct = 0, 0, 0, 0
for x in range(len(X_validate)):
    
    result = model.predict(X_validate[x].reshape(1,X_test.shape[1]),batch_size=1,verbose = 2)[0][0]
    print (result, Y_validate.values[x])
    
    if result>=0.5:
        if Y_validate.values[x]==1:
            pos_correct+=1
            pos_cnt+=1
        else:
            neg_cnt+=1
    else:
        if Y_validate.values[x]==0:
            neg_correct+=1
            neg_cnt+=1
        else:
            pos_cnt+=1

print("pos_acc", pos_correct/pos_cnt*100, "%")
print("neg_acc", neg_correct/neg_cnt*100, "%")

0.0011882622 0
0.26062018 0
0.00043267754 0
0.008901552 0
0.0013350381 0
0.0041489825 0
0.0111680655 0
0.005679926 1
0.00201489 0
0.099065766 0
0.0022559264 0
0.0008874759 0
0.00861515 0
0.06324725 0
0.98388404 1
0.0113879945 0
0.08433372 1
0.00067802414 0
0.7595814 0
0.15486097 0
0.00039225788 0
0.002078962 0
0.99127674 0
0.28534657 1
0.04323918 0
0.008186874 0
0.40773013 1
0.00015490348 0
0.00035242498 0
0.001073553 0
0.996777 1
0.08515249 0
0.95832896 0
0.9392569 1
0.61757946 0
0.17948511 0
0.0009472508 0
0.00013702954 0
0.00036254828 0
0.0005477684 1
0.0014306403 0
0.0029999474 0
0.43763897 0
0.0008874759 0
0.0029999474 0
0.026464082 0
0.024962548 0
0.99148023 1
0.004827218 1
0.00056511216 1
0.0006928505 0
0.9798782 1
0.15541036 0
0.9953897 1
0.840353 1
0.02682033 0
0.0002981319 0
0.00032248354 0
0.00020030527 0
0.037487913 0
0.1497652 0
0.016861392 0
0.0004645485 0
0.041007735 0
0.00036254828 0
0.004980318 0
0.08521755 0
0.00090299174 0
0.001634005 0
0.05107479 0
0.00023178439 0
0

Please note that the network performs poorly. It's because the training data is very unbalanced (pos: 4472, neg: 16986), you should get more data, use other dataset, use pre-trained model, or weight classes to achieve reliable predictions.

In [47]:
twt = ['Spoke very well']
#vectorizing the tweet by the pre-fitted tokenizer instance
twt = tokenizer.texts_to_sequences(twt)
#padding the tweet to have exactly the same shape as `embedding_2` input
twt = pad_sequences(twt, maxlen=28, dtype='int32', value=0)
print(twt)
sentiment = model.predict(twt,batch_size=1,verbose = 2)[0][0]
if(sentiment < 0.5):
    print("negative")
else:
    print("positive")

[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0 1122  100  222]]
positive


In [48]:
twt = ['What a waste of time']
#vectorizing the tweet by the pre-fitted tokenizer instance
twt = tokenizer.texts_to_sequences(twt)
#padding the tweet to have exactly the same shape as `embedding_2` input
twt = pad_sequences(twt, maxlen=28, dtype='int32', value=0)
print(twt)
sentiment = model.predict(twt,batch_size=1,verbose = 2)[0][0]
if(sentiment < 0.5):
    print("negative")
else:
    print("positive")

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0  49   7   6 108]]
negative
