# First GOP Debate Twitter Sentiment

This notebook is built with Recurrent Neural Network (RNN) in Python 3.

I use the dataset from [Kaggle](https://www.kaggle.com/crowdflower/first-gop-debate-twitter-sentiment/data)  

This data originally came from [Crowdflower's Data for Everyone library](https://www.crowdflower.com/data-for-everyone/).  

As the original source says,  


We looked through tens of thousands of tweets about the early August GOP debate in Ohio and asked contributors to do both sentiment analysis and data categorization. Contributors were asked if the tweet was relevant, which candidate was mentioned, what subject was mentioned, and then what the sentiment was for a given tweet. We've removed the non-relevant messages from the uploaded dataset.

   
   
   
### Sentiment Analysis: 

the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.

## Step 1 

Import packages.


In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import re

# Input data files are available in the root directory of the project.

Using TensorFlow backend.


Only keep the necessary columns.

In [4]:
data = pd.read_csv('Sentiment.csv')
# Keep only the neccessary columns
data = data[['text','sentiment']]

Next, I drop the 'Neutral' sentiments as my goal was to only differentiate positive and negative tweets.   
After that, I filter the tweets, so only valid texts and words remain.  
Then, I define the number of max features as 2000, and use Tokenizer to vectorize and convert text into Sequences, so the network can deal with it as input.

In [5]:
data = data[data.sentiment != "Neutral"]
data['text'] = data['text'].apply(lambda x: x.lower())
data['text'] = data['text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

print(data[ data['sentiment'] == 'Positive'].size)
print(data[ data['sentiment'] == 'Negative'].size)

for idx,row in data.iterrows():
    row[0] = row[0].replace('rt',' ')
    
max_fatures = 2000
tokenizer = Tokenizer(nb_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['text'].values)
X = tokenizer.texts_to_sequences(data['text'].values)
X = pad_sequences(X)

4472
16986




Next, I compose the LSTM Network.   
Note that embed_dim, lstm_out, batch_size, droupout_x variables are hyperparameters, their values are somehow intuitive, can be and must be played with in order to achieve good results.   
Also note that I use softmax as activation function. The reason is that our Network is using categorical crossentropy, and softmax is just the right activation method for that.

In [6]:
embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1], dropout=0.2))
model.add(LSTM(lstm_out, dropout_U=0.2, dropout_W=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

  """
  


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 28, 128)           256000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 196)               254800    
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 394       
Total params: 511,194
Trainable params: 511,194
Non-trainable params: 0
_________________________________________________________________
None


Here I declare the train and test dataset. 

In [7]:
Y = pd.get_dummies(data['sentiment']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(7188, 28) (7188, 2)
(3541, 28) (3541, 2)


Here I train the Network.   
The number of epoches can be bigger to reach better result.

In [9]:
batch_size = 32
model.fit(X_train, Y_train, nb_epoch = 10, batch_size=batch_size, verbose = 2)



Epoch 1/10
15s - loss: 0.4259 - acc: 0.8215
Epoch 2/10
10s - loss: 0.3067 - acc: 0.8737
Epoch 3/10
10s - loss: 0.2646 - acc: 0.8907
Epoch 4/10
10s - loss: 0.2261 - acc: 0.9082
Epoch 5/10
10s - loss: 0.1961 - acc: 0.9218
Epoch 6/10
10s - loss: 0.1763 - acc: 0.9303
Epoch 7/10
10s - loss: 0.1560 - acc: 0.9356
Epoch 8/10
10s - loss: 0.1327 - acc: 0.9463
Epoch 9/10
10s - loss: 0.1241 - acc: 0.9491
Epoch 10/10
10s - loss: 0.1159 - acc: 0.9538


<keras.callbacks.History at 0x1e78c3fc7f0>

Here I extract a validation set, and measuring score and accuracy.

In [11]:
validation_size = 1500

X_validate = X_test[-validation_size:]
Y_validate = Y_test[-validation_size:]
X_test = X_test[:-validation_size]
Y_test = Y_test[:-validation_size]
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

score: 0.63
acc: 0.82


Finally measuring the number of correct guesses.   
It is clear that finding negative tweets goes very well for the Network, but deciding whether is positive is not really.   
My guess is that the positive training set is dramatically smaller than the negative, hence the "bad" results for positive tweets.

In [12]:
pos_cnt, neg_cnt, pos_correct, neg_correct = 0, 0, 0, 0
for x in range(len(X_validate)):
    
    result = model.predict(X_validate[x].reshape(1,X_test.shape[1]),batch_size=1,verbose = 2)[0]
   
    if np.argmax(result) == np.argmax(Y_validate[x]):
        if np.argmax(Y_validate[x]) == 0:
            neg_correct += 1
        else:
            pos_correct += 1
       
    if np.argmax(Y_validate[x]) == 0:
        neg_cnt += 1
    else:
        pos_cnt += 1



print("pos_acc", pos_correct/pos_cnt*100, "%")
print("neg_acc", neg_correct/neg_cnt*100, "%")

pos_acc 51.45631067961165 %
neg_acc 90.42821158690177 %
