# Sentiment analysis using RNN and word embedding

### Summary :

- [Importing data](#1)
- [Preprocessing](#2)
- [Tokenizer](#3)
- [Split of train and test data](#4)
- [Compile the model](#5)

Database at https://www.kaggle.com/kazanova/sentiment140

This notebook is written in Python3, the goal was to analyse a 16.000.000 tweet dataset using RNN and word embedding to do sentiment analysis

The notebook is organised like this : titles are in bold typo and shortcuts are available in the summary, explainations about what we are doing are in comments (markdowns) and notes of few things I tried and worked with but didn't actually work in the end are in brut text (for NBconvert) and the code inside it wont compute

And here we go !

## 0. Importing the libraries

In [1]:
import pandas as pd
import numpy as np
from keras.layers import LSTM, Activation, Dropout, Dense, Input, Embedding
from keras.models import Model
import string
import re
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
from sklearn.model_selection import train_test_split
import nltk
from nltk.tokenize import RegexpTokenizer

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/victorgaya/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

<a name="1"></a>
## 1. Importing data

Importing file data.csv with all the tweets

In [86]:
data = pd.read_csv("../data/data.csv", sep=',', encoding = 'latin', header=None)

data

         0           1                             2         3  \
0        0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY   
1        0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY   
2        0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY   
3        0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
4        0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
...     ..         ...                           ...       ...   
1599995  4  2193601966  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599996  4  2193601969  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599997  4  2193601991  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599998  4  2193602064  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599999  4  2193602129  Tue Jun 16 08:40:50 PDT 2009  NO_QUERY   

                       4                                                  5  
0        _TheSpecialOne_  @switchfoot http://twitpic.com/2y1zl - Awww, t...  
1          scotthamilton  is upset that he can't up

I tried to take only 1000 positive and 1000 negative tweets to get faster calculation just taking from the middle of the dataset but all the time I was trying to compile the model it didn't work, I finaly used the iloc function from the pandas library (few lines later) to be able to use less tweets and for my model to run properly

We can rename the columns for better understanding

In [43]:
data.columns = ['sentiment', 'id', 'date', 'query', 'user_id', 'text']
data.head()

Unnamed: 0,sentiment,id,date,query,user_id,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


As we don't need all the columns, let's drop the ones we don't want

In [44]:
data = data[['text','sentiment']]
data.head()

Unnamed: 0,text,sentiment
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0


We separate positive and negative tweets

In [45]:
pos = data[data.sentiment == 4]
neg = data[data.sentiment == 0]

We take only 5000 positive and 5000 negative tweets to get faster calculation

In [46]:
pos = pos.iloc[:int(5000)]
neg = neg.iloc[:int(5000)]

We assign 1 to positive instead of 4 so it makes our testing possible later

In [47]:
pos.sentiment = 1

We concatenate positive and negative tweets

In [48]:
data = pd.concat([pos, neg])

data

Unnamed: 0,text,sentiment
800000,I LOVE @Health4UandPets u guys r the best!!,1
800001,im meeting up with one of my besties tonight! ...,1
800002,"@DaRealSunisaKim Thanks for the Twitter add, S...",1
800003,Being sick can be really cheap when it hurts t...,1
800004,@LovesBrooklyn2 he has that effect on everyone,1
...,...,...
4995,long day today,0
4996,a friend broke his promises..,0
4997,@gjarnling I am fine thanks - tired,0
4998,trying to keep my eyes open..damn baking,0


In the .csv, the tweets with negative meaning is labeled as a 0, and positive as a 4, so we will replace 0 with word negative, 4 with the word positive for better understanding of the dataset

<a name="2"></a>
## 2. Preprocessing

To start preprocessing, we want to make our tweets in lower case

In [49]:
data.text=data.text.str.lower()

data.text.head()

800000         i love @health4uandpets u guys r the best!! 
800001    im meeting up with one of my besties tonight! ...
800002    @darealsunisakim thanks for the twitter add, s...
800003    being sick can be really cheap when it hurts t...
800004      @lovesbrooklyn2 he has that effect on everyone 
Name: text, dtype: object

To preprocess our data, we will want to lemmatize and to remove stop words to keep only the most important of our tweets

In [50]:
def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in nltk.corpus.stopwords.words('english')])

In [51]:
data.text = data.text.apply(lambda x: remove_stopwords(x))

In [52]:
data.text

800000                love @health4uandpets u guys r best!!
800001    im meeting one besties tonight! cant wait!! - ...
800002    @darealsunisakim thanks twitter add, sunisa! g...
800003    sick really cheap hurts much eat real food plu...
800004                      @lovesbrooklyn2 effect everyone
                                ...                        
4995                                         long day today
4996                                friend broke promises..
4997                         @gjarnling fine thanks - tired
4998                     trying keep eyes open..damn baking
4999                                           hell snowing
Name: text, Length: 10000, dtype: object

Next we want to remove the characters which are repeated in the words so it is even cleaner

In [53]:
def remove_repeating_char(text):
    return re.sub(r'(.)\1+', r'\1', text)

In [54]:
data.text = data.text.apply(lambda x: remove_repeating_char(x))

In [55]:
data.text

800000                 love @health4uandpets u guys r best!
800001    im meting one besties tonight! cant wait! - gi...
800002    @darealsunisakim thanks twiter ad, sunisa! got...
800003    sick realy cheap hurts much eat real fod plus,...
800004                        @lovesbroklyn2 efect everyone
                                ...                        
4995                                         long day today
4996                                 friend broke promises.
4997                         @gjarnling fine thanks - tired
4998                       trying kep eyes open.damn baking
4999                                            hel snowing
Name: text, Length: 10000, dtype: object

Then we want to clean and remove names or emails, which are always after the symbol "@"

In [56]:
def remove_at(data):
    return re.sub('@[^\s]+', ' ', data)

In [57]:
data.text = data.text.apply(lambda x: remove_at(x))

In [58]:
data.text

800000                                love   u guys r best!
800001    im meting one besties tonight! cant wait! - gi...
800002      thanks twiter ad, sunisa! got met hin show d...
800003    sick realy cheap hurts much eat real fod plus,...
800004                                       efect everyone
                                ...                        
4995                                         long day today
4996                                 friend broke promises.
4997                                    fine thanks - tired
4998                       trying kep eyes open.damn baking
4999                                            hel snowing
Name: text, Length: 10000, dtype: object

We now clean and remove URLs

In [59]:
def remove_URLs(data):
    return re.sub('((www\.[^\s]+)|(https?://[^\s]+))',' ',data)

In [60]:
data.text = data.text.apply(lambda x: remove_URLs(x))

In [61]:
data.text

800000                                love   u guys r best!
800001    im meting one besties tonight! cant wait! - gi...
800002      thanks twiter ad, sunisa! got met hin show d...
800003    sick realy cheap hurts much eat real fod plus,...
800004                                       efect everyone
                                ...                        
4995                                         long day today
4996                                 friend broke promises.
4997                                    fine thanks - tired
4998                       trying kep eyes open.damn baking
4999                                            hel snowing
Name: text, Length: 10000, dtype: object

Now removing numeric numbers

In [62]:
def remove_numbers(data):
    return re.sub('[0-9]+', '', data)

In [63]:
data.text = data.text.apply(lambda x: remove_numbers(x))

In [64]:
data.text

800000                                love   u guys r best!
800001    im meting one besties tonight! cant wait! - gi...
800002      thanks twiter ad, sunisa! got met hin show d...
800003    sick realy cheap hurts much eat real fod plus,...
800004                                       efect everyone
                                ...                        
4995                                         long day today
4996                                 friend broke promises.
4997                                    fine thanks - tired
4998                       trying kep eyes open.damn baking
4999                                            hel snowing
Name: text, Length: 10000, dtype: object

In [65]:
def remove_punctuations(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

In [66]:
data.text = data.text.apply(lambda x: remove_punctuations(x))

In [67]:
data.text

800000                                 love   u guys r best
800001    im meting one besties tonight cant wait  girl ...
800002      thanks twiter ad sunisa got met hin show dc ...
800003    sick realy cheap hurts much eat real fod plus ...
800004                                       efect everyone
                                ...                        
4995                                         long day today
4996                                  friend broke promises
4997                                     fine thanks  tired
4998                        trying kep eyes opendamn baking
4999                                            hel snowing
Name: text, Length: 10000, dtype: object

<a name="3"></a>
## 3. Tokenizer

Tokenization of tweets text

In [68]:
tokenizer = RegexpTokenizer(r'\w+')
data.text = data.text.apply(tokenizer.tokenize)

In [69]:
data.text

800000                             [love, u, guys, r, best]
800001    [im, meting, one, besties, tonight, cant, wait...
800002    [thanks, twiter, ad, sunisa, got, met, hin, sh...
800003    [sick, realy, cheap, hurts, much, eat, real, f...
800004                                    [efect, everyone]
                                ...                        
4995                                     [long, day, today]
4996                              [friend, broke, promises]
4997                                  [fine, thanks, tired]
4998                  [trying, kep, eyes, opendamn, baking]
4999                                         [hel, snowing]
Name: text, Length: 10000, dtype: object

We can now apply stemming

In [70]:
def stemming_data(data):
    text = [nltk.PorterStemmer().stem(word) for word in data]
    return data

In [71]:
data.text = data.text.apply(lambda x: stemming_data(x))

In [72]:
data.text

800000                             [love, u, guys, r, best]
800001    [im, meting, one, besties, tonight, cant, wait...
800002    [thanks, twiter, ad, sunisa, got, met, hin, sh...
800003    [sick, realy, cheap, hurts, much, eat, real, f...
800004                                    [efect, everyone]
                                ...                        
4995                                     [long, day, today]
4996                              [friend, broke, promises]
4997                                  [fine, thanks, tired]
4998                  [trying, kep, eyes, opendamn, baking]
4999                                         [hel, snowing]
Name: text, Length: 10000, dtype: object

And finaly we apply the lemmatizer

In [73]:
def lemmatizing_data(data):
    text = [nltk.WordNetLemmatizer().lemmatize(word) for word in data]
    return data

In [74]:
data.text = data.text.apply(lambda x: lemmatizing_data(x))

In [75]:
data.text

800000                             [love, u, guys, r, best]
800001    [im, meting, one, besties, tonight, cant, wait...
800002    [thanks, twiter, ad, sunisa, got, met, hin, sh...
800003    [sick, realy, cheap, hurts, much, eat, real, f...
800004                                    [efect, everyone]
                                ...                        
4995                                     [long, day, today]
4996                              [friend, broke, promises]
4997                                  [fine, thanks, tired]
4998                  [trying, kep, eyes, opendamn, baking]
4999                                         [hel, snowing]
Name: text, Length: 10000, dtype: object

<a name="4"></a>
## 4. Split of train and test data

We want to shuffle the dataset for it to be randomized, train_test_split will shuffle the dataset for us and split it to gives training and testing dataset.

First we separate the dataset

In [76]:
x=data.text
y=data.sentiment

We will now prepare the input features for training 

We convert the text words into a matrix with a maximum of 300 features per word selected for the training

In [77]:
max_len = 500
tok = Tokenizer(num_words=2000)
tok.fit_on_texts(x)
sequences = tok.texts_to_sequences(x)
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)

In [78]:
sequences_matrix.shape

(10000, 500)

Here we use train_test_split to have our data splited for the training and testing, x being the text and y the sentiment

In [79]:
x_train, x_test, y_train, y_test = train_test_split(sequences_matrix, y, test_size=0.3, random_state=2)

We now define our model, which will be made of an input layer, an embedding layer, then LSTM, a dense layer, an activation using ReLU, then the dropout and an other dense layer before finishing with a last layer of activation using sigmoid this time.

In [80]:
def tensorflow_based_model(): #Defined tensorflow_based_model function for training tenforflow based model
    inputs = Input(name='inputs',shape=[max_len])#step1
    layer = Embedding(2000,50,input_length=max_len)(inputs) #step2
    layer = LSTM(64)(layer) #step3
    layer = Dense(256,name='FC1')(layer) #step4
    layer = Activation('relu')(layer) # step5
    layer = Dropout(0.5)(layer) # step6
    layer = Dense(1,name='out_layer')(layer) #step4 again but this time its giving only one output as because we need to classify the tweet as positive or negative
    layer = Activation('sigmoid')(layer) #step5 but this time activation function is sigmoid for only one output.
    model = Model(inputs=inputs,outputs=layer) #here we are getting the final output value in the model for classification
    return model #function returning the value when we call it

<a name="5"></a>
## 5. Compile the model

We want to call our model, so will be using 2 classes, if we set "binary_crossentropy" and use more than two classes then we will be using "categorical_crossentropy" 

We can change the features of neural network such as learning rate with the optimizer function in order to reduce the losses. 

In [81]:
model = tensorflow_based_model() # here we are calling the function of created model
model.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy'])

We will now train and validate the model with parameter tuning

In [82]:
history=model.fit(x_train,y_train,batch_size=80,epochs=6, validation_split=0.1)# here we are starting the training of model by feeding the training data
print('Training finished !!')

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Training finished !!


Testing the Trained model on test data

In [83]:
accuracy = model.evaluate(x_test,y_test) #we are starting to test the model here



Accuracy

In [85]:
print('Test set\n  Accuracy: {:0.2f}'.format(accuracy[1])) #the accuracy of the model on test data is given below

Test set
  Accuracy: 0.71
