# Deep Learning - Word Embedding

## BRZOZOWSKI MAREK

## Twitter sentiment dataset will be used to explore:

## Classifying the sentiment value of the tweet using Word Embedding with LSTM class of Deep Neural Networks 

Neural Networks are a series algorithms for building a computer program that learns from data. It loosely resembles the way our human brains operate. Neurons in the simplest form are links that activate on certain responses whether chemical signals or data inputs for computers. As the brain evolves to create new linking neurons so to does nequral networks as they adapt to changing inputs. 

Using LSTM we are able to added text classification using Word Embedding techniques. Algorithms represent individual values as real-valued vectors in a defined vector space. The combination of LSTM with Word Embedding allows for a sequence classification, a predictive modelling solution where once with the goal to predict a category for the sequence/ document.

In [1]:
# Load Packages
import numpy as np 
import pandas as pd 
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D


Using TensorFlow backend.


In [2]:
# Loading data
raw_data =pd.read_csv('Sentiment.csv')
raw_data.head()

Unnamed: 0,id,candidate,candidate_confidence,relevant_yn,relevant_yn_confidence,sentiment,sentiment_confidence,subject_matter,subject_matter_confidence,candidate_gold,...,relevant_yn_gold,retweet_count,sentiment_gold,subject_matter_gold,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,1,No candidate mentioned,1.0,yes,1.0,Neutral,0.6578,None of the above,1.0,,...,,5,,,RT @NancyLeeGrahn: How did everyone feel about...,,2015-08-07 09:54:46 -0700,629697200650592256,,Quito
1,2,Scott Walker,1.0,yes,1.0,Positive,0.6333,None of the above,1.0,,...,,26,,,RT @ScottWalker: Didn't catch the full #GOPdeb...,,2015-08-07 09:54:46 -0700,629697199560069120,,
2,3,No candidate mentioned,1.0,yes,1.0,Neutral,0.6629,None of the above,0.6629,,...,,27,,,RT @TJMShow: No mention of Tamir Rice and the ...,,2015-08-07 09:54:46 -0700,629697199312482304,,
3,4,No candidate mentioned,1.0,yes,1.0,Positive,1.0,None of the above,0.7039,,...,,138,,,RT @RobGeorge: That Carly Fiorina is trending ...,,2015-08-07 09:54:45 -0700,629697197118861312,Texas,Central Time (US & Canada)
4,5,Donald Trump,1.0,yes,1.0,Positive,0.7045,None of the above,1.0,,...,,156,,,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,,2015-08-07 09:54:45 -0700,629697196967903232,,Arizona


In [3]:
# Assignment only requires the following columns
data = raw_data[['text','sentiment']]
data.tail()

Unnamed: 0,text,sentiment
13866,RT @cappy_yarbrough: Love to see men who will ...,Negative
13867,RT @georgehenryw: Who thought Huckabee exceede...,Positive
13868,"RT @Lrihendry: #TedCruz As President, I will a...",Positive
13869,RT @JRehling: #GOPDebate Donald Trump says tha...,Negative
13870,RT @Lrihendry: #TedCruz headed into the Presid...,Positive


In [4]:
# Describing the data
data.describe()

Unnamed: 0,text,sentiment
count,13871,13871
unique,10402,3
top,RT @RWSurferGirl: Jeb Bush reminds me of eleva...,Negative
freq,161,8493


In [5]:
# Looks like there are three unique types for sentiment
pd.unique(data.sentiment)

array(['Neutral', 'Positive', 'Negative'], dtype=object)

In [6]:
# We remove the Neutral comments and do some text preprocessing
data_v1 = data.copy()
data_v1 = data[data.sentiment != 'Neutral']
data_v1.head()

Unnamed: 0,text,sentiment
1,RT @ScottWalker: Didn't catch the full #GOPdeb...,Positive
3,RT @RobGeorge: That Carly Fiorina is trending ...,Positive
4,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,Positive
5,"RT @GregAbbott_TX: @TedCruz: ""On my first day ...",Positive
6,RT @warriorwoman91: I liked her and was happy ...,Negative


In [7]:
# Convert text to all lowercase
data_v2 = data_v1.copy()
data_v2.text = data_v2.text.apply(lambda x: x.lower())
data_v2.head()

Unnamed: 0,text,sentiment
1,rt @scottwalker: didn't catch the full #gopdeb...,Positive
3,rt @robgeorge: that carly fiorina is trending ...,Positive
4,rt @danscavino: #gopdebate w/ @realdonaldtrump...,Positive
5,"rt @gregabbott_tx: @tedcruz: ""on my first day ...",Positive
6,rt @warriorwoman91: i liked her and was happy ...,Negative


In [8]:
# Keeping only letters and numbers
data_v3 = data_v2.copy()
data_v3.text = data_v3.text.apply(lambda x: re.sub('[^a-zA-Z0-9\s]','',x))
data_v3.tail()

Unnamed: 0,text,sentiment
13866,rt cappyyarbrough love to see men who will nev...,Negative
13867,rt georgehenryw who thought huckabee exceeded ...,Positive
13868,rt lrihendry tedcruz as president i will alway...,Positive
13869,rt jrehling gopdebate donald trump says that h...,Negative
13870,rt lrihendry tedcruz headed into the president...,Positive


In [9]:
# Twitter has a rt@ NAME feature present in the text column. We have removed the @ feature, 
# but many texts still have the rt string. Let's remove that to reduce unwanted error.
clean = data_v3
for idx,row in clean.iterrows():
    row[0] = row[0].replace('rt',' ')

clean.head()

Unnamed: 0,text,sentiment
1,scottwalker didnt catch the full gopdebate l...,Positive
3,robgeorge that carly fiorina is trending ho...,Positive
4,danscavino gopdebate w realdonaldtrump deliv...,Positive
5,gregabbotttx tedcruz on my first day i will ...,Positive
6,warriorwoman91 i liked her and was happy whe...,Negative


In [10]:
# Calculating the size of the remaining positive and negative sentiments.
print('Number of Positive Sentiment Text: ', clean[clean.sentiment == 'Positive'].size)
print('Number of Negative Sentiment Text: ', clean[clean.sentiment == 'Negative'].size)

Number of Positive Sentiment Text:  4472
Number of Negative Sentiment Text:  16986


To have efficient sentiment analysis or solving any NLP problem, we need a lot of features. Its not easy to figure out the exact number of features are needed. So we are going to try, 10,000 to 30,000. And print out accuracy scores associate with the number of features.

In [11]:
# For the sake of the assignment we will stick with a max_feature of 2000
max_features = 2000

In [12]:
# Tokenizer = Represent words as a series of numbers
# We are going to tokenizer the words of the text for the entire text documents. Separating the words into our 
# max_feature length array. 
tokenizer = Tokenizer(num_words = max_features, split = ' ') 
tokenizer.fit_on_texts(clean.text.values)

clean_X  = tokenizer.texts_to_sequences(clean.text.values)
clean_X[:1]

[[363,
  122,
  1,
  722,
  2,
  39,
  58,
  237,
  36,
  210,
  6,
  174,
  1757,
  12,
  1317,
  1403,
  742]]

In [13]:
# Next we need to pad the tokenized vector. Input zero to form a fixed array
clean_padded = pad_sequences(clean_X)

In [14]:
# Now we pull the values for the sentiment column into labels
labels = pd.get_dummies(clean.sentiment).values
labels[0:5]

array([[0, 1],
       [0, 1],
       [0, 1],
       [0, 1],
       [1, 0]], dtype=uint8)

In [15]:
# Let's create train-test data splits. Based on the assignment 1/3 of the data is to be split at a random state of 42
train_features, test_features, train_labels, test_labels = train_test_split(
    clean_padded,labels, test_size = 0.33, random_state = 42
    )

print('Shape of Train_Features: ', train_features.shape)
print('Shape of Train_Labels: ', train_labels.shape)
print('Shape of Test_Features: ', test_features.shape)
print('Shape of Test_Labels: ', test_features.shape)

Shape of Train_Features:  (7188, 28)
Shape of Train_Labels:  (7188, 2)
Shape of Test_Features:  (3541, 28)
Shape of Test_Labels:  (3541, 28)


In this assignment,:

Reshape the train data set

Generate the model with Embedding layer and LSTM layer and 1 dense layer as the last layer.

You can add it in the empty line of assignment 7 code.

Generate The Model including Embedding layer and LSTM


Building and RNN model for sentiment analysis.

We need to remember that our input is a sequence of words in the form of integer word IDs of a maximum length = max_words and our output is a binary sentiment label of 0 or 1.

We will be using a Keras Embedding Layer which can be used for neural network on text data. It requires that the input data be integer encoded, as we have done.

The Embedding layers is intialized with random weights and will learn an embedding for all of the words in the training set.

This Embedding layer is a flexible layer that can be used in many ways.

- It can be used alone to learn a word embediding that can be saved and used in another later model.
- It can be used as part of a deep learning model where the embedding is learned along with the model itself.
- It can be used to load a pre-trained word embedding model, a type of transfer learn.


3 Criteria must be specified: Input_dim, Output_dim, input_length

Input_dim: Is the size of the of the features in text data. Eg., if the data is integeger encoded to values between 0-10, then the input_dim is 11 words.

Output_dim: Is the size of the vector space in which words will be embedded in. This defines the size of the output vector from this layer for each word. For our problem Output_dim is 28

Input_length: Is the length of the the input sequences, similar for any input layer of a RNN Keras model. So if your input documents comprise of X words, the input_length would be x.

Example:

e = Embedding (input_dim, Output_dim, input_length)

This layer discovers weights, and can be save within the layer. 

If there is an inclusion of a Dense Later one must Flatten the 2D output matrixc into a 1D vector, using a Flatten layer.

In [16]:
# RNN MODEL 

# Emded dimension is the output_dimension
embed_dimension = 128

# Number of LSTM nodes
lstm_out = 200

model = Sequential()

# First embedded layer: With dimension of (1, 28, 7188)
model.add(Embedding(max_features, embed_dimension, input_length = train_features.shape[1]))
# Adding Spatial Dropout, connections are special so Spatial Dropout drops entire 1D feature maps 
# instead of individual elements.
# LSTM RNN model layer.
model.add(SpatialDropout1D(0.1))
model.add(LSTM(lstm_out))

# Last layer is a 2 node layer, essentially 0 or 1. Providing a negative or positive result.
model.add(Dense(2,activation='sigmoid'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 28, 128)           256000    
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 28, 128)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 200)               263200    
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 402       
Total params: 519,602
Trainable params: 519,602
Non-trainable params: 0
_________________________________________________________________


In [17]:
# Compiler, binary_crossentropy for 2 outcomes.
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

In [18]:
# Fitting model
batch_size = 32
model.fit(train_features, train_labels, epochs = 7, batch_size=batch_size, verbose = 2)

Epoch 1/7
 - 8s - loss: 0.4345 - acc: 0.8207
Epoch 2/7
 - 7s - loss: 0.3091 - acc: 0.8752
Epoch 3/7
 - 7s - loss: 0.2601 - acc: 0.8961
Epoch 4/7
 - 7s - loss: 0.2207 - acc: 0.9117
Epoch 5/7
 - 7s - loss: 0.1945 - acc: 0.9225
Epoch 6/7
 - 7s - loss: 0.1665 - acc: 0.9336
Epoch 7/7
 - 7s - loss: 0.1435 - acc: 0.9413


<keras.callbacks.History at 0x1d63b9c9240>

In [19]:
# For the assignment deteremine the accuracy and score of the model on the test sets
validation_size = 1500

# Combining features and labels for test.
X_validate = test_features[-validation_size:]
Y_validate = test_labels[-validation_size:]
X_test = test_features[:-validation_size]
Y_test = test_labels[:-validation_size]

# Score and Accuracy
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("Score: %.2f" % (score))
print("Accuracy: %.2f" % (acc))

Score: 0.52
Accuracy: 0.83


In [20]:
# Calculating the accurate positive and negative predictions
pos_cnt, neg_cnt, pos_correct, neg_correct = 0, 0, 0, 0

for x in range(len(X_validate)):
    
    result = model.predict(X_validate[x].reshape(1,X_test.shape[1]),batch_size=1,verbose = 2)[0]
   
    if np.argmax(result) == np.argmax(Y_validate[x]):
        if np.argmax(Y_validate[x]) == 0:
            neg_correct += 1
        else:
            pos_correct += 1
       
    if np.argmax(Y_validate[x]) == 0:
        neg_cnt += 1
    else:
        pos_cnt += 1



print("Positive Accuracy", pos_correct/pos_cnt*100, "%")
print("Negative Accuracy", neg_correct/neg_cnt*100, "%")

Positive Accuracy 60.19417475728155 %
Negative Accuracy 90.09235936188077 %


In [21]:
# Determine the sentiment of a new tweet
twt = ['Meetings: Because none of us is as dumb as all of us.']

# Vectorizing the tweet by the pre-fitted tokenizer instance
twt = tokenizer.texts_to_sequences(twt)

# Padding the tweet to have exactly the same shape as `embedding_2` input
twt = pad_sequences(twt, maxlen=28, dtype='int32', value=0)
print(twt)

# Prediciting the sentiment by our model
sentiment = model.predict(twt,batch_size=1,verbose = 2)[0]
if(np.argmax(sentiment) == 0):
    print("negative")
elif (np.argmax(sentiment) == 1):
    print("positive")

[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0  206  633    6  156    5   55 1050   55   46    6  156]]
negative
