
##  LSTM Example


#### Load Keras Packages


In [9]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import re



####  Read in Data

In [25]:
import os
myDir = "/scratch/" + os.getenv('USER') + "/Sentiment_Analysis"
os.chdir(myDir)

/scratch/jmh5ad/Sentiment_Analysis


In [11]:

data = pd.read_csv('Sentiment.csv')

# Keeping only the necessary columns
data = data[['text','sentiment']]
print(data.head())


                                                text sentiment
0  RT @NancyLeeGrahn: How did everyone feel about...   Neutral
1  RT @ScottWalker: Didn't catch the full #GOPdeb...  Positive
2  RT @TJMShow: No mention of Tamir Rice and the ...   Neutral
3  RT @RobGeorge: That Carly Fiorina is trending ...  Positive
4  RT @DanScavino: #GOPDebate w/ @realDonaldTrump...  Positive


####  Pre-Process the Data:  Simplify

We'll need to do a few tasks to simplify the texts.
1. Let's remove the items classified as "Neutral" (i.e., keep anything that is not "Neutral");
2. Let's remove any characters that are not alphanumeric (i.e., replace anything not alphanumeric or a space with nothing); 
3. Let's remove the "RT" at the beginning of the messages, and
4. Let's convert everything to lower case.

In [12]:

data = data[data.sentiment != "Neutral"]
data['text'] = data['text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))
data['text'] = data['text'].apply((lambda x: re.sub('^RT','',x)))
data['text'] = data['text'].apply(lambda x: x.lower())

print(data.head())

## Just to know what we are left with, let's check the # of positive or negative sentiments
print("\nNum Positive Tweets:", data[ data['sentiment'] == 'Positive'].size)
print("Num Negative Tweets:", data[ data['sentiment'] == 'Negative'].size)

            

                                                text sentiment
1   scottwalker didnt catch the full gopdebate la...  Positive
3   robgeorge that carly fiorina is trending  hou...  Positive
4   danscavino gopdebate w realdonaldtrump delive...  Positive
5   gregabbott_tx tedcruz on my first day i will ...  Positive
6   warriorwoman91 i liked her and was happy when...  Negative

Num Positive Tweets: 4472
Num Negative Tweets: 16986


#### Pre-Process the Data:  Create Sequences

To process the data with tensorflow, we will need to convert it to a numerical representation.  This is done by creating a list of words and representing each tweet with the index of the words in that list.  

Each tweet will be a _*sequence*_ of indices.  But, for mathematical processing, we will want the vectors to be the same size.  So, we will "pad" the vectors with zeros to ensure the proper length.

Similary, we will want to convert the classifications (e.g., positive or negative) to a sequence.  This sequence is called _one hot vector_.  It is a vector of zeros and a single value of 1.  The 1 will be placed in the position corresponding to one of the classifications. 

In [13]:

max_words = 2000
tokenizer = Tokenizer(num_words=max_words, split=' ')
tokenizer.fit_on_texts(data['text'].values)
X = tokenizer.texts_to_sequences(data['text'].values)

Y = pd.get_dummies(data['sentiment']).values

# Let's take a peek at what we have for X and Y
print(data['text'].values[2])
print(X[2])
X = pad_sequences(X)
print(X[2])


print(data['sentiment'].values[2])
print(Y[2])
             

 danscavino gopdebate w realdonaldtrump delivered the highest ratings in the history of presidential debates trump2016 httptco
[1248, 2, 300, 23, 1928, 1, 1615, 213, 12, 1, 695, 6, 183, 204, 367, 680]
[   0    0    0    0    0    0    0    0    0    0    0    0 1248    2
  300   23 1928    1 1615  213   12    1  695    6  183  204  367  680]
Positive
[0 1]


####  Split data into training and testing sets

The next step is to split the data into a _*training*_ set and a _*testing*_ set.  The training set will be used to solve for the mathematical model that will be used to determine if a text is positive or negative.  The testing set is used to see how accuately the model works on data that were not used for creating the model.

Depending on the size of your data, you may be able to use large subset for testing.  In this example, we set `test_size` to 0.33.  So, about 33% or 1/3 of the data will be set aside to be used for testing purposes.

In [14]:


X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)



(7188, 28) (7188, 2)
(3541, 28) (3541, 2)


####  Define the Model

Defining the model is an art.

Unless we get into very sophisticated model, we will want to start with a _Sequential_ model.  This gives us the framework for adding layer to our model.

##### Embedding Layer
The first layer in an LSTM module should be an _Embedding_ layer.  The _Embedding_ will convert the input words to vectors of values.  Each word will be represented by a vector of size given by _embed_dim_ (in this case, 128). The _Embedding_ function also needs to know the maximum number of words that we are considering as part of our corpus (max_words) and number of columns of the input (X.shape[1]).

##### SpatialDropout1D Layer
The Spatial Dropout layer will randomly choose rows from the model to remove (i.e., dropout).  When placed in an iterative loop, this step will ensure that one or two features are not dominating the entire model.  

The value passed into the _SpatialDropout1D_ function represents the percentage of rows to drop for each iteration.

##### LSTM Layer
The _LSTM_ layer performs the model fitting based on the training data.In this example the function takes as input the size of the output (i.e., lstm_out), the dropout rate of the linear transformation (0.2), and the dropout rate of the recurrent state (0.2).

##### Dense Layer
The _Dense_ layer takes the outputs from the previous layer and applies a basic neural network operation to transform the values.  In this example, the activation function will determine which nodes of the neural network will be turned on or off.  Because this is the final step of our model, we want to ensure that the output of the neural network will match the total number of classifications (i.e., 2 -- positive or negative).

In [15]:

embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(max_words, embed_dim, input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))



Defining the model will simply create the locations in memory for the computations that will fit the model to the data.  It does not do any of the actual computations.  That step will come after we configure the learning process. 

####  Configure the Learning Process

One more step that is needed in to specify how the algorithm will "learn".  In other words, what functions will be used to measure loss, perform updates to the values.  You can learn more about different functons that can be selected at https://keras.io/api/losses/

In [16]:

model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics = ['accuracy'])
print(model.summary())


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 128)           256000    
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 28, 128)           0         
_________________________________________________________________
lstm (LSTM)                  (None, 196)               254800    
_________________________________________________________________
dense (Dense)                (None, 2)                 394       
Total params: 511,194
Trainable params: 511,194
Non-trainable params: 0
_________________________________________________________________
None


####  Fit the Model to the Data

We are ready to run the algorithm to fit the model to the training data.  To do this, we will give the _fit_ function the X and Y values of the training data.  We also need to tell the function how many iterations or _epochs_ we would like for the model to repeat.  Normally, the fit will improve with each epoch. Finally, we want to specify the _batch_size_.  This parameter takes a little more explanation.

If we have hundreds of thousands of rows in our data, the computations to fit the model to the data would be enormous, requiring lots of computer memory.  Insead of trying to use all of the data for the fitting, the algorithm will split the data into batches and use each batch to compute results.  The _batch_size_ will be the number of rows that we want to run through the model at one time.  The parameters for the model are tweaked slightly when the results of batch run are assessed.  All of the batches will be run through the model within a single epoch.

In other words, the algorithm goes through many, many iterations, making adjustments to the parameters with each iteration.  Because we want to watch where it is in the fitting process, we set the _verbose_ parameter to 1.  This parameter will print some intermediate results to the screen. 

In [22]:
num_epochs = 7 
batch_size = 32

model.fit(X_train, Y_train, epochs = num_epochs, batch_size=batch_size, verbose=1)



Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


<tensorflow.python.keras.callbacks.History at 0x7f5650399e10>

####  Apply Model to Test Data

In [26]:
 

score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))




111/111 - 3s - loss: 0.7500 - accuracy: 0.8252
score: 0.75
acc: 0.83


#### Apply the model to a "new" object

In [27]:
twt = ['Meetings: Because none of us is as dumb as all of us.']

# Pre-process tweet
twt = re.sub('[^a-zA-z0-9\s]','',twt[0].lower())
twt = tokenizer.texts_to_sequences(twt)
twt = pad_sequences(twt, maxlen=28, value=0)


In [28]:
# Run through the model
sentiment = model.predict(twt,batch_size=1,verbose = 2)[0]

#State result
if(np.argmax(sentiment) == 0):
    print("**The tweet is negative.**")
elif (np.argmax(sentiment) == 1):
    print("**The tweet is positive,**")


51/51 - 1s
**The tweet is negative.**
