# Instructor Do: RNNs for NLP - Sentiment Analysis

In this activity, students will learn how to define a LSTM RNN model for sentiment analysis using Keras. Also, data preparation for using LSTM models for natural language processing is introduced.

In [1]:
# Initial imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from pathlib import Path

%matplotlib inline

In [1]:
public class Card{
private int rank, suit;

private String[] suitNames = new String[]{ "H", "C", "S", "D" };
private String[] rankNumber = new String[]{ "A", "2", "3", "4", "5", "6", "7", "8", "9", "10", "J", "Q", "K" };
private int[] points = new int[]{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10};

Card(int suitIndex, int rankIndex){
    rank = rankIndex;
    suit = suitIndex;
}

public @Override String toString(){
    return rankNumber[rank]+suitNames[suit]; 
}

public int getRank(){
    return rank;
}

public int getSuit(){
    return suit;
}

public String getSuitName(){
    return suitNames[suit];
}

public String getRankName(){
    return rankNumber[rank];
}

public int getPoints(){
     return points[rank];
}

public ImageIcon ImageOfCard() throws Exception{
    ImageIcon icon = new ImageIcon("/StandardDeck/GameCards/"+getRankName() + getSuitName()+".png");
    return icon;
}
}

SyntaxError: invalid syntax (<ipython-input-1-f6b74d79b057>, line 1)

## Data Preprocessing

RNN input requires an array data type. The `full_review_text` column will be transformed into the `X` array and the “sentiment” column into the `y` array.

In [3]:
# Creating the X and y vectors
X = reviews_df["full_review_text"].values
y = reviews_df["sentiment"].values

To train the RNN model, we need to encode the text data as an integer. This transformation can be done using the following tools from Keras.

In [4]:
# Import Keras modules for data encoding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [5]:
# Create an instance of the Tokenizer and fit it with the X text data
tokenizer = Tokenizer(lower=True)
tokenizer.fit_on_texts(X)

In [6]:
# Print the first five elements of the encoded vocabulary
for token in list(tokenizer.word_index)[:5]:
    print(f"word: '{token}', token: {tokenizer.word_index[token]}")

word: 'the', token: 1
word: 'and', token: 2
word: 'a', token: 3
word: 'i', token: 4
word: 'to', token: 5


The RNN model requires that all the values of the `X` vector have the same length; the `pad_sequences` method will ensure that all integer encoded reviews have the same size. Each entry in `X` will be shortened to `140` integers, or pad with `0's` in case it's shorter.

In [8]:
# Padding sequences
X_pad = pad_sequences(X_seq, maxlen=140, padding="post")

Now that the data is encoded, the training and testing sets will be created.

In [9]:
# Creating training, validation, and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_pad, y, random_state=78)

## Build and Train the LSTM RNN Model

In this section, a custom LSTM RNN model is going to be designed in Keras, and it's going to be fitted (trained) using the training data we defined.

These are the steps that will be followed:

* Define the model architecture in Keras.

* Compile the model.

* Fit the model to the training data.

In [10]:
# Import Keras modules for model creation
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

### Setting Up the Model

The `Embedding` layer requires as parameter the size of the vocabulary in the text that is going to be processed. The `vocabulary_size` is set at the total number of words in the `tokenizer` dictionary plus `1`. The other parameter needed by this layer is the `input_length`; this parameter is set at `140` (`max_words` variable) that is the value defined for padding the reviews.

The `embedding_size` parameter specifies how many dimensions will be used to represent each word. As a rule-of-thumb, a multiple of eight could be used; for this demo, tuning the model value to `64` delivered the best result.

In [11]:
# Model set-up
vocabulary_size = len(tokenizer.word_counts.keys()) + 1
max_words = 140
embedding_size = 64

### Defining the Model's Structure

In [12]:
# Define the LSTM RNN model
model = Sequential()

# Layer 1
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))

# Layer 2
model.add(LSTM(units=280))

# Output layer
model.add(Dense(1, activation="sigmoid"))

### Compiling the Model

In [13]:
# Compile the model
model.compile(
    loss="binary_crossentropy",
    optimizer="adam"
)

In [14]:
# Summarize the model
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 140, 64)           1091264   
_________________________________________________________________
lstm (LSTM)                  (None, 280)               386400    
_________________________________________________________________
dense (Dense)                (None, 1)                 281       
Total params: 1,477,945
Trainable params: 1,477,945
Non-trainable params: 0
_________________________________________________________________


### Training the Model

 ### Making Predictions