# Sentiment Analysis with an RNN

In this notebook, you'll implement a recurrent neural network that performs sentiment analysis. Using an RNN rather than a feedfoward network is more accurate since we can include information about the *sequence* of words. Here we'll use a dataset of movie reviews, accompanied by labels.

The architecture for this network is shown below.

<img src="assets/network_diagram.png" width=400px>

Here, we'll pass in words to an embedding layer. We need an embedding layer because we have tens of thousands of words, so we'll need a more efficient representation for our input data than one-hot encoded vectors. You should have seen this before from the word2vec lesson. You can actually train up an embedding with word2vec and use it here. But it's good enough to just have an embedding layer and let the network learn the embedding table on it's own.

From the embedding layer, the new representations will be passed to LSTM cells. These will add recurrent connections to the network so we can include information about the sequence of words in the data. Finally, the LSTM cells will go to a sigmoid output layer here. We're using the sigmoid because we're trying to predict if this text has positive or negative sentiment. The output layer will just be a single unit then, with a sigmoid activation function.

We don't care about the sigmoid outputs except for the very last one, we can ignore the rest. We'll calculate the cost from the output of the last step and the training label.

In [6]:
import numpy as np
import tensorflow as tf
from collections import Counter
from sklearn.model_selection import train_test_split

In [9]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  1


In [96]:
with open('./sentiment-network/reviews.txt', 'r') as f:
    reviews = f.readlines()
with open('./sentiment-network/labels.txt', 'r') as f:
    labels = f.readlines()

reviews = [review.strip() for review in reviews]
labels = [label.strip() for label in labels]

In [48]:
print(reviews[0])
print(labels[0])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t
positive


## Data preprocessing

In [104]:
review_lens = Counter([len(review.split()) for review in reviews])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 0
Maximum review length: 2633


For training data
1. Encode words into integers
2. Reconcile each integer list to length 200 by slice longer list and padding shorter list

For labels
1. Encode positive to 1 and negative to 0

In [101]:
class Encoder:
  def __init__(self, vocab, max_length=200):
    self.vocab = vocab
    self.max_length = max_length
    self.vocab_to_int = {word: index+1 for index, word in enumerate(vocab)}
  
  def encode(self, text):
    text_int = [self.vocab_to_int.get(word, 0) for word in text.split()[:self.max_length]]
    return np.array([0]*(0 if len(text_int) > self.max_length else self.max_length - len(text_int)) + text_int)

In [106]:
vocab = set(" ".join(reviews).split(" "))
encoder = Encoder(vocab)
features = np.array([encoder.encode(review) for review in reviews])

In [97]:
# Convert labels to 1s and 0s for 'positive' and 'negative'
labels = np.array([1 if label == 'positive' else 0 for label in labels])
labels[0:10]

array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0])

If you build features correctly, it should look like that cell output below.

In [107]:
features[:10,:100]

array([[    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0, 57095, 20557, 56762,
        39375, 65062, 30691, 38375,  3424, 66180, 66596, 47425, 17325,
         8538, 55536, 21725, 22600, 31630, 11246,  9023, 33005, 51552,
        55536, 67162, 38375, 40378,  3950, 55979, 47425, 71680, 40166,
        62761, 68329, 19067, 19860, 38151, 57095, 20557,  8878, 33330,
        56762, 54647,  8013, 19067, 50188, 73844, 56762, 67162, 38375,
        47425, 63398, 19067, 65799, 24117, 47425, 55994, 51593, 62632,
        54824, 32680, 68614, 11905, 57127, 61883, 67162, 73007, 47425,
        31545],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     

## Training, Validation, Test

With our data in nice shape, we'll split it into training, validation, and test sets.

> **Exercise:** Create the training, validation, and test sets here. You'll need to create sets for the features and the labels, `X_train` and `y_train` for example. Define a split fraction, `split_frac` as the fraction of data to keep in the training set. Usually this is set to 0.8 or 0.9. The rest of the data will be split in half to create the validation and testing data.

In [111]:
split_frac = 0.8

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Further split the train+val into training and validation sets (80% train, 20% val)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

# Convert to TensorFlow datasets
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
val_dataset = tf.data.Dataset.from_tensor_slices((X_val, y_val))
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test))

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(X_train.shape), 
      "\nValidation set: \t{}".format(X_val.shape),
      "\nTest set: \t\t{}".format(X_test.shape))

			Feature Shapes:
Train set: 		(20000, 200) 
Validation set: 	(2500, 200) 
Test set: 		(2500, 200)


### Build the Model using Subclassing

Here, we'll build the model. First up, defining the hyperparameters.

* `lstm_size`: Number of units in the hidden layers in the LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `lstm_layers`: Number of LSTM layers in the network. I'd start with 1, then add more if I'm underfitting.
* `batch_size`: The number of reviews to feed the network in one training pass. Typically this should be set as high as you can go without running out of memory.
* `learning_rate`: Learning rate

In [None]:
lstm_size = 256
lstm_layers = 1
batch_size = 128
learning_rate = 0.001

For the network itself, we'll be passing in our 200 element long review vectors. Each batch will be `batch_size` vectors. We'll also be using dropout on the LSTM layer, so we'll make a placeholder for the keep probability.

> **Exercise:** Create the `inputs_`, `labels_`, and drop out `keep_prob` placeholders using `tf.placeholder`. `labels_` needs to be two-dimensional to work with some functions later.  Since `keep_prob` is a scalar (a 0-dimensional tensor), you shouldn't provide a size to `tf.placeholder`.

In [22]:
n_words = len(vocab_to_int) + 1  # Vocabulary size, +1 for padding token

In [58]:
class SentimentAnalysisModel(tf.keras.Model):
    def __init__(self, vocab_size, seq_len, embedding_dim=128, lstm_units=128, dropout_rate=0.2):
        super(SentimentAnalysisModel, self).__init__()
        # Define layers
        self.embedding = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_len)
        self.lstm = tf.keras.layers.LSTM(lstm_units, dropout=dropout_rate)
        self.dense = tf.keras.layers.Dense(1, activation='sigmoid') 
    
    def call(self, inputs, training=False):
        # Forward pass
        x = self.embedding(inputs)
        x = self.lstm(x)
        return self.dense(x)

### Train and Validate

In [125]:
# Initialize the model
model = SentimentAnalysisModel(vocab_size=n_words, seq_len=SEQ_LEN)

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.build(input_shape=(None, SEQ_LEN))

# Print model summary to check the architecture
model.summary()

Model: "sentiment_analysis_model_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_13 (Embedding)    multiple                  9481600   
                                                                 
 lstm_13 (LSTM)              multiple                  131584    
                                                                 
 dense_13 (Dense)            multiple                  129       
                                                                 
Total params: 9,613,313
Trainable params: 9,613,313
Non-trainable params: 0
_________________________________________________________________


In [126]:
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


The accuracy cannot be improved further. Let us save the last model.

In [127]:
model.save('sentiment_analysis', save_format='tf')



INFO:tensorflow:Assets written to: sentiment_analysis\assets


INFO:tensorflow:Assets written to: sentiment_analysis\assets


### Test

In [128]:
# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_test, y_test)

# Print the evaluation results
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")


Test Loss: 0.8754271268844604
Test Accuracy: 0.8087999820709229


Try for simple tests

In [135]:
tests = ["I love this movie!", "This was the worst experience ever."]

single_input = encoder.encode(tests[0]).reshape(1, 200)
prediction = model.predict(single_input)
print(prediction)


single_input = encoder.encode(tests[1]).reshape(1, 200)
prediction = model.predict(single_input)
print(prediction)


[[0.713653]]
[[0.1438838]]
