# Natural Language Processing with RNNs
Natural Language Processing (or NLP for short) is a discipline in computing that deals with the communication between natural (human) languages and computer languages. A common example of NLP is something like spell check or autocomplete. Essentially, NLP is the field that focuses on how computers can understand and/or process natural/human languages.

## Recurrent Neural Networks
In this tutorial we will introduce a new kind of neural network that is much more capable of processing sequential data such as text or characters called a **recurrent neural network** (RNN for short).

We will learn how to use a recurrent neural network to do the following:
- Sentiment Analysis
- Character Generation

RNN's are fairly complex and come in many different forms so in this tutorial we will focus on how they work and the kind of problems they are best suited for.

## Sequence Data
In the previous tutorials we focused on data that we could represent as one static data point where the notion of time or step was irrelevant. Take for example our image data, it was simply a tensor of shape (width, height, channels). That data doesn't change or care about the notion of time.

In this tutorial we will look at sequences of text and learn how we can encode them in a meaningful way. Unlike images, sequences data such as long chains of text, weather patterns, videos and really anything where the notion of a step or time is relevant need to be processed and handled in a special way.

But what do I mean by sequences and why is text data a sequence? Since textual data contains many words that follow in a very specific and meaningful order we need to be able to keep track of each word and when it occurs in the data. Simply encoding say an entire paragraph of text into one data point wouldn't give us a very meaningful picture of the data and would be very difficult to do anything with. This is why we treat text as a sequence and process one word at a time. We will keep track of where each of these words appear and use that information to try to understand the meaning of pieces of text.

## Encoding Text
As we know machine learning models and neural networks don't take raw text data as an input. This means we must somehow encode our textual data to numeric values that our models can udnerstand. There are many different ways of doing this and we will look at a few examples below.

Before we get into the different encoding/preprocessing methods, let's understand the information we can get from textual data by looking at the following two movie reviewes.

`I thought the movie was going to be bad, but it was actually amazing!`

`I thought the movie was going to be amazing, but it was actually bad!`

Although these two sentences are very similar, we know that they have very different meanings! This because of the **ordering** of words, a very important property of textual data.

Now keep that in mind while we consider some different ways of encoding our textual data.

## Bag of Words
The first and simplest way to encode our data is to use something called **bag of words**. This is a pretty easy technique where each word in a sentence is encoded whith an integer and thrown into a collection that does not maintain the order of the words but does keep track of the frequency. Have a look at the python function below that encodes a string of text into bag of words.

In [7]:
vocab = {} # maps word to integer representing it
word_encoding = 1

def bag_of_words(text):
    global word_encoding

    words = text.lower().split(" ") # create a list of all of the words in the text, we'll assume there is no
    bag = {} # stores all of the encodings and their frequency

    for word in words:
        if word in vocab:
            encoding = vocab[word] # get encoding from vocab
        else:
            vocab[word] = word_encoding
            encoding = word_encoding
            word_encoding += 1

        if encoding in bag:
            bag[encoding] += 1
        else:
            bag[encoding] = 1

    return bag

text = "this is a test to see if this test will work is is test a a"
bag = bag_of_words(text)
print(bag)
print(vocab)    

{1: 2, 2: 3, 3: 3, 4: 3, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}
{'this': 1, 'is': 2, 'a': 3, 'test': 4, 'to': 5, 'see': 6, 'if': 7, 'will': 8, 'work': 9}


Consider a sentence where the order is important, then bag of words doesn't work because it doesn't take the order into consideration.

Now, we encode using integer and just leave it as where it was as an array, we can have "I am Tim" as [1, 2, 3] which is different than [3, 1, 2] although they using the same words and same frequency.

Suppose we are working with 100000 thousands of different words where it happens to encode: happy = 1, sad = 2, ..., bad = 99999, good = 100000. Then, the model would difficult to tell happy and sad apart as well as happy and good being similar yet seperated 99999 units apart.

## Word Embedding
Classify every single of words into a vector. Here, if we encode word happy then we hope the word good to point in a similar direction as the vector of word happy since it has similar meaning. And if it has the opposite meaning, it will point different direction.

## Recurrent Neural Networks (RNN's)
Now that we've learned a little bit about how we can encode text, it's time to dive into recurrent neural network. Up until this point, we have been using something called **feed-forward** neural networks. This simply means that all of our data is def forwards (all at once) from left to right through the network. This was fine for the problems we considered before but won't work very well for procesing text. After all even we (humans) don't process text all at once. We read word by word from left to right and keep track of the current meaning of the sentence so we can understand the meaning of the next word. This is exactly what a recurrent neural network is designed to do. When we say recurrent neural network all we really mean is a network that contains a loop. A RNN will process one word at a time while maintaining an internal memory of what it's already seen. This will allow it to treat words differently based on their order in a sentence and to slowly build an understanding of the entire output, one word at a time.

This is why we are treating our text data as a sequence! So that we can pass one word at a time to the RNN!

## LSTM
LSTM (Long Short-Term Memory) layer works very similarly to the simpleRNN layer but adds a way to access inputs from any timestep in the past. Whereas in our simple RNN layer input from previous timestamps gradually disappeared as we got further through the input. With a LSTM we have a long term memory data structure storing all of the previously seen inputs as well as when we saw the. This allows for us to access any precious value we want at any point in time. This adds to the complexity of our network and allows it to discover more useful relationships between inputs and when they appear.

## Sentiment Analysis
We are going to do something called sentiment analysis.

The formal definition of this term from wikipedia is as follows:

*the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular opic, product, etc. is positive, negative, or neutral*.

The example we'll use here is classifying movie reviews as either positive, negative or neutral.

*This guide is based on the following tensorflow tutorial*: https://www.tensorflow.org/tutorials/text/text_classification_rnn

### Movie Review Dataset
We'll start by loading the IMDB movie review dataset from keras. his dataset contains 25,000 reviews from IMDB where each one is already preprocessed and has a label as either positive or negative. Each review is encoded by integers that represents how common a word is in the entire dataset. For example, a word encoded by the integer 3 means that it is the 3rd most common word in the dataset.

In [20]:
from keras.datasets import imdb
from keras.preprocessing import sequence
import tensorflow as tf
import os
import numpy as np

In [9]:
VOCAB_SIZE = 88584

MAXLEN = 250
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = VOCAB_SIZE)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [10]:
# Lets look at one review
train_data[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

### More Preprocessing
If we have a look at some of our loaded in reviews we'll notice that they are different lengths. This is an issue. We cannot pass different length data into our neural network. Therefore, we must make each review the same length. To do thus, we will follow the procedure below:
- if the review is greater than 250 words then trim off the extra words
- if the review is less than 250 words add the necessary amount of 0's to make it equal to 250.

Luckily for us keras has a function that can do this for us:

In [11]:
train_data = sequence.pad_sequences(train_data, MAXLEN)
test_data = sequence.pad_sequences(test_data, MAXLEN)

### Creating the Model
Now, it's time to create model. We'll use a word embedding layer as the first layer in our model and add a LSTM layer afterwards that feeds into a dense node to get our predicted sentiment.

32 stands for the output dimension of the vectors generated by the embedding layer. We cand change this value if we'd like!

In [12]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [13]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 32)          2834688   
                                                                 
 lstm (LSTM)                 (None, 32)                8320      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 2843041 (10.85 MB)
Trainable params: 2843041 (10.85 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [14]:
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics='acc')

history = model.fit(train_data, train_labels, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


And we'll evaluate the model on our training data to see how well it performs.

In [15]:
results = model.evaluate(test_data, test_labels)
print(results)

[0.48403313755989075, 0.8578400015830994]


So, we're scoring somewhere in the mid-high 80's. Not bad for sa simple recurrent network.

### Making Predictions
Now, let's use our network to make predictions on our own reviews.

Since our reviews are encoded, we'll need to convert any review that we write into that form so the network can understand it. To do that, we'll load the encodings from the dataset and use themto encode our own data.

In [21]:
word_index = imdb.get_word_index()

def encode_text(text):
    tokens = keras.preprocessing.text.text_to_words_sequence(text)
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]
    return sequence.pad_sequences([tokens], MAXLEN)[0]

text = "that movie was just amazing, so amazing"
encoded = encode_text(text)
print(encode)

AttributeError: module 'keras.preprocessing.text' has no attribute 'text_to_words_sequence'