## Natural Language Processing
Natural Language Processing (or NLP for short) is a discipline in computing that deals with the communication between natural (human) languages and computer languages. A common example of NLP is something like spellcheck or autocomplete. Essentially NLP is the field that focuses on how computers can understand and/or process natural/human languages.

### Recurrent Neural Networks
Kind of neural network that is much more capable of processing sequential data such as text or characters called a recurrent neural network (RNN for short).

We will learn how to use a reccurent neural network to do the following:
- Sentiment Analysis  *(how bad/god is a text)*
- Character Generation *(autocomplete, ...)*

## Encoding Text

### Word Embeddings
This method keeps the order of words intact as well as encodes similar words with very similar labels. It attempts to not only encode the frequency and order of words but the meaning of those words in the sentence. It encodes each word as a dense vector that represents its context in the sentence.

Unlike the previous techniques word embeddings are learned by looking at many different training examples. You can add what's called an embedding layer to the beggining of your model and while your model trains your embedding layer will learn the correct embeddings for words. You can also use pretrained embedding layers.

## Recurrent Neural Networks (RNN's)
Now that we've learned a little bit about how we can encode text it's time to dive into recurrent neural networks. Up until this point we have been using something called **feed-forward** neural networks. This simply means that all our data is fed forwards (all at once) from left to right through the network. This was fine for the problems we considered before but won't work very well for processing text. After all, even we (humans) don't process text all at once. We read word by word from left to right and keep track of the current meaning of the sentence so we can understand the meaning of the next word. Well this is exaclty what a recurrent neural network is designed to do. When we say recurrent neural network all we really mean is a network that contains a loop. A RNN will process one word at a time while maintaining an internal memory of what it's already seen. This will allow it to treat words differently based on their order in a sentence and to slowly build an understanding of the entire input, one word at a time.

This is why we are treating our text data as a sequence! So that we can pass one word at a time to the RNN.

Let's have a look at what a recurrent layer might look like.

![alt text](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)
*Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/*

Let's define what all these variables stand for before we get into the explination.

**h<sub>t</sub>** output at time t

**x<sub>t</sub>** input at time t

**A** Recurrent Layer (loop)

What this diagram is trying to illustrate is that a recurrent layer processes words or input one at a time in a combination with the output from the previous iteration. So, as we progress further in the input sequence, we build a more complex understanding of the text as a whole.

What we've just looked at is called a **simple RNN layer**. It can be effective at processing shorter sequences of text for simple problems but has many downfalls associated with it. One of them being the fact that as text sequences get longer it gets increasingly difficult for the network to understand the text properly.



### LSTM
The layer we dicussed in depth above was called a simpleRNN. However, there does exist some other recurrent layers (layers that contain a loop) that work much better than a simple RNN layer. The one we will talk about here is called LSTM (Long Short-Term Memory). This layer works very similarily to the simpleRNN layer but adds a way to access inputs from any timestep in the past. Whereas in our simple RNN layer input from previous timestamps gradually disappeared as we got further through the input. With a LSTM we have a long-term memory data structure storing all the previously seen inputs as well as when we saw them. This allows for us to access any previous value we want at any point in time. This adds to the complexity of our network and allows it to discover more useful relationships between inputs and when they appear.

------

### Movie Review 

In [1]:
import tensorflow as tf
from keras.datasets import imdb
from keras.preprocessing import sequence

import os
import numpy as np

#### Dataset
Well start by loading in the IMDB movie review dataset from keras. This dataset contains 25,000 reviews from IMDB where each one is already preprocessed and has a label as either positive or negative. Each review is encoded by integers that represents how common a word is in the entire dataset. For example, a word encoded by the integer 3 means that it is the 3rd most common word in the dataset.

In [2]:
VOCAB_SIZE = 88584 # unique words
MAXLEN = 250 # max len for each review
BATCH_SIZE = 64

(data_train, y_train), (data_test, y_test) = imdb.load_data(num_words=VOCAB_SIZE)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [8]:
# Look at the reviews
len(data_train[0])

218

In [9]:
len(data_train[10])

450

### Preprocessing
If we have a look at some of our loaded in reviews, we'll notice that they are different lengths. This is an issue. We cannot pass different length data into our neural network. Therefore, we must make each review the same length. To do this we will follow the procedure below:
- if the review is greater than 250 words then trim off the extra words
- if the review is less than 250 words add the necessary amount of 0's to make it equal to 250.

Luckily for us keras has a function that can do this for us:

In [10]:
data_train = sequence.pad_sequences(data_train, MAXLEN)
data_test = sequence.pad_sequences(data_test, MAXLEN)

### Creating the model

In [15]:
model = tf.keras.Sequential([
    # input_dim: Integer. Size of the vocabulary
    # output_dim: Integer. Dimension of the dense embedding
    tf.keras.layers.Embedding(VOCAB_SIZE, 32),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [16]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 32)          2834688   
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320      
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


In [17]:
model.compile(optimizer='rmsprop',
             loss='binary_crossentropy',
             metrics=['acc'])

### Training the model

In [18]:
history = model.fit(data_train, y_train, epochs=10, validation_split=0.2)

  return f(*args, **kwds)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Evaluating the model

In [19]:
results = model.evaluate(data_test, y_test)
print(results)

[0.5825123190879822, 0.8445199728012085]


##### We get an accuracy of 0.8445 %

### Making Predictions
Since our reviews are encoded well need to convert any review that we write into that form so the network can understand it. To do that we'll load the encodings from the dataset and use them to encode our own data.

In [42]:
word_index = imdb.get_word_index() # Retrieves a dict mapping words to their index in the IMDB dataset.

In [43]:
# We build a function to encode
def encode_text(text):
    tokens = tf.keras.preprocessing.text.text_to_word_sequence(text) # Convert text to tokens
    tokens = [word_index[word] if word in word_index else 0 for word in tokens] # we get the index of each token
    # Now, we have to process the data -- Max len is 250
    return sequence.pad_sequences([tokens], MAXLEN)[0] # We get a list of lists, but we want just one list
    
    
text = "that movie was just amazing, so amazing"
encoded = encode_text(text)
print(encoded)

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0  12  17  13  4

In [51]:
# Function to decode
def decode_text(encoded):
    final = ""
    for code in encoded:
        if code != 0:
            for key, value in word_index.items():
                if value == code:
                    final += key + " "
    return final
            

decoded = decode_text(encoded)
print(decoded)

that movie was just amazing so amazing 


#### Time to make a prediction

In [63]:
def predict(review):
    encoded_review = encode_text(review)
    predictions = np.zeros((1, 250))
    predictions[0] = encoded_review
    result = model.predict(predictions)
    return result[0]


positive_review = "That movie was awesome! really loved it and would watch it again because it was amazingly great"
print(predict(positive_review))

negative_review = "that movie sucked. I hated it and wouldn't watch it again. Was one of the worst things I've ever watched"
print(predict(negative_review))

[0.95112455]
[0.5036001]
