# 16. Natural Language Processing with RNNs and Attention

Looking at it from a certain perspective, the Turing test is an NLP task. This chapter will focus on how to tackle NLP tasks (albeit less complex than a Turing test) using RNNs. 

### Generating Shakespearean Text Using a Character RNN

Let's look at how to build a Char-RNN, a net that predicts the next character in a sentence. 

#### Creating the Training Dataset

Downloading the file from Andrej Karpathy's GitHub repo:

In [1]:
from tensorflow import keras
import os

filepath = os.path.join(os.getcwd(), 'datasets', 'shakespeare', 'input.txt')
with open(filepath) as f:
    shakespeare_text = f.read()

Next, we must encode every character as an integer. We will use `Tokenizer` for this. 

In [2]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts([shakespeare_text])

Let's quickly check what it does: 

In [3]:
tokenizer.texts_to_sequences(["Hello"])

[[7, 2, 12, 12, 4]]

In [4]:
tokenizer.sequences_to_texts([[7, 2, 12, 12, 4]])

['h e l l o']

In [5]:
max_id = len(tokenizer.word_index) # number of distinct characters

In [6]:
dataset_size = tokenizer.document_count # total number of characters

In [8]:
import numpy as np

[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1 # starting from 0

#### How to Split a Sequential Dataset

It is very important to avoid any overlap between the training set, the validation set, and the test set. It would also be a good idea to leave a gap between these sets to avoid the risk of a paragraph overlapping over two sets.

In our case, let's keep 90% for the training set:

In [10]:
import tensorflow as tf

train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

#### Chopping the Sequential Dataset into Multiple Windows

Now we have a single very long sequence of characters. We can't just train our RNN on it. Let's use `window` to to convert this long sequence of characters into many smaller windows of text.

In [11]:
n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

By default, windows are **not** overlapping, but we used `shift=1` to make them so. We drop remainder to keep all the windows to 101 character length. 

In [12]:
# flattening our dataset
dataset = dataset.flat_map(lambda window: window.batch(window_length))

Now the dataset contains consecutive windows of 101 characters each. To get the best out of Gradient Descent, we can batch the windows and then separate the inputs (first 100 chars) from the target (last char). 

In [13]:
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

Let's encode each character: 

In [14]:
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth = max_id),
Y_batch))

In [15]:
# prefetching
dataset = dataset.prefetch(1)