**Chapter 16 – Natural Language Processing with RNNs and Attention**

_This notebook contains all the sample code in chapter 16._

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/ageron/handson-ml2/blob/master/16_nlp_with_rnns_and_attention.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

# Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20 and TensorFlow ≥2.0.



Following the 'recipe' for the "very basic seq2seq model", build and train a system that translates for loops into while loops and see what kind of performance you can get. Make sure you use the padding trick introduced to fix the problem with different length inputs! Three suggestions: Just use the ASCII values of characters as the character IDs. Find a reasonable maximum output size and pad all the training targets to that size. And don't be afraid to play around with the embedding size if things don't work with the initial value.
Optional, but I hope some of you do this: rework the data generation to produce two versions of the same code, where the input has had all whitespace and indentation removed and the target string has reasonable spacing and indentation. Learn to format code with nice spacing and indentation.
In addition, you should also keep progress notes, which will become part of your submission. At the bottom of your notebook, make a single text cell for these notes. Your notes should include two kinds of entries: problems and investigations. Problem entries should read like this: "I (or We) had <problem> in the cell that did <something>. I/We fixed this by <remedy>." Investigation entries should read like this: "I/We didn't understand <something>. We found an explanation at <link> (or perhaps, we talked to <person>) and now I/We understand it."
You may add other kinds of progress entries as you see fit, e.g. "I figured out a clever way to do <something>." But you should have at least 5 problem and 5 investigation entries.

In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
    !pip install -q -U tensorflow-addons
    !pip install -q -U transformers
    IS_COLAB = True
except Exception:
    IS_COLAB = False

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

if not tf.config.list_physical_devices('GPU'):
    print("No GPU was detected. LSTMs and CNNs can be very slow without a GPU.")
    if IS_COLAB:
        print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "nlp"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

[K     |████████████████████████████████| 686kB 8.9MB/s 
[K     |████████████████████████████████| 2.3MB 8.6MB/s 
[K     |████████████████████████████████| 3.3MB 37.6MB/s 
[K     |████████████████████████████████| 901kB 35.3MB/s 
[?25h

In [None]:
import random
import pandas as pd

vars = ("x", "y", "z", "current", "head", "curr", "i", "idx", "index", "j")
inits = (" 0", "0", " -1000", "-42", "list->head")
conds = (">", "<=", " == ", " < ")
limits = ("10", "42", "NULL", "100000", "-45", "LIMIT")
incrs = ("++", "+=2", "+=100", "-=20")
bodys = ("sum += x;", "var-;", "print(x);", "open(text.txt);", "random(x);")


def genForLoop(var, init, cond, limit, incr, body):
    return F"for({var} = {init}; {var}{cond}{limit}; {var}{incr}) {{\n\t{body}\n}}"

def genWhileLoop(var, init, cond, limit, incr, body):
    return F"{var} = {init};\nwhile({var}{cond}{limit}) {{\n\t{body}\n\t{var}{incr};\n}}"


def genLoopPairs(count):
    ret = list()
    for c in range(count):
        var = random.choice(vars)
        init = random.choice(inits)
        cond = random.choice(conds)
        limit = random.choice(limits)
        incr = random.choice(incrs)
        body = random.choice(bodys)
        pair = (F"for({var} = {init}; {var}{cond}{limit}; {var}{incr}) {{\n\t{body}\n}}",
                F"{var} = {init};\nwhile({var}{cond}{limit}) {{\n\t{body}\n\t{var}{incr};\n}}")
        ret.append(pair)
    df = pd.DataFrame(ret, columns=["for","while"])
    return df["for"], df["while"]



pairs = genLoopPairs(16000)



for     NaN
while   NaN
Name: while, dtype: float64

In [None]:
INPUT_CHARS = "".join(sorted(set("for"+"".join(vars) + "".join(inits) + "".join(conds) + "".join(limits) + "".join(incrs)+ "".join(bodys)+ "{"+"}")))
INPUT_CHARS

' ()+-.01245;<=>ILMNTUacdefhijlmnoprstuvxyz{}'

And here's the list of possible characters in the outputs:

In [None]:
OUTPUT_CHARS =  "".join(sorted(set("while"+"".join(vars) + "".join(inits) + "".join(conds) + "".join(limits) + "".join(incrs)+ "".join(bodys)+ "{"+"}")))
OUTPUT_CHARS

' ()+-.01245;<=>ILMNTUacdehijlmnoprstuvwxyz{}'

Let's write a function to convert a string to a list of character IDs, as we did in the previous exercise:

In [None]:
#converting the input to ascii
def loop_to_ids(date_str, chars=INPUT_CHARS):
    return [ord(c) for c in date_str]

In [None]:
#helper for ascii conversion
def prepare_loop_strs(loop_strs, chars=INPUT_CHARS):
    X_ids = [loop_to_ids(lp, chars) for lp in loop_strs]
    X = tf.ragged.constant(X_ids, ragged_rank=1)
    return (X + 1).to_tensor() # using 0 as the padding token ID


def create_dataset(num):
    x, y = genLoopPairs(num)
    #print(x.shape, y,shape//)
    return prepare_loop_strs(x, INPUT_CHARS), prepare_loop_strs(y, OUTPUT_CHARS)

In [None]:
np.random.seed(42)

X, Y = create_dataset(16000)

In [None]:
print(X[1])

tf.Tensor(
[103 112 115  41 121  33  62  33  46  53  51  60  33 121  61  62  79  86
  77  77  60  33 121  44  62  50  49  49  42  33 124  11  10 113 115 106
 111 117  41 121  42  60  11 126   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0], shape=(78,), dtype=int32)


In [None]:
print(Y[15000])

tf.Tensor(
[100 118 115 115 102 111 117  33  62  33  49  60  11 120 105 106 109 102
  41 100 118 115 115 102 111 117  33  62  62  33  46  53  54  42  33 124
  11  10 119  98 115  46  60  11  10 100 118 115 115 102 111 117  44  44
  60  11 126   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0], shape=(81,), dtype=int32)


In [None]:
x_train = X[0:12000]
y_train = Y[0:12000]
x_val = X[12001:14000]
y_val = Y[12001:14000]
x_test = X[14001:16000]
y_test = Y[14001:16000]

In [None]:

embedding_size = 128
max_input_length = x_train.shape[1]
max_output_length = y_train.shape[1]
print(len(INPUT_CHARS), max_input_length, max_output_length)
np.random.seed(42)
tf.random.set_seed(42)

encoder = keras.models.Sequential([
    keras.layers.Embedding(input_dim=128,
                           output_dim=embedding_size,
                           input_shape=[None]),
    keras.layers.LSTM(128)
])

decoder = keras.models.Sequential([
    keras.layers.LSTM(128, return_sequences=True),
    keras.layers.Dense(128, activation="softmax")
])

model = keras.models.Sequential([
    encoder,
    keras.layers.RepeatVector(max_output_length),
    decoder
])
#changed optimizer for better accuracy learning rate was too high
optimizer = keras.optimizers.Adam()
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])
history = model.fit(x_train, y_train, epochs=20,
                    validation_data=(x_val, y_val))

45 79 82
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
y_pred = model.predict(x_test)
y_pred[1][1]

array([1.7304744e-07, 1.0198947e-09, 1.6285319e-09, 2.1573965e-09,
       2.5840559e-09, 1.6212083e-09, 1.2076488e-09, 3.3574763e-09,
       7.2797579e-10, 8.5838375e-10, 1.1703553e-12, 4.8880279e-12,
       2.2960243e-09, 1.8934190e-09, 2.0978035e-09, 4.4858468e-09,
       9.2439292e-09, 1.5568039e-09, 3.2410583e-09, 2.9867027e-09,
       3.4886403e-09, 1.0266419e-09, 1.3375308e-09, 1.1717062e-09,
       2.4631381e-09, 3.3436343e-09, 1.7437438e-09, 1.9186415e-09,
       2.0223887e-09, 4.6646589e-09, 5.1491145e-09, 4.5773598e-09,
       2.4566307e-09, 9.9912602e-01, 8.2602341e-08, 5.1574300e-09,
       1.8407702e-09, 3.2359870e-09, 2.7534535e-09, 1.5073814e-09,
       9.9630126e-10, 5.8545396e-10, 3.1351761e-09, 1.5428513e-09,
       2.8043860e-07, 1.2823175e-09, 2.4611964e-07, 9.7917741e-08,
       1.3746163e-09, 8.7446715e-08, 1.6907691e-08, 1.5795790e-11,
       1.0919523e-09, 9.2292257e-11, 2.2677750e-11, 8.9815216e-10,
       1.9762787e-09, 3.2522791e-09, 1.2279774e-09, 2.0664779e

In [None]:
create_dataset()

#Some Notes
We added the python code from the lab page on the second cell of this project.
- We changed the input to give our randomly generated data generated by genLoopPairs function. 
-  Changed the input chars to a for loop and adding variables, conditions, limits. 
- Changed the output chars to a while loop and added variables, conditions and limits. 
- Performed padding in the prepare_loop_strs function
- Increased the embedded size and layer to support ASCII values. 
- Decreased some of the conditions to just greater than less than and equals to try improve the model. 
- Tried running the model and got up to 93% accuracy. 
- We changed optimizer for better accuracy. learning rate was too high



