<table class="table table-bordered">
    <tr>
        <th style="text-align:center; width:25%"><img src='https://www.np.edu.sg/PublishingImages/Pages/default/odp/ICT.jpg' style="width: 250px; height: 125px; "></th>
        <th style="text-align:center;"><h1>Deep Learning</h1><h2>Assignment 2 - Character Generator Model (Problem 2)</h2><h3>AY2020/21 Semester</h3></th>
    </tr>
</table>

Objective: Develop a English Language Model RNN capable of generating semi-coherent English sentences

TODO: Use perplexity as a validation metric

TODO: Use pretrained Transformer-XL  model from https://github.com/kimiyoung/transformer-xl

In [1]:
DATA_DIR = "../data"

In [2]:
# autoformat code on cell run.
%load_ext lab_black
# autoreload imported modules on change
%load_ext autoreload
%autoreload 2

# Import the Required Packages
import os
import mlflow
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import matplotlib
import matplotlib.pyplot as plt

from git import Repo
from unidecode import unidecode
from minio import Minio
from pprint import pprint
from zipfile import ZipFile
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical, Sequence
from modeling import (
    dense_classifier,
    rnn_block,
    compile_callbacks,
    build_model,
    train_eval_model,
)

Configure access to MLFlow by setting the following environment variables:
- `MLFLOW_TRACKING_URI` - URL to the MLFlow Tracking server.
- `MLFLOW_S3_ENDPOINT_URL` - URL to the MLFlow S3 Backend Store
- `MLFLOW_EXPERIMENT` - Optional. The name of the MLFlow experiment to log to.
- `MINIO_HOST` - End to the Minio S3 Store.
- `AWS_ACCESS_KEY_ID` - MLFlow S3 backend store Avectorsccess Key ID.
- `AWS_SECRET_ACCESS_KEY` - MLFlow S3 backend store secret access key.

`TF_FORCE_GPU_ALLOW_GROWTH` -  Force Tensorflow to allocate GPU memory dynamically
instead of of all at once as a workaround for this
[cuDNN failed to initialize issue](https://github.com/tensorflow/tensorflow/issues/24828).

In [3]:
%env TF_FORCE_GPU_ALLOW_GROWTH=true

env: TF_FORCE_GPU_ALLOW_GROWTH=true


Start MLFlow run with the name of the commit as fthe run name.

In [4]:
mlflow.set_experiment(os.environ.get("MLFLOW_EXPERIMENT", "staging"))

In [5]:
repo = Repo(search_parent_directories=True)
mlflow.start_run(run_name=repo.head.commit.message)

<ActiveRun: >

Setup `minio` client.

In [6]:
minio = Minio(
    endpoint=os.environ["MINIO_HOST"],
    access_key=os.environ["AWS_ACCESS_KEY_ID"],
    secret_key=os.environ["AWS_SECRET_ACCESS_KEY"],
    secure=False,
)

## Step 1 – Data Loading and Processing

### 1.1 Data Loading

 Load the `THE ADVENTURES OF SHERLOCK HOLMES` text:

In [7]:
with open(f"{DATA_DIR}/holmes.txt") as f:
    holmes_text = f.read()

### 1.2 EDA

Take a peek at the first 5 lines:
- a unicode character `\ufeff` has to be replaced.

In [8]:
holmes_text.splitlines()[:6]

['\ufeffTHE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE',
 '',
 '   I. A Scandal in Bohemia',
 '  II. The Red-headed League',
 ' III. A Case of Identity',
 '  IV. The Boscombe Valley Mystery']

- No. of unique characters:

In [9]:
unique_chars = np.unique(list(holmes_text))
print(unique_chars)
# -1 for the `\ufeff` character to be removed.
n_unique_chars = len(unique_chars) - 1
print("no. of unique characters:", n_unique_chars)

['\n' ' ' '!' '"' '&' "'" '(' ')' ',' '-' '.' '/' '0' '1' '2' '3' '4' '5'
 '6' '7' '8' '9' ':' ';' '?' 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K'
 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z' 'a' 'b' 'c'
 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u'
 'v' 'w' 'x' 'y' 'z' 'à' 'â' 'è' 'é' '\ufeff']
no. of unique characters: 81


### 1.3 Data Preprocessing

Replace unicode characters with ASCII ones:

In [10]:
clean_holmes_text = unidecode(holmes_text)

In [11]:
clean_holmes_text.splitlines()[:6]

['THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE',
 '',
 '   I. A Scandal in Bohemia',
 '  II. The Red-headed League',
 ' III. A Case of Identity',
 '  IV. The Boscombe Valley Mystery']

Compile a mapping from between a number and a character:

In [12]:
charset = {char: i for i, char in enumerate(unique_chars)}
print(dict(list(charset.items())[:6]), "...")

{'\n': 0, ' ': 1, '!': 2, '"': 3, '&': 4, "'": 5} ...


To train a character level language model that predicts the next character, text has to processed into:
- context - List of preceding characters as the context the model base its prediction on.
- character - The target prediction character given the preceding context that model is trying to predict.

In [13]:
process_params = {
    "context_len": 500,
    "charset_size": n_unique_chars,
}
mlflow.log_params(process_params)

In [14]:
def generate_context_char_idx(text, process_params):
    """Generate index for context and char for training a character level language model"""
    # calculate no. of examples: (context, char) pairs
    context_char_len = process_params["context_len"] + 1
    n_examples = len(text) - context_char_len

    ctx_positions = []
    target_char_idxs = []
    for i_example in range(n_examples):
        # collect positions of where the context substr start and ends
        context_start = i_example
        context_end = context_start + process_params["context_len"]
        ctx_positions.append((context_start, context_end))

        # collect index of target prediction char
        target_char_idxs.append(context_end)

    return ctx_positions, target_char_idxs

In [15]:
ctx_positions, target_char_idxs = generate_context_char_idx(holmes_text, process_params)

Split the generated indices into train, valid and test subsets, reserving 5000 examples for the test set, 2500 examples for the validation set:

In [16]:
n_test = 5000
n_validation = 2500

mlflow.log_param("test_size", n_test)
(
    train_valid_ctx_positions,
    test_ctx_positions,
    train_valid_char_idxs,
    test_char_idxs,
) = train_test_split(ctx_positions, target_char_idxs, test_size=n_test)
(
    train_ctx_positions,
    valid_ctx_positions,
    train_char_idxs,
    valid_char_idxs,
) = train_test_split(
    train_valid_ctx_positions, train_valid_char_idxs, test_size=n_validation
)

Use a Keras `Sequence` process at model runtime:
- Extract the context and the target char  is computationally inexpensive, so the performance overhead should be minimal.
- Since there is a significant amount of overlap between context characters, preprocessing the data now will result significant redundant RAM  usage.

In [17]:
class TextContext(Sequence):
    def __init__(
        self, ctx_positions, char_idxs, text, batch_size, charset, process_params
    ):
        """
        Create a text context data sequence useful for character level language models.

        Args:
            ctx_positions: List of positions of context in the text.
            char_idxs: List of index of target predictio characters in the text.
            text: Text to generate from.
            batch_size: Size of the batchs of data to generate.
            charset: Dictionary mapping character to int used to encode words as ints.
            process_params: Additional Data processing parameters.
        """
        self.ctx_positions = ctx_positions
        self.char_idxs = char_idxs
        assert len(ctx_positions) == len(char_idxs)

        self.text = text
        self.batch_size = batch_size
        self.charset = charset
        self.params = process_params

    @property
    def context_shape(self):
        """Return the shape of the processed context vector"""
        return (self.params["context_len"], self.params["charset_size"])

    def __len__(self):
        # no. of examples: (context, char) pairs
        return len(self.ctx_positions) // self.batch_size

    def __getitem__(self, batch_idx):
        """Process text context data for the given batch at index batch_idx"""
        contexts = []
        target_chars = []

        ctx_positions_batch = self.ctx_positions[
            batch_idx * self.batch_size : (batch_idx + 1) * self.batch_size
        ]
        char_idxs_batch = self.char_idxs[
            batch_idx * self.batch_size : (batch_idx + 1) * self.batch_size
        ]

        for context_pos, target_char_idx in zip(ctx_positions_batch, char_idxs_batch):

            context_start, context_end = context_pos
            # extract context vector
            context = self.text[context_start:context_end]
            context_int = [self.charset[c] for c in context]
            contexts.append(context_int)

            # extract target char
            target_char = self.text[target_char_idx]
            target_chars.append(self.charset[target_char])

        n_classes = self.params["charset_size"]
        return (
            to_categorical(contexts, num_classes=n_classes),
            to_categorical(target_chars, num_classes=n_classes),
        )

In [18]:
train_holmes_ctx = TextContext(
    ctx_positions=train_ctx_positions,
    char_idxs=train_char_idxs,
    batch_size=512,
    text=clean_holmes_text,
    charset=charset,
    process_params=process_params,
)

In [19]:
%%timeit
train_holmes_ctx.__getitem__(0)

50 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [20]:
train_holmes_ctx.__getitem__(0)[0].shape

(512, 500, 81)

## Step 2 – Develop Character Generator Model

### Building the Model

In [21]:
build_model(
    input_shape=train_holmes_ctx.context_shape,
    n_classes=process_params["charset_size"],
    rnn_bidirectional=True,
).summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 500, 81)]         0         
_________________________________________________________________
bidirectional (Bidirectional (None, 500, 128)          74752     
_________________________________________________________________
layer_normalization (LayerNo (None, 500, 128)          256       
_________________________________________________________________
bidirectional_1 (Bidirection (None, 500, 128)          98816     
_________________________________________________________________
layer_normalization_1 (Layer (None, 500, 128)          256       
_________________________________________________________________
bidirectional_2 (Bidirection (None, 128)               98816     
_________________________________________________________________
layer_normalization_2 (Layer (None, 128)               256   

### Train the Model

In [22]:
batch_size = 1028

train_holmes_ctx = TextContext(
    ctx_positions=train_ctx_positions,
    char_idxs=train_char_idxs,
    batch_size=batch_size,
    text=clean_holmes_text,
    charset=charset,
    process_params=process_params,
)

valid_holmes_ctx = TextContext(
    ctx_positions=valid_ctx_positions,
    char_idxs=valid_char_idxs,
    batch_size=batch_size,
    text=clean_holmes_text,
    charset=charset,
    process_params=process_params,
)

test_holmes_ctx = TextContext(
    ctx_positions=test_ctx_positions,
    char_idxs=test_char_idxs,
    batch_size=batch_size,
    text=clean_holmes_text,
    charset=charset,
    process_params=process_params,
)

In [23]:
# train on multiple GPUs
multi_gpu =  tf.distribute.MirroredStrategy(
    cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()
)

with multi_gpu.scope():
    model = train_eval_model(
        train_data=[train_holmes_ctx],
        validation_data=valid_holmes_ctx,
        test_data=[test_holmes_ctx],
        build_model_fn=build_model,
        n_classes=n_unique_chars,
        tags={
            "project": "dl-assign-2",
            "part": "1",
            "model": "sentiment",
        },
        input_shape=train_holmes_ctx.context_shape,
        git_repo=Repo(search_parent_directories=True),
        run_name=None,
        epochs=20,
        lr=1e-3,
        optimizer="adam",
        sgd_momentum=0.9,
        loss="categorical_crossentropy",
        metrics=[
            "accuracy",
        ],
        reduce_lr_stuck=False,
        reduce_lr_patience=5,
        reduce_lr_factor=0.5,
        batch_size=batch_size,
        dropout_prob=0.0,
        l2_reg=None,
        rnn_cell="lstm",
        n_rnn_units=64,
        n_rnn_layers=2,
        rnn_bidirectional=True,
        rnn_activation="tanh",
        use_layer_norm=True,
        n_dense_units=0,
        use_batch_norm=True,
        dense_activation="relu",
    )
mlflow.end_run()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
  ...
    to  
  ['...']
  ...
    to  
  ['...']
Train for 270 steps, validate for 1 steps
Epoch 1/20
INFO:tensorflow:batch_all_reduce: 18 all-reduces with algorithm = hierarchical_copy, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce 

InternalError:  [_Derived_]  Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 128, 64, 1, 500, 1024, 64] 
	 [[{{node gradients/CudnnRNN_grad/CudnnRNNBackprop}}]]
	 [[replica_1/StatefulPartitionedCall]]
	 [[Identity_2/_134]] [Op:__inference_distributed_function_23007]

Function call stack:
distributed_function -> distributed_function -> distributed_function


### Model #1

In [None]:
# model.save('chgen_model_1.h5')

### Model #2

In [None]:
# model.save('chgen_model_2.h5')

### Recommend the Best Model

In [None]:
# Save the Best Model
# model.save('chgen_model_best.h5')

## Step 3 – Use the Best Model to generate the characters / sentences

In [None]:
# model.load_weights('chgen_model_best.h5')

In [None]:
# takes the user input
# text_input = np.array([input()])

In [None]:
# encode the user input

In [None]:
# Use the Best Model to generate 400 characters