<table class="table table-bordered">
    <tr>
        <th style="text-align:center; width:25%"><img src='https://www.np.edu.sg/PublishingImages/Pages/default/odp/ICT.jpg' style="width: 250px; height: 125px; "></th>
        <th style="text-align:center;"><h1>Deep Learning</h1><h2>Assignment 2 - Character Generator Model (Problem 2)</h2><h3>AY2020/21 Semester</h3></th>
    </tr>
</table>

Objective: Develop a English Language Model RNN capable of generating semi-coherent English sentences

TODO: Use perplexity as a validation metric

TODO: Use pretrained Transformer-XL  model from https://github.com/kimiyoung/transformer-xl

In [1]:
DATA_DIR = "../data"

In [5]:
%load_ext lab_black
# Import the Required Packages
import os
import mlflow
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import matplotlib
import matplotlib.pyplot as plt

from git import Repo
from unidecode import unidecode
from minio import Minio
from zipfile import ZipFile
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import Sequence
from tensorflow.keras.layers import (
    Activation,
    LeakyReLU,
    ReLU,
    LSTM,
    GRU,
    BatchNormalization,
    LayerNormalization,
    Dense,
)
from tensorflow.keras.layers import *
from tensorflow.keras.optimizers import *
from tensorflow.keras.callbacks import *
from tensorflow.keras.regularizers import *
from tensorflow.keras import Input, Model
from tensorflow.keras.utils import to_categorical

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


Configure access to MLFlow by setting the following environment variables:
- `MLFLOW_TRACKING_URI` - URL to the MLFlow Tracking server.
- `MLFLOW_S3_ENDPOINT_URL` - URL to the MLFlow S3 Backend Store
- `MLFLOW_EXPERIMENT` - Optional. The name of the MLFlow experiment to log to.
- `MINIO_HOST` - End to the Minio S3 Store.
- `AWS_ACCESS_KEY_ID` - MLFlow S3 backend store Avectorsccess Key ID.
- `AWS_SECRET_ACCESS_KEY` - MLFlow S3 backend store secret access key.

`TF_FORCE_GPU_ALLOW_GROWTH` -  Force Tensorflow to allocate GPU memory dynamically
instead of of all at once as a workaround for this
[cuDNN failed to initialize issue](https://github.com/tensorflow/tensorflow/issues/24828).

In [6]:
%env TF_FORCE_GPU_ALLOW_GROWTH=true

env: TF_FORCE_GPU_ALLOW_GROWTH=true


Start MLFlow run with the name of the commit as fthe run name.

In [7]:
mlflow.set_experiment(os.environ.get("MLFLOW_EXPERIMENT", "staging"))

In [8]:
repo = Repo(search_parent_directories=True)
mlflow.start_run(run_name=repo.head.commit.message)

<ActiveRun: >

Setup `minio` client.

In [9]:
minio = Minio(
    endpoint=os.environ["MINIO_HOST"],
    access_key=os.environ["AWS_ACCESS_KEY_ID"],
    secret_key=os.environ["AWS_SECRET_ACCESS_KEY"],
    secure=False,
)

## Step 1 – Data Loading and Processing

### 1.1 Data Loading

Load the `THE ADVENTURES OF SHERLOCK HOLMES` text:

In [10]:
with open(f"{DATA_DIR}/holmes.txt") as f:
    holmes_text = f.read()

### 1.2 EDA

Take a peek at the first 5 lines:
- a unicode character `\ufeff` has to be replaced.

In [11]:
holmes_text.splitlines()[:5]

['\ufeffTHE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE',
 '',
 '   I. A Scandal in Bohemia',
 '  II. The Red-headed League',
 ' III. A Case of Identity']

### 1.3 Data Preprocessing

Replace unicode characters with ASCII ones:

In [12]:
clean_holmes_text = unidecode(holmes_text)

In [13]:
clean_holmes_text.splitlines()[:5]

['THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE',
 '',
 '   I. A Scandal in Bohemia',
 '  II. The Red-headed League',
 ' III. A Case of Identity']

To train a character level language model that predicts the next character, text has to processed into:
- context - List of preceding characters as the context the model base its prediction on.
- character - The target prediction character given the preceding context that model is trying to predict.

In [14]:
process_params = {
    "context_len": 10,
}
mlflow.log_params(process_params)

Use a Keras `Sequence` process data at model runtime:
- Data preprocessing is computationally inexpensive, so the performance overhead should be minimal.
- Since there is a significant amount of overlap between context characters, preprocessing the data now will result significant redundant RAM  usage.

In [15]:
class TextContext(Sequence):
    def __init__(self, text, batch_size, process_params):
        """
        Create a text context data sequence useful for character level language models.

        Args:
            text: Text to generate from.
            batch_size: No. of batchs to generate.
            process_params: Data processing parameters
        """
        self.text = text
        self.batch_size = batch_size
        self.context_len = process_params["context_len"]

    def __len__(self):
        # no. of examples: (context, char) pairs
        context_char_len = self.context_len + 1
        n_examples = (len(self.text) - context_char_len) + 1
        return n_examples // self.batch_size

    def __getitem__(self, batch_idx):
        """Process text context data for the given batch at index batch_idx"""
        contexts = []
        target_chars = []

        for i_example in range(self.batch_size):
            # calculate position of where the context vector start and ends
            context_start = batch_idx * self.batch_size + i_example
            context_end = context_start + self.context_len
            # extract context vector
            context = self.text[context_start:context_end]
            contexts.append(context)

            # use character right after context vector as target predict character
            target_char = self.text[context_end]
            target_chars.append(target_char)

        return np.asarray(contexts), np.asarray(target_chars)

In [16]:
holmes_ctx = TextContext(
    text=clean_holmes_text, batch_size=32, process_params=process_params
)

In [17]:
%%timeit
context_batch, char_batch =  holmes_ctx.__getitem__(0)

15.4 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


## Step 2 – Develop Character Generator Model

Define utility functions for building a model.

In [None]:
def dense_classifier(
    in_op,
    n_dense_units=128,
    n_classes=10,
    use_batch_norm=True,
    dropout_prob=0.0,
    l2_reg=None,
    dense_activation="relu",
    **activation_params,
):
    """Append a dense classfier block to the given in_op"""
    x = in_op

    # disable hidden dense layer if n_dense_units < 1
    if n_dense_units >= 1:
        x = Dense(
            units=n_dense_units,
            kernel_regularizer=None if l2_reg is None else l2(l=l2_reg),
        )(x)

        if use_batch_norm:
            x = BatchNormalization()(x)

        x = Activation(dense_activation)(x)
    # classifier output layer
    x = Dense(
        units=n_classes,
        activation="softmax",
        kernel_regularizer=None if l2_reg is None else l2(l=l2_reg),
    )(x)
    return x

In [None]:
def rnn_block(
    in_op,
    rnn_cell="lstm",
    n_rnn_units=64,
    rnn_activation="tanh",
    use_layer_norm=True,
    return_sequences=False,
    dropout_prob=0.0,
    l2_reg=None,
    **activation_params,
):
    rnn_params = {
        "units": n_rnn_units,
        "activation": rnn_activation,
        "recurrent_dropout": dropout_prob,
        "kernel_regularizer": None if l2_reg is None else l2(l=l2_reg),
        "return_sequences": return_sequences,
    }

    x = in_op
    if rnn_cell == "lstm":
        x = LSTM(**rnn_params)(x)
    elif rnn_cell == "gru":
        x = GRU(**rnn_params)(x)
    else:
        raise NotImplementedError(f"Unsupported RNN cell: {rnn_cell}")

    if use_layer_norm:
        x = LayerNormalization()(x)

    return x

### Building the Model

In [None]:
def build_model(
    input_shape=(process_params["context_len"], 1),
    n_classes=10,
    n_dense_units=0,
    use_batch_norm=True,
    rnn_cell="lstm",
    n_rnn_layers=3,
    n_rnn_units=64,
    rnn_activation="tanh",
    dense_activation="relu",
    use_layer_norm=True,
    **block_params
):
    in_op = Input(shape=input_shape)
    x = in_op

    for i_layer in range(1, n_rnn_layers + 1):
        x = rnn_block(
            in_op=x,
            rnn_cell=rnn_cell,
            n_rnn_units=n_rnn_units,
            rnn_activation=rnn_activation,
            use_layer_norm=use_layer_norm,
            # return sequences except last rnn layer
            return_sequences=True if i_layer < n_rnn_layers else False,
            **block_params,
        )

    x = dense_classifier(
        in_op=x,
        n_classes=n_classes,
        n_dense_units=n_dense_units,
        use_batch_norm=use_batch_norm,
        dense_activation=dense_activation,
        **block_params,
    )

    return Model(
        inputs=in_op,
        outputs=x,
    )

In [None]:
build_model().summary()

### Model #1

In [None]:
# model.save('chgen_model_1.h5')

### Model #2

In [None]:
# model.save('chgen_model_2.h5')

### Recommend the Best Model

In [None]:
# Save the Best Model
# model.save('chgen_model_best.h5')

## Step 3 – Use the Best Model to generate the characters / sentences

In [None]:
# model.load_weights('chgen_model_best.h5')

In [None]:
# takes the user input
# text_input = np.array([input()])

In [None]:
# encode the user input

In [None]:
# Use the Best Model to generate 400 characters