<table class="table table-bordered">
    <tr>
        <th style="text-align:center; width:25%"><img src='https://www.np.edu.sg/PublishingImages/Pages/default/odp/ICT.jpg' style="width: 250px; height: 125px; "></th>
        <th style="text-align:center;"><h1>Deep Learning</h1><h2>Assignment 2 - Character Generator Model (Problem 2)</h2><h3>AY2020/21 Semester</h3></th>
    </tr>
</table>

Objective: Develop a English Language Model RNN capable of generating semi-coherent English sentences

TODO: Use perplexity as a validation metric

TODO: Use pretrained Transformer-XL  model from https://github.com/kimiyoung/transformer-xl

In [2]:
DATA_DIR = "../data"

In [37]:
# autoformat code on cell run.
%load_ext lab_black
# autoreload imported modules on change
%load_ext autoreload
%autoreload 2

# Import the Required Packages
import os
import mlflow
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import matplotlib
import matplotlib.pyplot as plt

from git import Repo
from unidecode import unidecode
from minio import Minio
from pprint import pprint
from zipfile import ZipFile
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import Sequence
from tensorflow.keras.utils import to_categorical
from modeling import (
    dense_classifier,
    rnn_block,
    compile_callbacks,
    build_model,
    train_eval_model,
)

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Configure access to MLFlow by setting the following environment variables:
- `MLFLOW_TRACKING_URI` - URL to the MLFlow Tracking server.
- `MLFLOW_S3_ENDPOINT_URL` - URL to the MLFlow S3 Backend Store
- `MLFLOW_EXPERIMENT` - Optional. The name of the MLFlow experiment to log to.
- `MINIO_HOST` - End to the Minio S3 Store.
- `AWS_ACCESS_KEY_ID` - MLFlow S3 backend store Avectorsccess Key ID.
- `AWS_SECRET_ACCESS_KEY` - MLFlow S3 backend store secret access key.

`TF_FORCE_GPU_ALLOW_GROWTH` -  Force Tensorflow to allocate GPU memory dynamically
instead of of all at once as a workaround for this
[cuDNN failed to initialize issue](https://github.com/tensorflow/tensorflow/issues/24828).

In [4]:
%env TF_FORCE_GPU_ALLOW_GROWTH=true

env: TF_FORCE_GPU_ALLOW_GROWTH=true


Start MLFlow run with the name of the commit as fthe run name.

In [5]:
mlflow.set_experiment(os.environ.get("MLFLOW_EXPERIMENT", "staging"))

In [6]:
repo = Repo(search_parent_directories=True)
mlflow.start_run(run_name=repo.head.commit.message)

<ActiveRun: >

Setup `minio` client.

In [7]:
minio = Minio(
    endpoint=os.environ["MINIO_HOST"],
    access_key=os.environ["AWS_ACCESS_KEY_ID"],
    secret_key=os.environ["AWS_SECRET_ACCESS_KEY"],
    secure=False,
)

## Step 1 – Data Loading and Processing

### 1.1 Data Loading

 Load the `THE ADVENTURES OF SHERLOCK HOLMES` text:

In [8]:
with open(f"{DATA_DIR}/holmes.txt") as f:
    holmes_text = f.read()

### 1.2 EDA

Take a peek at the first 5 lines:
- a unicode character `\ufeff` has to be replaced.

In [9]:
holmes_text.splitlines()[:6]

['\ufeffTHE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE',
 '',
 '   I. A Scandal in Bohemia',
 '  II. The Red-headed League',
 ' III. A Case of Identity',
 '  IV. The Boscombe Valley Mystery']

- No. of unique characters:

In [45]:
unique_chars = np.unique(list(holmes_text))
print(unique_chars)
# -1 for the `\ufeff` character to be removed.
n_unique_chars = len(unique_chars) - 1
print("no. of unique characters:", n_unique_chars)

['\n' ' ' '!' '"' '&' "'" '(' ')' ',' '-' '.' '/' '0' '1' '2' '3' '4' '5'
 '6' '7' '8' '9' ':' ';' '?' 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K'
 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z' 'a' 'b' 'c'
 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u'
 'v' 'w' 'x' 'y' 'z' 'à' 'â' 'è' 'é' '\ufeff']
no. of unique characters: 81


### 1.3 Data Preprocessing

Replace unicode characters with ASCII ones:

In [11]:
clean_holmes_text = unidecode(holmes_text)

In [12]:
clean_holmes_text.splitlines()[:6]

['THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE',
 '',
 '   I. A Scandal in Bohemia',
 '  II. The Red-headed League',
 ' III. A Case of Identity',
 '  IV. The Boscombe Valley Mystery']

Compile a mapping from between a number and a character:

In [55]:
charset = {char: i for i, char in enumerate(unique_chars)}
print(dict(list(charset.items())[:6]), "...")

{'\n': 0, ' ': 1, '!': 2, '"': 3, '&': 4, "'": 5} ...


To train a character level language model that predicts the next character, text has to processed into:
- context - List of preceding characters as the context the model base its prediction on.
- character - The target prediction character given the preceding context that model is trying to predict.

In [17]:
process_params = {
    "context_len": 500,
    "charset_size": n_unique_chars,
}
mlflow.log_params(process_params)

Use a Keras `Sequence` process data at model runtime:
- Data preprocessing is computationally inexpensive, so the performance overhead should be minimal.
- Since there is a significant amount of overlap between context characters, preprocessing the data now will result significant redundant RAM  usage.

In [57]:
class TextContext(Sequence):
    def __init__(self, text, batch_size, charset, process_params):
        """
        Create a text context data sequence useful for character level language models.

        Args:
            text: Text to generate from.
            batch_size: No. of batchs to generate.
            charset: Dictionary mapping character to int.
            process_params: Data processing parameters
        """
        self.text = text
        self.batch_size = batch_size
        self.charset = charset
        self.params = process_params

    @property
    def context_shape(self):
        """Return the shape of the processed context vector"""
        return (self.params["context_len"], self.params["charset_size"])

    def __len__(self):
        # no. of examples: (context, char) pairs
        context_char_len = self.params["context_len"] + 1
        n_examples = (len(self.text) - context_char_len) + 1
        return n_examples // self.batch_size

    def __getitem__(self, batch_idx):
        """Process text context data for the given batch at index batch_idx"""
        contexts = []
        target_chars = []

        for i_example in range(self.batch_size):
            # calculate position of where the context vector start and ends
            context_start = batch_idx * self.batch_size + i_example
            context_end = context_start + self.params["context_len"]
            # extract context vector
            context = self.text[context_start:context_end]
            context_int = [self.charset[c] for c in context]
            # one hot encode the context
            ctx_one_hot = to_categorical(
                context_int, num_classes=self.params["charset_size"]
            )
            contexts.append(ctx_one_hot)

            # use character right after context vector as target predict character
            target_char = self.text[context_end]
            target_chars.append(target_char)

        return np.asarray(contexts), np.asarray(target_chars)

In [68]:
holmes_ctx = TextContext(
    batch_size=32,
    text=clean_holmes_text,
    charset=charset,
    process_params=process_params,
)

In [69]:
%%timeit
holmes_ctx.__getitem__(0)

2.42 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [70]:
context_batch, char_batch = holmes_ctx.__getitem__(0)

(32, 500, 81)

## Step 2 – Develop Character Generator Model

### Building the Model

In [73]:
build_model(
    input_shape=holmes_ctx.context_shape,
    n_classes=process_params["charset_size"],
).summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 500, 81)]         0         
_________________________________________________________________
lstm (LSTM)                  (None, 500, 64)           37376     
_________________________________________________________________
layer_normalization (LayerNo (None, 500, 64)           128       
_________________________________________________________________
lstm_1 (LSTM)                (None, 500, 64)           33024     
_________________________________________________________________
layer_normalization_1 (Layer (None, 500, 64)           128       
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                33024     
_________________________________________________________________
layer_normalization_2 (Layer (None, 64)                128   

### Model #1

In [None]:
# model.save('chgen_model_1.h5')

### Model #2

In [None]:
# model.save('chgen_model_2.h5')

### Recommend the Best Model

In [None]:
# Save the Best Model
# model.save('chgen_model_best.h5')

## Step 3 – Use the Best Model to generate the characters / sentences

In [None]:
# model.load_weights('chgen_model_best.h5')

In [None]:
# takes the user input
# text_input = np.array([input()])

In [None]:
# encode the user input

In [None]:
# Use the Best Model to generate 400 characters