# Document Classification Test (HeavyWater Machine Learning Challenge)
# LSTM models with document length side feature

**Problem Statement**

We process documents related to mortgages, aka everything that happens to originate a mortgage that you don't see as a borrower. Often times the only access to a document we have is a scan of a fax of a print out of the document. Our system is able to read and comprehend that document, turning a PDF into structured business content that our customers can act on.

This dataset represents the output of the OCR stage of our data pipeline ...  Each word in the source is mapped to one unique value in the output. If the word appears in multiple documents then that value will appear multiple times. The word order for the dataset comes directly from our OCR layer, so it should be roughly in order.

**Mission**

Train a document classification model. Deploy your model to a public cloud platform (AWS/Google/Azure/Heroku) as a webservice, send us an email with the URL to you github repo, the URL of your publicly deployed service so we can submit test cases and a recorded screen cast demo of your solution's UI, its code and deployment steps. Also, we use AWS so we are partial to you using that ... just saying.

**Lightweight way to test for tensorflow detection of GPUs (with diagnostics), using command line:**

```python
python3 -c "from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())"
```

## Setup

### Library import

We import all the required Python libraries

In [2]:
from time import asctime, gmtime, localtime, perf_counter
print(asctime(localtime()))

t0 = perf_counter()

from collections import Counter, OrderedDict
import gc		# garbage collection module
import os
import pathlib
import pickle
from random import random
import sys

print("Python version: ", sys.version_info[:])
print("Un-versioned imports:\n")
prefixStr = ''
print(prefixStr + 'collections', end="  ")
print(prefixStr + 'gc', end="  ")
print(prefixStr + 'os', end="  ")
print(prefixStr + 'pathlib', end="  ")
print(prefixStr + 'pickle', end="  ")
print(prefixStr + 'random', end="  ")
print(prefixStr + 'sys', end="")

import re

from dateutil import __version__ as duVersion
from dateutil.parser import parse
import numpy as np

mdVersion = None
# from modin import __version__ as mdVersion
# import modin.pandas as pd
import pandas as pd
ppVersion = None

import graphviz

scVersion = None
from scipy import __version__ as scVersion
import scipy.sparse as sp

from sklearn import __version__ as skVersion
from sklearn.metrics import confusion_matrix, classification_report
# from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight

tfVersion = None
from tensorflow import __version__ as tfVersion
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import Input
from tensorflow.keras.layers import Dense, Embedding, Bidirectional, LSTM, concatenate
from tensorflow.keras.models import Model
from tensorflow.keras.models import load_model as load
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard
from tensorflow.keras.utils import plot_model
from tensorflow.keras import backend as K
from tensorflow.python.client import device_lib 
from tensorflow import device
from tensorflow.keras.metrics import SparseCategoricalCrossentropy

tfaVersion = None
from tensorflow_addons import __version__ as tfaVersion
from tensorflow_addons.metrics import F1Score

# from joblib import __version__ as jlVersion
# from joblib import dump, load

# Visualizations

mpVersion = None
from matplotlib import __version__ as mpVersion
import matplotlib.pyplot as plt

import seaborn as sns
import colorcet as cc

print("\n")
print(f"colorcet: {cc.__version__}", end="\t")
print(f"dateutil: {duVersion}", end="\t")
print(f"graphviz: {duVersion}", end="\t")
# print(f"joblib: {jlVersion}", end="\t")
print(f"matplotlib: {mpVersion}", end="\t")
if 'modin' in sys.modules:
    print(f"modin: {mdVersion}", end="\t")
print(f"numpy: {np.__version__}", end="\t")
if 'pandas' in sys.modules:
    print(f"pandas: {pd.__version__}", end="\t")
print(f"re: {re.__version__}", end="\t")
print(f"scipy: {scVersion}", end="\t")
print(f"seaborn: {sns.__version__}", end="\t")
print(f"sklearn: {skVersion}", end="\t")
print(f"tensorflow: {tfVersion}", end="\t")
print(f"tensorflow_addons: {tfaVersion}", end="\t")

Δt = perf_counter() - t0
print(f"\n\nΔt: {Δt: 4.1f}s.")

print("\nlocal devices:\n\n", device_lib.list_local_devices())

%matplotlib inline

# Options for pandas
pd.options.display.max_columns = 30
pd.options.display.max_rows = 50

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload
    
%autoreload 2

Tue Jan 26 17:55:14 2021
Python version:  (3, 6, 9, 'final', 0)
Un-versioned imports:

collections  gc  os  pathlib  pickle  random  sys

colorcet: 1.0.0	dateutil: 2.8.1	graphviz: 2.8.1	matplotlib: 3.3.3	numpy: 1.19.5	pandas: 1.1.4	re: 2.2.1	scipy: 1.4.1	seaborn: 0.11.1	sklearn: 0.22.1	tensorflow: 2.4.1	tensorflow_addons: 0.12.0	

Δt:  0.0s.

local devices:

 [name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 16223764030492155934
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 200736768
locality {
  bus_id: 1
  links {
    link {
      device_id: 1
      type: "StreamExecutor"
      strength: 1
    }
  }
}
incarnation: 11279135380970445939
physical_device_desc: "device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:08:00.0, compute capability: 6.1"
, name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 109182976
locality {
  bus_id: 1
  links {
    link {
      type: "StreamExecutor"
      strength: 1
    }
  }
}
incarnation: 1156248

### Local library import

We import all the required local libraries libraries

In [None]:
rootPath = pathlib.Path.cwd().parent
libPath = rootPath / 'python'

# Include local library paths
sys.path.append(str(libPath)) # uncomment and fill to import local libraries

# Import local libraries
from utility import ModelTrain as mt
from plotHelpers import plotHelpers as ph

<a id="helper-tokenize"></a>
### Helper functions

#### `tokenize()`

In [None]:
def tokenize(corpus, vocabSz):
    """
    Generates the vocabulary and the list of list of integers for the input corpus

    Help from: https://www.tensorflow.org/tutorials/text/nmt_with_attention

    INPUTS:
        corpus: list, type(str), containing (short) document strings
        vocabSz: (int) Maximum number of words to consider in the vocabulary

    RETURNS: List of list of indices for each string in the corpus + Keras sentence tokenizer object

    Usage:
        listOfListsOfIndices, sentenceTokenizer = tokenize(mySentences, maxVocabCt)
    """

    # Define the sentence tokenizer
    tokenizer = Tokenizer(num_words=vocabSz,
    #                               filters='!#%()*+,./:;<=>?@[\\]^_`{|}~\t\n',
                                  filters='%',
                                  lower=False,
                                  split=' ', char_level=False, oov_token="<unkwn>")

    # Keep the double quote, dash, and single quote + & (different from word2vec training: didn't keep `&`)
    # oov_token: added to word_index & used to replace out-of-vocab words during text_to_sequence calls
    # num_words = maximum number of words to keep, dropping least frequent

    # Fit the tokenizer on the input corpus
    tokenizer.fit_on_texts(corpus)

    # Transform each text in corpus to a sequence of integers
    listOfIndexLists = tokenizer.texts_to_sequences(corpus)

    return listOfIndexLists, tokenizer

## Prepare Data

#### Define paths

In [None]:
dataPath = rootPath / 'data'
modelPath = rootPath / 'model'
plotPath = rootPath / 'figures'
checkpointPath = rootPath / 'checkpoints'
tensorBoardPath = rootPath / 'tensorBoardLogs'

### Import data

In [None]:
sourceData = dataPath / 'shuffled-full-set-hashed.csv.zip'
df0 = pd.read_csv(sourceData, header=None, names=['category', 'docText'])
df0.head()
df0.tail()

### Munge/inspect data

**There are 45 null documents**

In [None]:
df0.info()

**There are 14 document categories**

In [None]:
categories = df0.category.unique()
len(categories)
print(categories)

#### Extract tokens (in order to get document lengths)

In [None]:
df0['tokens'] = df0.docText.apply(lambda p: [] if isinstance(p, float) else p.split())
df0.head()

#### Get token counts (side feature)

In [None]:
df0['docLength'] = df0.tokens.apply(lambda t: len(t))
df0.head()

## Pre-process data

### Test-train split

* remove documents of length < <font color="darkred">**6**</font>:
  * these are unlikely to be informative, and probably are result of scan error
  * probably should have these labeled as an error, for human review, rather than risk downstream adoption
* class imbalance spanning almost 2 orders of magnitude ⟶ *stratified sampling*
* smallest classes 229 instances, so need half to test with ~10% uncertainty
* after model selection, can train on entire data set

In [None]:
df = df0.copy()[df0.docLength > 5]
df0.shape, df.shape

### Create list of lists of word indices, and TensorFlow sentence tokenizer object

Use strings from `dfTrain` to create vocabulary indices.

See [helper function `tokenize()`](#helper-tokenize)

* Each token is 12 characters long, so minimum string length is 6 &times; 12 + 5 (spaces) = 77

<a id="maxvocabct"></a>
Must specify a limit to the number of unique tokens for the tokenizer.
(Changing this will require re-instantiating it.)

* `maxVocabCt`			vocabulary size to be returned by tokenizer, dropping least frequent

Other parameters are defined below in [LSTM 0, baseline model parameters](#lstm0-parameters), and similarly for subsequent models.

Tokenizing takes ~10 s.

In [None]:
maxVocabCt = 200_000

In [None]:
df.docText.str.len().min()
ListOfDocsTr = list(dfTr.docText)
listOfListsOfWordIndicesTr, tokenizer = tokenize(ListOfDocsTr, maxVocabCt)

### Compute weights for each class

#### `dfTr` category breakdowns

* categoriesBySupport are category names ordered by support in `dfTr`

In [None]:
categoryCts = dfTr[['category', 'docLength']].groupby(by='category').count()\
    .rename(columns={'docLength': 'count'})
categoryCts

categoryCts.sort_values(by='count', ascending=False)
categoriesBySupport = list(categoryCts.sort_values(by='count', ascending=False).index)
categoriesBySupport

#### Extract training and test labels

In [None]:
categoryInds = {c: i for c, i in zip(categoriesBySupport, range(len(categories)))}

yTr = dfTr.category.apply(lambda c: categoryInds[c])
yTe = dfTe.category.apply(lambda c: categoryInds[c])
yTr.head()
yTr.tail()

#### Determine class weights

In [None]:
weights = class_weight.compute_class_weight('balanced',
                                            range(len(categories)),
                                            yTr)
print(weights)
classWeights = {i: weights[i] for i in range(len(categories))}
print("classWeights:\n", classWeights)

## LSTM model(s)

* This/these differ from simple models, as they include a side feature `docLength`
* Since a substantial fraction of documents have lengths `docLength > maxDocWords`, this feature should be informative

<a id="define-model1"></a>
### Define model1

* embedding layer
* bidirectional LSTM
* unidirectional LSTM *(optional)*
* dense layer (relu)
* dense layer (relu)
* classifier dense layer (softmax)

In [None]:
def model1(sequence_length, vocabSz, auxFeatureCount, LSTMinternalLayerSz,
           embedLayerDim, densLayerDim=64, softMaxCt=16, dropoutFrac=0.15,
           LSTMdropoutFrac=0.40, include2ndLSTMlayer=False):

    """
    INPUTS:
    sequence_length			int, number of LSTM units
    vocabSz					int, size of vocabulary
    auxFeatureCount			int, count of auxiliary (side) features
    LSTMinternalLayerSz		int, size of layers within LSTM units
    embedLayerDim			int, dimension of embedding layer
    densLayerDim			int, dimension of dense layers, default: 64
    softMaxCt				int, dimension of softmax output, default: 16
    dropoutFrac				int, dropout rate, default: 0.15
    LSTMdropoutFrac			int, dropout rate for LSTMs, default: 0.40
    include2ndLSTMlayer		bool, include unidirectional LSTM after
                            bidirectional LSTM, default: False
    """

    # Headline input: meant to receive sequences of *sequence_length*
    # integers, between 1 and *vocabSz*.

    main_input = Input(shape=(sequence_length,), dtype='int32', name='MainInput')
    auxiliary_input = Input(shape=(auxFeatureCount,), name='NumericalInput')

    # This embedding layer will encode the input sequence
    # into a sequence of dense 64-dimensional vectors.
    x = Embedding(output_dim=embedLayerDim, input_dim=vocabSz,
                  input_length=sequence_length, trainable=True, name="EmbedLayer")(main_input)

    # A LSTM will transform the vector sequence into a single vector,
    # containing information about the entire sequence
    lstmOut0 = Bidirectional(LSTM(LSTMinternalLayerSz,
                                    dropout=dropoutFrac,
                                    recurrent_dropout=LSTMdropoutFrac,
                                    return_sequences=False), name='BidirectionalLSTM')(x)
                                    # return_sequences=True), name='BidirectionalLSTM')(x)
    if not include2ndLSTMlayer:
        x = concatenate([lstmOut0, auxiliary_input], name='ConcatenatedFeatures')
    else:
        # Add a second, unidirectional LSTM, if desired
        lstmOut1 = LSTM(LSTMinternalLayerSz,
                        dropout=dropoutFrac,
                        recurrent_dropout=LSTMdropoutFrac, name='UnidirectionalLSTM')(lstmOut0)
        x = concatenate([lstmOut1, auxiliary_input], name='ConcatenatedFeatures')

    # We stack a deep densely-connected network on top
    x = Dense(densLayerDim, activation='relu', name='DenseLayer0')(x)
    x = Dense(densLayerDim, activation='relu', name='DenseLayer1')(x)

    # And finally we add the main logistic regression layer
    main_output = Dense(56, activation='softmax', name='mainOutput')(x)
    model = Model(inputs=[main_input, auxiliary_input], outputs=main_output)

    return model

<a id="tokenize-512-tokens"></a>
### LSTM 6 tokenize

* truncate docs to `maxDocWords = 512` tokens
* pre-pad shorter docs with 0s

##### Tensor of word indices for train

In [None]:
padValue = 0
maxDocWords = 512

XdocsTr = pad_sequences(listOfListsOfWordIndicesTr,
                        maxlen=maxDocWords,
                        dtype='int32', padding='pre',
                        truncating='post', value=padValue)

In [None]:
ListOfDocsTr[0]
print(listOfListsOfWordIndicesTr[0])
XdocsTr[0]

##### Tensor of word indices for test

In [None]:
ListOfDocsTe = list(dfTe.docText)
listOfListsOfWordIndicesTe = tokenizer.texts_to_sequences(ListOfDocsTe)
XdocsTe = pad_sequences(listOfListsOfWordIndicesTe,
                        maxlen=maxDocWords,
                        dtype='int32', padding='pre',
                        truncating='post', value=padValue)

In [None]:
ListOfDocsTe[0]
print(listOfListsOfWordIndicesTe[0])
XdocsTe[0]

#### Auxiliary (side) data need to be shaped

* creates a row vector

In [None]:
XauxTr = dfTr.docLength.values.reshape(dfTr.shape[0], 1)
XauxTr.shape

<a id="lstm6-parameters"></a>
### LSTM 6

#### baseline model parameters

* Bidirectional(LSTM) only
* `LSTMlayerUnits = 128`
* `maxDocWords = 512`

Refer to the [LSTM 6 tokenize](#tokenize-512-tokens) section for the size of `maxVocabCt`.

|parameter|&nbsp;&nbsp;|description|
|:--------|------------|:----------|
|`testFrac`||fraction of data set withheld|
|`LSTMlayerUnits`||# units within each activation unit in LSTMs|
|`embeddingDim'||size of dimension for generated embeddings|
|`auxFeaturesCt`||# of features in auxiliary data|
|`classCt`||# classes (softmax output dim)|
|`auxFeatureCount`||# of side features|
|`dropoutFrac`||dropout fraction|
|`LSTMdropoutFrac`||dropout fraction within LSTMs|
|`batchSz`||size of batches|
|`epochCt`||number of epochs to run|

In [None]:
testFrac = 0.5
LSTMlayerUnits = 128		# 🢢🢢🢢
embeddingDim = 64
classCt = len(categoriesBySupport)
auxFeatureCount = 1
dropoutFrac = 0.15
LSTMdropoutFrac = 0.5
# LSTMdropoutFrac = 0			# Must be 0 for use of cudnn
batchSz = 64
epochCt = 30

#### Save space

In [None]:
# del LSTMX		# (placeholder, in case this section copied for subsequent models)

#### LSTM 6 callbacks

* checkpoints
* TensorBoard
* (no early stopping)

In [None]:
modelInstanceDir = (f"vocabCt{maxVocabCt:06d}maxCommentLen{maxDocWords:03d}"
                    + f"classCt{classCt:02d}"
                    + f"embedDim{embeddingDim:03d}"
                    + f"LSTMlayerSz{LSTMlayerUnits:03d}batchSz{batchSz:03d}"
                    + f"dropoutFrac{dropoutFrac:4.2f}"
                    + f"LSTMdropoutFrac{dropoutFrac:4.2f}")
print(modelInstanceDir, "\n")

checkpointPrefix = os.path.join(checkpointPath, modelInstanceDir,
                                "ckpt{epoch:03d}")
print(checkpointPrefix, "\n")

checkpointCallback=ModelCheckpoint(filepath=checkpointPrefix,
                                   save_weights_only=True)
os.makedirs(tensorBoardPath, exist_ok=True)                       

logsDir = os.path.join(tensorBoardPath, modelInstanceDir)
print(logsDir, "\n")

os.makedirs(logsDir, exist_ok=True)
tensorboardCallback = TensorBoard(log_dir=logsDir, histogram_freq=1)

#### Load or instantiate LSTM 6

In [None]:
LSTM6name = 'LSTM6'

if (modelPath / LSTM6name).is_file():
    print(f"Loading {LSTM6name} model from disk.")
    LSTM6 = load(modelPath / LSTM6name)
else:
    np.random.seed(0)  # Set a random seed for reproducibility

    print("Instantiate LSTM 6, using model0 ...")
    with device('/device:GPU:1'):
        LSTM6 = model1(maxDocWords, maxVocabCt, auxFeatureCount, classCt, LSTMlayerUnits,
                       embeddingDim, softMaxCt=classCt)
    LSTM6.summary()

<a id="model1-graph"></a>
#### Model 1 graph

In [None]:
plot_model(LSTM6, to_file=os.path.join(plotPath, 'model0graph.png'))

#### Compile LSTM 6

In [None]:
if not (modelPath / LSTM3name).is_file():
    with device('/device:GPU:1'):
      LSTM3.compile(optimizer='rmsprop',
                  loss='sparse_categorical_crossentropy',
                  # metrics = ['accuracy', Recall(), Precision(),
                  #            F1Score(num_classes=classCt), 'categorical_crossentropy'])
                  metrics = ['accuracy'])

#### Train LSTM 6

In [None]:
if not (modelPath / LSTM6name).is_file():
    print(epochCt, batchSz)
    print(classWeights)
    with device('/device:GPU:1'):
        history6 = LSTM6.fit(x=[XdocsTr, XauxTr],
                             y= yTr.values,
                             epochs=epochCt, batch_size=batchSz,
                             shuffle=True,
                             class_weight=classWeights,
                             validation_split=0.2,
                             callbacks=[checkpointCallback, tensorboardCallback],
                             verbose=1)

#### Save LSTM 6, if new model

In [None]:
if not (modelPath / LSTM6name).is_file():
    print(f"Saving {LSTM6name} to disk.")
    LSTM6.save(modelPath / LSTM6name)

#### LSTM 6 inference on test data

In [None]:
softmaxOut = LSTM6.predict(x=XdocsTe)
yPred = np.argmax(softmaxOut, axis=1)

In [None]:
confusionMat = confusion_matrix(yTe, yPred)
print(confusionMat)

In [None]:
np.where(np.sum(confusionMat, axis=0) == 0)

In [None]:
accuracy = np.trace(confusionMat)/np.sum(confusionMat)
recall = np.diag(confusionMat)/np.sum(confusionMat, axis=1)
precision = np.diag(confusionMat)/np.sum(confusionMat, axis=0)
print(f"accuracy: {accuracy:0.3f}, "
      f"<precision>: {np.mean(precision):0.3f}, "
      f"<recall>: {np.mean(recall):0.3f}")

##### Classification report

In [None]:
classificationReport = classification_report(yTe.values, yPred,
                                             target_names=[str(c)for c in categoriesBySupport])
print(classificationReport)

##### Sorted classification report

* order by support

In [None]:
print(ph.sortClassificationReport(classificationReport))

##### Plot confusion matrix

* As this is a straight confusion matrix, diagonal elements mostly reflect class size in test set
* *This is hard to interpret by visual inspection alone*

In [None]:
labelFontSz = 16
tickFontSz = 13
titleFontSz = 20

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
ph.plotConfusionMatrix(confusionMat, saveAs=None, xlabels=categories,
                       ylabels=categories, titleText = 'LSTM 6',
                       ax = ax,  xlabelFontSz=labelFontSz,
                       xtickRotate=0.65, ytickRotate=0.0,
                       ylabelFontSz=labelFontSz, xtickFontSz=tickFontSz,
                       ytickFontSz=tickFontSz, titleFontSz=titleFontSz)

##### Plot recall confusion matrix

* normalized by *row*
* diagonal elements now represent the *recall* for each class

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
ph.plotConfusionMatrix(confusionMat, saveAs=None, xlabels=categories,
                       ylabels=categories, titleText = 'LSTM 6',
                       ax = ax, xlabelFontSz=labelFontSz,
                       xtickRotate=0.65, ytickRotate=0.0, type='recall',
                       ylabelFontSz=labelFontSz, xtickFontSz=tickFontSz,
                       ytickFontSz=tickFontSz, titleFontSz=titleFontSz)

##### Plot precision confusion matrix

* normalized by *column*
* diagonal elements now represent the *precision* for each class

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
ph.plotConfusionMatrix(confusionMat, saveAs=None, xlabels=categories,
                       ylabels=categories, titleText = 'LSTM 6',
                       ax = ax,  xlabelFontSz=labelFontSz,
                       xtickRotate=0.65, ytickRotate=0.0, type='precision',
                       ylabelFontSz=labelFontSz, xtickFontSz=tickFontSz,
                       ytickFontSz=tickFontSz, titleFontSz=titleFontSz)