# Claims classification with Keras: The Python Deep Learning Library

In this notebook, you will train a classification model for claim text that will predict `1` if the claim is an auto insurance claim or `0` if it is a home insurance claim. The model will be built using a Deep Neural Network using TensorFlow via the Keras library.

This notebook will walk you through a simplified text analytic process that consists of:

Normalizing the training text data
Extracting the features of the training text as vectors
Creating and training a DNN based classifier model
Using the model to predict classifications

## Prepare modules

This notebook will use the Keras library to build and train the classifier. In addition, it relies on a supplied helper library that performs common text analytic functions, called textanalytics.

In [None]:
import re
import nltk
import uuid

import os
import numpy as np
import pandas as pd

import tensorflow as tf
import keras
from keras import models, layers, optimizers, regularizers
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.utils import to_categorical

print('Keras version: ', keras.__version__)
print('Tensorflow version: ', tf.__version__)

**Let's copy locally all the data needed by this notebook.**

In [None]:
import urllib.request

data_location = './data'
base_data_url = 'https://databricksdemostore.blob.core.windows.net/data/05.03/'
filesToDownload = ['claims_labels.txt', 'claims_text.txt', 'contractions.py', 'textanalytics.py']

os.makedirs(data_location, exist_ok=True)

for file in filesToDownload:
    data_url = os.path.join(base_data_url, file)
    local_file_path = os.path.join(data_location, file)
    urllib.request.urlretrieve(data_url, local_file_path)
    print('Downloaded file: ', file)

Let's import some helper code that resides in python .py files.
Note: This is how you can bring your extra Python code residing on an arbitrary location into the notebook.

In [None]:
import nltk
import sys
nltk.download('stopwords')
nltk.download('punkt')
sys.path.append(data_location)
import textanalytics as ta

## Prepare the training data

Contoso Ltd has provided a small document containing examples of the text they receive as claim text. They have provided this in a text file with one line per sample claim.

Run the following cell to examine the contents of the file. Take a moment to read the claims (you may find some of them rather comical!).

In [None]:
claims_corpus = [claim for claim in open(os.path.join(data_location, 'claims_text.txt'))]
claims_corpus

In addition to the claims sample, Contoso Ltd has also provided a document that labels each of the sample claims provided as either 0 ("home insurance claim") or 1 ("auto insurance claim"). This to is presented as a text file with one row per sample, presented in the same order as the claim text.

Run the following cell to examine the contents of the supplied claims_labels.txt file:

In [None]:
labels = [int(re.sub("\n", "", label)) for label in open(os.path.join(data_location, 'claims_labels.txt'))]
print(len(labels))
print(labels[0:5]) # first 5 labels
print(labels[-5:]) # last 5 labels

As you can see from the above output, the values are integers 0 or 1. In order to use these as labels with which to train our model, we need to convert these integer values to categorical values (think of them like enum's from other programming languages).

We can use the to_categorical method from `keras.utils` to convert these value into binary categorical values. Run the following cell:

In [None]:
labels = to_categorical(labels, 2)
print(labels.shape)
print()
print(labels[0:2]) # first 2 categorical labels
print()
print(labels[-2:]) # last 2 categorical labels

Now that we have our claims text and labels loaded, we are ready to begin our first step in the text analytics process, which is to normalize the text.

## Normalize the claims corpus

The textanalytics module supplied takes care of implementing our desired normalization logic. In summary, what it does is:

- Expand contractions (for example "can't" becomes "cannot")
- Lowercase all text
- Remove special characters (like punctuation)
- Remove stop words (these are words like "a", "an", "the" that add no value)

Run the following command and observe how the claim text is modified:

In [None]:
norm_corpus = ta.normalize_corpus(claims_corpus)
norm_corpus

## Feature extraction: vectorize the claims corpus

Feature extraction in text analytics has the goal of creating a numeric representation of the textual documents.

During feature extraction, a “vocabulary” of unique words is identified and each word becomes a column in the output. In other words, the table is as wide as the vocabulary.

Each row represents a document. The value in each cell is typically a measure of the relative importance of that word in the document, where if a word from the vocabular does not appear that cell has a zero value in that column. In other words, the table is as tall as all of the documents in the corpus.

This approach enables machine learning algorithms, which operate against arrays of numbers, to also operate against text becasue each text document is now represented as an array of numbers.

Deep learning algorithms operate on tensors, which are also vectors (or arrays of numbers) and so this approach is also valid for preparing text for use with a deep learning algorithm.

Run the following command to see what the vectorized version of the claims in norm_corpus looks like:

In [None]:
vectorizer, tfidf_matrix = ta.build_feature_matrix(norm_corpus) 
data = tfidf_matrix.toarray()
print(data.shape)
data

Observe in the above output, that the shape (the dimensions) of the data is 40 rows by 258 columns. You should interpret this as our vectorizer determined that there 258 words in the vocabulary learned from all documents in the set. There are 40 documents in our training set, hence the vectorized output has 40 rows (one array of numbers for each document).

## Build the neural network

Now that you have normalized and extracted the features from training text data, you are ready to build the classifier. In this case, we will build a simple neural network. The network will be 2 layers deep and each node in one layer is connected to every other node in a subsequent layer. This is what is meant by fully connected.

Run the following cell to build the structure for your neural network:

In [None]:
np.random.seed(125)
model = Sequential()
model.add(Dense(60, input_dim=data.shape[1], kernel_regularizer=regularizers.l2(0.02)))
model.add(Activation('relu'))
model.add(Dense(2))
model.add(Activation('sigmoid'))

model.summary()

## Train the neural network

First, we will split the data into two sets: (1) training set and (2) validation or test set. The validation set accuracy will be used to measure the performance of the model.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=0)

We will use the `Adam` optimization algorithm to train the model. Also, given that the problem is of type `Binary Classification`, we are using the `Sigmoid` activation function for the output layer and the `Binary Crossentropy` as the loss function.

In [None]:
opt = keras.optimizers.Adam(lr=0.001)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

Now we are ready to let the DNN learn by fitting it against our training data and labels. We have defined the batch size and the number of epochs for our training.

Run the following cell to fit your model against the data:

In [None]:
epochs = 100
batch_size = 16
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_test, y_test))

Take a look at the final output for the value "val_acc". This stands for validation set accuracy. If you think of random chance as having a 50% accuracy, is you model better than random?

It's OK if it's not much better then random at this point- this is only your first model! The typical data science process would continue with many more iterations taking different actions to improve the model accuracy, including:
- Acquiring more labeled documents for training
- Preparing the text with more sophisticated techniques such as lemmatization
- Regularization to prevent overfitting
- Adjusting the model hyperparameters, such as the number of layers, number of nodes per layer, and learning rate

## Test classifying claims

Now that you have constructed a model, try it out against a set of claims. Recall that we need to normalize and featurize the text using the exact same pipeline we used during training.

Run the following cell to prepare our test data:

In [None]:
test_claim = ['I crashed my car into a pole.', 
              'The flood ruined my house.', 
              'I lost control of my car and fell in the river.']
test_claim = ta.normalize_corpus(test_claim)
test_claim = vectorizer.transform(test_claim)

test_claim = test_claim.toarray()
print(test_claim.shape)

Now use the model to predict the classification:

In [None]:
pred = model.predict(test_claim)
pred_label = pred.argmax(axis=1)
pred_df = pd.DataFrame(np.column_stack((pred,pred_label)), columns=['class_0', 'class_1', 'label'])
pred_df.label = pred_df.label.astype(int)
print('Predictions')
pred_df

## Model exporting and importing

Now that you have a working model, you need export the trained model, and the vectorizer to a file so that they can be used downstream by the deployed web service.

To export the model and the vectorizer, run the following cell:

In [None]:
from sklearn.externals import joblib

output_folder = './output'
model_filename = 'final_model.hdf5'
os.makedirs(output_folder, exist_ok=True)
model.save(os.path.join(output_folder, model_filename))

vectorizer_name = 'vectorizer'
joblib.dump(vectorizer, os.path.join(output_folder, vectorizer_name))

To test re-loading the model into the same Notebook instance, run the following cell:

In [None]:
from keras.models import load_model
loaded_model = load_model(os.path.join(output_folder, model_filename))
loaded_model.summary()

As before you can use the model to run predictions.

Run the following cells to try the prediction with the re-loaded model:

In [None]:
pred = loaded_model.predict(test_claim)
pred_label = pred.argmax(axis=1)
pred_df = pd.DataFrame(np.column_stack((pred,pred_label)), columns=['class_0', 'class_1', 'label'])
pred_df.label = pred_df.label.astype(int)
print('Predictions')
pred_df