# Claims classification with Keras: The Python Deep Learning Library

In this notebook, you will train a classification model for claim text that will predict `1` if the claim is an auto insurance claim or `0` if it is a home insurance claim. The model will be built using a type of DNN called the Long Short-Term Memory (LSTM) recurrent neural network using TensorFlow via the Keras library.

This notebook will walk you through the text analytic process that consists of:

- Example word analogy with Glove word embeddings
- Vectorizing training data using GloVe word embeddings
- Creating and training a LSTM based classifier model
- Using the model to predict classifications

## Prepare modules

This notebook will use the Keras library to build and train the classifier.

In [None]:
import string
import re
import os
import numpy as np
import pandas as pd
import urllib.request

import tensorflow as tf
import keras
from keras import models, layers, optimizers, regularizers
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, LSTM
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences

print('Keras version: ', keras.__version__)
print('Tensorflow version: ', tf.__version__)

**Let's download the pretrained GloVe word embeddings and load them in this notebook.**

This will create a `dictionary` of size **400,000** words, and the corresponding `GloVe word vectors` for words in the dictionary. Each word vector is of size: 50, thus the dimensionality of the word embeddings used here is **50**.

*The next cell might take couple of minutes to run*

In [None]:
words_list_url = ('https://quickstartsws9073123377.blob.core.windows.net/'
                  'azureml-blobstore-0d1c4218-a5f9-418b-bf55-902b65277b85/glove50d/wordsList.npy')

word_vectors_url = ('https://quickstartsws9073123377.blob.core.windows.net/'
                    'azureml-blobstore-0d1c4218-a5f9-418b-bf55-902b65277b85/glove50d/wordVectors.npy')

word_vectors_dir = './word_vectors'

os.makedirs(word_vectors_dir, exist_ok=True)
urllib.request.urlretrieve(words_list_url, os.path.join(word_vectors_dir, 'wordsList.npy'))
urllib.request.urlretrieve(word_vectors_url, os.path.join(word_vectors_dir, 'wordVectors.npy'))

dictionary = np.load(os.path.join(word_vectors_dir, 'wordsList.npy'))
dictionary = dictionary.tolist()
dictionary = [word.decode('UTF-8') for word in dictionary]
print('Loaded the dictionary! Dictionary size: ', len(dictionary))

word_vectors = np.load(os.path.join(word_vectors_dir, 'wordVectors.npy'))
print ('Loaded the word vectors! Shape of the word vectors: ', word_vectors.shape)

**Create the word contractions map. The map is going to used to expand contractions in our corpus (for example "can't" becomes "cannot").**

In [None]:
contractions_url = ('https://quickstartsws9073123377.blob.core.windows.net/'
                    'azureml-blobstore-0d1c4218-a5f9-418b-bf55-902b65277b85/glove50d/contractions.xlsx')
contractions_df = pd.read_excel(contractions_url)
contractions = dict(zip(contractions_df.original, contractions_df.expanded))
print('Review first 10 entries from the contractions map')
print(contractions_df.head(10))

## Word analogy example with GloVe word embeddings

GloVe represents each word in the dictionary as a vector. We can use word vectors for predicting word analogies. 

See example below that solves the following analogy: **father->mother :: king->?**

Cosine similarity is a measure used to evaluate how similar two words are. This helper function takes vectors of two words and returns their cosine similarity that range from -1 to 1. For synonyms the cosine similarity will be close to 1 and for antonyms the cosine similarity will be close to -1.

In [None]:
def cosine_similarity(u, v):
    dot = u.dot(v)
    norm_u = np.linalg.norm(u)
    norm_v = np.linalg.norm(v)
    cosine_similarity = dot/norm_u/norm_v
    return cosine_similarity

Let’s review the vector for the words **father**, **mother**, and **king**

In [None]:
father = word_vectors[dictionary.index('father')]
mother = word_vectors[dictionary.index('mother')]
king = word_vectors[dictionary.index('king')]
print(father)
print('')
print(mother)
print('')
print(king)

To solve for the analogy, we need to solve for x in the following equation:

**mother – father = x - king**

Thus, **x = mother - father + king**

In [None]:
x = mother - father + king

**Next, we will find the word whose word vector is closest to the vector x computed above**

To limit the computation cost, we will identify the best word from a list of possible answers instead of searching the entire dictionary.

In [None]:
answers = ['women', 'prince', 'princess', 'england', 'goddess', 'diva', 'empress', 
           'female', 'lady', 'monarch', 'title', 'queen', 'sovereign', 'ruler', 
           'male', 'crown', 'majesty', 'royal', 'cleopatra', 'elizabeth', 'victoria', 
           'throne', 'internet', 'sky', 'machine', 'learning', 'fairy']

df = pd.DataFrame(columns = ['word', 'cosine_similarity'])

# Find the similarity of each word in answers with x
for w in answers:
    sim = cosine_similarity(word_vectors[dictionary.index(w)], x)   
    df = df.append({'word': w, 'cosine_similarity': sim}, ignore_index=True)
    
df.sort_values(['cosine_similarity'], ascending=False, inplace=True)

print(df)

**From the results above, you can observe the vector for the word `queen` is most similar to the vector `x`.**

## Prepare the training data

Contoso Ltd has provided a small document containing examples of the text they receive as claim text. They have provided this in a text file with one line per sample claim.

Run the following cell to download and examine the contents of the file. Take a moment to read the claims (you may find some of them rather comical!).

In [None]:
data_location = './data'
base_data_url = 'https://databricksdemostore.blob.core.windows.net/data/05.03/'
filesToDownload = ['claims_text.txt', 'claims_labels.txt']

os.makedirs(data_location, exist_ok=True)

for file in filesToDownload:
    data_url = os.path.join(base_data_url, file)
    local_file_path = os.path.join(data_location, file)
    urllib.request.urlretrieve(data_url, local_file_path)
    print('Downloaded file: ', file)
    
claims_corpus = [claim for claim in open(os.path.join(data_location, 'claims_text.txt'))]
claims_corpus

In addition to the claims sample, Contoso Ltd has also provided a document that labels each of the sample claims provided as either 0 ("home insurance claim") or 1 ("auto insurance claim"). This to is presented as a text file with one row per sample, presented in the same order as the claim text.

Run the following cell to examine the contents of the supplied claims_labels.txt file:

In [None]:
labels = [int(re.sub("\n", "", label)) for label in open(os.path.join(data_location, 'claims_labels.txt'))]
print(len(labels))
print(labels[0:5]) # first 5 labels
print(labels[-5:]) # last 5 labels

As you can see from the above output, the values are integers 0 or 1. In order to use these as labels with which to train our model, we need to convert these integer values to categorical values (think of them like enum's from other programming languages).

We can use the to_categorical method from `keras.utils` to convert these value into binary categorical values. Run the following cell:

In [None]:
labels = to_categorical(labels, 2)
print(labels.shape)
print()
print(labels[0:2]) # first 2 categorical labels
print()
print(labels[-2:]) # last 2 categorical labels

Now that we have our claims text and labels loaded, we are ready to begin our first step in the text analytics process, which is to normalize the text.

### Process the claims corpus

- Lowercase all words
- Expand contractions (for example "can't" becomes "cannot")
- Remove special characters (like punctuation)
- Convert the list of words in the claims text to a list of corresponding indices of those words in the dictionary. Note that the order of the words as they appear in the written claims is maintained.

Run the next cell to process the claims corpus.

In [None]:
def remove_special_characters(token):
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_token = pattern.sub('', token)
    return filtered_token

def convert_to_indices(corpus, dictionary, c_map, unk_word_index = 399999):
    sequences = []
    for i in range(len(corpus)):
        tokens = corpus[i].split()
        sequence = []
        for word in tokens:
            word = word.lower()
            if word in c_map:
                resolved_words = c_map[word].split()
                for resolved_word in resolved_words:
                    try:
                        word_index = dictionary.index(resolved_word)
                        sequence.append(word_index)
                    except ValueError:
                        sequence.append(unk_word_index) #Vector for unkown words
            else:
                try:
                    clean_word = remove_special_characters(word)
                    if len(clean_word) > 0:
                        word_index = dictionary.index(clean_word)
                        sequence.append(word_index)
                except ValueError:
                    sequence.append(unk_word_index) #Vector for unkown words
        sequences.append(sequence)
    return sequences

claims_corpus_indices = convert_to_indices(claims_corpus, dictionary, contractions)

**Review the indices of one sample claim**

In [None]:
print(remove_special_characters(claims_corpus[5]).split())
print()
print('Ordered list of indices for the above claim')
print(claims_corpus_indices[5])
print('')
print('For example, the index of second word in the claims text \"pedestrian\" is: ', dictionary.index('pedestrian'))

**Create fixed length vectors**

The number of words used in a claim, vary with the claim. We need to create the input vectors of fixed size. We will use the utility function `pad_sequences` from `keras.preprocessing.sequence` to help us create fixed size vector (size = 125) of word indices.

In [None]:
maxSeqLength = 125

X = pad_sequences(claims_corpus_indices, maxlen=maxSeqLength, padding='pre', truncating='post')

print('Review the new fixed size vector for a sample claim')
print(remove_special_characters(claims_corpus[5]).split())
print()
print(X[5])
print('')
print('Lenght of the vector: ', len(X[5]))

## Build the LSTM recurrent neural network

Now that you have preprocessed the input features from training text data, you are ready to build the classifier. In this case, we will build a LSTM recurrent neural network. The network will have a word embedding layer that will convert the word indices to GloVe word vectors. The GloVe word vectors are then passed to the LSTM layer, followed by a binary classifier output layer.

Run the following cell to build the structure for your neural network:

In [None]:
embedding_layer = Embedding(word_vectors.shape[0],
                            word_vectors.shape[1],
                            weights=[word_vectors],
                            input_length=maxSeqLength,
                            trainable=False)
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='sigmoid'))
model.summary()

## Train the neural network

First, we will split the data into two sets: (1) training set and (2) validation or test set. The validation set accuracy will be used to measure the performance of the model.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=0)

We will use the `Adam` optimization algorithm to train the model. Also, given that the problem is of type `Binary Classification`, we are using the `Sigmoid` activation function for the output layer and the `Binary Crossentropy` as the loss function.

In [None]:
opt = keras.optimizers.Adam(lr=0.001)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

Now we are ready to let the DNN learn by fitting it against our training data and labels. We have defined the batch size and the number of epochs for our training.

Run the following cell to fit your model against the data:

In [None]:
epochs = 100
batch_size = 16
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_test, y_test))

Take a look at the final output for the value "val_accuracy". This stands for validation set accuracy. If you think of random chance as having a 50% accuracy, is you model better than random?

It's OK if it's not much better then random at this point- this is only your first model! The typical data science process would continue with many more iterations taking different actions to improve the model accuracy, including:
- Acquiring more labeled documents for training
- Regularization to prevent overfitting
- Adjusting the model hyperparameters, such as the number of layers, number of nodes per layer, and learning rate

## Test classifying claims

Now that you have constructed a model, try it out against a set of claims. Recall that we need to first preprocess the text.

Run the following cell to prepare our test data:

In [None]:
test_claim = ['I crashed my car into a pole.', 
              'The flood ruined my house.', 
              'I lost control of my car and fell in the river.']

test_claim_indices = convert_to_indices(test_claim, dictionary, contractions)
test_data = pad_sequences(test_claim_indices, maxlen=maxSeqLength, padding='pre', truncating='post')

Now use the model to predict the classification:

In [None]:
pred = model.predict(test_data)
pred_label = pred.argmax(axis=1)
pred_df = pd.DataFrame(np.column_stack((pred,pred_label)), columns=['class_0', 'class_1', 'label'])
pred_df.label = pred_df.label.astype(int)
print('Predictions')
pred_df

## Model exporting and importing

Now that you have a working model, you need export the trained model to a file so that it can be used downstream by the deployed web service.

*The next two cells might take couple of minutes to run*

To export the model run the following cell:

In [None]:
import joblib

output_folder = './output'
model_filename = 'final_model.hdf5'
os.makedirs(output_folder, exist_ok=True)
model.save(os.path.join(output_folder, model_filename))

To test re-loading the model into the same Notebook instance, run the following cell:

In [None]:
from keras.models import load_model
loaded_model = load_model(os.path.join(output_folder, model_filename))
loaded_model.summary()

As before you can use the model to run predictions.

Run the following cells to try the prediction with the re-loaded model:

In [None]:
pred = loaded_model.predict(test_data)
pred_label = pred.argmax(axis=1)
pred_df = pd.DataFrame(np.column_stack((pred,pred_label)), columns=['class_0', 'class_1', 'label'])
pred_df.label = pred_df.label.astype(int)
print('Predictions')
pred_df