## Insurance Recommendation engine: Content-Based model Using Neural Networks

The following notebook involved the feature engineering and model building for a content based recommendation engine using neural networks

Overview of Steps:
1. Build Feature Columns
2. Specify evaluation metrics for Tensorflow
3. Build and train Model
4. Perform recommendations

In [None]:
#Ensure that all Tensorflow libraries are installed
%%bash
pip freeze | grep tensor

We install the necessary required version of tensorflow-hub, and all required packages. 

In [None]:
#Install neccessary libraries
!pip3 install tensorflow-hub==0.7.0
!pip3 install --upgrade tensorflow==1.15.3
!pip3 install google-cloud-bigquery==1.10

In [None]:
#import all neccessary libraries and project parameters
import os
import tensorflow as tf
import numpy as np
import tensorflow_hub as hub
import shutil

PROJECT = 'astute-veld-253418' # Masters Project ID
BUCKET = 'masters-research' # Data storage bucket name
REGION = 'us' # Location of server hosted
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.15.3'

In [None]:
#Set environment variables
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

### Feature Engineering for Insurance Recommendation Engine

We import all data that was generated in the pre-processing stage.

In [None]:
gender_list = open("gender.txt").read().splitlines()
occupation_list = open("occupation_list.txt").read().splitlines()
policy_list = open("policy_list.txt").read().splitlines()
habit_list = open("habit.txt").read().splitlines()

We now perform the embedding of the data, so it can be read into Tensorflow for model development

In [None]:
gender_column_categorical = tf.feature_column.categorical_column_with_vocabulary_list(
    key="gender",
    vocabulary_list=gender_list,
    num_oov_buckets=1)
gender_column = tf.feature_column.indicator_column(gender_column_categorical)

habit_column_categorical = tf.feature_column.categorical_column_with_vocabulary_list(
    key="habit",
    vocabulary_list=habit_list,
    num_oov_buckets=1)
habit_column = tf.feature_column.indicator_column(habit_column_categorical)

occupation_column_categorical = tf.feature_column.categorical_column_with_vocabulary_list(
    key="occupation",
    vocabulary_list=occupation_list,
    num_oov_buckets=1)
occupation_column = tf.feature_column.indicator_column(occupation_column_categorical)

premium_boundaries = list(range(-50,50000,100))
premium_column = tf.feature_column.numeric_column(
    key="premium")
months_since_epoch_bucketized = tf.feature_column.bucketized_column(
    source_column = premium_column,
    boundaries = premium_boundaries)

age_boundaries = list(range(20,100,5))
age_column = tf.feature_column.numeric_column(
    key="age")
age_bucketized = tf.feature_column.bucketized_column(
    source_column = age_column,
    boundaries = age_boundaries)

policy_time_boundaries = list(range(0,65,5))
policy_time_column = tf.feature_column.numeric_column(
    key="policy_time")
policy_time_bucketized = tf.feature_column.bucketized_column(
    source_column = policy_time_column,
    boundaries = policy_time_boundaries)

feature_columns = [gender_column,
                   habit_column,
                   occupation_column,
                   age_bucketized,
                   policy_time_bucketized] 

### Build appropriate input function

This function is developed so data can be read and passed into the mode for training.

In [None]:
record_defaults = [["Unknown"], [430.],["Unknown"],["Unknown"],[43.],[16.],["Unknown"]]
column_keys = ["occupation", "premium", "gender", "habit", "age", "policy_time", "product"]
label_key = "product"
def read_dataset(filename, mode, batch_size = 512):
  def _input_fn():
      def decode_csv(value_column):
          columns = tf.decode_csv(value_column,record_defaults=record_defaults)
          features = dict(zip(column_keys, columns))          
          label = features.pop(label_key)         
          return features, label

      # Look for a list of files that match this pattern
      file_list = tf.io.gfile.glob(filename)

      # Now develop a dataset
      dataset = tf.data.TextLineDataset(file_list).map(decode_csv)

      if mode == tf.estimator.ModeKeys.TRAIN:
          num_epochs = None # This means carry on indefinitely
          dataset = dataset.shuffle(buffer_size = 10 * batch_size)
      else:
          num_epochs = 1 # Once the file is complete

      dataset = dataset.repeat(num_epochs).batch(batch_size)
      return dataset.make_one_shot_iterator().get_next()
  return _input_fn

### Build the model and train/evaluate


Once the input function is complete, we can train the recommendation engine. As part of this we need to specify a measure of performance. In this case we use Top-N accuracy, where we measure the ability to recommend the top-N products for a particular user. For the code below, we choose top-3 as the accuracy measure, since this represents about 10% of the possible choices (22 products).

In [None]:
def model_fn(features, labels, mode, params):
  net = tf.feature_column.input_layer(features, params['feature_columns'])
  for units in params['hidden_units']:
        net = tf.layers.dense(net, units=units, activation=tf.nn.relu)
   # Calculate logits (1 per class).
  logits = tf.layers.dense(net, params['n_classes'], activation=None) 

  predicted_classes = tf.argmax(logits, 1)
  from tensorflow.python.lib.io import file_io
    
  with file_io.FileIO('policy_list.txt', mode='r') as ifp:
    content = tf.constant([x.rstrip() for x in ifp])
  predicted_class_names = tf.gather(content, predicted_classes)
  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = {
        'class_ids': predicted_classes[:, tf.newaxis],
        'class_names' : predicted_class_names[:, tf.newaxis],
        'probabilities': tf.nn.softmax(logits),
        'logits': logits,
    }
    return tf.estimator.EstimatorSpec(mode, predictions=predictions)
  table = tf.contrib.lookup.index_table_from_file(vocabulary_file="policy_list.txt")
  labels = table.lookup(labels)
  # Work out the loss.
  loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

  # Specify the evaluation metrics.
  accuracy = tf.metrics.accuracy(labels=labels,
                                 predictions=predicted_classes,
                                 name='acc_op')
  top_3_accuracy = tf.metrics.mean(tf.nn.in_top_k(predictions=logits, 
                                                   targets=labels, 
                                                   k=3))
  
  metrics = {
    'accuracy': accuracy,
    'top_3_accuracy' : top_3_accuracy}
  
  tf.summary.scalar('accuracy', accuracy[1])
  tf.summary.scalar('top_3_accuracy', top_3_accuracy[1])

  if mode == tf.estimator.ModeKeys.EVAL:
      return tf.estimator.EstimatorSpec(
          mode, loss=loss, eval_metric_ops=metrics)

  # Training operation
  assert mode == tf.estimator.ModeKeys.TRAIN

  optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
  train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
  return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

### Train and Evaluate

Once all the models parameters, and accuracy measures a specified, we can now train the model.

In [None]:
outdir = 'content_based_model_trained'
shutil.rmtree(outdir, ignore_errors = True) # start from the beginning each time
#tf.summary.FileWriterCache.clear() # ensure cache is clear
estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    model_dir = outdir,
    params={
     'feature_columns': feature_columns,
      'hidden_units': [200, 100, 50],
      'n_classes': len(policy_list)
    })

train_spec = tf.estimator.TrainSpec(
    input_fn = read_dataset("training_set.csv", tf.estimator.ModeKeys.TRAIN),
    max_steps = 2000)

eval_spec = tf.estimator.EvalSpec(
    input_fn = read_dataset("test_set.csv", tf.estimator.ModeKeys.EVAL),
    steps = None,
    start_delay_secs = 30,
    throttle_secs = 60)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

### Test out predictions with trained models

We now test the model using a few predictions from the training dataset.

In [None]:
%%bash
head -5 training_set.csv > first_5.csv
head first_5.csv
awk -F "\"*,\"*" '{print $2}' first_5.csv > first_5_customers

Once again, we need to pass this through the input function to make predictions

In [None]:
output = list(estimator.predict(input_fn=read_dataset("first_5.csv", tf.estimator.ModeKeys.PREDICT)))
output