<a href="https://colab.research.google.com/github/hotbread213/createClass/blob/master/IVADO_Day_2_PM_Insurance_Case_(Google_Colab).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Insurance Case

### Part 1) Study of the insurance database

Libraries importation:

In [0]:
import pandas as pd
import numpy as np
import random
from sklearn import linear_model
from sklearn.metrics import r2_score 
from functools import reduce
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from google_drive_downloader import GoogleDriveDownloader

Download the datasets needed

In [0]:
class Constants:
    
    SEED = 1
    
    # Paths to the data files
    Cluster_variables = '/Cluster_variables.csv' 
    Data_category = '/Data_category.csv' 
    INSEE_CODES = '/INSEE_CODES.csv' 
    PG_2017_CLAIMS_YEAR0 = '/PG_2017_CLAIMS_YEAR0.csv' 
    PG_2017_YEAR0 = '/PG_2017_YEAR0.csv' 
    
    # Google drive id to be able to download from drive
    Cluster_variables_ID = '1duFLPcauNQSyRoWGafw0E_W8MQnu-S7z'
    Data_category_ID = '1y96sh5rOexFuWVGerDqtOhaMH8hQRUqe'
    INSEE_CODES_ID = '1-M8ah4fKmRSrICaR7dBETDJNk7b1AB9k'
    PG_2017_CLAIMS_YEAR0_ID = '1CJAUfq-624qXYxMaahi7XcV16OMWDoWQ'
    PG_2017_YEAR0_ID = '1JaLa2adWDEhapFJ4W-3kG7jHB7psP5B1'
    
constants = Constants
random.seed(constants.SEED)

In [0]:
!rm /Cluster_variables.csv
!rm /Data_category.csv
!rm /INSEE_CODES.csv
!rm /PG_2017_CLAIMS_YEAR0.csv
!rm /PG_2017_YEAR0.csv

GoogleDriveDownloader.download_file_from_google_drive(file_id=constants.Cluster_variables_ID, dest_path=constants.Cluster_variables, unzip=False)
GoogleDriveDownloader.download_file_from_google_drive(file_id=constants.Data_category_ID, dest_path=constants.Data_category, unzip=False)
GoogleDriveDownloader.download_file_from_google_drive(file_id=constants.INSEE_CODES_ID, dest_path=constants.INSEE_CODES, unzip=False)
GoogleDriveDownloader.download_file_from_google_drive(file_id=constants.PG_2017_CLAIMS_YEAR0_ID, dest_path=constants.PG_2017_CLAIMS_YEAR0, unzip=False)
GoogleDriveDownloader.download_file_from_google_drive(file_id=constants.PG_2017_YEAR0_ID, dest_path=constants.PG_2017_YEAR0, unzip=False)

Dataset importation:
- PG_2017_Year0 is the underwriting dataset with 100,000 insured at time 0. It contains 31 features for each policy.
- Thus, it's a matrix of size 100000 X 31.

In [0]:
insured_data = pd.read_csv(constants.PG_2017_YEAR0, header=0)
insured_data.head(10)

Importation of the claim dataset.
- Every row of this file contains a specific claim associated with a policy. 
- The number of rows in 'PG_2017_CLAIMS_YEAR0.csv' is smaller than 'PG_2017_YEAR0.csv', since the majority of the policies got zero claim during 2017. 

In [0]:
claims_data = pd.read_csv(constants.PG_2017_CLAIMS_YEAR0, header=0)
claims_data.head(10)

Note that the claims data doesn't have the "id_policy" needed to join the two dataframes. We create it by concatenation the "id_client" and "id_vehicle" columns:

In [0]:
claims_data["id_policy"] = claims_data["id_client"] + "-" + claims_data["id_vehicle"]

Now that we have the same "id_policy" in both dataframes, we can merge them in order to have all the claims made for
each policy:

In [0]:
merged_data = pd.merge(insured_data, claims_data[["id_policy", "claim_nb", "claim_amount"]], on="id_policy", how="left")
merged_data.head(10)

We can now compute the total number of claims, and the total claims amount per policy:

In [0]:
total_claims_number = merged_data.groupby("id_policy", as_index=False)[["claim_nb"]].sum()
total_claims_amount = merged_data.groupby("id_policy", as_index=False)[["claim_amount"]].sum()

We now have the complete dataset on which we will perform statistical analysis after cleaning it, of course.

In [0]:
complete_data = reduce(lambda left, right: pd.merge(left, right, on="id_policy", how="left"),
                       [insured_data, total_claims_number, total_claims_amount])
complete_data[['claim_nb', 'claim_amount']] = complete_data[['claim_nb', 'claim_amount']].fillna(0)
complete_data.head(10)

It's always good practice to explore the dataset before training any ML algorithms on it.
This is what will briefly do here.

In [0]:
complete_data.info()

What can we say with this information?

1) The data seems very clean!

2) There seems to be a missing value in "vh_age". We replace the missing value with the same vh_model mean. This is called mean substitution, and it is one of the most popular imputation techniques (not necessarily the best one).


In [0]:
complete_data["vh_age"] = complete_data.groupby("vh_model")["vh_age"].apply(lambda x: x.fillna(x.mean()))

We compute basic statistics on our dataset.

How many policies does the dataset contain?

In [0]:
"Total number of policies: {0:2d}".format(complete_data['id_policy'].nunique())

How many clients?

In [0]:
"Total number of clients: {0:2d}".format(complete_data['id_client'].nunique())

Statistical description of the numerical columns of the dataset:

In [0]:
complete_data.describe()

Do you notice anything strange? What about drv_age_lic1 v.s. drv_age1 and drv_age_lic2 v.s. drv_age2?

Does having vh_cyl values at 0 make sense? Same question with vh_value and vh_weight.

What should we do with these rows (i.e. data points)? What could be the cause behind this anomaly? For the time being, we will simply
delete these rows from our analysis. In practice we would like to understand why these errors are present, and correct
them if possible (verification with other datasets).

In [0]:
complete_data = complete_data[complete_data["drv_age1"] >= complete_data["drv_age_lic1"]+17]

Since certain policies have no secondary driver, we have to get rid of the zeros to study the columns that are related
to the secondary drivers: drv_age2, drv_sex2, and drv_age_lic2:

In [0]:
second_drivers_data = complete_data.loc[insured_data["drv_drv2"] == "Yes", ["drv_age2", "drv_sex2", "drv_age_lic2"]]
second_drivers_data.describe()

In [0]:
"Number of inadequate second driver ages: {0:2d}".format(len(second_drivers_data[second_drivers_data["drv_age2"] <
                                                                   second_drivers_data["drv_age_lic2"]+16]))

What's happening here?

In [0]:
second_drivers_data[second_drivers_data["drv_age2"] <
                          second_drivers_data["drv_age_lic2"]]["drv_age_lic2"].unique()[0]

Again, we are getting rid of this data to continue our analysis. We are talking about a somewhat good amount of data.
In practice, we would have to understand the origin of the error. This isn't possible here:

In [0]:
complete_data = complete_data[(complete_data["drv_age2"] >= complete_data["drv_age_lic2"]+16) |
                              (complete_data["drv_drv2"] == "No")]

Let's replace the vh_value of 0 with the average value of the same vh_model:

In [0]:
complete_data["vh_value"] = complete_data.groupby("vh_model")["vh_value"].apply(lambda x: x.replace(0, x[x > 0].mean()))

We do the same thing with the weight:

In [0]:
complete_data["vh_weight"] = complete_data.groupby("vh_model")["vh_weight"].apply(lambda x: x.replace(0, x[x > 0].mean()))

Now that we know that the dataset is finally clean, we can start visualizing the data!

In [0]:
clean_data = complete_data

We generate simple histograms for the categorical variables, and that gives us a good first overview of our insured
portfolio:

In [0]:
figure = plt.figure()
plt.figure(figsize=(22,30))
figure.subplots_adjust(hspace=0.4, wspace=0.4)

# What's the first driver's sex distribution on these policies?
plt.subplot(3, 3, 1)
clean_data['drv_sex1'].value_counts().plot(kind='bar')
plt.title("Sex distribution of the auto insurance policies (first driver)")
plt.xlabel("Sex")
plt.ylabel("Policies")

# What's the secondary driver's sex distribution on these policies?
plt.subplot(3, 3, 2)
clean_data.loc[clean_data['drv_drv2'] == "Yes", "drv_sex2"].value_counts().plot(kind='bar')
plt.title("Sex distribution of the auto insurance policies (secondary driver)")
plt.xlabel("Sex")
plt.ylabel("Policies")

# What's the coverage distribution of the insurance portfolio?
plt.subplot(3, 3, 3)
clean_data['pol_coverage'].value_counts().plot(kind='bar')
plt.title("Coverage policies distribution")
plt.xlabel("Type")
plt.ylabel("Policies")

# What's the mileage-based policy subscription proportion of the portfolio?
plt.subplot(3, 3, 4)
clean_data['pol_payd'].value_counts().plot(kind='bar')
plt.title("Mileage-based policy distribution")
plt.xlabel("Subscription")
plt.ylabel("Policies")

# What's the payment frequency distribution of the policies?
plt.subplot(3, 3, 5)
clean_data['pol_pay_freq'].value_counts().plot(kind='bar')
plt.title("Payment frequency distribution")
plt.xlabel("Payment Frequency")
plt.ylabel("Policies")

# What's the vehicle usage distribution?
plt.subplot(3, 3, 6)
clean_data['pol_usage'].value_counts().plot(kind='bar')
plt.title("Vehicle usage distribution")
plt.xlabel("Usage")
plt.ylabel("Policies")

# What's the motor alimentation distribution?
plt.subplot(3, 3, 7)
clean_data['vh_fuel'].value_counts().plot(kind='bar')
plt.title("Motor alimentation distribution")
plt.xlabel("Alimentation")
plt.ylabel("Policies")

# How much policies have a second driver?
plt.subplot(3, 3, 8)
clean_data['drv_drv2'].value_counts().plot(kind='bar')
plt.title("Secondary drivers distribution")
plt.xlabel("Secondary driver")
plt.ylabel("Policies")

# What's the vehicle type distribution?
plt.subplot(3, 3, 9)
clean_data['vh_type'].value_counts().plot(kind='bar')
plt.title("Vehicle type distribution")
plt.xlabel("Vehicle type")
plt.ylabel("Policies")
plt.show()

For categorical variables with a large number of classes, histograms are not really appropriate to analyze the data.

We create a pie chart describing the 5 most popular car makers in our portfolio:

In [0]:
[car_makers, car_numbers] = [clean_data["vh_make"].value_counts().index.tolist(),
                             clean_data["vh_make"].value_counts().as_matrix()]
others_number = car_numbers[5:].sum()
car_makers, car_numbers = car_makers[:5], car_numbers[:5]
car_makers.append("OTHERS")
car_numbers = np.append(car_numbers, others_number)

plt.figure(figsize=(10,10))
plt.pie(car_numbers, labels=car_makers, autopct='%1.1f%%')
plt.title("Distribution of automakers")
plt.show()

Let's analyze the geographical data.

We will import the commune names from the table INSEE_CODES.csv that I created from the official INSEE table
"Table d'appartenance géographique des communes au 1er janvier 2011" taken from
 https://www.insee.fr/fr/information/2028028. More information on each INSEE code can be extracted from this table,
which are very useful for pricing (e.g. Tranche aire urbaine (TDUU2010), catégorie de commune (CATAEU2010) et
 population (POP_MUN_2009)). See the documentation for the exact meaning of these new variables.

In [0]:
commune_data = pd.read_csv(constants.INSEE_CODES, header=0)
commune_data.head(10)

We join the commune data to the clean data by the INSEE code in order to have our final dataset that could be used for
modeling:

In [0]:
final_data = pd.merge(clean_data, commune_data, left_on="pol_insee_code", right_on="code_geographique", how="left")

We create the department variable, which simply is the first two digits of the insee code, and we visualize them in a pie chart:

In [0]:
final_data["department"] = final_data["pol_insee_code"].astype(str).str[:2]

[department, department_numbers] = [final_data["department"].value_counts().index.tolist(),
                                    final_data["department"].value_counts().as_matrix()]

others_number = department_numbers[10:].sum()
department, department_numbers = department[:10], department_numbers[:10]
department.append("OTHERS")
department_numbers = np.append(department_numbers, others_number)

plt.figure(figsize=(12,12))
plt.pie(department_numbers, labels=department, autopct='%1.1f%%')
plt.title("Departments distribution")
plt.show()

Surprisingly, a lot of policies are in the North of France (59th department). In second place, we have the 75th department,
which is Paris. In third place, we have the 69th department, Rhône, which contains Lyon. The policies aren't condensed
into one specific city/department, which is good.

Now, let's analyze some continuous and discrete variables:

In [0]:
figure = plt.figure()
plt.figure(figsize=(18,18))
figure.subplots_adjust(hspace=0.4, wspace=0.4)

# What's the bonus-malus coefficient distribution on these policies?
plt.subplot(3, 3, 1)
plt.hist(final_data['pol_bonus'])
plt.title("Bonus-malus coefficient distribution")
plt.xlabel("Coefficient")
plt.ylabel("Policies")

# What's the policy duration distribution on these policies?
plt.subplot(3, 3, 2)
plt.hist(final_data['pol_duration'])
plt.title("Policies duration distribution")
plt.xlabel("Duration")
plt.ylabel("Policies")

# What's the situation duration distribution of the insurance portfolio?
plt.subplot(3, 3, 3)
plt.hist(final_data['pol_sit_duration'])
plt.title("Situation duration distribution")
plt.xlabel("Type")
plt.ylabel("Policies")

# What's the first driver's age distribution of the policies?
plt.subplot(3, 3, 4)
plt.hist(final_data['drv_age1'])
plt.title("First driver's age distribution")
plt.xlabel("Age")
plt.ylabel("Policies")

# What's the second driver's age distribution of the policies?
plt.subplot(3, 3, 5)
plt.hist(final_data.loc[final_data["drv_drv2"] == "Yes", 'drv_age2'])
plt.title("Second driver's age distribution")
plt.xlabel("Age")
plt.ylabel("Policies")

# What's the first driver's licence age?
plt.subplot(3, 3, 6)
plt.hist(final_data['drv_age_lic1'])
plt.title("First driver's licence age distribution")
plt.xlabel("Licence age")
plt.ylabel("Policies")

# What's the second driver's licence age?
plt.subplot(3, 3, 7)
plt.hist(final_data.loc[final_data['drv_drv2'] == 'Yes', 'drv_age_lic2'])
plt.title("Second driver's licence age distribution")
plt.xlabel("Licence age")
plt.ylabel("Policies")

# How old are the insured cars?
plt.subplot(3, 3, 8)
plt.hist(final_data['vh_sale_end'])
plt.title("Years since the end of marketing years of the vehicle distribution")
plt.xlabel("Years")
plt.ylabel("Policies")

# What's the vehicle type distribution?
plt.subplot(3, 3, 9)
plt.hist(final_data['vh_value'])
plt.title("Vehicle value distribution")
plt.xlabel("Value (euros)")
plt.ylabel("Policies")
plt.show()

We will know analyze a little more the target variables, i.e. the number and amount of claims per policy.

What are the claim frequencies?

In [0]:
final_data["claim_nb"].value_counts()

It's also important to quantify the dependence relationship between our exogenous variables, and the endogenous
variable. It will come in handy during the feature selection. We'll only do a couple, so that you can also do your own
analysis.

We start with the linear correlation coefficients for each continuous and discrete variables.

In [0]:
final_data.corr()

The variables that have a high correlation coefficient with claim_nb or claim_amount might be the most
important predictive variables in the future models. More advanced feature selection techniques need to be used.

Linear relationship between vh_value and claim_amount:

In [0]:
regression = linear_model.LinearRegression()
x = final_data.loc[(final_data["claim_amount"] > 0) & (final_data["claim_amount"] < 10000), "vh_value"].\
    as_matrix().reshape((-1,1))
y = final_data.loc[(final_data["claim_amount"] > 0) & (final_data["claim_amount"] < 10000), "claim_amount"].\
    as_matrix().reshape((-1,1))

We fit the regression:

In [0]:
regression.fit(x, y)
prediction = regression.predict(x)

plt.scatter(x, y, color='black')
plt.plot(x, prediction, color='red')
plt.xlabel("Vehicle value")
plt.ylabel("Claim amount")
plt.title("Claim amount regressed over vehicle age")
plt.show()

In [0]:
"Regression coefficient: "+str(regression.coef_[0][0])

In [0]:
"Regression R^2 coefficient: {0:2f}".format(r2_score(y, prediction))

Clearly, a simple linear regression isn't complex enough to generate useful predictions. More complex ML algorithms are needed!

Relationship between drv_sex1 and claim_nb:

In [0]:
final_data.groupby("drv_sex1")["claim_nb"].describe()

In [0]:
final_data[final_data['claim_nb'] > 0].groupby("drv_sex1")["claim_nb"].describe()

Relationship between pol_coverage and claim_amount:

In [0]:
final_data.groupby("pol_coverage")["claim_amount"].describe()

Relationship between pol_coverage and claim_amount:

In [0]:
final_data.groupby("pol_usage")["claim_amount"].describe()

In [0]:
final_data.groupby("pol_usage")["claim_nb"].describe()

### Part 2) Unsupervised learning
In this part, we will analyze the clustering produced by the algorithm K-means. 
As you've seen this morning, K-means is an unsupervised learning algorithm which aims to partition a dataset of features in K clusters. 

For computational reasons, we have already pre-computed the clusters using the sklearn library with K = 5 clusters (see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html if you want to implement it yourself).

The file 'Cluster_variables.csv' contains for each cluster, the average value of each feature

In [0]:
cluster_variable = pd.read_csv(constants.Cluster_variables, header=0)
cluster_variable

We generate some plots of interesting features among the different clusters. 

In [0]:
figure = plt.figure()
plt.figure(figsize=(22,30))
figure.subplots_adjust(hspace=0.4, wspace=0.4)
lin_space = np.linspace(1, 5,5)

# What's the size of each cluster in terms of number of policy/cluster?
plt.subplot(3, 3, 1)
plt.bar(lin_space, cluster_variable["Size"])
plt.title("1) Size (number of policies)", fontsize=22)
plt.xlabel("Cluster", fontsize=22)
plt.ylabel("Size", fontsize=22)

# What's the average aggregate claim per cluster?
plt.subplot(3, 3, 2)
plt.bar(lin_space, cluster_variable["ag_claims"])
plt.title("2) Average claim size", fontsize=22)
plt.xlabel("Cluster", fontsize=22)
plt.ylabel("Average claim", fontsize=22)

# What's the subscribed proportion to a mileage-based policy, i.e. the premium payed is based on miles driven
plt.subplot(3, 3, 3)
plt.bar(lin_space, cluster_variable["pol_payd(1=Yes)"])
plt.title("3) Subscribed prop. to a mileage-based policy", fontsize=22)
plt.xlabel("Cluster", fontsize=22)
plt.ylabel("Proportion", fontsize=22)

# What's the proportion of retiree?
plt.subplot(3, 3, 4)
plt.bar(lin_space, cluster_variable["pol_usage(Retired)"])
plt.title("4) Retiree proportion", fontsize=22)
plt.xlabel("Cluster", fontsize=22)
plt.ylabel("Proportion", fontsize=22)

# What's the coverage distribution of the insurance portfolio?
plt.subplot(3, 3, 5)
plt.bar(lin_space, cluster_variable["pol_usage(Professional)"])
plt.title("5) Professional proportion", fontsize=22)
plt.xlabel("Cluster", fontsize=22)
plt.ylabel("Proportion", fontsize=22)

# What's the population density?
plt.subplot(3, 3, 6)
plt.bar(lin_space, cluster_variable["dens_pop"])
plt.title("6) Population density", fontsize=22)
plt.xlabel("Cluster", fontsize=22)
plt.ylabel("Density", fontsize=22)

# What's the proportion of Male as first driver?
plt.subplot(3, 3, 7)
plt.bar(lin_space, cluster_variable["drv_sex1(1=Male)"])
plt.title("7) Proportion of Male as first driver", fontsize=22)
plt.xlabel("Cluster", fontsize=22)
plt.ylabel("Proportion", fontsize=22)

# What's the vehicle value?
plt.subplot(3, 3, 8)
plt.bar(lin_space, cluster_variable["vh_value"])
plt.title("8) Vehicle value distribution", fontsize=22)
plt.xlabel("Cluster", fontsize=22)
plt.ylabel("Value", fontsize=22)

# What's the vehicule speed?
plt.subplot(3, 3, 9)
plt.bar(lin_space, cluster_variable["vh_speed"])
plt.title("9) Vehicule speed distribution", fontsize=22)
plt.xlabel("Cluster", fontsize=22)
plt.ylabel("Speed", fontsize=22)


#### 2.1) Analysis of the resulting clusters
- Based on the 9 plots shown before, which cluster of policies should be considered the lowest/highest risk cluster?

#### Analysis of the lowest risk cluster: number 3.
- Average aggregate claim: smallest by a large margin (Plot 2).
- Average car value: by far the smallest (Plot 8). 
- Car speed: slower on average (Plot 9).
- Population density: significantly smaller, which implies less risk (Plot 6).  
- Male first driver: significantly less than the other clusters (Plot 7).
- Policy usage: more 'Retirees' and less 'Professionals' in this cluster (Plot 4 and 5).
- Mileage-based policy subscription: a bit higher than the other clusters (Plot 3).

#### Analysis of the highest risk cluster: number 5.
- Average aggregate claim: highest by a large margin (Plot 2).
- Average car value: by far the highest (Plot 8). 
- Car speed: faster on average (Plot 9).
- Male first driver: significantly more male as the first driver (Plot 7).
- Population density: on the higher end, smaller than cluster 4 (see below for a comment on cluster 4) (Plot 6). 
- Policy usage: significantly more 'Professionals' (Plot 5).
- Mileage-based policy subscription: on the lower end (Plot 3).

#### Additional notes:
- Cluster 4: the density population is by far the largest amongst the 5 clusters (Plot 6). If you look at the INSEE table of this cluster, almost everyone lives in the department 75, which is...Paris!
- Cluster 3: contains the largest number of policies, thus we can have a high level of confidence in the conclusion we've made..
- Cluster 5: contains the smallest number of policies by far (less than 2% of the policies). Thus, in practice, we would advocate keeping in mind that the number of policies is small before drawing conclusions.

### Part 3: Introduction to the deep learning library Keras

#### Goals:
- A) Introduction of the Keras library on a high-level for supervised learning tasks.
- B) On a toy example (iris dataset), predict the type of flower of the iris dataset (classification) as well as a regression task with a neural network.

A) Importation of the Keras library
- Keras is a high-level deep learning library that is considered to be user-friendly, modular and extensible. 
- Keras can be used to compile different types of deep neural networks which you will see throughout this week: multilayer perceptron (today), convolutional layer, recurrent layer (tomorrow), etc...
- The implementation of a neural network with Keras requires a small number of lines of code. The training part (estimation of parameters) is also optimized such that it can be done automatically for you (more details about this tomorrow, see stochastic gradient descent and backprop algorithm). 

In [0]:
import keras
from keras.models import Sequential, load_model      
from keras.layers import Dense       # 'Dense' layer is a fully-connected layer in Keras, i.e. the layer used in an MLP 
from keras.utils import np_utils     # For the one-hot encoding, see below. 

#### B.1.1) First supervised learning task
- We will use a toy example for computational reasons: the Iris dataset.

Let's first download the iris dataset
- The iris dataset is really simple. It consist of 150 flowers with 4 features: petal length, petal width, sepal length, sepal width and three species of flowers (setosa, virginica and versicolor). 
- We will try to predict the type of flower based on the 4 features.

In [0]:
from sklearn.datasets import load_iris
iris_data = load_iris() 
features = iris_data['data']
targets = iris_data['target']

In [0]:
print("examples of features:")
print (features[0:10,:])
print("species")
print(targets)

For a multi-classification task, it's best practice to encode the targets as a 'one-hot' encoding. 
- A one-hot incoding is simply a vector of zeros everywhere except for a '1' at the position of the class. 
- Ex: for the iris dataset, we have three classes. A class '0' will be encoded as :[1., 0, 0], a class '1': [0,1,0] and class '2': [0,0,1]. 

One-hot encoding is done in the following box

In [0]:
dummy_y = np_utils.to_categorical(targets)
print(dummy_y)

Let's split our dataset into a Train|Test set of 70%/30%. 
- This is done with the function 'train_test_split'. 

In [0]:
x_train, x_test, y_train, y_test = train_test_split(features, dummy_y, test_size=0.3, random_state=0)

Let's compile our first neural network model to predict the type of flower.
- Single hidden layer of 5 neurons each with activation function 'tanh'

In [0]:
model = Sequential()

# Input layer: 5 hidden neurons and 'tanh' activation function
model.add(Dense(units=5, activation='tanh', input_dim=x_train.shape[-1]))

# Output layer: need to use softmax as we have a classification problem
model.add(Dense(3, activation='softmax'))

# Compile the model by specifying the loss function (cross entropy), the optimizer (Adam) and optional metric to output ('accuracy')
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=50, batch_size=10)

Let's evaluate the performance on the Train|Test set from the resulting neural network
- In the context of multi-classification, the function 'predict' of Keras takes as input a set of features, and outputs the probability of being in each class.
- i.e. model.predict(x_train) outputs a discrete probability distribution over the 3 types of flowers for each example in our training set. 
- The predicted flower will be the highest probability class. 

In [0]:
y_pred_train = model.predict(x_train)

Let's analyze the predicted discrete probability distribution for the first 5 examples in the train set

In [0]:
print("Probability distribution for the first 5 examples of the Train set:")
print(y_pred_train[0:5])

and the predicted class for the same five examples vs the true label (true type of flower). 

In [0]:
print("Predicted flower for the first 5 examples (goes from 0 to 2):")
print(np.argmax(y_pred_train[0:5],axis=1))
print("True type of flowers for the first 5 examples")
print(np.argmax(y_train[0:5],axis=1))

Let's compute the accuracy on the train and test set.

In [0]:
print("Train set accuracy:")
print(np.sum(np.argmax(y_pred_train, axis=1)==np.argmax(y_train, axis=1))/x_train.shape[0])
print("Number of errors on the Train set out of %d examples:" %(x_train.shape[0]))
print(x_train.shape[0] - np.sum(np.argmax(y_pred_train, axis=1)==np.argmax(y_train, axis=1)))

print("Test set accuracy:")
y_pred_test = model.predict(x_test)
print(np.sum(np.argmax(y_pred_test, axis=1)==np.argmax(y_test, axis=1))/x_test.shape[0])
print("Number of errors on the Test set out of %d examples:" %(x_test.shape[0]))
print(x_test.shape[0] - np.sum(np.argmax(y_pred_test, axis=1)==np.argmax(y_test, axis=1)))

#### B.1.2) Predict the sepal width based on the other features

Still with the 'iris' dataset, let's predict the value of the 'sepal width' (the 4th feature) based on the other three other. 
- This is a regression task, as the four features take real values.

In [0]:
features_without_sepal_width = features[:,:-1]
sepal_width = features[:,-1]

Again, let's split our dataset into a Train|Test set of 70%. 

In [0]:
x_train, x_test, y_train, y_test = train_test_split(features_without_sepal_width, sepal_width, test_size=0.3, random_state=0)

Let's compile our neural network. Notice two important differences:
- 1) Output layer: must be of size 1 and the activation is no longer 'softmax', it's a 'linear' activation function.
- 2) Loss function: mean-square error (MSE) instead of cross-entropy. MSE is the default loss function to use in most regression tasks.

In [0]:
model = Sequential()

# Input layer: 5 hidden neurons and 'tanh' activation function
model.add(Dense(units=5, activation='tanh', input_dim=x_train.shape[-1]))

# Output layer: default activation function is 'linear'
model.add(Dense(1, activation='linear'))

# Compile the model by specifying the loss function (MSE), the optimizer (Adam).
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x_train,y_train,epochs=50,batch_size=10)

Let's evaluate the performance on the Train|Test set from the resulting neural network.
- In the context of a regression task, the function 'predict' of Keras takes as input a set of features, and outputs a scalar that represents the prediction of the target (sepal width)

In [0]:
y_pred_train = model.predict(x_train)
print("Predicted sepal width for the first 5 examples")
print(y_pred_train[0:5])
print("True sepal width for the first 5 examples")
print(y_train[0:5])

Let's evaluate the total performance on the Train and Test set of our model.

In [0]:
print("Mean-square error obtained on the Train set:")
print(np.average((y_pred_train[:,0]-y_train)**2))

print("Mean-square error obtained on the Test set:")
y_pred_test = model.predict(x_test)
print(np.average((y_pred_test[:,0]-y_test)**2))

## Part 4: Hyperparameter search with two popular methods: grid search and random search

### 4.1) Grid search algorithm for model tuning and study of learning curves.

- A) Search over the following simple grid (each combination will be tested):
    - {1,2} layers
    - {5,10,15,20} neurons/layer
    - {'relu', 'sigmoid', 'tanh'} activation function

For the rest of this notebook, we will work on the classification task of the iris dataset (i.e. predict the type of flower). 
- Note: grid search and random search are applicable to any machine learning task that requires tuning of the hyperparameters.

In [0]:
# 1) Reload the dataset
iris_data = load_iris() 
features = iris_data['data']
targets = iris_data['target']

# 2) one-hot encoding of the targets
dummy_y = np_utils.to_categorical(targets)

# 3) Split 70%|30% for train and validation set
x_train, x_valid, y_train, y_valid = train_test_split(features, dummy_y, test_size=0.3, random_state=0)

In [0]:
from keras.callbacks import EarlyStopping
# Earlystopping criteria: if after 10 epochs, the validation loss ('val_loss') 
# has not improved by at least 0.001: 
# - stop, and restore the best weights (i.e. 5 epochs before). 

nb_epoch = 100
earlystop = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=5,
                          verbose=1, mode='auto', restore_best_weights=True)

# Grid search
nbs_layers_range = np.array([1,2])               # 1 or 2 hidden layers
nbs_neurons_range = np.array([5,10,15,20])       # for each hidden layer, either {100,120,140} hidden neurons
activation_range = ['relu', 'sigmoid', 'tanh']   

# Statistics to keep track of during the optimization
valid_loss_best = 99999999
best_nbs_layer = 999999999
best_nbs_neurons = 9999999
best_batch_size = 99999999

In [0]:
# Loop over the different combination on the grid
for i in range(len(nbs_layers_range)):
    nbs_layer = nbs_layers_range[i]
    for j in range(len(nbs_neurons_range)):
        nbs_neurons = nbs_neurons_range[j]
        for k in range(len(activation_range)):
            activation = activation_range[k]
            
            print("Current set of HP: %d hidden layers, %d number of neurons, %s activation" %(nbs_layer, nbs_neurons, activation))
            
            # Compile the model:
            model = Sequential()
            
            # First hidden layer
            model.add(Dense(units = nbs_neurons, activation = activation, input_dim = x_train.shape[1]))
            
            # Check if we have to add a second hidden layer on top of the first one
            if(nbs_layer == 2):
                model.add(Dense(units = nbs_neurons, activation = activation))
            
            # Output layer
            model.add(Dense(3, activation='softmax'))
            model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
             
            # With the given set of hyperparameters, train the model 
            model.fit(x_train, y_train, epochs=nb_epoch, batch_size=10, verbose = 1,  
                      validation_data=(x_valid, y_valid), callbacks=[earlystop])
            
            # Evaluate the validation loss of the trained model
            [val_loss, val_accu] = model.evaluate(x_valid, y_valid)
            
            # If it's the best model so far:
            # - Save the model and update the statistics
            if (val_loss < valid_loss_best):
                valid_loss_best = val_loss
                best_nbs_layer = nbs_layer
                best_nbs_neurons = nbs_neurons
                best_activation = activation
                #model.save("best_model_grid_search.h5")

# Print the resulting best model:
print("Best results obtained:")
print("Best number of neurons: %d" %(best_nbs_neurons))
print("Best nbs layers: %d" %(best_nbs_layer))
print("Best activation layer: %s" %(best_activation))

Some notes:
- Best resulting model we got:
    - Number of neurons: 20
    - Nbs layers: 2
    - Activation function: tanh
- Even with our toy example (complete dataset of 150 examples) and a small grid of hyperparameter (24 combinations), the optimization with Grid Search takes a significant amount of time;
- Random search (next method) is known to be computationally much more efficient than Grid Search.
    - Does not mean that Random search will provide a better set of hyperparameters than Grid Search;
    - Means that on average, for a given fix computational cost, Random search will provide a better set of hyperparameters than Grid search.

### 4.1.1) Compute the learning curves of the best model found by Grid Search
- Need to retrain our model with best set of hyper-parameters if we want to monitor the learning curves.
- Note: the computation of the learning curves could have been done simultaneously with Grid Search 
    - but it would be suboptimal;
    - Reason: we would need to compute the learning curves for EACH combination of hyperparameters tested.

In [0]:
# 1) Compile the model with the best set of HP's we got when we ran the code:
#    - Number of neurons: 20
#    - Nbs layers: 2
#    - Activation function: tanh
import random
random.seed(30)
model_lc_grid_search = Sequential()
model_lc_grid_search.add(Dense(units = 20, activation = 'tanh', input_dim = x_train.shape[1]))
model_lc_grid_search.add(Dense(units = 20, activation = 'tanh'))
model_lc_grid_search.add(Dense(3, activation='softmax'))
model_lc_grid_search.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# 2) Train the model over 100 epochs, 1 epoch at a time. 
# - At the end of each epoch, monitor the losses on each dataset (train|valid|test)
nb_epoch = 300
batch_size = 10
results_tensor = np.zeros((nb_epoch,2, 2))  # Compile the loss and accuracy on the Train|Valid after each epoch

# Fit the model 1 epoch at a time. 
for i in range(nb_epoch):
    model_lc_grid_search.fit(x_train, y_train, epochs=1, batch_size=batch_size,verbose=1)
    
    # note: model_lc_grid_search.evaluate returns a list of values: [loss, accuracy]
    results_tensor[i,0,:] = model_lc_grid_search.evaluate(x_train, y_train, verbose = 0)  # train set computation
    results_tensor[i,1,:] = model_lc_grid_search.evaluate(x_valid, y_valid, verbose = 0)  # valid set computation

In [0]:
# 1) Accuracy plots
lin_nb_epoch = np.linspace(1,300,300)
fig = plt.figure(figsize=(5, 5), dpi=100)
plt.plot(lin_nb_epoch, results_tensor[:,0,1], lin_nb_epoch, results_tensor[:,1,1])
plt.title('Figure 1 - Accuracy - Grid Search')
plt.legend(['Train', 'Valid'])
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.show()

# 2) Loss plots
fig = plt.figure(figsize=(5, 5), dpi=100)
plt.plot(lin_nb_epoch, results_tensor[:,0,0], lin_nb_epoch, results_tensor[:,1,0])
plt.title('Figure 2 - Loss - Grid Search')
plt.legend(['Train', 'Valid'])
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()

### 4.2) Random search algorithm for model tuning
- A) Define a space of possible HPs from which we want to find an optimal set.
- B) Random search: 
    - For a fix number of trials, sample randomly a set of HPs from the defined space;
    - Choose the set of HPs which minimizes the MSE on the valid set.

In [0]:
nb_epoch = 100
batch_size = 5
earlystop = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=15,
                          verbose=1, mode='auto', restore_best_weights=True)

# Search space - define the boundaries for each HP
nbs_layers_range = np.array([1,2,3])             # 1 or 2 hidden layers
nbs_neurons_range = np.array([5,30])             # number of neurons within {5,6,...,20}
activation_range = ['relu', 'sigmoid', 'tanh']   
lr_range = np.array([0.0001,0.01])               # learning rate within [0.0001, 0.01]

# Statistics to keep track of during the optimization
valid_loss_best = 999999
best_nbs_layer = 99999
best_nbs_neurons = 99999
best_lr = 99999

# Number of iterations to be done in the random search
nbs_iteration = 20

In [0]:
# loop over the number of iterations
for i in range(nbs_iteration):
    
    # 1) At the beginning of each iteration, randomly sample a set of HP's
    nbs_layers = np.random.randint(low=nbs_layers_range[0], high=nbs_layers_range[1]+1)
    nbs_neurons = np.random.randint(low=nbs_neurons_range[0], high=nbs_neurons_range[1]+1)
    acti_idx = np.random.randint(low=0, high=len(activation_range))
    activation = activation_range[acti_idx]
    learning_rate = np.random.uniform(low=lr_range[0], high=lr_range[1])
    print("Iteration %d --- Nbs layers: %d, Nbs neurons: %d, learning rate: %.4f, activation function: %s" % (i+1, nbs_layers, nbs_neurons, learning_rate, activation))
    
    # Compile the model:
    model = Sequential()
    model.add(Dense(units = nbs_neurons, activation = activation, input_dim = x_train.shape[1]))
            
    # Check if we have to add a third hidden layer on top of the first one
    if(nbs_layer == 2):
        model.add(Dense(units = nbs_neurons, activation = activation))
            
    # Output layer
    model.add(Dense(3, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
             
    # With the given set of hyperparameters, train the model 
    model.fit(x_train, y_train, epochs=nb_epoch, batch_size=10, verbose = 1,  
            validation_data=(x_valid, y_valid), callbacks=[earlystop])
            
    # Evaluate the validation loss of the trained model
    [val_loss, val_accu] = model.evaluate(x_valid, y_valid)
            
    # If it's the best model so far:
    # - Save the model and update the statistics
    if (val_loss < valid_loss_best):
        valid_loss_best = val_loss
        best_nbs_layer = nbs_layer
        best_nbs_neurons = nbs_neurons
        best_activation = activation
        best_lr = learning_rate
        #model.save("best_model_random_search.h5")
        
# Print the resulting best model:
print("Best results obtained:")
print("Best number of neurons: %d" %(best_nbs_neurons))
print("Best nbs layers: %d" %(best_nbs_layer))
print("Best activation layer: %s" %(best_activation))
print("Best learning rate: %.4f" %(best_lr))

### 4.3) Train|Valid|Test MSE for grid search and random search
- For this last part, we will upload the models we have already optimized using grid search and random search;
- For grid search, we got:
    - Best number of neurons: 20
    - Best nbs layers: 2
    - Best activation layer: tanh
- For random search, we got:
    - Best number of neurons: 30
    - Best nbs layers: 2
    - Best activation layer: tanh
    - Best learning rate: 0.0043

In [0]:
# Load the pre-run best models
class Constants:
    
    SEED = 1

    # Paths to pre-ran models
    best_model_grid_search_final = '/best_model_grid_search_final.h5'
    best_model_random_search_final = '/best_model_random_search_final.h5'
    
    # Google drive id to be able to download from drive
    best_model_grid_search_final_ID = '17_f7xutqw0GwYRC-hJ9-1C9UVk5iL3q3'
    best_model_random_search_final_ID = '1913m7CWOvO4c2BBCLyvdoPImE0xCQeCi'

constants = Constants
random.seed(constants.SEED)

!rm /best_model_grid_search_final.h5
!rm /best_model_random_search_final.h5

GoogleDriveDownloader.download_file_from_google_drive(file_id=constants.best_model_grid_search_final_ID, dest_path=constants.best_model_grid_search_final, unzip=False)
GoogleDriveDownloader.download_file_from_google_drive(file_id=constants.best_model_random_search_final_ID, dest_path=constants.best_model_random_search_final, unzip=False)

# Test to see everything is working as intended
from keras.models import load_model
best_model_grid_search_final = load_model(constants.best_model_grid_search_final)
best_model_random_search_final = load_model(constants.best_model_random_search_final)

print("Model found by random search:")
best_model_random_search_final.summary()
print("-----------------------------")
print("Model found by grid search:")
best_model_grid_search_final.summary()
print("-----------------------------")

In [0]:
# 1) Compute the Train set loss and accuracy with each model
[loss_train_random_search, acc_train_random_search] = best_model_random_search_final.evaluate(x_train,y_train, verbose=0)
[loss_train_grid, acc_train_grid] = best_model_grid_search_final.evaluate(x_train,y_train, verbose=0)

print("Loss Train set")
print("Random search: %.4f" % loss_train_random_search)
print("Grid search: %.4f" % loss_train_grid)
print("------------------------------------------")
print("Accuracy Train set")
print("Random search: %.4f" % acc_train_random_search)
print("Grid search: %.4f" % acc_train_grid)
print("------------------------------------------")

# 2) Compute the Valid set error with each model
[loss_valid_random_search, acc_valid_random_search] = best_model_random_search_final.evaluate(x_valid,y_valid, verbose=0)
[loss_valid_grid, acc_valid_grid] = best_model_grid_search_final.evaluate(x_valid,y_valid, verbose=0)

print("Loss Valid set")
print("Random search: %.4f" % loss_valid_random_search)
print("Grid search: %.4f" % loss_valid_grid)
print("------------------------------------------")
print("Accuracy Valid set")
print("Random search: %.4f" % acc_valid_random_search)
print("Grid search: %.4f" % acc_valid_grid)
print("------------------------------------------")

Conclusion:
- 1) In general, Random search > Grid search because of the computational burden from Grid Search. 
- 2) The optimization procedures shown in this notebook (grid search and random search) are applicable in many machine learning algorithms. 

### Part 5) Insurance classification ('homework')
- Based on what you've learned in Keras and multiclass classification, build a classification MLP that predicts the car brand ['vh_make'] given its characteristics (features).
- Note:
    - Total number of vehicule makers: 101
    - Top 5: Renault, peugeot, citroen, volkswagen, ford with a respective proportion of [26.8%, 19.6%, 15.9%, 5.3%, 4.5%]
    - Other 96 brands cumulative proportion: 28%
- Conclusion:
    - A) Good practice: try to tackle one part of the problem if the problem seems very complex to validate if it's possible to solve:
        - i.e.: build an MLP to classify a car as either [Renault, peugeot, citroen, volkswagen, ford, others]. 
    - B) If part A) works, try to generalize your model to more vehicules. 

Here are some suggestions for you to follow:
- 1) Split the insurance dataset 'clean_data' into a set of features and targets:
    - Targets: one-hot encoding of the possible car brand 
        - Hint: there's a one-line function to do this (see what we've done previously with the Iris dataset).  
    - Features: Your choice.
        - Don't necessarily need to use all of them.
    
- 2) Split the resulting dataset into a Train|Valid sets
    - Hint: one line code. 

- 3) Try a simple neural network (no hyperparameter search):
    - This is to validate that the task is possible to solve;
    - For ex: 2 layers, 50 neurons/layer with relu activation function, could it potentially work?
    - For the neural network implementation: look at the multiclassification task implemented on the Iris dataset. 
    
- 4) If the implementation in step 3) shows potential, use a hyperparameter search to optimize the results:
    - Suggestion: only try Random Search. 
    
- 5) If step 4) shows potential, try a more complex task by adding more possible classes (i.e. more car brand!)