# Porto Seguro's Safe Driving Prediction (AutoML Remote AML Compute)

Porto Seguro, one of Brazil’s largest auto and homeowner insurance companies, completely agrees. Inaccuracies in car insurance company’s claim predictions raise the cost of insurance for good drivers and reduce the price for bad ones.

In the [Porto Seguro Safe Driver Prediction competition](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction), the challenge is to build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year. While Porto Seguro has used machine learning for the past 20 years, they’re looking to Kaggle’s machine learning community to explore new, more powerful methods. A more accurate prediction will allow them to further tailor their prices, and hopefully make auto insurance coverage more accessible to more drivers.

Lucky for you, a machine learning model was built to solve the Porto Seguro problem by the data scientist on your team. The solution notebook has steps to load data, split the data into test and train sets, train, evaluate and save a LightGBM model that will be used for the future challenges.

#### Hint: use shift + enter to run the code cells below. Once the cell turns from [*] to [#], you can be sure the cell has run. 

## Import Needed Packages

Import the packages needed for this solution notebook. The most widely used package for machine learning is [scikit-learn](https://scikit-learn.org/stable/), [pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started), and [numpy](https://numpy.org/). These packages have various features, as well as a lot of clustering, regression and classification algorithms that make it a good choice for data mining and data analysis.

In [2]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import joblib
from sklearn import metrics

##  Get Azure ML Workspace to use

In [3]:
from azureml.core import Workspace, Dataset

# Get Workspace defined in by default config.json file
ws = Workspace.from_config()

## Load data from Azure ML Dataset into Pandas DataFrame

In [4]:
# Load Data
aml_dataset = ws.datasets['porto_seguro_safe_driver_prediction_train']

# Use Pandas DataFrame just to sneak peak some data and schema
data_df = aml_dataset.to_pandas_dataframe()
# .to_pandas_dataframe().dropna()
print(data_df.shape)
data_df.head(5)

(595212, 59)


Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,...,3,1,1,3,0,0,0,1,1,0


## Split Data into Train and Validatation AML Tabular Datasets

Note that for remote AML Training you need to use AML Datasets, you cannot submit Pandas Dataframes.

Partitioning data into training, validation, and holdout sets allows you to develop highly accurate models that are relevant to data that you collect in the future, not just the data the model was trained on. 

In machine learning, features are the measurable property of the object you’re trying to analyze. Typically, features are the columns of the data that you are training your model with minus the label. In machine learning, a label (categorical) or target (regression) is the output you get from your model after training it.

In [20]:
# Split in train/validation datasets (Validation=20%, Train=80%)

train_dataset, validation_dataset = aml_dataset.random_split(0.8, seed=0)

# Use Pandas DF only to check the data
train_df = train_dataset.to_pandas_dataframe()
validation_df = validation_dataset.to_pandas_dataframe()

In [21]:
print(train_df.shape)
print(validation_df.shape)

(475857, 59)
(119355, 59)


In [28]:
validation_df.head(5)

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
1,22,0,5,1,4,0,0,1,0,0,...,7,1,3,6,1,0,1,0,1,0
2,26,0,5,1,3,1,0,0,0,1,...,4,2,1,5,0,1,0,0,0,1
3,50,0,1,2,1,0,0,0,0,1,...,3,3,1,8,0,0,1,0,0,0
4,58,0,5,1,6,0,1,1,0,0,...,9,1,3,9,0,1,1,0,0,0


## Connect to Remote AML Compute (Existing AML cluster)

In [14]:
# Define remote compute target to use
# Further docs on Remote Compute Target: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-remote

# Choose a name for your cluster.
amlcompute_cluster_name = "cpu-cluster"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets

if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
     found = True
     print('Found existing training cluster.')
     # Get existing cluster
     # Method 1:
     aml_remote_compute = cts[amlcompute_cluster_name]
     # Method 2:
     # aml_remote_compute = ComputeTarget(ws, amlcompute_cluster_name)
    
if not found:
     print('Creating a new training cluster...')
     provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D13_V2", # for GPU, use "STANDARD_NC12"
                                                                 #vm_priority = 'lowpriority', # optional
                                                                 max_nodes = 20)
     # Create the cluster.
     aml_remote_compute = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
aml_remote_compute.wait_for_completion(show_output = True, min_node_count = 0, timeout_in_minutes = 20)

Found existing training cluster.
Checking cluster status...
Succeeded
AmlCompute wait for completion finished
Minimum number of nodes requested have been provisioned


In [15]:
# For additional details of current AmlCompute status:
aml_remote_compute.get_status()

<azureml.core.compute.amlcompute.AmlComputeStatus at 0x7fe388153e10>

## Train with Azure AutoML automatically searching for the 'best model' (Best algorithms and best hyper-parameters)

### List and select primary metric to drive the AutoML classification problem

In [16]:
from azureml.train import automl

# List of possible primary metrics is here:
# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#primary-metric
    
# Get a list of valid metrics for your given task
automl.utilities.get_primary_metrics('classification')

['norm_macro_recall',
 'precision_score_weighted',
 'accuracy',
 'average_precision_score_weighted',
 'AUC_weighted']

## Define AutoML Experiment settings

In [24]:
import logging

# You can provide additional settings as a **kwargs parameter for the AutoMLConfig object
# automl_settings = {
#     "whitelist_models": 'XGBoostClassifier'
# }

from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='classification',
                             primary_metric='AUC_weighted',
                             # experiment_timeout_minutes= 20,                            
                             training_data=train_dataset,
                             # validation_data=validation_dataset,
                             label_column_name="target", 
                             # blacklist_models='XGBoostClassifier', 
                             # iteration_timeout_minutes= 5,
                             # iterations=2
                             max_concurrent_iterations=6,
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log='automated_ml_errors.log',
                             verbosity= logging.INFO,
                             enable_onnx_compatible_models=False
                             # **automl_settings
                             )

# Explanation of Settings: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#configure-your-experiment-settings

# AutoMLConfig info on: 
# https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig

## Run Experiment (on remote AML Compute) with multiple child runs under the covers

In [25]:
from azureml.core import Experiment

experiment_name = "SDK_remote_porto_seguro_driver_pred"
print(experiment_name)

experiment = Experiment(workspace=ws, 
                        name=experiment_name)

import time
start_time = time.time()
            
run = experiment.submit(automl_config, show_output=True)

print('Manual run timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (time.time() - start_time))


SDK_remote_porto_seguro_driver_pred
Running on remote compute: cpu-cluster
Parent Run ID: AutoML_ad93b312-3aeb-427d-bffb-35b9807e6e82

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   StandardScalerWrapper SGD                      0:01:49       0.5961    0.5961
         1   StandardScalerWrapper SGD                      0:01:43       0.5789    0.5961

## Explore results with Widget

In [26]:
# Explore the results of automatic training with a Jupyter widget: https://docs.microsoft.com/en-us/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py
from azureml.widgets import RunDetails
RunDetails(run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'NOTSET', …

### Measure Parent Run Time needed for the whole AutoML process 

In [31]:
import time
from datetime import datetime

run_details = run.get_details()

# Like: 2020-01-12T23:11:56.292703Z
end_time_utc_str = run_details['endTimeUtc'].split(".")[0]
start_time_utc_str = run_details['startTimeUtc'].split(".")[0]
timestamp_end = time.mktime(datetime.strptime(end_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())
timestamp_start = time.mktime(datetime.strptime(start_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())

parent_run_time = timestamp_end - timestamp_start
print('Run Timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (parent_run_time))

Run Timing: --- 5157.0 seconds needed for running the whole Remote AutoML Experiment ---


## Retrieve the 'Best' Scikit-Learn Model

In [32]:
best_run, fitted_model = run.get_output()
print(best_run)
print('--------')
print(fitted_model)

Run(Experiment: SDK_remote_porto_seguro_driver_pred,
Id: AutoML_ad93b312-3aeb-427d-bffb-35b9807e6e82_36,
Type: azureml.scriptrun,
Status: Completed)
--------
Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
        feature_sweeping_config=None, feature_sweeping_timeout=None,
        featurization_config=None, is_cross_validation=None,
        is_onnx_compatible=None, logger=None, observer=None, task=None)), ('pref...666666666666, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.13333333333333333]))])


## See files associated with the 'Best run'

In [33]:
print(best_run.get_file_names())

# best_run.download_file('azureml-logs/70_driver_log.txt')

['accuracy_table', 'automl_driver.py', 'azureml-logs/55_azureml-execution-tvmps_63bd8296a41f2401572d6eede84e034084bc73063ece64981d86289777effaa8_d.txt', 'azureml-logs/65_job_prep-tvmps_63bd8296a41f2401572d6eede84e034084bc73063ece64981d86289777effaa8_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_63bd8296a41f2401572d6eede84e034084bc73063ece64981d86289777effaa8_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'confusion_matrix', 'logs/azureml/153_azureml.log', 'logs/azureml/azureml.log', 'logs/azureml/azureml_automl.log', 'outputs/conda_env_v_1_0_0.yml', 'outputs/env_dependencies.json', 'outputs/model.pkl', 'outputs/scoring_file_v_1_0_0.py', 'pipeline_graph.json']


## Make Predictions

### Load Test Dataset from AML Workspace
ISSUE: The Test dataset doesn't have any label column...

In [34]:
# Commented since the Test Dataset doesn't have a label column
# aml_test_dataset = ws.datasets['porto_seguro_safe_driver_prediction_test']

# test_df = aml_test_dataset.to_pandas_dataframe()
# print(test_df.shape)
# test_df.head(5)

### Prep Validation or Test Data: Extract X values (feature columns) from test dataset and convert to NumPi array for predicting 

In [35]:
import pandas as pd

#Remove Label/y column
if 'target' in validation_df.columns:
    y_validation_df = validation_df.pop('target')

x_validation_df = validation_df

x_validation_df.describe()

Unnamed: 0,id,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
count,119355.0,119355.0,119355.0,119355.0,119355.0,119355.0,119355.0,119355.0,119355.0,119355.0,...,119355.0,119355.0,119355.0,119355.0,119355.0,119355.0,119355.0,119355.0,119355.0,119355.0
mean,744170.72,1.9,1.36,4.42,0.42,0.4,0.39,0.26,0.16,0.19,...,5.44,1.44,2.88,7.55,0.12,0.63,0.56,0.29,0.35,0.15
std,429025.37,1.99,0.66,2.7,0.49,1.35,0.49,0.44,0.37,0.39,...,2.34,1.2,1.7,2.75,0.33,0.48,0.5,0.45,0.48,0.36
min,9.0,0.0,-1.0,0.0,-1.0,-1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,372705.5,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,1.0,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,743546.0,1.0,1.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,1.0,3.0,7.0,0.0,1.0,1.0,0.0,0.0,0.0
75%,1114391.0,3.0,2.0,6.0,1.0,0.0,1.0,1.0,0.0,0.0,...,7.0,2.0,4.0,9.0,0.0,1.0,1.0,1.0,1.0,0.0
max,1488021.0,7.0,4.0,11.0,1.0,6.0,1.0,1.0,1.0,1.0,...,19.0,10.0,12.0,22.0,1.0,1.0,1.0,1.0,1.0,1.0


In [36]:
y_validation_df.describe()

count   119355.00
mean         0.04
std          0.19
min          0.00
25%          0.00
50%          0.00
75%          0.00
max          1.00
Name: target, dtype: float64

### Load model in memory

In [37]:
# Load the model into memory
import joblib
fitted_model = joblib.load('model.pkl')
print(fitted_model)

Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
        feature_sweeping_config=None, feature_sweeping_timeout=None,
        featurization_config=None, is_cross_validation=None,
        is_onnx_compatible=None, logger=None, observer=None, task=None)), ('pref...666666666666, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.13333333333333333]))])


### Make predictions in bulk

In [38]:
# Try the best model making predictions with the test dataset
y_predictions = fitted_model.predict(x_validation_df)

print('10 predictions: ')
print(y_predictions[:10])

10 predictions: 
[0 0 0 0 0 0 0 0 0 0]


## Evaluate Model

Evaluating performance is an essential task in machine learning. In this case, because this is a classification problem, the data scientist elected to use an AUC - ROC Curve. When we need to check or visualize the performance of the multi - class classification problem, we use AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve. It is one of the most important evaluation metrics for checking any classification model’s performance.

<img src="https://www.researchgate.net/profile/Oxana_Trifonova/publication/276079439/figure/fig2/AS:614187332034565@1523445079168/An-example-of-ROC-curves-with-good-AUC-09-and-satisfactory-AUC-065-parameters.png"
     alt="Markdown Monster icon"
     style="float: left; margin-right: 12px; width: 320px; height: 239px;" />

### Calculate the Accuracy with Validation or Test Dataset

In [39]:
from sklearn.metrics import accuracy_score

print('Accuracy with Scikit-Learn model:')
print(accuracy_score(y_validation_df, y_predictions))


Accuracy with Scikit-Learn model:
0.9640316702274727


In [40]:
fpr, tpr, thresholds = metrics.roc_curve(y_validation_df, y_predictions)
print('AUC (Area Under the Curve) with Scikit-Learn model:')
metrics.auc(fpr, tpr)

# AUC with plain LightGBM was: 0.6374553321494826 

AUC (Area Under the Curve) with Scikit-Learn model:


0.5

## Register Model in AML Model Registry

In [None]:
# TBD