# Auto ML Using AutoGluon
### Notebook Setup
- Select instance size:  ml.m5.4xlarge (16vCPU + 64GB)
- Install the following libraries

### AutoML Training Approach
The AutoML training will be applied using three different approaches using different training criteria within each approach.
- **Model1:**  baseline model using 50% of training data and excluding KNN and neural network models
- **Model2:**  advanced training using 50% of training data with hyper parameter optimization and including neural network models
- **Model3:**  the best performing approach from model 1 and model 2 will be selected to train on the entire training set

In [None]:
# !pip install -U pip
# !pip install -U setuptools wheel
# !pip install -U "mxnet<2.0.0" bokeh==2.0.1
# !pip install autogluon --no-cache-dir

In [None]:
import pandas as pd
import numpy as np
import boto3
from autogluon.tabular import TabularPredictor
import altair as alt

import shap
shap.initjs()

# set name of S3 bucket
s3_bucket = 'traffic-data-bucket'

## 1. Import data

In [None]:
df = pd.read_parquet('s3://traffic-data-bucket/model_data/model_data_post_transformation.parquet', engine='auto')

In [None]:
df.head()

## 2. Data preprocessing
### 2.1 Check for missing values

In [None]:
percent_missing = df.isnull().sum() / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
                                 'percent_missing': percent_missing})

missing_value_df.sort_values('percent_missing', ascending=False)

### 2.2 Drop years 2014
2014 will be dropped from the dataset due to some features such as prior year collisions not being available for year 2014.

In [None]:
sorted(df['collision_year'].unique())

In [None]:
df = df[df['collision_year'] != 2014]

### 2.3 Feature selection
Feature selection will take place by selecting features based on the following categories:
- Street
- Time and date
- Hexagon
- Weather

#### Angular distance for day of the week
The day of the week is currently a numeric feature but Saturday is closer to Monday than Wednesday so in order to capture the circular nature of weeks and teh actual distance between days, we will calculate the cosine and sine values of the degree. We will also convert the month to a categorical variable to see if that changes performance.

In [None]:
df['day_of_week_sin'] = np.sin(df['collision_dayofweek'] * (2 * np.pi / 7))
df['day_of_week_cos'] = np.cos(df['collision_dayofweek'] * (2 * np.pi / 7))
df.drop('collision_dayofweek', axis=1, inplace=True)

In [None]:
street_features = ['la_data_city_name', 
                     'node_street_count', 'node_stop', 'node_traffic_signals',
                     'edge_speed_kph_max', 'edge_speek_kph_min',
                     'edge_lanes_max', 'edge_motorway_flag', 'edge_motorway_link_flag',
                     'edge_living_street_flag', 'edge_bridge_flag', 'edge_oneway_flag',
                     'edge_tunnel_flag', 'amenities_bar_cnt', 'amenities_school_cnt',
                     'amenities_restaurant_cnt', 'amenities_college_cnt',
                     'drv_edge_lanes_max_imputed_flag']

time_features = ['drv_collision_hour_sin','drv_collision_hour_cos',
                 'collision_month', 'drv_holiday_flag', 'day_of_week_sin', 'day_of_week_cos' # add cosine and sine for day of the week
                ]

hex_history_features = ['prev1_yr_coll_cnt', 'prev1_yr_coll_neighbor1']

weather_features = ['noaa_wind_speed', 'noaa_precipitation',
                    'noaa_temperature_average', 'noaa_temperature_max',
                    'noaa_temperature_min']

# include the target
model_features = street_features +  time_features + hex_history_features +  weather_features + ['target']

### 2.4 Feature encoding
Review data types.

In [None]:
df[model_features].dtypes

AutoGluon does not recognize `Int64` or `Float64` data types so these columns need to be converted to `int64` and `float64`.

In [None]:
for column in df.columns:
    if df[column].dtype == 'Int64':
        df[column] = df[column].astype(int)
    if df[column].dtype == 'Float64':
        df[column] = df[column].astype('float64')

Set the city name as category.

In [None]:
print('Number of cities in Los Angeles County:', df['la_data_city_name'].nunique())
df['la_data_city_name'] = df['la_data_city_name'].astype('category')

Subtract the previous year collision count from the previous year collision neighbor.

In [None]:
df['prev1_yr_coll_neighbor1'] = df['prev1_yr_coll_neighbor1'] - df['prev1_yr_coll_cnt']

In [None]:
df[model_features].dtypes

### 2.5 Split data into train-test-validation and drop 2020 data from train and test sets

In [None]:
label = 'target'

# training set with target
df_train = df[df['ttv_split'] == 'Train']
df_train = df_train[df_train['collision_year'] != 2020][model_features]

# test set for model tuning
df_test = df[df['ttv_split'] == 'Test']
df_test = df_test[df_test['collision_year'] != 2020][model_features]

# all data for model predictions containing 2020 data
df_all = df.copy()
df_all = df_all[model_features]

# target 
y_train = df_train[label]
y_test = df_test[label]

# drop target label
X_train = df_train.drop(columns=[label])
X_test = df_test.drop(columns=[label])
X_all = df_all.drop(columns=[label])

Check the distribution between the negative (0) and positive (1) classes.

In [None]:
df_train.groupby('target', as_index=False)['target'].count().plot(kind='bar', title='Distribution of target class')

### 2.4 Create a sample dataset using 50% of data for initial training and evaluation

In [None]:
df_train_sample = df_train.sample(frac =.50, random_state=42)

In [None]:
df_train_sample.shape

## 3. Model training
The training approach will be to train three different models using different training criteria.
- **Model1:**  baseline model using 50% of training data and excluding KNN and neural network models
- **Model2:**  advanced training using 50% of training data with hyper parameter optimization and including neural network models
- **Model3:**  the best performing approach from model 1 and model 2 will be selected to train on the entire training set

If you already have a model saved you can load it into the predictor to save on training.  Uncomment the cell below if you have a model saved that you'd like to load.

In [None]:
# # where the model is stored
# load_path = 'agModels-baseModel_updated'
# # load into the predictor
# predictor = TabularPredictor.load(load_path)

### 3.1 Baseline model using sample subset data
This model will be trained on 50% of the training data and does not include any feature encoding and does not train on any KNN or Neural Network models.
- Training time: 10 minutes
- No hyper parameter optimization

In [None]:
# choose where to store the model
save_path = 'agModels-base_model'

# select training data
training_data = df_train_sample

# set training time
training_time = 10*60

# set quality of models
model_quality = 'best_quality'

# model classes to be excluded from training
excluded_model_types = ['KNN', 'NN_TORCH']

predictor = TabularPredictor(label="target", problem_type = 'binary', eval_metric='roc_auc', learner_kwargs={'positive_class':1}, path=save_path
                            ).fit(train_data=training_data, time_limit=training_time, presets=model_quality, excluded_model_types=excluded_model_types)

In [None]:
print("AutoGluon infers problem type is: ", predictor.problem_type)
print("AutoGluon identified the following types of features:")
print(predictor.feature_metadata)

#### 3.1.1 Generate leaderboard of ranking model performance

In [None]:
baseline_leaderboard = predictor.leaderboard()

Define a function to output a chart that shows the scores for the various trained models.

In [None]:
def leaderboard_chart(leaderboard, title, subtitle, color_scheme):
    '''
    Arguments:
        title: title of the chart in string format
        subtitle: subtitle of the chart in string format
        leaderboard: predictor leaderboard object
        color_scheme: Altair color scheme
    
    '''
    
    # create base chart
    base = alt.Chart(leaderboard).mark_bar(size=10).encode(
        x=alt.X('model:N', axis=alt.Axis(title='Model'), sort='-y'),
        y=alt.Y('score_val:Q', axis=alt.Axis(title='ROC and AUC Score', format=",.2f"), scale=alt.Scale(domain=(0,1.0)))
    )

    # apply color to bars only
    bars = base.mark_bar().encode(
        color=alt.Color('score_val:Q', legend=None, scale=alt.Scale(scheme=color_scheme))
    )

    # lay text over base chart
    text = base.mark_text(
        align='center',
        baseline='middle',
        dy=-10  # Nudges text to right so it doesn't appear on top of the bar
    ).encode(
        text=alt.Text('score_val:Q', format=",.3f")
    )

    # combine all charts
    combined_chart = (bars + text).properties(width=700, title={'text':[title],
                                               'subtitle':[subtitle,''],
                                               'subtitleFont':'Segoe UI',
                                               'subtitleFontSize':14
                                              }).configure_axis(
        grid=False, # remove gridlines
    ).configure_title(
        anchor='start',
        fontSize=20
    )
    
    return combined_chart

In [None]:
leaderboard_chart(baseline_leaderboard, 'Model 1 Performance', 'Training done on 50% of data using AutoGluon and no hyper parameter optimization','tealblues')

#### 3.1.2 Predict probabilities on training set

In [None]:
# use .predict method to create predictions on test set
y_pred = predictor.predict_proba(X_test)
# print("Predictions:  \n", y_pred)

#### 3.1.3 Evaluate predictions against ground truth using test set

In [None]:
perf_baseline_model = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)
perf_baseline_model

#### 3.1.4 Generate predictions on the entire dataset and upload to S3
Let's start by creating a function that returns predictions for `df_all` which represents the entire dataset and years 2015 to 2020.  The year 2014 was dropped during preprocessing.

In [None]:
def predictions_df(model_name, prediction_values):
    '''
    Arguments:
        model_name:  name of the model
        prediction_values: predictions generated using .predict_proba or .predict
        
    Returns:
        Dataframe with predictions to be used for model scoring
    
    '''
    
    
    df['model_name'] = model_name
    df['prediction'] = prediction_values[1]
    
    df_predictions = df[['hex_id', 'collision_date', 'collision_hour', 'ttv_split', 'prediction', 'model_name']]
    
    return df_predictions

In [None]:
predictions = predictor.predict_proba(X_all)

In [None]:
df_output = predictions_df('AutoGluon_Baseline', predictions)

df_output.to_csv(f"s3://{s3_bucket}/model_scoring/individual_model_scores/AutoGluon_Baseline.csv", index=False)

## 3.2 Model w/manually selected hyperparameters
#### 3.2.1 Perform one-hot encoding

Create new datasets to store one-hot encoded features.

In [None]:
transformed_df_train_sample = df_train_sample.copy()
transformed_X_test = X_test.copy()
transformed_X_all = X_all.copy()

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()

transformed = ohe.fit_transform(transformed_df_train_sample[['la_data_city_name']])
# add transformed columns back to dataframe
transformed_df_train_sample[ohe.categories_[0]] = transformed.toarray()
transformed_df_train_sample.drop('la_data_city_name', axis=1, inplace=True)

# apply the same transformation to the test set
transformed = ohe.fit_transform(transformed_X_test[['la_data_city_name']])
# add transformed columns back to dataframe
transformed_X_test[ohe.categories_[0]] = transformed.toarray()
transformed_X_test.drop('la_data_city_name', axis=1, inplace=True)

# apply the same transformation to validation dataframe
transformed = ohe.fit_transform(transformed_X_all[['la_data_city_name']])
# add transformed columns back to dataframe
transformed_X_all[ohe.categories_[0]] = transformed.toarray()
transformed_X_all.drop('la_data_city_name', axis=1, inplace=True)

### 3.2.2 Define hyper parameters and train model
- Model types trained:  Neural network, GBM, XGB
- Training data:  sample subset
- Training time: 10 minutes

In [None]:
import autogluon.core as ag

# neural network hyper parameters
nn_options = {  # specifies non-default hyperparameter values for neural network models
    'num_epochs': 10,  # number of training epochs (controls training time of NN models)
    'learning_rate': ag.space.Real(1e-4, 1e-2, default=5e-4, log=True),  # learning rate used in training (real-valued hyperparameter searched on log-scale)
    'activation': ag.space.Categorical('relu', 'softrelu', 'tanh'),  # activation function used in NN (categorical hyperparameter, default = first entry)
    'dropout_prob': ag.space.Real(0.0, 0.5, default=0.1),  # dropout probability (real-valued hyperparameter)
}

# gbm hyper parameters
gbm_options = {  # specifies non-default hyperparameter values for lightGBM gradient boosted trees
    'num_boost_round': 100,  # number of boosting rounds (controls training time of GBM models)
    'num_leaves': ag.space.Int(lower=26, upper=66, default=36),  # number of leaves in trees (integer hyperparameter)
}

# xgb hyper parameters
xgb_options = {'n_estimators': 1000, 'learning_rate': ag.Real(0.01, 0.1, log=True)}

hyperparameters = {  # hyperparameters of each model type
                   'GBM': gbm_options,
                   'NN_TORCH': nn_options,  # NOTE: comment this line out if you get errors on Mac OSX
                    'XGB':xgb_options}  # When these keys are missing from hyperparameters dict, no models of that type are trained

time_limit = 10*60  # train various models for ~10 min
num_trials = 5  # try at most 5 different hyperparameter configurations for each type of model
search_strategy = 'auto'  # to tune hyperparameters using random search routine with a local scheduler

hyperparameter_tune_kwargs = {  # HPO is not performed unless hyperparameter_tune_kwargs is specified
    'num_trials': num_trials,
    'scheduler' : 'local',
    'searcher': search_strategy,
}

Perform training on one-hot encoded sample training data.

In [None]:
save_path = 'agModels-hpo_model'

training_data = transformed_df_train_sample


# set quality of models
model_quality = 'best_quality'

hpo_predictor = TabularPredictor(label="target", problem_type = 'binary', eval_metric='roc_auc', learner_kwargs={'positive_class':1}, path=save_path
                            ).fit(train_data=training_data, time_limit=time_limit, num_stack_levels=2,
                                  presets=model_quality, hyperparameters=hyperparameters,
                                  hyperparameter_tune_kwargs=hyperparameter_tune_kwargs)

In [None]:
hpo_leaderboard = hpo_predictor.leaderboard()

#### 3.2.2 Predict probabilities on training set

In [None]:
# use .predict method to create predictions on test set
y_pred = hpo_predictor.predict_proba(transformed_X_test)
# print("Predictions:  \n", y_pred)

### 3.2.3 Evaluate predictions against ground truth using test set

In [None]:
perf_hpo_model = hpo_predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

#### 3.2.4 Generate predictions on the entire dataset
Let's start by creating a function that returns predictions for `df_all` which represents the entire dataset and years 2015 to 2020.  The year 2014 was dropped during preprocessing.

In [None]:
predictions = hpo_predictor.predict_proba(transformed_X_all)

#### 3.2.5 Output a dataframe of predictions for the entire dataset and upload to S3

In [None]:
df_output = predictions_df('AutoGluon_HPO', predictions)

df_output.to_csv(f"s3://{s3_bucket}/model_scoring/individual_model_scores/AutoGluon_HPO.csv", index=False)

In [None]:
feature_importance_baseline = hpo_predictor.feature_importance(data=transformed_df_train_sample, subsample_size=5000)

In [None]:
feature_importance_baseline.head(20)

## 3.3 Final model
Our model didn't seem to improve after performing manually selecting hyperparameters or with one-hot encoding.  The WeightedEnsemble_L2 model came on top again.

In [None]:
# choose where to store the model
save_path = 'agModels-final_model_updated'

training_data = df_train

# set quality of models
model_quality = 'best_quality'

# model classes to be excluded from training
excluded_model_types = ['KNN']

training_time = 15*60

full_training_predictor = TabularPredictor(label="target", problem_type = 'binary', eval_metric='roc_auc', learner_kwargs={'positive_class':1}, path=save_path
                            ).fit(train_data=training_data, time_limit=training_time, presets=model_quality, excluded_model_types=excluded_model_types)

In [None]:
full_training_leaderboard = full_training_predictor.leaderboard()

Create leaderboard plot

In [None]:
leaderboard_chart(full_training_leaderboard, 'AutoGluon Performance', 'Training done on 100% of data','tealblues')

In [None]:
y_pred = full_training_predictor.predict_proba(X_test)

In [None]:
perf_full_model = full_training_predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

#### Generate predictions on the entire dataset

In [None]:
predictions = full_training_predictor.predict_proba(X_all)

In [None]:
df_output = predictions_df('AutoGluon_Full_Training', predictions)

df_output.to_csv(f"s3://{s3_bucket}/model_scoring/individual_model_scores/AutoGluon_Full_Training.csv", index=False)

## 4. Explain Predictions and Feature Importance

In [None]:
print('postiive class:', full_training_predictor.positive_class)

In [None]:
class AutogluonWrapper:
    def __init__(self, predictor, feature_names):
        self.ag_model = predictor
        self.feature_names = feature_names
    
    def predict_binary_prob(self, X):
        if isinstance(X, pd.Series):
            X = X.values.reshape(1,-1)
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X, columns=self.feature_names)
        return self.ag_model.predict_proba(X, as_multiclass=False)

Calculate the mode across all columns in X_train.

In [None]:
mode_df = X_train.mode()  # X_train.mode() would be a more appropriate baseline for ordinally-encoded categorical features
mode_df

Get feature names from X_train and create a KernelExplainer that returns a Kernsl SHAP values to explain particular AutoGluon predictions.

In [None]:
feature_names = X_train.columns

# use the full training predictor
ag_wrapper = AutogluonWrapper(full_training_predictor, feature_names)
explainer = shap.KernelExplainer(ag_wrapper.predict_binary_prob, mode_df)

NSHAP_SAMPLES = 100  # how many samples to use to approximate each Shapely value, larger values will be slower
N_VAL = 50  # how many datapoints from validation data should we interpret predictions for, larger values will be slower

Plot SHAP values aggregated across the first n datapoints of the validation data.

In [None]:
shap_values = explainer.shap_values(X_all.iloc[0:N_VAL], nsamples=NSHAP_SAMPLES)

In [None]:
shap.summary_plot(shap_values, X_all.iloc[:N_VAL,:])

In [None]:
shap.summary_plot(shap_values, X_all.iloc[:N_VAL,:], plot_type='bar')