## Experiment 4: Baseline (w/o Fullness feature)
This experiment trains, validates and tunes a regression of load weights on the following features:

1. **Timestamp** Datetime attributes of the arriving load
1. **Station:** Facility (Metro Central or Metro South) to which the load is arriving
2. **Material:** Dominant material stream (MSW, residential or commercial organics, wood or yard debris) arriving in the load
3. **Vehicle ID:** Information about the arriving vehicle's general type only (i.e. pickup, car, etc.)

The dataset is described in more detail in [this notebook](https://app.hex.tech/2737cf3a-31c1-4361-9f90-8dea0b629cf0/hex/fa95f966-0912-42ca-9c83-9e14b785420f/draft/logic). 

In [1]:
# import packages
import pandas as pd
import numpy as np
from pycaret.regression import *

In [2]:
# import data
path = r'C:\Users\Sherman\OneDrive - Metro\Sherman\Projects\Metro TS Load Weight Prediction\Data\Baseline_noFull.csv'
exp4_data = pd.read_csv(path)

# convert timestamp column to datetime type
exp4_data['TimeStamp'] = pd.to_datetime(exp4_data['TimeStamp'])

In [3]:
exp4_data.head()

Unnamed: 0,TimeStamp,Station,Material,Vehicle,Tons
0,2021-04-04 07:02:21,Metro South,MSW,Standard Pickup,0.11
1,2021-04-04 07:01:35,Metro South,MSW,Standard Pickup,0.13
2,2021-04-04 07:05:41,Metro South,MSW,Standard Pickup,0.22
3,2021-04-04 07:13:58,Metro South,MSW,Standard Pickup,0.09
4,2021-04-04 07:15:39,Metro South,MSW,Standard Pickup,0.06


In [4]:
# Set up experiment and pre-process data
exp4 = setup(data = exp4_data, 
             target = 'Tons',
             normalize = True,
             session_id = 7512,
             use_gpu = True
            )

In [5]:
# Get setup configuration grid
exp4_config = pull()
exp4_config.to_csv('exp4_config.csv', index=False)

In [6]:
# train an extra trees regressor using 10-fold CV
exp4_bestmodel = create_model('et')

# track and dump cv training scores
exp4_training = pull()
exp4_training.to_csv('exp4_training.csv', index=False)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.5588,1.2208,1.1049,0.7966,0.2858,1.1918
1,0.5604,1.2271,1.1077,0.8002,0.285,1.2431
2,0.5625,1.2536,1.1196,0.7914,0.2888,1.2037
3,0.5566,1.2115,1.1007,0.7962,0.2866,1.1942
4,0.5562,1.2037,1.0971,0.8008,0.2839,1.2111
5,0.5534,1.1953,1.0933,0.7975,0.2842,1.2307
6,0.5557,1.2095,1.0998,0.7983,0.2865,1.2096
7,0.5509,1.1879,1.0899,0.7984,0.2835,1.2482
8,0.558,1.2253,1.1069,0.7959,0.2838,1.2304
9,0.564,1.2479,1.1171,0.7976,0.2862,1.2056


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

In [7]:
# Residuals plot
plot_model(exp4_bestmodel, plot = 'residuals', save = True)

'Residuals.png'

In [8]:
# Error plot
plot_model(exp4_bestmodel, plot='error', save=True)

'Prediction Error.png'

In [9]:
# Feature importance plot
plot_model(exp4_bestmodel, plot='feature', save=True)

'Feature Importance.png'

In [10]:
# Use test/hold-out set to make predictions
exp4_pred_holdout = predict_model(exp4_bestmodel)
exp4_pred_holdout.to_csv('exp4_pred_holdout.csv', index=False)

# Pull and export test/hold-out scores
exp4_holdout = pull()
exp4_holdout.to_csv('exp4_holdout.csv', index=False)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extra Trees Regressor,0.1891,0.4626,0.6802,0.9229,0.1394,0.1714


In [11]:
# Save model pkl file
save_model(exp4_bestmodel, 'exp4_pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=FastMemory(location=C:\Users\Sherman\AppData\Local\Temp\joblib),
          steps=[('date_feature_extractor',
                  TransformerWrapper(include=['TimeStamp'],
                                     transformer=ExtractDateTimeFeatures())),
                 ('numerical_imputer',
                  TransformerWrapper(include=[], transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=['Station', 'Material', 'Ve...
                                     transformer=OneHotEncoder(cols=['Material'],
                                                               handle_missing='return_nan',
                                                               use_cat_names=True))),
                 ('rest_encoding',
                  TransformerWrapper(include=['Vehicle'],
                                     transformer=LeaveOneOutEncoder(cols=['Vehicle'],
                                                             