## Experiment 1: Baseline Model
This experiment trains, validates and tunes a regression of load weights on the following features:

1. **TimeStamp** Date and time of the load's arrival. 
1. **Station:** Facility (Metro Central or Metro South) to which the load is arriving
2. **Material:** Dominant material stream (MSW, residential or commercial organics, wood or yard debris) arriving in the load
3. **Vehicle ID:** Information about the arriving vehicle's truck number or general type (if truck number unknown)
4. **Vehicle fullness:** The vehicle's level of "fullness", expressed as a percentage, rounded to 2 decimal places (scale of 1 to 100)

The dataset is described in more detail in [this notebook](https://app.hex.tech/2737cf3a-31c1-4361-9f90-8dea0b629cf0/hex/fa95f966-0912-42ca-9c83-9e14b785420f/draft/logic). 

In [1]:
# import packages
import pandas as pd
import numpy as np
from pycaret.regression import *

In [2]:
# import data
path = r'C:\Users\Sherman\OneDrive - Metro\Sherman\Projects\Metro TS Load Weight Prediction\Data\Baseline.csv'
exp1_data = pd.read_csv(path)

# convert timestamp column to datetime type
exp1_data['TimeStamp'] = pd.to_datetime(exp1_data['TimeStamp'])

In [3]:
exp1_data.head()

Unnamed: 0,TimeStamp,Station,Material,Vehicle,Fullness,Tons
0,2021-04-04 07:02:21,Metro South,MSW,Standard Pickup,0.01,0.11
1,2021-04-04 07:01:35,Metro South,MSW,Standard Pickup,0.01,0.13
2,2021-04-04 07:05:41,Metro South,MSW,Standard Pickup,0.02,0.22
3,2021-04-04 07:13:58,Metro South,MSW,Standard Pickup,0.01,0.09
4,2021-04-04 07:15:39,Metro South,MSW,Standard Pickup,0.0,0.06


In [4]:
# Set up experiment and pre-process data
exp1 = setup(data = exp1_data, 
             target = 'Tons',
             normalize = True,
             session_id = 7512,
             use_gpu = True
            )

In [5]:
# Get setup configuration grid
exp1_config = pull()
exp1_config.to_csv('exp1_config.csv', index=False)

In [6]:
# compare a suite of regressors on MAE using 10-fold CV
exp1_bestmodel = compare_models(sort = 'MAE')

# track and dump cv training scores
exp1_training = pull()
exp1_training.to_csv('exp1_training.csv', index=False)

In [7]:
# Residuals plot
plot_model(exp1_bestmodel, plot = 'residuals', save = True)

'Residuals.png'

In [8]:
# Error plot
plot_model(exp1_bestmodel, plot='error', save=True)

'Prediction Error.png'

In [9]:
# Feature importance plot
plot_model(exp1_bestmodel, plot='feature', save=True)

'Feature Importance.png'

In [10]:
# Use test/hold-out set to make predictions
exp1_pred_holdout = predict_model(exp1_bestmodel)
exp1_pred_holdout.to_csv('exp1_pred_holdout.csv', index=False)

# Pull and export test/hold-out scores
exp1_holdout = pull()
exp1_holdout.to_csv('exp1_holdout.csv', index=False)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extra Trees Regressor,0.0795,0.1135,0.3369,0.9811,0.067,0.0794


In [11]:
# Save model pkl file
save_model(exp1_bestmodel, 'exp1_pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=FastMemory(location=C:\Users\Sherman\AppData\Local\Temp\joblib),
          steps=[('date_feature_extractor',
                  TransformerWrapper(include=['TimeStamp'],
                                     transformer=ExtractDateTimeFeatures())),
                 ('numerical_imputer',
                  TransformerWrapper(include=['Fullness'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=['Station', 'Mat...
                                     transformer=OneHotEncoder(cols=['Material'],
                                                               handle_missing='return_nan',
                                                               use_cat_names=True))),
                 ('rest_encoding',
                  TransformerWrapper(include=['Vehicle'],
                                     transformer=LeaveOneOutEncoder(cols=['Vehicle'],
                         