## Experiment 5: Baseline w/o Timestamp or Truck number in Vehicle feature
This experiment trains, validates and tunes a regression of load weights on the following features:

1. **Station:** Facility (Metro Central or Metro South) to which the load is arriving
2. **Material:** Dominant material stream (MSW, residential or commercial organics, wood or yard debris) arriving in the load
3. **Vehicle ID:** Information about the arriving vehicle's general type only
4. **Vehicle fullness:** The vehicle's level of "fullness", expressed as a percentage, rounded to 2 decimal places (scale of 1 to 100)

The dataset is described in more detail in [this notebook](https://app.hex.tech/2737cf3a-31c1-4361-9f90-8dea0b629cf0/hex/fa95f966-0912-42ca-9c83-9e14b785420f/draft/logic). 

In [1]:
# import packages
import pandas as pd
import numpy as np
from pycaret.regression import *

In [2]:
# import data
path = r'C:\Users\Sherman\OneDrive - Metro\Sherman\Projects\Metro TS Load Weight Prediction\Data\Baseline_noDT_noTN.csv'
exp5_data = pd.read_csv(path)

In [3]:
exp5_data.head()

Unnamed: 0,Station,Material,Vehicle,Fullness,Tons
0,Metro Central,MSW,Standard Pickup,0.01,0.17
1,Metro South,MSW,Car,0.01,0.05
2,Metro Central,MSW,Standard Pickup,0.01,0.13
3,Metro Central,Yd,Standard Pickup,0.01,0.04
4,Metro South,MSW,Car,0.03,0.15


In [4]:
# Set up experiment and pre-process data
exp5 = setup(data = exp5_data, 
             target = 'Tons',
             normalize = True,
             session_id = 7512,
             use_gpu = True
            )

In [5]:
# Get setup configuration grid
exp5_config = pull()
exp5_config.to_csv('exp5_config.csv', index=False)

In [6]:
# train an extra trees regressor using 10-fold CV
exp5_bestmodel = create_model('et')

# track and dump cv training scores
exp5_training = pull()
exp5_training.to_csv('exp5_training.csv', index=False)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.0289,0.0049,0.0702,0.9992,0.0242,0.1094
1,0.0289,0.0037,0.061,0.9994,0.0237,0.1103
2,0.0292,0.0051,0.0715,0.9992,0.0253,0.1115
3,0.0288,0.0029,0.0542,0.9995,0.024,0.1097
4,0.0291,0.0036,0.0597,0.9994,0.0244,0.1116
5,0.0292,0.005,0.0705,0.9992,0.0251,0.1114
6,0.0291,0.004,0.0635,0.9993,0.0243,0.1112
7,0.0289,0.0041,0.0644,0.9993,0.0233,0.1106
8,0.0288,0.0033,0.0573,0.9995,0.0235,0.1105
9,0.029,0.0031,0.0561,0.9995,0.0239,0.1112


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

In [7]:
# Residuals plot
plot_model(exp5_bestmodel, plot = 'residuals', save = True)

'Residuals.png'

In [8]:
# Error plot
plot_model(exp5_bestmodel, plot='error', save=True)

'Prediction Error.png'

In [9]:
# Feature importance plot
plot_model(exp5_bestmodel, plot='feature', save=True)

'Feature Importance.png'

In [10]:
# Use test/hold-out set to make predictions
exp5_pred_holdout = predict_model(exp5_bestmodel)
exp5_pred_holdout.to_csv('exp5_pred_holdout.csv', index=False)

# Pull and export test/hold-out scores
exp5_holdout = pull()
exp5_holdout.to_csv('exp5_holdout.csv', index=False)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extra Trees Regressor,0.0289,0.0038,0.0618,0.9994,0.0238,0.1106


In [11]:
# Save model pkl file
save_model(exp5_bestmodel, 'exp5_pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=FastMemory(location=C:\Users\Sherman\AppData\Local\Temp\joblib),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['Fullness'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=['Station', 'Material', 'Vehicle'],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('ordinal_encoding',
                  TransformerWrap...
                                                                          'mapping': Metro Central    0
 Metro South      1
 NaN             -1
 dtype: int64}]))),
                 ('onehot_encoding',
                  TransformerWrapper(include=['Material', 'Vehicle'],
                                     transformer=OneHotEncoder(cols=['Material',
                                                                     'Vehicle'],
                              