## Experiment 2: Baseline w/o Timestamp
This experiment trains, validates and tunes a regression of load weights on the following features:

1. **Station:** Facility (Metro Central or Metro South) to which the load is arriving
2. **Material:** Dominant material stream (MSW, residential or commercial organics, wood or yard debris) arriving in the load
3. **Vehicle ID:** Information about the arriving vehicle's truck number or general type (if truck number unknown)
4. **Vehicle fullness:** The vehicle's level of "fullness", expressed as a percentage, rounded to 2 decimal places (scale of 1 to 100)

The dataset is described in more detail in [this notebook](https://app.hex.tech/2737cf3a-31c1-4361-9f90-8dea0b629cf0/hex/fa95f966-0912-42ca-9c83-9e14b785420f/draft/logic). 

In [1]:
# import packages
import pandas as pd
import numpy as np
from pycaret.regression import *

In [2]:
# import data
path = r'C:\Users\Sherman\OneDrive - Metro\Sherman\Projects\Metro TS Load Weight Prediction\Data\Baseline_noDT.csv'
exp2_data = pd.read_csv(path)

In [3]:
exp2_data.head()

Unnamed: 0,Station,Material,Vehicle,Fullness,Tons
0,Metro South,MSW,Standard Pickup,0.01,0.11
1,Metro South,MSW,Standard Pickup,0.01,0.13
2,Metro South,MSW,Standard Pickup,0.02,0.22
3,Metro South,MSW,Standard Pickup,0.01,0.09
4,Metro South,MSW,Standard Pickup,0.0,0.06


In [4]:
# Set up experiment and pre-process data
exp2 = setup(data = exp2_data, 
             target = 'Tons',
             normalize = True,
             session_id = 7512,
             use_gpu = True
            )

In [5]:
# Get setup configuration grid
exp2_config = pull()
exp2_config.to_csv('exp2_config.csv', index=False)

In [6]:
# train an extra trees regressor using 10-fold CV
exp2_bestmodel = create_model('et')

# track and dump cv training scores
exp2_training = pull()
exp2_training.to_csv('exp2_training.csv', index=False)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.1355,0.2136,0.4622,0.9647,0.0927,0.181
1,0.1423,0.2392,0.4891,0.9608,0.0968,0.1846
2,0.143,0.2322,0.4818,0.9605,0.0977,0.1883
3,0.1401,0.222,0.4711,0.9631,0.0946,0.1841
4,0.1294,0.1905,0.4364,0.968,0.0884,0.1774
5,0.1349,0.2114,0.4597,0.9649,0.091,0.1744
6,0.136,0.211,0.4593,0.9646,0.0921,0.1792
7,0.1383,0.2197,0.4687,0.9639,0.0939,0.1804
8,0.1405,0.2334,0.4831,0.9614,0.0962,0.1811
9,0.1343,0.1946,0.4412,0.9678,0.0908,0.1814


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

In [7]:
# Residuals plot
plot_model(exp2_bestmodel, plot = 'residuals', save = True)

'Residuals.png'

In [8]:
# Error plot
plot_model(exp2_bestmodel, plot='error', save=True)

'Prediction Error.png'

In [9]:
# Feature importance plot
plot_model(exp2_bestmodel, plot='feature', save=True)

'Feature Importance.png'

In [10]:
# Use test/hold-out set to make predictions
exp2_pred_holdout = predict_model(exp2_bestmodel)
exp2_pred_holdout.to_csv('exp2_pred_holdout.csv', index=False)

# Pull and export test/hold-out scores
exp2_holdout = pull()
exp2_holdout.to_csv('exp2_holdout.csv', index=False)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extra Trees Regressor,0.049,0.1105,0.3324,0.9816,0.0637,0.0649


In [11]:
# Save model pkl file
save_model(exp2_bestmodel, 'exp2_pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=FastMemory(location=C:\Users\Sherman\AppData\Local\Temp\joblib),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['Fullness'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=['Station', 'Material', 'Vehicle'],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('ordinal_encoding',
                  TransformerWrap...
                                     transformer=OneHotEncoder(cols=['Material'],
                                                               handle_missing='return_nan',
                                                               use_cat_names=True))),
                 ('rest_encoding',
                  TransformerWrapper(include=['Vehicle'],
                                     transformer=LeaveOneOutEncoder(cols=['Vehicle'],
                        