 # Comprehensive Exam

 ## [Coding Artifact](./00-kg-main-artifact.ipynb)

 Kalin Gibbons

 Nov 20, 2020
 

 # Model Selection

 Base selection of regressors is performed by fitting multiple regressors
 without any prior hyperparameter tuning, then comparing the resulting errors
 across functional groups. Models with lower errors will be marked for
 parameter tuning investigations.

In [1]:
import os
import sys
import math
import logging
from pathlib import Path
from IPython.display import display

import numpy as np

import sklearn
from sklearn.ensemble import (
    AdaBoostRegressor,
    GradientBoostingRegressor,
    RandomForestRegressor
)
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.multioutput import MultiOutputRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from tqdm.auto import tqdm


%load_ext autoreload
%autoreload 2

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# import seaborn as sns
import pandas as pd

import artifact
from artifact.datasets import load_tkr, tkr_group_lut
from artifact.helpers import RegressionProfile, REGRESSION_PROFILE_PATH



In [2]:
plt.rcParams['figure.figsize'] = (9, 5.5)
mpl.rcParams['mathtext.fontset'] = 'stix'
mpl.rcParams['font.size'] = 14
mpl.rcParams['font.family'] = 'Times New Roman'

# sns.set_context("poster")
# sns.set(rc={'figure.figsize': (16, 9.)})
# sns.set_style("whitegrid")

pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

logging.basicConfig(level=logging.INFO, stream=sys.stdout)



 ## Profiling the regressors

 First, we'll choose potential regressors to investigate. Early choices are
 linear, decision trees, as well as boosting and forest ensemble methods.
 ### Learner Selection

In [3]:
learners = (
    GradientBoostingRegressor(n_estimators=100),
    RandomForestRegressor(n_estimators=100),
    AdaBoostRegressor(DecisionTreeRegressor(), n_estimators=100),
    AdaBoostRegressor(LinearRegression(), n_estimators=100),
    DecisionTreeRegressor(),
    LinearRegression()
)


 Next, we'll select a functional group to examine, and only load the necessary
 data.
 ### Functional group selection

In [4]:
func_groups = list(tkr_group_lut.keys())
func_groups


['contact_mechanics', 'joint_loads', 'kinematics', 'ligaments', 'patella']

In [5]:
group = 'patella'


 ### Loading the data

 We'll load a subset of the data containing the responses making up the chosen
 functional group. We'll also use a `RegressionProfile` object to allow
 persistent results.

In [6]:
shared_kwargs = dict(load_fcn=load_tkr, functional_group=group)
tkr_train = artifact.Results(**shared_kwargs, subset='train')
tkr_test = artifact.Results(**shared_kwargs, subset='test')
display(tkr_train.response_names[1:])

reg_prof = RegressionProfile(load_path=REGRESSION_PROFILE_PATH)


['patella_area',
 'pat_cop_1',
 'pat_cop_2',
 'pat_cop_3',
 'pat_press',
 'pat_force_1',
 'pat_force_2',
 'pat_force_3',
 'pl_disp',
 'pl_force',
 'pat_fem_flexion',
 'pat_fem_valgus',
 'pat_fem_external',
 'pat_fem_lat',
 'pat_fem_ant',
 'pat_fem_inf']

 ### Fitting and profiling
 If the profiling results from the selected functional group have been loaded,
 then the `force_search` flag will need to be set to `True` to overwrite the
 previous profiling session.

In [7]:
force_search = False


In [8]:
learner_names = [x.__str__().replace('()', '') for x in learners]
scaler = StandardScaler()
regr = artifact.Regressor(tkr_train, tkr_test, learners[0], scaler=scaler)
err_df = pd.DataFrame(index=learner_names)

saved_keys = reg_prof.error_dataframes.keys()
if (force_search) or (group not in saved_keys):
    resp_pbar = tqdm(regr.train_results.response_names, desc='Processing...')
    for resp in resp_pbar:
        if resp == 'time':
            continue
        resp_pbar.set_description(f'Processing {resp}')
        errs = np.zeros_like(learner_names, dtype=np.float)
        lrn_pbar = tqdm(learners, desc='Fitting...', leave=False)
        for idx, lrn in enumerate(lrn_pbar):
            desc = f'{learner_names[idx].replace("base_estimator=", "")}'
            lrn_pbar.set_description(desc)
            regr.learner = MultiOutputRegressor(lrn)
            y_pred = regr.fit(resp).predict()
            errs[idx] = regr.prediction_error
        err_df[resp] = errs
        lrn_pbar.close()
    resp_pbar.close()

    reg_prof.add_results(group, err_df)
    reg_prof.save(REGRESSION_PROFILE_PATH)


 ## Results

In [9]:
# reg_prof.summarize(group)
for key in reg_prof.error_dataframes.keys():
    reg_prof.summarize(key)




[33mjoint_loads
-----------
[0m
Best learners total by response:


LinearRegression                                                             4
AdaBoostRegressor(base_estimator=DecisionTreeRegressor, n_estimators=100)    2
GradientBoostingRegressor                                                    2
dtype: int64

med_torque_1    AdaBoostRegressor(base_estimator=DecisionTreeR...
lat_torque_1    AdaBoostRegressor(base_estimator=DecisionTreeR...
med_force_2                             GradientBoostingRegressor
lat_force_2                             GradientBoostingRegressor
med_force_1                                      LinearRegression
med_torque_2                                     LinearRegression
lat_force_1                                      LinearRegression
lat_torque_2                                     LinearRegression
dtype: object



Sorted by median RMS error (smallest to largest):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
"AdaBoostRegressor(base_estimator=DecisionTreeRegressor, n_estimators=100)",8.0,457.436336,615.357405,33.018228,62.568494,175.478336,557.412198,1486.255491
RandomForestRegressor,8.0,465.865525,624.637692,32.185607,65.5382,180.323926,567.866601,1514.090976
GradientBoostingRegressor,8.0,453.882096,605.045488,28.624252,59.69274,181.80511,561.551478,1471.967865
LinearRegression,8.0,447.428664,586.178902,33.169663,55.411237,185.267365,565.693107,1430.352219
"AdaBoostRegressor(base_estimator=LinearRegression, n_estimators=100)",8.0,537.220447,681.397757,36.384291,69.945561,234.843365,709.33259,1691.947807
DecisionTreeRegressor,8.0,669.671069,891.333514,47.775067,104.640293,254.695111,823.570689,2173.725325




RMS Errors:


Unnamed: 0,med_force_1,med_force_2,med_torque_1,med_torque_2,lat_force_1,lat_force_2,lat_torque_1,lat_torque_2
GradientBoostingRegressor,65.327128,28.624252,285.255098,1471.967865,78.355123,42.789574,293.734093,1365.003632
RandomForestRegressor,70.956263,32.185607,281.823194,1514.090976,78.824658,49.28401,285.853455,1413.906037
"AdaBoostRegressor(base_estimator=DecisionTreeRegressor, n_estimators=100)",67.577091,33.018228,277.754143,1486.255491,77.659439,47.542703,273.297233,1396.386362
"AdaBoostRegressor(base_estimator=LinearRegression, n_estimators=100)",72.443998,36.384291,390.646228,1691.947807,79.040501,62.450251,436.239931,1528.610569
DecisionTreeRegressor,116.544116,47.775067,426.63888,2173.725325,124.552855,68.928824,384.837367,2014.366115
LinearRegression,57.539003,33.169663,311.983333,1430.352219,67.756311,49.02794,302.778418,1326.822428







[33mcontact_mechanics
-----------------
[0m
Best learners total by response:


LinearRegression             5
GradientBoostingRegressor    3
RandomForestRegressor        2
dtype: int64

medial_area     GradientBoostingRegressor
lateral_area    GradientBoostingRegressor
lat_cop_1       GradientBoostingRegressor
med_cop_1                LinearRegression
med_cop_2                LinearRegression
med_cop_3                LinearRegression
lat_cop_2                LinearRegression
lat_cop_3                LinearRegression
med_press           RandomForestRegressor
lat_press           RandomForestRegressor
dtype: object



Sorted by median RMS error (smallest to largest):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LinearRegression,10.0,4.357068,4.704962,0.807879,1.004875,1.365148,6.774553,12.626426
GradientBoostingRegressor,10.0,3.960516,3.805417,0.898092,1.107078,1.568996,6.736251,10.675957
"AdaBoostRegressor(base_estimator=DecisionTreeRegressor, n_estimators=100)",10.0,4.065698,3.891305,0.950414,1.18119,1.698489,6.593432,11.354011
"AdaBoostRegressor(base_estimator=LinearRegression, n_estimators=100)",10.0,5.459104,5.626212,0.926974,1.406089,1.707355,9.333867,14.390028
RandomForestRegressor,10.0,4.09814,3.929937,0.965873,1.21223,1.729505,6.5527,11.423556
DecisionTreeRegressor,10.0,5.918483,5.702119,1.54213,1.854379,2.538217,8.0531,16.57287




RMS Errors:


Unnamed: 0,medial_area,lateral_area,med_cop_1,med_cop_2,med_cop_3,lat_cop_1,lat_cop_2,lat_cop_3,med_press,lat_press
GradientBoostingRegressor,10.675957,9.195851,0.898092,1.051336,1.459073,0.965172,1.274303,1.678919,5.137179,7.269275
RandomForestRegressor,11.423556,9.541434,0.965873,1.161893,1.605035,1.128586,1.363241,1.853975,4.801305,7.136499
"AdaBoostRegressor(base_estimator=DecisionTreeRegressor, n_estimators=100)",11.354011,9.283727,0.950414,1.122366,1.604833,1.10541,1.357661,1.792145,4.942756,7.143657
"AdaBoostRegressor(base_estimator=LinearRegression, n_estimators=100)",14.34628,14.390028,0.966382,0.926974,1.490307,1.378016,1.541629,1.873082,7.849778,9.828564
DecisionTreeRegressor,16.57287,14.724805,1.54213,1.635232,2.53691,1.843612,1.886678,2.539525,7.7484,8.154667
LinearRegression,12.626426,11.890284,0.807879,0.854544,1.307279,0.966394,1.120319,1.423017,5.312703,7.261837







[33mkinematics
----------
[0m
Best learners total by response:


LinearRegression             4
GradientBoostingRegressor    1
dtype: int64

tib_fem_inf         GradientBoostingRegressor
tib_fem_valgus               LinearRegression
tib_fem_external             LinearRegression
tib_fem_lat                  LinearRegression
tib_fem_ant                  LinearRegression
dtype: object



Sorted by median RMS error (smallest to largest):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LinearRegression,5.0,0.847709,0.514818,0.387899,0.465945,0.670514,1.080836,1.633352
GradientBoostingRegressor,5.0,1.048831,0.661963,0.35307,0.749246,0.775569,1.296102,2.070167
"AdaBoostRegressor(base_estimator=LinearRegression, n_estimators=100)",5.0,1.058902,0.528898,0.466833,0.804519,0.927792,1.224674,1.870692
"AdaBoostRegressor(base_estimator=DecisionTreeRegressor, n_estimators=100)",5.0,1.243228,0.73946,0.439947,0.838643,1.013616,1.579634,2.344301
RandomForestRegressor,5.0,1.253924,0.747095,0.442298,0.833238,1.01409,1.632387,2.347605
DecisionTreeRegressor,5.0,1.955112,1.116242,0.718784,1.287697,1.648867,2.573168,3.547045




RMS Errors:


Unnamed: 0,tib_fem_valgus,tib_fem_external,tib_fem_lat,tib_fem_ant,tib_fem_inf
GradientBoostingRegressor,0.775569,2.070167,0.749246,1.296102,0.35307
RandomForestRegressor,1.01409,2.347605,0.833238,1.632387,0.442298
"AdaBoostRegressor(base_estimator=DecisionTreeRegressor, n_estimators=100)",1.013616,2.344301,0.838643,1.579634,0.439947
"AdaBoostRegressor(base_estimator=LinearRegression, n_estimators=100)",0.804519,1.870692,0.927792,1.224674,0.466833
DecisionTreeRegressor,1.648867,3.547045,1.287697,2.573168,0.718784
LinearRegression,0.465945,1.633352,0.670514,1.080836,0.387899







[33mpatella
-------
[0m
Best learners total by response:


LinearRegression             14
GradientBoostingRegressor     2
dtype: int64

pat_force_3         GradientBoostingRegressor
pat_fem_inf         GradientBoostingRegressor
patella_area                 LinearRegression
pat_cop_1                    LinearRegression
pat_cop_2                    LinearRegression
pat_cop_3                    LinearRegression
pat_press                    LinearRegression
pat_force_1                  LinearRegression
pat_force_2                  LinearRegression
pl_disp                      LinearRegression
pl_force                     LinearRegression
pat_fem_flexion              LinearRegression
pat_fem_valgus               LinearRegression
pat_fem_external             LinearRegression
pat_fem_lat                  LinearRegression
pat_fem_ant                  LinearRegression
dtype: object



Sorted by median RMS error (smallest to largest):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LinearRegression,16.0,8.347111,19.601499,0.03972,0.364332,0.615132,5.704172,78.47328
"AdaBoostRegressor(base_estimator=LinearRegression, n_estimators=100)",16.0,10.763468,26.532261,0.056374,0.42559,0.708399,6.811542,106.845082
GradientBoostingRegressor,16.0,8.929895,20.380611,0.044186,0.498351,1.00228,6.213557,81.784288
"AdaBoostRegressor(base_estimator=DecisionTreeRegressor, n_estimators=100)",16.0,9.359233,20.774763,0.04557,0.617674,1.220988,6.286227,83.039612
RandomForestRegressor,16.0,9.329025,20.650144,0.046053,0.618356,1.275852,6.283071,82.578164
DecisionTreeRegressor,16.0,13.62885,29.837621,0.072473,0.895509,1.869171,9.371401,119.306811




RMS Errors:


Unnamed: 0,patella_area,pat_cop_1,pat_cop_2,pat_cop_3,pat_press,pat_force_1,pat_force_2,pat_force_3,pl_disp,pl_force,pat_fem_flexion,pat_fem_valgus,pat_fem_external,pat_fem_lat,pat_fem_ant,pat_fem_inf
GradientBoostingRegressor,3.711422,1.039986,0.343133,1.429776,2.641105,81.784288,16.423624,18.257453,0.044186,13.719961,0.798638,0.661859,0.964573,0.545532,0.155981,0.356807
RandomForestRegressor,3.617298,1.316362,0.384606,1.739372,2.763957,82.578164,17.795987,20.488181,0.046053,14.28039,0.916872,0.840942,1.235343,0.692568,0.17258,0.39572
"AdaBoostRegressor(base_estimator=DecisionTreeRegressor, n_estimators=100)",3.642699,1.261601,0.378424,1.673906,2.719691,83.039612,17.979399,20.632525,0.04557,14.216809,0.88077,0.830537,1.180376,0.690498,0.176108,0.399204
"AdaBoostRegressor(base_estimator=LinearRegression, n_estimators=100)",3.908288,0.738867,0.298481,1.282473,2.845428,106.845082,15.521304,20.977951,0.056374,16.999886,0.677931,0.676898,0.591485,0.196462,0.130626,0.467959
DecisionTreeRegressor,5.061745,1.929259,0.656853,2.861992,4.004441,119.306811,27.315488,28.032696,0.072473,22.300368,1.622457,1.217882,1.809084,0.975061,0.293358,0.601626
LinearRegression,3.461912,0.586916,0.283162,1.153262,2.242558,78.47328,14.332141,18.449326,0.03972,12.430954,0.643349,0.412737,0.391389,0.134664,0.122012,0.396402







[33mligaments
---------
[0m
Best learners total by response:


LinearRegression    17
dtype: int64

ap_disp      LinearRegression
pmc_disp     LinearRegression
pom_disp     LinearRegression
pol_disp     LinearRegression
pcl_disp     LinearRegression
pcm_disp     LinearRegression
alc_disp     LinearRegression
vi1_disp     LinearRegression
pfl_force    LinearRegression
lclp_disp    LinearRegression
lcla_disp    LinearRegression
lcl_disp     LinearRegression
mclp_disp    LinearRegression
mcla_disp    LinearRegression
mcl_disp     LinearRegression
pfl_disp     LinearRegression
vi1_force    LinearRegression
dtype: object



Sorted by median RMS error (smallest to largest):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LinearRegression,17.0,2.644678,6.578398,0.496801,0.690125,0.85719,0.98269,27.809719
"AdaBoostRegressor(base_estimator=LinearRegression, n_estimators=100)",17.0,3.071228,7.433703,0.533512,0.762196,1.03223,1.193374,31.357367
GradientBoostingRegressor,17.0,3.523874,8.862122,0.595402,0.9225,1.142407,1.385996,37.667473
"AdaBoostRegressor(base_estimator=DecisionTreeRegressor, n_estimators=100)",17.0,3.898217,9.532387,0.639053,1.085326,1.327939,1.64302,40.645751
RandomForestRegressor,17.0,4.051685,10.089778,0.655742,1.105437,1.344097,1.649689,42.980346
DecisionTreeRegressor,17.0,5.993554,14.507311,1.057411,1.711025,2.14711,2.661122,61.99725




RMS Errors:


Unnamed: 0,ap_disp,mcl_disp,mcla_disp,mclp_disp,lcl_disp,lcla_disp,lclp_disp,pfl_disp,pfl_force,alc_disp,pcm_disp,pcl_disp,pol_disp,pom_disp,pmc_disp,vi1_disp,vi1_force
GradientBoostingRegressor,1.32091,0.866712,0.833466,0.895061,1.302893,1.360081,1.385996,1.834739,37.667473,1.142407,1.026687,0.924339,1.078023,0.9225,1.415725,0.595402,5.333449
RandomForestRegressor,1.649689,1.079333,1.045374,1.105437,1.545818,1.617806,1.639246,2.155293,42.980346,1.32617,1.178683,1.035248,1.344097,1.218054,1.717247,0.655742,5.585059
"AdaBoostRegressor(base_estimator=DecisionTreeRegressor, n_estimators=100)",1.598345,1.054265,1.020414,1.085326,1.581645,1.64302,1.6822,2.164975,40.645751,1.327939,1.140784,1.041014,1.309432,1.127109,1.605072,0.639053,5.603342
"AdaBoostRegressor(base_estimator=LinearRegression, n_estimators=100)",1.23494,0.659524,0.643355,0.670545,1.050308,1.068814,1.03223,1.193374,31.357367,1.286204,0.890204,0.998838,1.101196,0.762196,0.840768,0.533512,6.887504
DecisionTreeRegressor,2.596926,1.671426,1.597184,1.711025,2.589422,2.643307,2.671249,3.263703,61.99725,2.14711,1.848993,1.668983,2.145812,1.806551,2.661122,1.057411,7.812937
LinearRegression,1.095271,0.527545,0.515668,0.534548,0.837581,0.85719,0.86743,1.092973,27.809719,0.901664,0.838817,0.876701,0.98269,0.690125,0.744542,0.496801,5.29026







Return to the root [Coding Artifact](./00-kg-main-artifact.ipynb) document.