# Test of notebook run-time environment
Purpose of notebook is to smoke environment setup and prototype code to load sklearn Random Forest Regressor model from external storage.

## Metrics reported in this notebook are from synthetic data and **have not** been calibrated to representative dataset or model sizes.


## Notebook run-time enviornment
* **Hardware:** MacBook Pro (2019), 16GB RAM, 1TB SSD drive
* **OS:** MacOS 11.6.1
* **Docker:** Docker for Desktop (Mac)
* **Docker Image:** Base image: `jupyter/datascience-notebook` with ONNX packages added

## Key software versions

In [1]:
!python --version

Python 3.9.7


In [2]:
!conda list -n onnx_sandbox | grep "\(onnx\|scikit\|numpy\|pandas\)"

# packages in environment at /opt/conda/envs/onnx_sandbox:
numpy                     1.21.2           py39h20f2e39_0    defaults
numpy-base                1.21.2           py39h79a1101_0    defaults
onnx                      1.10.2           py39h8b1bc1a_2    conda-forge
onnxconverter-common      1.8.1              pyhd8ed1ab_0    conda-forge
onnxruntime               1.10.0           py39h15e0acf_2    conda-forge
pandas                    1.3.5            py39h8c16a72_0    defaults
scikit-learn              1.0.1            py39h51133e4_0    defaults
skl2onnx                  1.10.3             pyhd8ed1ab_0    conda-forge


## Import required libraries

In [3]:
import os
import pandas as pd
import numpy as np
import onnxruntime as rt
import pickle

## Setup on configuration for analysis

In [4]:
# required to allow for import of project speccific utility functions
os.chdir('..')

# import project specific utiity functions
from utils.utils import load_config, rf_model_size_mb

In [5]:
# get configuration parameters
config = load_config('./config.yaml')
config

{'data_dir': '/Users/jim/Desktop/onnx_sandbox/data',
 'models_dir': '/Users/jim/Desktop/onnx_sandbox/models',
 'number_records': 100000,
 'number_features': 20,
 'number_informative': 14,
 'fraction_for_test': 0.2,
 'number_counties': 20,
 'random_seed': 123}

In [6]:
COUNTY_ID = 'cnty0000'
DATA_DIR = config['data_dir']
MODELS_DIR = config['models_dir']

## Retrieve test data set

In [7]:
test_df = pd.read_parquet(os.path.join(DATA_DIR, 'benchmark', 'test.parquet'))
test_df = test_df.loc[test_df['county'] == COUNTY_ID]
test_df.shape

(1024, 22)

In [8]:
# retrieve one record to score
one_record = pd.DataFrame(test_df.iloc[0,:]).T
one_record

Unnamed: 0,county,X_00,X_01,X_02,X_03,X_04,X_05,X_06,X_07,X_08,...,X_11,X_12,X_13,X_14,X_15,X_16,X_17,X_18,X_19,y
3,cnty0000,1.356254,-1.625286,0.038011,-0.417242,0.105904,0.998065,-0.939552,-0.321798,2.528673,...,1.310078,1.92076,-0.251311,1.092388,-0.094214,-0.459,-0.669557,0.266269,0.200396,136.709747


## Score with pickle model file

In [9]:
%%time
# retrieve model from persistent storage
with open(os.path.join(MODELS_DIR, 'testbed', COUNTY_ID+'.pkl'), 'rb') as f:
    rf_pkl_model = pickle.load(f)
rf_pkl_model

CPU times: user 268 ms, sys: 55.9 ms, total: 324 ms
Wall time: 325 ms


RandomForestRegressor(random_state=123)

In [10]:
print(f'Unpickled RF modle size: {rf_model_size_mb(rf_pkl_model)} MB')

Unpickled RF modle size: 30.2889 MB


In [11]:
%%time
# score one record from test data set
pkl_scores = rf_pkl_model.predict(one_record.drop(['county', 'y'], axis='columns'))

CPU times: user 12.4 ms, sys: 3.1 ms, total: 15.5 ms
Wall time: 14.6 ms


## Score with onnx model file
Based on [this example code](http://onnx.ai/sklearn-onnx/auto_examples/plot_convert_model.html#sphx-glr-auto-examples-plot-convert-model-py). 
Not clear how to obtain in-memory size for ONNX `sess` object.

In [12]:
%%time
# retrieve model from file
sess = rt.InferenceSession(os.path.join(MODELS_DIR, 'testbed', COUNTY_ID+'.onnx'))

CPU times: user 576 ms, sys: 52.3 ms, total: 629 ms
Wall time: 568 ms


In [13]:
%%time
# Score one record
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name
onnx_scores = sess.run([label_name], 
        {input_name: one_record.drop(['county', 'y'], axis='columns').astype(np.float32).to_numpy()})[0]

CPU times: user 1.35 ms, sys: 0 ns, total: 1.35 ms
Wall time: 1.46 ms


## Compare predicted scores

In [14]:
print(
    f'Score from unpickled RF model: {pkl_scores[0]:0.5f}, '
    f'Score from ONNX RF model: {onnx_scores[0,0]:0.5f}'
)

Score from unpickled RF model: -98.02014, Score from ONNX RF model: -98.02013


## Collect RF Structure metrics

In [15]:
def get_rf_model_structure(model):
    number_of_estimators = len(model.estimators_)
    tree_depth = [tree.tree_.max_depth for tree in model.estimators_]
    min_depth = np.min(tree_depth)
    max_depth = np.max(tree_depth)
    mean_depth = np.mean(tree_depth)
    return number_of_estimators, min_depth, mean_depth, max_depth

In [16]:
number_trees, smallest_tree, average_tree, biggest_tree = get_rf_model_structure(rf_pkl_model)

In [17]:
print(
    f'number of trees: {number_trees}, smallest tree size: {smallest_tree}. '
    f'average tree size: {average_tree}, biggest tree size: {biggest_tree}'
)

number of trees: 100, smallest tree size: 20. average tree size: 22.3, biggest tree size: 26
