# Analyzing results from notebooks

The `.ipynb` format is capable of storing tables and charts in a standalone file. This makes it a great choice for model evaluation reports. `NotebookCollection` allows you to retrieve results from previously executed notebooks to compare them.

In [1]:
import papermill as pm
import jupytext

from sklearn_evaluation import NotebookCollection

Let's first generate a few notebooks, we have a `train.py` script that trains a single model, let's convert it to a jupyter notebook:

In [2]:
nb = jupytext.read('train.py')
jupytext.write(nb, 'train.ipynb')

We use papermill to execute the notebook with different parameters, we'll train 4 models: 2 random forest, a linear regression and a support vector regression:

In [4]:
# models with their corresponding parameters
params = [{
    'model': 'sklearn.ensemble.RandomForestRegressor',
    'params': {
        'n_estimators': 50
    }
}, {
    'model': 'sklearn.ensemble.RandomForestRegressor',
    'params': {
        'n_estimators': 100
    }
}, {
    'model': 'sklearn.linear_model.LinearRegression',
    'params': {}
}, {
    'model': 'sklearn.svm.LinearSVR',
    'params': {}
}]

# ids to identify each experiment
ids = [
    'random_forest_1', 'random_forest_2', 'linear_regression',
    'support_vector_regression'
]

# output files
files = [f'{i}.ipynb' for i in ids]

# execute notebooks using papermill
for f, p in zip(files, params):
    pm.execute_notebook('train.ipynb', output_path=f, parameters=p)

Executing:   0%|          | 0/17 [00:00<?, ?cell/s]

Executing:   0%|          | 0/17 [00:00<?, ?cell/s]

Executing:   0%|          | 0/17 [00:00<?, ?cell/s]

Executing:   0%|          | 0/17 [00:00<?, ?cell/s]

To use `NotebookCollection`, we pass a a list of paths, and optionally, ids for each notebook (uses paths by default).

The only requirement is that cells whose output we want to extract must have tags, each tag then becomes a key in the notebook collection. For instructions on adding tags, [see this](https://papermill.readthedocs.io/en/latest/usage-parameterize.html).

Extracted tables add colors to certain cells to identify the best and worst metrics. By default, it assumes that metrics are errors (smaller is better). If you are using scores (larger is better), pass `scores=True`, if you have both, pass a list of scores:

In [5]:
nbs = NotebookCollection(paths=files, ids=ids, scores=['r2'])

To get a list of tags available:

In [6]:
list(nbs)

['model_name', 'feature_names', 'model_params', 'plot', 'metrics', 'houseage']

`model_params` contains a dictionary with model parameters, let's get them (click on the tabs to switch):

In [7]:
# pro-tip: then typing the tag, press the "Tab" key for autocompletion!
nbs['model_params']

`plot` has a `y_true` vs `y_pred` chart:

In [8]:
nbs['plot']

On each notebook, `metrics` outputs a data frame with a single row with mean absolute error (mae) and mean squared error (mse) as columns.

For single-row tables, a "Compare" tab shows all results at once:

In [9]:
nbs['metrics']

Unnamed: 0,random_forest_1,random_forest_2,linear_regression,support_vector_regression
mae,0.335897,0.334725,0.529571,2.069225
mse,0.260655,0.261182,0.536969,7.196834
r2,0.8044,0.804005,0.597049,-4.400629

Unnamed: 0,mae,mse,r2
0,0.335897,0.260655,0.8044

Unnamed: 0,mae,mse,r2
0,0.334725,0.261182,0.804005

Unnamed: 0,mae,mse,r2
0,0.529571,0.536969,0.597049

Unnamed: 0,mae,mse,r2
0,2.069225,7.196834,-4.400629


We can see that the second random forest is performing the best in both metrics.

`houseage` contains a multi-row table where with error metrics broken down by the `HouseAge` indicator feature. Multi-row tables *do not* display the "Compare" tab:

In [17]:
nbs['houseage']

Unnamed: 0_level_0,mae,mse,r2
HouseAge,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,0.7677,0.589364,
2.0,0.599907,0.710334,0.595768
3.0,0.479741,0.35022,0.462628
4.0,0.455627,0.44398,0.575692
5.0,0.378937,0.300442,0.611565
6.0,0.462348,0.703149,0.36362
7.0,0.361878,0.281907,0.31237
8.0,0.343787,0.227829,0.822608
9.0,0.305901,0.242166,0.696575
10.0,0.379315,0.313215,0.680129

Unnamed: 0_level_0,mae,mse,r2
HouseAge,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,0.92108,0.848389,
2.0,0.590687,0.64542,0.632709
3.0,0.464221,0.318916,0.51066
4.0,0.414445,0.376015,0.640646
5.0,0.401092,0.322133,0.58352
6.0,0.445738,0.742799,0.327734
7.0,0.355293,0.263592,0.357044
8.0,0.328219,0.20369,0.841403
9.0,0.29365,0.207941,0.739458
10.0,0.357486,0.287711,0.706175

Unnamed: 0_level_0,mae,mse,r2
HouseAge,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,0.077045,0.005936,
2.0,0.621786,0.704731,0.598956
3.0,0.394869,0.309883,0.52452
4.0,0.50212,0.526965,0.496384
5.0,0.402867,0.349914,0.547602
6.0,0.535902,0.955226,0.135479
7.0,0.471769,0.422926,-0.031604
8.0,0.435266,0.300705,0.765865
9.0,0.395658,0.334463,0.58093
10.0,0.548596,0.44574,0.544787

Unnamed: 0_level_0,mae,mse,r2
HouseAge,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,0.34344,0.117951,
2.0,3.770918,26.895622,-14.305588
3.0,4.739191,50.409633,-76.347725
4.0,4.834227,47.458356,-44.355506
5.0,3.562077,23.406674,-29.262066
6.0,3.104015,12.781344,-10.567671
7.0,3.427083,19.193599,-45.817185
8.0,3.03937,14.417053,-10.225399
9.0,2.607912,10.82966,-12.569172
10.0,2.243591,8.865642,-8.054045


If we only compare two notebooks, the output is a bit different:

In [18]:
# only compare two notebooks
nbs_two = NotebookCollection(paths=files[:2], ids=ids[:2], scores=['r2'])

Comparing single-row tables includes a diff column with the error difference between experiments. Error reductions are showed in green, increments in red:

In [19]:
nbs_two['metrics']

Unnamed: 0,random_forest_1,random_forest_2,diff,diff_relative,ratio
mae,0.335897,0.334725,-0.001172,-0.35%,0.996511
mse,0.260655,0.261182,0.000527,0.20%,1.002022
r2,0.8044,0.804005,-0.000395,-0.05%,0.999509

Unnamed: 0,mae,mse,r2
0,0.335897,0.260655,0.8044

Unnamed: 0,mae,mse,r2
0,0.334725,0.261182,0.804005


When comparing multi-row tables, the "Compare" tab appears, showing the difference between the tables:

In [20]:
nbs_two['houseage']

Unnamed: 0_level_0,mae,mse,r2
HouseAge,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,0.15338,0.259025,
2.0,-0.00922,-0.064914,0.036941
3.0,-0.01552,-0.031304,0.048032
4.0,-0.041182,-0.067965,0.064954
5.0,0.022155,0.021691,-0.028045
6.0,-0.01661,0.03965,-0.035886
7.0,-0.006585,-0.018315,0.044674
8.0,-0.015568,-0.024139,0.018795
9.0,-0.012251,-0.034225,0.042883
10.0,-0.021829,-0.025504,0.026046

Unnamed: 0_level_0,mae,mse,r2
HouseAge,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,0.7677,0.589364,
2.0,0.599907,0.710334,0.595768
3.0,0.479741,0.35022,0.462628
4.0,0.455627,0.44398,0.575692
5.0,0.378937,0.300442,0.611565
6.0,0.462348,0.703149,0.36362
7.0,0.361878,0.281907,0.31237
8.0,0.343787,0.227829,0.822608
9.0,0.305901,0.242166,0.696575
10.0,0.379315,0.313215,0.680129

Unnamed: 0_level_0,mae,mse,r2
HouseAge,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,0.92108,0.848389,
2.0,0.590687,0.64542,0.632709
3.0,0.464221,0.318916,0.51066
4.0,0.414445,0.376015,0.640646
5.0,0.401092,0.322133,0.58352
6.0,0.445738,0.742799,0.327734
7.0,0.355293,0.263592,0.357044
8.0,0.328219,0.20369,0.841403
9.0,0.29365,0.207941,0.739458
10.0,0.357486,0.287711,0.706175


When displaying dictionaries, a "Compare" tab shows with a diff view:

In [21]:
nbs_two['model_params']

0,1,2,3,4,5
f,1,{,f,1,{
,2,"'bootstrap': True,",,2,"'bootstrap': True,"
,3,"'ccp_alpha': 0.0,",,3,"'ccp_alpha': 0.0,"
,4,"'criterion': 'squared_error',",,4,"'criterion': 'squared_error',"
,5,"'max_depth': None,",,5,"'max_depth': None,"
,6,"'max_features': 1.0,",,6,"'max_features': 1.0,"
,7,"'max_leaf_nodes': None,",,7,"'max_leaf_nodes': None,"
,8,"'max_samples': None,",,8,"'max_samples': None,"
,9,"'min_impurity_decrease': 0.0,",,9,"'min_impurity_decrease': 0.0,"
,10,"'min_samples_leaf': 1,",,10,"'min_samples_leaf': 1,"

Legends,Legends.1
Colors Added Changed Deleted,Links (f)irst change (n)ext change (t)op

Colors
Added
Changed
Deleted

Links,Links.1
(f)irst change,
(n)ext change,
(t)op,


Lists (and sets) are compared based on elements existence:

In [16]:
nbs_two['feature_names']

Both,Only in random_forest_1,Only in random_forest_2
AveBedrms,,
AveOccup,,
AveRooms,,
HouseAge,,
Latitude,,
Longitude,,
MedInc,,
Population,,


## Using the mapping interface

`NotebookCollection` has a dict-like interface, you can retrieve data from individual notebooks:

In [None]:
nbs['model_params']['random_forest_1']

In [None]:
nbs['plot']['random_forest_2']