# Parallelize your experiments locally

In this section we'll see how to parallelize your work, caching the results and parameterizing workflows. It'll allow us to run multiple experiments simultaneously.

In [1]:
# Install packages
!pip install -q memory-profiler
!pip install -q ploomber-engine
!pip install -q sklearn-evaluation 

In [2]:
!ploomber examples -n guides/intro-to-ploomber -o intro

Loading examples...
[31mError: 'intro' already exists in the current working directory, please rename it or move it to another location and try again.[0m
[0m

____________
**Go to the intro folder and run the README.ipynb file**

# Ploomber engine

In [3]:
# get sample notebook
!curl -O https://raw.githubusercontent.com/ploomber/ploomber-engine/main/tests/assets/debuglater.ipynb

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2054  100  2054    0     0   6769      0 --:--:-- --:--:-- --:--:--  6869


In [4]:
# TODO: Shift into ploomber-engine
!papermill debuglater.ipynb tmp.ipynb --engine debuglater

Input Notebook:  debuglater.ipynb
Output Notebook: tmp.ipynb
Executing notebook with kernel: python3
Executing: 2cell [00:01,  1.45cell/s]                                           
Traceback (most recent call last):
  File "/Users/idomi/opt/miniconda3/bin/papermill", line 8, in <module>
    sys.exit(papermill())
  File "/Users/idomi/opt/miniconda3/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/idomi/opt/miniconda3/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/idomi/opt/miniconda3/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/idomi/opt/miniconda3/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/idomi/opt/miniconda3/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_curr

### Run the following command on a terminal
`dltr jupyter.dump`

### Debugging & profiling
We can now fix the notebook after debugging it and perform profiling.
This will let us know how much CPU and memory it consumes during its run.

In [5]:
# mprof run papermill ../tests/assets/profiling.ipynb tmp.ipynb
!mprof run papermill debuglater.ipynb tmp.ipynb --engine profiling

mprof: Sampling memory every 0.1s
running new process
Input Notebook:  debuglater.ipynb
Output Notebook: tmp.ipynb
Executing: 100%|████████████████████████████████| 1/1 [00:00<00:00,  1.77cell/s]
Traceback (most recent call last):
  File "/Users/idomi/opt/miniconda3/bin/papermill", line 8, in <module>
    sys.exit(papermill())
  File "/Users/idomi/opt/miniconda3/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/idomi/opt/miniconda3/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/idomi/opt/miniconda3/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/idomi/opt/miniconda3/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/idomi/opt/miniconda3/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    ret

In [6]:
!mprof plot --output profiling.png

Using last profile data.


# Execute on the cloud

To run in the cloud, we need to get a `ploomber API key` from https://cloud.ploomber.io/register (You can signin with 3rd party like Google or Github). Once you're in, you'll have a set of instructions to perform. You can use this notebook to execute them

In [7]:
# Fill cloud instructions in each of the cells here...
# !pip install ploomber --upgrade
# !ploomber cloud set-key {your-key}
# !curl https://raw.githubusercontent.com/ploomber/projects/master/guides/cloud-notebook-simple/plot.ipynb -o plot.ipynb
# !ploomber cloud nb plot.ipynb
# !ploomber cloud list
# !ploomber cloud logs @latest --image | tail -n 10
# !ploomber cloud list
# !ploomber cloud status @latest
# !ploomber cloud status @latest
# !ploomber cloud products
# !ploomber cloud download 'plot-aebe61a1/*.ipynb'


_____
**Make sure you complete the cloud section before moving on to visualization!**

___
# Visualization and analysis comparison

## Comparing classifiers

Learn how to easily compare plots from different models.

- Compare two models by plotting all values: `plot1 + plot2`
- Compare the performance between two models: `plot2 - plot1`

## Confusion matrix

*Added in sklearn-evaluation version 0.7.2*

In [8]:
import matplotlib
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn_evaluation import plot

In [9]:
matplotlib.rcParams["figure.figsize"] = (7, 7)
matplotlib.rcParams["font.size"] = 18

In [10]:
# get training and testing data
X, y = datasets.make_classification(
    1000, 20, n_informative=10, class_sep=0.80, n_classes=3, random_state=0
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


# fit decision tree and random forest, return confusion matrices
tree_pred, forest_pred = [
    est.fit(X_train, y_train).predict(X_test)
    for est in [DecisionTreeClassifier(), RandomForestClassifier()]
]

tree_cm = plot.ConfusionMatrix(y_test, tree_pred, normalize=False)
forest_cm = plot.ConfusionMatrix(y_test, forest_pred, normalize=False)

### Decision tree confusion matrix

In [11]:
tree_cm

### Random forest confusion matrix

In [12]:
forest_cm

### Compare confusion matrices

In [13]:
tree_cm + forest_cm

In [14]:
forest_cm - tree_cm

## Classification report

*Added in sklearn-evaluation version 0.7.8*

In [15]:
# !pip install --upgrade sklearn-evaluation

In [16]:
tree_cr = plot.ClassificationReport(y_test, tree_pred)
forest_cr = plot.ClassificationReport(y_test, forest_pred)

### Decision tree classification report

In [17]:
tree_cr

### Random forest classification report

In [18]:
forest_cr

### Compare classification reports

In [19]:
tree_cr + forest_cr

In [20]:
forest_cr - tree_cr

____
# Experiment Tracking

# Tracking Machine Learning experiments

`SQLiteTracker` provides a simple yet powerful way to track ML experiments using a SQLite database.

In [21]:
from sklearn_evaluation import SQLiteTracker

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [22]:
tracker = SQLiteTracker('my_experiments.db')

In [23]:
iris = load_iris(as_frame=True)
X, y = iris['data'], iris['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

models = [RandomForestRegressor(), LinearRegression(), Lasso()]

In [24]:
for m in models:
    model = type(m).__name__
    print(f'Fitting {model}')

    # .new() returns a uuid and creates an entry in the db
    uuid = tracker.new()
    m.fit(X_train, y_train)
    y_pred = m.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)

    # add data with the .update(uuid, {'param': 'value'}) method
    tracker.update(uuid, {'mse': mse, 'model': model, **m.get_params()})

Fitting RandomForestRegressor
Fitting LinearRegression
Fitting Lasso


Or use `.insert(uuid, params)` to supply your own ID:

In [25]:
svr = SVR()
svr.fit(X_train, y_train)
y_pred = svr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

tracker.insert('my_uuid', {'mse': mse, 'model': type(svr).__name__, **svr.get_params()})

`tracker` shows last experiments by default:

In [26]:
tracker

uuid,created,parameters,comment
my_uuid,2022-11-08 15:30:09,"{""mse"": 0.03041912541362142, ""model"": ""SVR"", ""C"": 1.0, ""cache_size"": 200, ""coef0"": 0.0, ""degree"": 3, ""epsilon"": 0.1, ""gamma"": ""scale"", ""kernel"": ""rbf"", ""max_iter"": -1, ""shrinking"": true, ""tol"": 0.001, ""verbose"": false}",
bc0f6544e8b548ac9d60475900d12a38,2022-11-08 15:30:08,"{""mse"": 0.008448000000000002, ""model"": ""RandomForestRegressor"", ""bootstrap"": true, ""ccp_alpha"": 0.0, ""criterion"": ""squared_error"", ""max_depth"": null, ""max_features"": 1.0, ""max_leaf_nodes"": null, ""max_samples"": null, ""min_impurity_decrease"": 0.0, ""min_samples_leaf"": 1, ""min_samples_split"": 2, ""min_weight_fraction_leaf"": 0.0, ""n_estimators"": 100, ""n_jobs"": null, ""oob_score"": false, ""random_state"": null, ""verbose"": 0, ""warm_start"": false}",
6034260fb237454d81e414bf68835106,2022-11-08 15:30:08,"{""mse"": 0.04260034113761793, ""model"": ""LinearRegression"", ""copy_X"": true, ""fit_intercept"": true, ""n_jobs"": null, ""normalize"": ""deprecated"", ""positive"": false}",
81b451d511074aada456a3d788ed4c2e,2022-11-08 15:30:08,"{""mse"": 0.4317655183287654, ""model"": ""Lasso"", ""alpha"": 1.0, ""copy_X"": true, ""fit_intercept"": true, ""max_iter"": 1000, ""normalize"": ""deprecated"", ""positive"": false, ""precompute"": false, ""random_state"": null, ""selection"": ""cyclic"", ""tol"": 0.0001, ""warm_start"": false}",


## Querying experiments

In [27]:
ordered = tracker.query("""
SELECT uuid,
       json_extract(parameters, '$.model') AS model,
       json_extract(parameters, '$.mse') AS mse
FROM experiments
ORDER BY json_extract(parameters, '$.mse') ASC
""")
ordered

Unnamed: 0_level_0,model,mse
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1
bc0f6544e8b548ac9d60475900d12a38,RandomForestRegressor,0.008448
my_uuid,SVR,0.030419
6034260fb237454d81e414bf68835106,LinearRegression,0.0426
81b451d511074aada456a3d788ed4c2e,Lasso,0.431766


The query method returns a data frame with "uuid" as the index:

In [28]:
type(ordered)

pandas.core.frame.DataFrame

## Adding comments

In [29]:
tracker.comment(ordered.index[0], 'Best performing experiment')

User `tracker[uuid]` to get a single experiment:

In [30]:
tracker[ordered.index[0]]

Unnamed: 0_level_0,created,parameters,comment
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bc0f6544e8b548ac9d60475900d12a38,2022-11-08 15:30:08,"{""mse"": 0.008448000000000002, ""model"": ""Random...",Best performing experiment


## Getting recent experiments

The recent method also returns a data frame:

In [31]:
df = tracker.recent()
df

Unnamed: 0_level_0,created,parameters,comment
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
my_uuid,2022-11-08 15:30:09,"{""mse"": 0.03041912541362142, ""model"": ""SVR"", ""...",
bc0f6544e8b548ac9d60475900d12a38,2022-11-08 15:30:08,"{""mse"": 0.008448000000000002, ""model"": ""Random...",Best performing experiment
6034260fb237454d81e414bf68835106,2022-11-08 15:30:08,"{""mse"": 0.04260034113761793, ""model"": ""LinearR...",
81b451d511074aada456a3d788ed4c2e,2022-11-08 15:30:08,"{""mse"": 0.4317655183287654, ""model"": ""Lasso"", ...",


Pass `normalize=True` to convert the nested JSON dictionary into columns:

In [32]:
df = tracker.recent(normalize=True)
df

Unnamed: 0_level_0,created,mse,model,C,cache_size,coef0,degree,epsilon,gamma,kernel,...,random_state,warm_start,copy_X,fit_intercept,normalize,positive,alpha,precompute,selection,comment
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
my_uuid,2022-11-08 15:30:09,0.030419,SVR,1.0,200.0,0.0,3.0,0.1,scale,rbf,...,,,,,,,,,,
bc0f6544e8b548ac9d60475900d12a38,2022-11-08 15:30:08,0.008448,RandomForestRegressor,,,,,,,,...,,False,,,,,,,,Best performing experiment
6034260fb237454d81e414bf68835106,2022-11-08 15:30:08,0.0426,LinearRegression,,,,,,,,...,,,True,True,deprecated,False,,,,
81b451d511074aada456a3d788ed4c2e,2022-11-08 15:30:08,0.431766,Lasso,,,,,,,,...,,False,True,True,deprecated,False,1.0,False,cyclic,


In [33]:
# delete our example database
from pathlib import Path
Path('my_experiments.db').unlink()