## Pypads/PadrePads Demo

### Mapping files
#### Default event-based loggers PyPads
Before jumping into how to define hooks for pypads events and map them to pypads' loggers, let's have a look into the list of default loggers.

| Logger  | Event | Hook | Description
| :-------------: |:----------:|: -----------:| ----------------|
| LogInit  | init | 'pypads_init'| Debugging purposes |
| Log  | log | 'pypads_log'| Debugging purposes |
| Parameters  |  parameters | 'pypads_fit'| tracks parameters of the tracked function call |
| Cpu,Ram,Disk  |  hardware | 'pypads_fit'| track usage information, properties and other info on CPU, Memory and Disk. |
| Input  |  input | 'pypads_fit' |tracks the input parameters of the current tracked function call. | 
| Output  | output | 'pypads_predict', 'pypads_fit' |Logs the output of the current tracked function call.| 
| Metric  | metric | 'pypads_metric' |tracks the output of the tracked metric function. | 
| PipelineTracker  | pipeline | 'pypads_fit','pypads_predict', 'pypads_transform', 'pypads_metrics'|tracks the workflow of execution of the different pipeline elements of the experiment.| 

**Note**: The loggers that we will focus on are:
- Parameters: event->parameters, hook->pypads_fit
- Input: event->input, hook->pypads_fit
- Output: event->output, hook->pypads_fit, pypads_predict
- Metric: event->metric, hook->pypads_metric

#### Default even-based logger for PadrePads

| Logger  | Event | Hook | Description
| :-------------: |:----------:|: -----------:| ----------------|
| Dataset  | dataset | 'pypads_dataset'| Tracking and logging your dataset object and metadata |
| Split  | splits | 'pypads_split'| Logging the splits of your dataset, train and test indices |
| ParameterSearch  |  parameter_search | 'pypads_param_search'| Logging a hyperparameter grid search combinations in case there was one to be tracked. |
| Decisions  |  predictions | 'pypads_predict'| tracks individual decisions of your estimators (Predicted_value/Truth_value/Decision_score) whenever possible. |



#### How to define what to be tracked in PadrePads?
First, let's suppose we want to run the following machine learning workflow which is the following simple classification example:
```python
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score


# Load dataset
data = load_breast_cancer()

# Organize our data
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

# Look at our data
print(label_names)
print('Class label = ', labels[0])
print(feature_names)
print(features[0])

# Split our data
train, test, train_labels, test_labels = train_test_split(features,
                                                          labels,
                                                          test_size=0.33,
                                                          random_state=42)

# Initialize our classifier
gnb = GaussianNB()

# Train our classifier
model = gnb.fit(train, train_labels)

# Make predictions
preds = gnb.predict(test)
print(preds)

# Evaluate accuracy
print(accuracy_score(test_labels, preds))
```
##### Example
Suppose we want to track the data flow of the estimator GaussianNB (inputs and outputs) as well as the parameters of the model and the accuracy metric value.

The entry in the mapping file would look like the following:
```json
{
  "default_hooks": {
    "modules": {
      "fns": {}
    },
    "classes": {
      "fns": {
        "pypads_init": [
          "__init__"
        ]
      }
    },
    "fns": {}
  },
  "algorithms": [
    {
      "name": "Gaussian Naive Bayes",
      "other_names": [],
      "implementation": {
        "sklearn": "sklearn.naive_bayes.GaussianNB"
      },
      "hooks": {
        "pypads_fit": [
          "fit",
          "fit_predict",
          "fit_transform"
        ],
        "pypads_predict": [
          "fit_predict",
          "predict"
        ],
        "pypads_transform": [
          "fit_transform",
          "transform"
        ]
      }
    },
    {
      "name": "sklearn classification metrics",
      "other_names": [],
      "implementation": {
        "sklearn": "sklearn.metrics.classification"
      },
      "hooks": {
        "pypads_metric": [
          "accuracy_score"
        ]
      }
    }
  ],
  "metadata": {
    "author": "DEMO",
    "library": "sklearn",
    "library_version": "0.19.1",
    "mapping_version": "0.1"
  }
}
```
##### Note: Defining a hook can be done with 3 different way ("always", a regex expression for the function name, package name hook)

##### Utilities 

In [1]:
# A simple method to see logged data
import http.server
import socketserver
import os
import shutil

def setup_server(path,PORT=8000):
    web_dir = path
    os.chdir(web_dir)

    Handler = http.server.SimpleHTTPRequestHandler
    httpd = socketserver.TCPServer(("", PORT), Handler)
    return httpd

def archive(output_filename,dir_name):
    return shutil.make_archive(output_filename, 'zip', dir_name)


### Defining the PadrePads instance "tracker"
Adding the mapping file we defined above as a resource to PadrePads can either be done by copying it under pypads/bindings/resources/mapping/ after cloning the source code **OR** by creating an instance of MappingFile class out of the json object and passing it into PadrePads initialization

In [2]:
mapping_json = {
  "default_hooks": {
    "modules": {
      "fns": {}
    },
    "classes": {
      "fns": {
        "pypads_init": [
          "__init__"
        ]
      }
    },
    "fns": {}
  },
  "algorithms": [
    {
      "name": "Gaussian Naive Bayes",
      "other_names": [],
      "implementation": {
        "sklearn": "sklearn.naive_bayes.GaussianNB"
      },
      "hooks": {
        "pypads_fit": [
          "fit",
          "fit_predict",
          "fit_transform"
        ],
        "pypads_predict": [
          "fit_predict",
          "predict"
        ],
        "pypads_transform": [
          "fit_transform",
          "transform"
        ]
      }
    },
    {
      "name": "sklearn classification metrics",
      "other_names": [],
      "implementation": {
        "sklearn": "sklearn.metrics.classification"
      },
      "hooks": {
        "pypads_metric": [
          "accuracy_*"
        ]
      }
    },
      {
      "name": "sklearn datasets",
      "other_names": [],
      "implementation":{
          "sklearn": "sklearn.datasets.base"
      },
      "hooks": {
          "pypads_dataset": ["load_breast_cancer"]
      }  
    }
  ],
  "metadata": {
    "author": "DEMO",
    "library": "sklearn",
    "library_version": "0.19.1",
    "mapping_version": "0.1"
  }
}
# Temporary directory to store resutls in (Default directory is $HOME/.mlruns/)
from tempfile import TemporaryDirectory
import threading

# MappingFile instance
from pypads.autolog.mappings import MappingFile
import json
mapping_example = MappingFile("sklearn_example", mapping_json)

temp_dir = TemporaryDirectory()

# Initializing PadrePads
from padrepads.base import PyPadrePads
tracker = PyPadrePads(uri=temp_dir.name,mapping=mapping_example)

###### Script #######
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score


# Load dataset
data = load_breast_cancer()

# Organize our data
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

# Look at our data
print(label_names)
print('Class label = ', labels[0])
print(feature_names)
print(features[0])

# Split our data
train, test, train_labels, test_labels = train_test_split(features,
                                                          labels,
                                                          test_size=0.33,
                                                          random_state=42)

# Initialize our classifier
gnb = GaussianNB()

# Train our classifier
model = gnb.fit(train, train_labels)

# Make predictions
preds = gnb.predict(test)
print(preds)

# Evaluate accuracy
print(accuracy_score(test_labels, preds))

# getting uri for the experiment folder and the dataset folder
run = tracker.api.active_run()
experiment_folder = run.info.experiment_id

datasets = tracker.mlf.get_experiment_by_name("datasets")
datasets_folder = datasets.experiment_id

tracker.api.end_run()
###### Script #######

###### Logged results (to show the folder structure) #######

path = archive(temp_dir.name+'/logs', temp_dir.name)

server = setup_server(temp_dir.name)

threading.Thread(target=server.serve_forever).start()

from IPython.display import IFrame

experiment_frame = IFrame("http://localhost:8000/"+experiment_folder,width=800, height=650)

dataset_frame = IFrame("http://localhost:8000/"+datasets_folder,width=800, height=650)

print('To get the logged results zipped, download them at http://localhost:8000/logs.zip')
###### Logged results #######
tracker.deactivate_tracking()

  from collections import Mapping, Set, Iterable


2020-05-26 13:49:01.758 | INFO     | pypads.autolog.mappings:load_mapping:198 - Added mapping file with name: keras_2_3_1.json


2020-05-26 13:49:01.758 | INFO     | pypads.autolog.mappings:load_mapping:198 - Added mapping file with name: keras_2_3_1.json


2020-05-26 13:49:01.760 | INFO     | pypads.autolog.mappings:load_mapping:198 - Added mapping file with name: sklearn_0_19_1.json


2020-05-26 13:49:01.760 | INFO     | pypads.autolog.mappings:load_mapping:198 - Added mapping file with name: sklearn_0_19_1.json


2020-05-26 13:49:01.768 | INFO     | pypads.autolog.mappings:load_mapping:198 - Added mapping file with name: keras_2_3_1.json


2020-05-26 13:49:01.768 | INFO     | pypads.autolog.mappings:load_mapping:198 - Added mapping file with name: keras_2_3_1.json


2020-05-26 13:49:01.769 | INFO     | pypads.autolog.mappings:load_mapping:198 - Added mapping file with name: sklearn_0_19_1.json


2020-05-26 13:49:01.769 | INFO     | pypads.autolog.mappings:load_mapping:198 - Added mapping file with name: sklearn_0_19_1.json


2020-05-26 13:49:01.771 | INFO     | pypads.autolog.mappings:load_mapping:198 - Added mapping file with name: torch_1_4_0.json


2020-05-26 13:49:01.771 | INFO     | pypads.autolog.mappings:load_mapping:198 - Added mapping file with name: torch_1_4_0.json


2020-05-26 13:49:01.842 | INFO     | pypads.functions.pre_run.pre_run:_call:52 - Tracking execution to run with id 15db4229b3d4449ba87afd27cd95fc70


2020-05-26 13:49:01.842 | INFO     | pypads.functions.pre_run.pre_run:_call:52 - Tracking execution to run with id 15db4229b3d4449ba87afd27cd95fc70






  cmdline: git stash push --include-untracked
  stderr: 'usage: git stash list [<options>]
   or: git stash show [<stash>]
   or: git stash drop [-q|--quiet] [<stash>]
   or: git stash ( pop | apply ) [--index] [-q|--quiet] [<stash>]
   or: git stash branch <branchname> [<stash>]
   or: git stash [save [--patch] [-k|--[no-]keep-index] [-q|--quiet]
		       [-u|--include-untracked] [-a|--all] [<message>]]
   or: git stash clear'


  cmdline: git stash push --include-untracked
  stderr: 'usage: git stash list [<options>]
   or: git stash show [<stash>]
   or: git stash drop [-q|--quiet] [<stash>]
   or: git stash ( pop | apply ) [--index] [-q|--quiet] [<stash>]
   or: git stash branch <branchname> [<stash>]
   or: git stash [save [--patch] [-k|--[no-]keep-index] [-q|--quiet]
		       [-u|--include-untracked] [-a|--all] [<message>]]
   or: git stash clear'


2020-05-26 13:49:02.753 | INFO     | pypads.functions.pre_run.pre_run:_call:52 - Tracking execution to run with id 83adb22365b04f9685b2b0cb569bc30d


2020-05-26 13:49:02.753 | INFO     | pypads.functions.pre_run.pre_run:_call:52 - Tracking execution to run with id 83adb22365b04f9685b2b0cb569bc30d






  cmdline: git stash push --include-untracked
  stderr: 'usage: git stash list [<options>]
   or: git stash show [<stash>]
   or: git stash drop [-q|--quiet] [<stash>]
   or: git stash ( pop | apply ) [--index] [-q|--quiet] [<stash>]
   or: git stash branch <branchname> [<stash>]
   or: git stash [save [--patch] [-k|--[no-]keep-index] [-q|--quiet]
		       [-u|--include-untracked] [-a|--all] [<message>]]
   or: git stash clear'


  cmdline: git stash push --include-untracked
  stderr: 'usage: git stash list [<options>]
   or: git stash show [<stash>]
   or: git stash drop [-q|--quiet] [<stash>]
   or: git stash ( pop | apply ) [--index] [-q|--quiet] [<stash>]
   or: git stash branch <branchname> [<stash>]
   or: git stash [save [--patch] [-k|--[no-]keep-index] [-q|--quiet]
		       [-u|--include-untracked] [-a|--all] [<message>]]
   or: git stash clear'


['malignant' 'benign']
Class label =  0
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]




[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0
 0 1 1]
0.9414893617021277
--Return--
None
> [0;32m<ipython-input-2-cebbf16e456d>[0m(130)[0;36m<module>[0;34m()[0m
[0;32m    129 [0;31m[0;32mimport[0m [0mipdb[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 130 [0;31m[0mipdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    131 [0;31m[0mdatasets_folder[0m [0;34m=[0m [0mdatasets[0m[0;34m.[0m[0mexperiment_id[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> temp_dir.name
'/tmp/tmpjonr38ms'
ipdb> continue


The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead.
  if not cb.iterable(width):
The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead.
  if cb.iterable(node_size):  # many node sizes


To get the logged results zipped, download them at http://localhost:8000/logs.zip


In [3]:
experiment_frame

In [4]:
dataset_frame

127.0.0.1 - - [26/May/2020 13:49:39] "GET /0 HTTP/1.1" 301 -
127.0.0.1 - - [26/May/2020 13:49:39] "GET /0/ HTTP/1.1" 200 -


In [7]:
server.shutdown()
temp_dir.cleanup()