## Quickstart Example

### Sculpture

To get started, connect a dataset via
`donatello.components.data.Dataset` and
`donatello.components.estimator.Estimator` to
`donatello.components.core.Sculpture`

Sculptures are modeling object that uphold scikit-learn's estimator contracts

    1. ``__init__`` cannot mutate parameters
    2. ``get_params`` and ``set_params`` support
    2. ``fit``, ``transform``, and ``fit_transform`` support


Sculptures can be embedded as nodes in `donatello.components.transformers.ModelDAG` as well 
as transformers in `sklearn.pipeline.Pipeline`

Donatello approaches model configuration as declaring intent. Everything required
to build evaluate and produce artifacts is specified during instantiation.

In [1]:
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score

from donatello.components.data import Dataset
from donatello.components.estimator import Estimator
from donatello.components.measure import Metric, FeatureWeights, ThresholdRates
from donatello.components.core import Sculpture


def load_sklearn_bc_dataset():
    """
    Helper to load sklearn dataset into a pandas dataframe

    Returns:
        tuple(pd.DataFrame, pd.Series): X and y
    """
    dataset = load_breast_cancer()
    X = pd.DataFrame(data=pd.np.c_[dataset['data'], ],
                     columns=(dataset['feature_names'].tolist())
                     )
    y = pd.Series(dataset['target'], name='is_malignant')
    return X, y


def load_sculpture():
    """
    Helper to load sculpture
    """
    X, y = load_sklearn_bc_dataset()
    # Datasets can operate with custom folders but to help simplify
    # KFold, stratified, group, or time based splits from sklearn can 
    # be invoked by strings through the `clay` parameter
    dataset = Dataset(X=X, y=y, clay='stratify')

    estimator = Estimator(model=LogisticRegression(),
                          paramGrid={'model__C': list(pd.np.logspace(-2, -0.01, 5))},
                          searchKwargs={'scoring': 'roc_auc', 'cv': 3},
                          method='predict_proba',
                          scorer='score_second'
                          )
    
    # metrics are functors that evaluate the outputs of models with ground truth,
    # design data, and the models themselves
    metrics = [Metric(roc_auc_score), Metric(average_precision_score),
               FeatureWeights(sort='coefficients'), ThresholdRates()]

    sculpture = Sculpture(dataset=dataset, estimator=estimator, metrics=metrics)

    return sculpture


sculpture = load_sculpture()

#### Dataset

The dataset can be specifed through 

    1. explicit ``X`` and ``y`` (if supervised)
    2. a ``raw`` table and a reference to the ``target`` (if supervised)
    3. a collection of ``raw`` tables with a ``primaryKey`` to merge along + ``target`` (if supervised)

#### Estimator

The estimator object requires a ``model``, and a reference to the ``method`` of the model to call. Optionally
a callback to transform the raw output can be supplied through the ``scorer``. To enable hyperparameter tuning
a parameter grid and search arguments can be supplied. Currently donatello only supports grid searching through
the scikit-learn API, which prevents searching over input datasets which are collections of tables. Until this
functionality is built out it can be hacked around via `donatello.components.transformers.ModelDAG`
and embedding a ``Sculpture`` as a node downstream of a node that combines the data. The ``Dataset`` will manage
the indexing to prevent leakage.

#### Intent

In [2]:
sculpture.declaration

{'dataset': Dataset_2019_05_30_07_44,
 'entire': False,
 'estimator': Estimator_2019_05_30_07_44,
 'holdout': 'search',
 'measure': Measure_2019_05_30_07_44,
 'metrics': [roc_auc_score_2019_05_30_07_44,
  average_precision_score_2019_05_30_07_44,
  feature_weights_2019_05_30_07_44,
  ThresholdRates_2019_05_30_07_44],
 'outsideData': None,
 'persist': <function donatello.utils.helpers.persist>,
 'storeReferences': True,
 'timeFormat': '%Y_%m_%d_%H_%M',
 'validation': 'search',
 'writeAttrs': ('', 'estimator')}

The validation, holdout, and entire flags dictate over which (data / subsets) of data estimators are fit and metrics are calculated (if applicable) and whether or not to gridsearch

#### Metrics

The metrics list is a collection of `donatello.components.measure.Metric` objects which fit calculate statistics around model performance, which can either wrap a scikit-learn metric or execute custom scoring functionality. If information needs to be shared across folds for computation, it can be stored during the `fit` method.

#### Fitting

The sculputre’s fit method defaults to instructions provided during instantiation.

Declared by the given flags above, this sculpture will perform a 5 fold stratified K Fold cross validation within the training subset of the data and then fit a model over the entire training set and evaluate on the hold out set.

Per scikit-learn Transformer pattern, fitting returns the object itself

Donatello leverages the `fallback` decorator extensively, which will default to the attribute the object has attached as a property unless another object is supplied during the method call. 

Note - The new object will NOT replace the existing attribute.

```sculpture.fit() == sculpture.fit(dataset=sculpture.dataset)```



In [3]:
sculpture.fit()

Cross Validation
grid searching




grid searching
grid searching
grid searching
grid searching
Holdout
grid searching


Sculpture_2019_05_30_07_44

At the end of the model's fitting the object will persist the attributes (and/or itself) perscribed by the writeAttrs field

In [4]:
ls *pkl

Estimator.pkl  Sculpture.pkl


#### Evaluation

During the fitting process, metrics are calculated over the specified samples of data and stored in a sklearn.utils.Bunch (a lighlty wrapped dict, with attribute style accessors)

This information is attached to the Sculpture in the measurements attribute

In [5]:
sculpture.measurements.keys()

['holdout', 'crossValidation']

In [6]:
sculpture.measurements.crossValidation.keys()

['ThresholdRates',
 'roc_auc_score',
 'feature_weights',
 'average_precision_score']

In [7]:
from donatello.utils import helpers
helpers.view_sk_metric(sculpture.measurements.crossValidation.average_precision_score)

Unnamed: 0,score
mean,0.989885
std,0.016318


The feature weights metric is a short cut to pulling coefficients for glms and feature_importances from ensemble method



In [8]:
sculpture.measurements.crossValidation.feature_weights.mean

Unnamed: 0_level_0,coefficients
names,Unnamed: 1_level_1
worst concavity,-1.206367
worst compactness,-0.806753
mean concavity,-0.45701
worst concave points,-0.444182
worst texture,-0.402055
worst symmetry,-0.315246
mean compactness,-0.277337
mean concave points,-0.225841
worst smoothness,-0.216312
worst perimeter,-0.212148


The Threshold Rates Metric helps parameterize the binary confusion matrix by sampling scores from the held out data and evaluting the rate

In [9]:
sculpture.measurements.crossValidation.ThresholdRates.mean.columns.tolist()

['true_negative',
 'false_positive',
 'false_negative',
 'true_positive',
 'precision',
 'recall',
 'specificity',
 'false_omission_rate',
 'negative_predictive_value',
 'f1',
 'fall_out',
 'false_discovery_rate']

In [10]:
sculpture.measurements.crossValidation.ThresholdRates.mean[['precision', 'recall']].loc[::5]

Unnamed: 0_level_0,precision,recall
points,Unnamed: 1_level_1,Unnamed: 2_level_1
8.809836000000001e-33,0.629158,1.0
1.744928e-11,0.661834,1.0
3.488636e-08,0.699541,1.0
6.097043e-06,0.740136,1.0
0.0003799622,0.786753,1.0
0.008716837,0.840557,1.0
0.07257958,0.890697,0.992982
0.3598898,0.93973,0.97193
0.8182808,0.985649,0.940351
0.9192521,0.988889,0.866667


The objects in donatello strive to be strongly encapsulated with simple interfaces for debugging. For example, the dataset itself is an iterable that will yield the training/testing subsets of data directly

In [11]:
for fold, (designTrain, designTest, targetTrain, targetTest) in enumerate(sculpture.dataset):
    print(fold)
    display(designTrain.head(3))
    display(targetTest.to_frame().head(3))
    print('*'*10)

0


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Unnamed: 0,is_malignant
2,0
3,0
23,0


**********
1


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173


Unnamed: 0,is_malignant
1,0
6,0
10,0


**********
2


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758


Unnamed: 0,is_malignant
4,0
9,0
11,0


**********
3


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173


Unnamed: 0,is_malignant
0,0
5,0
17,0


**********
4


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758


Unnamed: 0,is_malignant
7,0
8,0
14,0


**********


Note the first record is in the training set in fold 1,2,3,5 and in the evaluation set in fold 4