# Universal Custom Estimators

## Patterns for Adding Custom Functionality

What can go wrong and why is it important to get this right

==============================

## Lessons Learned and Best Practices

- Follow the sci-kit learn API: https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator
- Favor array-like interface data structures internally in estimators. These work better across numpy, Dask, and Rapids that dataframe collections. If you need dataframe operations (like groupby, etc.) consider moving those to a transformer and converting to an array-like for processing.
- All attributes learned during .fit should be concrete, i.e. they should not be dask collections.
- To the extent possible, transformers should support
        numpy.ndarray
        pandas.DataFrame
        dask.Array
        dask.DataFrame
- If possible, transformers should accept a columns keyword to limit the transformation to just those columns, while passing through other columns untouched. inverse_transform should behave similarly (ignoring other columns) so that inverse_transform(transform(X)) equals X.
- Methods returning arrays (like .transform, .predict), should return the same type as the input. So if a dask.array is passed in, a dask.array with the same chunks should be returned.


last 4 from: https://ml.dask.org/contributing.html#conventions

## Extend `dask-ml` Using a Real World Example

- Setup/data
- Simple custom transformer (maybe fillna or similar?)
- transformer for adding columns (this will allow us to differentiate between dask and pandas)
- custom estimator
- make the previous steps into a pipeline (look at the beginning of the GTC talk and find the part where Mike talks about not running fit a second time. Might want todo something with that to show the value of a pipeline here.
- Grid search - show running their pipeline with grid search to find the best model.

In [None]:
from sklearn.datasets import make_classification
import pandas as pd

X_array, y_array = make_classification(
    n_samples=10_000,
    n_features=50,
    weights=[0.75, 0.25],
    flip_y=0.75,
    random_state=123,
)

X = pd.DataFrame(X_array, columns = [f"var{i}" for i in range(0,50)])
y = pd.Series(y_array)

In [None]:
from dask_ml.datasets import make_classification_df

X_dask, y_dask = make_classification_df(
    n_samples=10_000,
    n_features=50,
#     weights=[0.75, 0.25],
#     flip_y=0.75,
    random_state=123,
    chunks=1000,
)

### Transformer: Add Feature

In [None]:
new_feature_array, _ = make_classification(
    n_samples=10_000,
    n_features=1,
    n_informative=1,
    n_redundant=0,
    n_classes=1,
    random_state=123,
)

In [None]:
import numpy as np
new_feature = pd.Series(np.random.randn(10_000))

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from abc import ABCMeta

class AddFeature(TransformerMixin, BaseEstimator, metaclass=ABCMeta):
    def fit(self, X, y):
        return self
    
    def transform(self, X):
        """Add Feature to X"""
        
        
        return X

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from abc import ABCMeta
import dask.dataframe as dd

class AddFeature(TransformerMixin, BaseEstimator, metaclass=ABCMeta):
    def fit(self, X, y):
        return self
    
    def transform(self, X):
        """Add Feature to X"""
        if isinstance(X, pd.DataFrame):
            X['var50'] = new_feature
        elif isinstance(X, dd.DataFrame):
            X['var50'] = dd.from_pandas(new_feature, npartitions=X.npartitions)
        
        return X

In [None]:
AF = AddFeature()
X = AF.transform(X)
X

In [None]:
AF = AddFeature()
X_dask = AF.transform(X_dask)
X_dask

### Custom Estimator

In [None]:
from sklearn.base import BaseEstimator
import logging

logger = logging.getLogger('exp a')

class CustomEstimator(BaseEstimator):    
    def __init__(self, estimator, logger):
        self.estimator = estimator        
        self.logger = logger
    
    def fit(self, X, y=None, **fit_kws):
        self.estimator.fit(X, y)        
        self.logger.info("... log things ...")
        return self

In [None]:
import xgboost as xgb

clf = CustomEstimator(xgb.XGBClassifier(n_jobs=-1, eval_metric='error', use_label_encoder=False), logger)
clf.fit(X, y)

In [None]:
clf.estimator.evals_result() # or something to show we trained the model.

### Custom Estimator with Dask

In [None]:
from dask.distributed import Client, progress

client = Client()
client

In [None]:
# fit your custom estimator with Dask

In [None]:
import xgboost.dask as dxgb

clf = CustomEstimator(dxgb.DaskXGBClassifier(n_jobs=-1, eval_metric='error', use_label_encoder=False), logger)
clf.fit(X_dask, y_dask)

### Pipeline

In [None]:
TODO