# Universal Custom Estimators

## Patterns for Adding Custom Functionality

What can go wrong and why is it important to get this right

==============================

## Lessons Learned and Best Practices

- Follow the sci-kit learn API: https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator
- Favor array-like interface data structures internally in estimators. These work better across numpy, Dask, and Rapids that dataframe collections. If you need dataframe operations (like groupby, etc.) consider moving those to a transformer and converting to an array-like for processing.
- All attributes learned during .fit should be concrete, i.e. they should not be dask collections.
- To the extent possible, transformers should support
        numpy.ndarray
        pandas.DataFrame
        dask.Array
        dask.DataFrame
- If possible, transformers should accept a columns keyword to limit the transformation to just those columns, while passing through other columns untouched. inverse_transform should behave similarly (ignoring other columns) so that inverse_transform(transform(X)) equals X.
- Methods returning arrays (like .transform, .predict), should return the same type as the input. So if a dask.array is passed in, a dask.array with the same chunks should be returned.


last 4 from: https://ml.dask.org/contributing.html#conventions

## Extend `dask-ml` Using a Real World Example

- Setup/data
- Simple custom transformer (maybe fillna or similar?)
- transformer for adding columns (this will allow us to differentiate between dask and pandas)
- custom estimator
- make the previous steps into a pipeline (look at the beginning of the GTC talk and find the part where Mike talks about not running fit a second time. Might want todo something with that to show the value of a pipeline here.
- Grid search - show running their pipeline with grid search to find the best model.

In [1]:
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=100_000,
    n_features=100,
    weights=[0.75, 0.25],
    flip_y=0.75,
    random_state=123,
)

### Transformer 1

In [6]:
from sklearn.base import BaseEstimator, TransformerMixin
from abc import ABCMeta

class Mutate(TransformerMixin, BaseEstimator, metaclass=ABCMeta):
    def fit(self, X, y):
        return self
    
    def transform(self, X):
        """Mutates X"""
        X = X + 1
        return X

### Should we do a transformer 2? Maybe have #1 be transforming to DF then 2 be adding a column (to differentiate between dask and pandas

### Custom Estimator

In [11]:
from sklearn.base import BaseEstimator
import logging

logger = logging.getLogger('exp a')

class CustomEstimator(BaseEstimator):    
    def __init__(self, estimator, logger):
        self.estimator = estimator        
        self.logger = logger
    
    def fit(self, X, y=None, **fit_kws):
        self.estimator.fit(X, y)        
        self.logger.info("... log things ...")
        return self

In [12]:
import xgboost as xgb

clf = CustomEstimator(xgb.XGBClassifier(n_jobs=-1, eval_metric='error', use_label_encoder=False), logger)

clf.fit(X, y)

CustomEstimator(estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                        colsample_bylevel=1, colsample_bynode=1,
                                        colsample_bytree=1, eval_metric='error',
                                        gamma=0, gpu_id=-1,
                                        importance_type='gain',
                                        interaction_constraints='',
                                        learning_rate=0.300000012,
                                        max_delta_step=0, max_depth=6,
                                        min_child_weight=1, missing=nan,
                                        monotone_constraints='()',
                                        n_estimators=100, n_jobs=-1,
                                        num_parallel_tree=1, random_state=0,
                                        reg_alpha=0, reg_lambda=1,
                                        scale_pos_weight=1, subsample=1,
        

In [None]:
xgb.best_estimator # or something to show we trained the model.

In [15]:
from dask_ml.datasets import make_classification as make_classification_dask

In [17]:
X_dask, y_dask = make_classification_dask(
    n_samples=100_000,
    n_features=100,
    weights=[0.75, 0.25],
    flip_y=0.75,
    random_state=123,
    chunks=1000,
)

In [19]:
from dask.distributed import Client, progress
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:41863  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 24.92 GiB


hide solutions and add a blank cell

In [21]:
import xgboost.dask as dxgb

clf = CustomEstimator(dxgb.DaskXGBClassifier(n_jobs=-1, eval_metric='error', use_label_encoder=False), logger)

clf.fit(X_dask, y_dask)

CustomEstimator(estimator=DaskXGBClassifier(base_score=0.5, booster='gbtree',
                                            colsample_bylevel=1,
                                            colsample_bynode=1,
                                            colsample_bytree=1,
                                            eval_metric='error', gamma=0,
                                            gpu_id=-1,
                                            interaction_constraints='',
                                            learning_rate=0.300000012,
                                            max_delta_step=0, max_depth=6,
                                            min_child_weight=1,
                                            monotone_constraints='()',
                                            n_jobs=-1, num_parallel_tree=1,
                                            objective='binary:logistic',
                                            random_state=0, reg_alpha=0,
                         

### Pipeline

### Cross Validator

In [None]:
# maybe write an updated estimator with the implemented cv

In [None]:
# have them write their own CV

### Grid Search