In [None]:
!pip install comet_ml --quiet

[K     |████████████████████████████████| 299 kB 7.3 MB/s 
[K     |████████████████████████████████| 546 kB 53.1 MB/s 
[K     |████████████████████████████████| 52 kB 1.8 MB/s 
[K     |████████████████████████████████| 54 kB 3.3 MB/s 
[?25h  Building wheel for configobj (setup.py) ... [?25l[?25hdone


In [None]:
import pandas as pd 
import numpy as np
import comet_ml
from comet_ml import Experiment, Artifact, Optimizer, API
import warnings
from sklearn.exceptions import DataConversionWarning, ConvergenceWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)
from sklearn.metrics import fbeta_score, average_precision_score, auc
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.experimental import enable_hist_gradient_boosting  
from sklearn.ensemble import HistGradientBoostingClassifier
comet_ml.init()

Please enter your Comet API key from https://www.comet.ml/api/my/settings/
(api key may not show as you type)
Comet API key: ··········


COMET INFO: Comet API key is valid
COMET INFO: Comet API key saved in /root/.comet.config


In [None]:
artifact_list = ['X_train_smt', 'y_train_smt', 'X_test', 'y_test']

def fetch_artifact(artifact_name, ws, exp_name):
    experiment= Experiment(workspace=ws, project_name=exp_name)
    artifact = experiment.get_artifact(artifact_name)
    artifact.download(path = './')
    experiment.end()

for artifact in artifact_list:
    fetch_artifact(artifact_name= artifact, ws='team-comet-ml', exp_name="fraud-detection-demo")

COMET INFO: Couldn't find a Git repository in '/content' and lookings in parents. You can override where Comet is looking for a Git Patch by setting the configuration `COMET_GIT_DIRECTORY`
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/team-comet-ml/fraud-detection-demo/2dfc537a71f541ac9bd7baefa18ca9af

COMET INFO: Artifact 'team-comet-ml/X_train_smt:1.0.0' download has been started asynchronously
COMET INFO: Still downloading 1 file(s), remaining 138.10 MB/138.10 MB
COMET INFO: Artifact 'team-comet-ml/X_train_smt:1.0.0' has been successfully downloaded
COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/team-comet-ml/fraud-detection-demo/2dfc537a71f541ac9bd7baefa18ca9af
COMET INFO:   Downloads:
COMET INFO:     artifact assets : 1 (138.10 MB)
COMET INFO:     artifacts       : 1
COM

In [None]:
X_train_smt = pd.read_parquet('X_train_smt.parquet.gzip')
y_train_smt = pd.read_parquet('y_train_smt.parquet.gzip')
X_test = pd.read_parquet('X_test.parquet.gzip')
y_test = pd.read_parquet('y_test.parquet.gzip')

# Hyperparameter Optimization Strategy

We'll use the Bayes algorithm for optimizing our hyperparameters.

Bayesian Optimization is a hyperparameter search technique based on Bayes Theorem, which is efficient and effective. The algorithm builds a probabilistic model of the objective function - aka the surrogate function - and then uses that to select the most promising hyperparameters to evaluate in the true objective function. If you love math, details, and nitty gritty then [this article may be for you](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f).


The Bayes algorithim may be the best choice for most uses of the Comet Optimizer.  It provides a well-tested algorithm that balances exploring unknown space, with exploiting the best known so far. The Comet Bayes algorithm implements the adaptive Parzen-Rosenblatt estimator.

We start off by creating the configuration for the Comet Optimizier and selecting some range of values for our hyperparameters to tune.

For this tutorial, we're only going to tune the following hyperparamaters:

- `learning_rate`
- `max_leaf_nodes`
- `max_depth`
- `min_samples_leaf`
- `l2_regularization`


You can learn more about the different hyperparameters for `HistGradientBoostingClassifier` by visiting the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html#sklearn.ensemble.HistGradientBoostingClassifier). I encourage you to try other hyperparameters for tuning, adjusting the values, or even using an [iterative approach with randomized search](https://www.comet.ml/team-comet-ml/parameter-optimizations/reports/advanced-ml-parameter-optimization).

Be sure to [swing by our community Slack channel](https://bit.ly/comet-community) to show off your work, ask questions, or suggest some new features you'd like to see in the product.

We won't discuss the choice of initial hyperparameters in this notebook, but we will have an in-depth series on in the very near future. Keep an eye out for that!

In the meantime, I encourage you to try out different hyperparameter values. 

In addition the the Bayes algorithm, the Comet Optimizer also supports the Random Search and Grid Search algorithms. Learn more about that [here](https://www.comet.ml/docs/python-sdk/introduction-optimizer/).

I encourage you to experiment with the different optimizers and different hyperparameter values, see if you can beat the results we get here.

If you have questions or want to showcase your results, feel free to swing by our community [Slack channel](http://bit.ly/comet-community).

# Step 1 - Optimizer Configuration¶

We'll create an Optimizer configuration dictionary - which can either specified in code, or in a config file. Y

The configuration has [five sections](https://www.comet.ml/docs/python-sdk/introduction-optimizer/#optimizer-configuration), but we will only work with three for this example:

- `algorithm`(string) - Which search algorithm to use. The Comet Optimizer supports:
  - `grid` - Sweep algorithm based on picking parameter values from discrete, possibly sampled, regions
  - `random` - Random sampling algorithm
  - `bayes` - Bayesian algorithm based on distributions, balancing exploitation and exploration

- `spec`(dictionary) - the specifications for the searching algorithm. This is where you will identify the maximum number of hyperparameter combinations to run (`maxCombo`), the metirc you want to optimize for(`metric`), and the whether this metric should be `maximized` or `minimzed` (via the `objective` parameter).

- `parameters`(dictionary) - This is where you will put algorithm-specific hyperparameters. Since we're using `HistGradientBoostingClassifier`, we specify the parameters which are specific to that. 

Feel free to experiment with any of the other hyperparameters for this algorithim, you can learn more by visiting the [`scikit-learn docs`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html#sklearn.ensemble.HistGradientBoostingClassifier).


Just a heads up,  I've left the `maxCombo` value at 5 for illustrative purposes, but you may want to change it for something more than that when you run it. Running this on Colab may take a long time.

In [None]:
optimizer_config = {
    "algorithm": "bayes",
    "spec": {
        "maxCombo":5,
        "metric": "cv_mean_test_average_precision",
        "objective": "maximize",
    },
    "parameters": {
        "learning_rate": {"type": "discrete", "values": [0.0769, 0.5384, 0.8461]},
        "max_leaf_nodes": {"type": "discrete", "values": [3, 5, 7, 11]}, 
        "max_depth": {"type": "discrete", "values": [3, 4, 5]},
        "min_samples_leaf": {"type": "discrete", "values":  [1, 3, 5, 7]},    
        "l2_regularization": {"type": "discrete", "values": [0.0769, 0.5384, 0.8461]},
        "random_state":{"type": "discrete", "values": [42]},  
    }
}

# Step 2 - Write a Function for Cross Validation

The function below will perform a cross validation and keep track of the two metrics we care most about, which are `average_precision` and `recall`.

Note that we take the average value of the evaluation metrics across all folds for each iteration of the cross validation. 

This is done with the following line of code: `experiment.log_metrics({f"cv_mean_{k}": np.mean(scores)})`.

Which is why we passed `cv_mean_test_average_precision` as the evaluation metric in the Optimizer configuration above.

In [None]:
def run_search(experiment, model, X, y, cv):
  # Run a cross validation on training set and 
  # compute the mean of the metrics: 
  #
  results = cross_validate(
      model, X, y, cv=cv, 
      scoring=[
          "average_precision", 
          "recall"], return_train_score=True)
  for k in results.keys():
    scores = results[k]
    experiment.log_metrics({f"cv_mean_{k}": np.mean(scores)})

# Step 3 - Initialize Cross Validation and Optimizer

In [None]:
cv = StratifiedKFold(n_splits=3)
optimizer = comet_ml.Optimizer(optimizer_config)

COMET INFO: COMET_OPTIMIZER_ID=b0d997b9a85d43d48a79e159fd91aa90
COMET INFO: Using optimizer config: {'algorithm': 'bayes', 'configSpaceSize': 432, 'endTime': None, 'id': 'b0d997b9a85d43d48a79e159fd91aa90', 'lastUpdateTime': None, 'maxCombo': 5, 'name': 'b0d997b9a85d43d48a79e159fd91aa90', 'parameters': {'l2_regularization': {'type': 'discrete', 'values': [0.0769, 0.5384, 0.8461]}, 'learning_rate': {'type': 'discrete', 'values': [0.0769, 0.5384, 0.8461]}, 'max_depth': {'type': 'discrete', 'values': [3, 4, 5]}, 'max_leaf_nodes': {'type': 'discrete', 'values': [3, 5, 7, 11]}, 'min_samples_leaf': {'type': 'discrete', 'values': [1, 3, 5, 7]}, 'random_state': {'type': 'discrete', 'values': [42]}}, 'predictor': None, 'spec': {'gridSize': 10, 'maxCombo': 5, 'metric': 'cv_mean_test_average_precision', 'minSampleSize': 100, 'objective': 'maximize', 'retryAssignLimit': 0, 'retryLimit': 1000}, 'startTime': 24265775592, 'state': {'mode': None, 'seed': None, 'sequence': [], 'sequence_i': 0, 'sequence

# Step 4 - Run the Search!


`Optimizer.get_parameters()` is a [`generator`]() which will allow us to iterate over all possible parameters for this sweep or search. 

Each parameters combinations will be emitted once, unless performing multiple trials per parameter set.

In [None]:
for experiment in optimizer.get_experiments(workspace= 'team-comet-ml',project_name="fraud-detection-demo"):
  experiment.add_tag("Optimize-HGBclf")
  model = HistGradientBoostingClassifier(**{
      "loss": "binary_crossentropy",
      "learning_rate": experiment.get_parameter("learning_rate"), 
      "max_leaf_nodes": experiment.get_parameter("max_leaf_nodes"), 
      "max_depth": experiment.get_parameter("max_depth"), 
      "min_samples_leaf": experiment.get_parameter("min_samples_leaf"), 
      "l2_regularization": experiment.get_parameter("l2_regularization"), 
  })

  run_search(experiment, model, X_train_smt, y_train_smt, cv)
  experiment.end()

COMET INFO: Couldn't find a Git repository in '/content' and lookings in parents. You can override where Comet is looking for a Git Patch by setting the configuration `COMET_GIT_DIRECTORY`
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/team-comet-ml/fraud-detection-demo/327e6e1b09a24eb78cee6ac6e9c9d860

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/team-comet-ml/fraud-detection-demo/327e6e1b09a24eb78cee6ac6e9c9d860
COMET INFO:   Metrics:
COMET INFO:     cv_mean_fit_time                : 18.540867884953816
COMET INFO:     cv_mean_score_time              : 6.221289873123169
COMET INFO:     cv_mean_test_average_precision  : 0.9995371468347695
COMET INFO:     cv_mean_test_recall             : 0.9974375099989518
COMET INFO:     cv_mean_train_average_precision : 0.9996735597401666


Let's go ahead and view the results from the sweep.

Notice in the display below, right next to **Experiments** there's a filters button. We can click on that and add a filter, where column name is `TAG` and the tag is `Optimize-HGBclf`. 

Typically the experiments are displayed as the latest first.

We can see that we get some pretty decent performance via cross valdation, with the experiment named `actual_cheese_3203` performing best (since it trained slightly faster than `continuing_adhesive_1153`).

We can click into that experiment, scroll down to the Other tab, and graph the parameters from the `optimizer_parameters` field: `{"l2_regularization": 0.5384, "learning_rate": 0.8461, "max_depth": 4, "max_leaf_nodes": 11, "min_samples_leaf": 7, "random_state": null}`.

You'll have to forgive me for not setting the random state in the parameters above, but we'll set it to be `42` when we train the model on the entire training data.



In [None]:
experiment.display()

In [None]:
#Comet experiment code
experiment= Experiment(workspace= 'team-comet-ml',project_name="fraud-detection-demo")
experiment.add_tag('register-model')

params = {"l2_regularization": 0.5384, "learning_rate": 0.8461, "max_depth": 4, "max_leaf_nodes": 11, "min_samples_leaf": 7}

hgboost = HistGradientBoostingClassifier(**params, random_state=42, loss='binary_crossentropy')

hgboost.fit(X_train_smt, y_train_smt)

#Predict class probabilities and grab probability  of Class 1, which is fraud class

y_proba = hgboost.predict_proba(X_test)[:, 1]

#Apply business rule, if probability of  Class 1 is greater than or equal 0.80 classify as fraud case

y_pred = np.where(y_proba >= 0.80, 1, 0)

#calculate evaluation metrics

f_beta = fbeta_score(y_test, y_pred, beta=2)
print(f'fbeta_score for the final model is: {f_beta}')

avg_precision = average_precision_score(y_test, y_pred)
print(f'avg_precision for the final model is: {avg_precision}')

#Comet experiment metadata
metrics = {"f_beta": f_beta,
            "average_percision_score": avg_precision,
        }

#Comet experiment code
experiment.log_parameters(params)
experiment.log_metrics(metrics)


COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/team-comet-ml/fraud-detection-demo/8306623889f447298e7bfea70e2860c6
COMET INFO:   Parameters:
COMET INFO:     l2_regularization   : 0.5384
COMET INFO:     learning_rate       : 0.8461
COMET INFO:     loss                : binary_crossentropy
COMET INFO:     max_bins            : 255
COMET INFO:     max_depth           : 4
COMET INFO:     max_iter            : 100
COMET INFO:     max_leaf_nodes      : 11
COMET INFO:     min_samples_leaf    : 7
COMET INFO:     n_bins              : 256
COMET INFO:     n_iter_no_change    : 1
COMET INFO:     random_state        : 42
COMET INFO:     scoring             : 1
COMET INFO:     subsample           : 200000
COMET INFO:     tol                 : 1e-07
COMET INFO:     validation_fraction : 0.1
COMET INFO:     

fbeta_score for the final model is: 0.8799682735524809
avg_precision for the final model is: 0.6197068578135242


We'll pickle the model and save it with the filename `hgbclf.sav`.

We'll pass the following arguments to the `log_model` method of the `Experiment` object:

- `name`: the name of the model as registered on Comet
- `file_or_folder`: the location and file name of the model as saved on disk, in ourcase it's the name of the pickle object.
- `file_name`: this will be saved in the "Assets & Artifacts" section of the experiment, and you can define what the folder structire would look like.

For more information about registering an experiment, check out the Comet [docs](https://www.comet.ml/docs/user-interface/models/#2-register-a-model).

In [None]:
import pickle
model = pickle.dump(hgboost, open('hbgclf.sav', 'wb'))
experiment.log_model(name='hgboost', file_or_folder='hbgclf.sav', file_name='../models/hbgclf-fraudetc.sav')
experiment.end()

In [None]:
experiment.end()

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/team-comet-ml/fraud-detection-demo/a8c1e09977ba4eeaaa01a8242e0b1d58
COMET INFO:   Metrics:
COMET INFO:     average_percision_score : 0.6197068578135242
COMET INFO:     f_beta                  : 0.8799682735524809
COMET INFO:   Parameters:
COMET INFO:     l2_regularization   : 0.5384
COMET INFO:     learning_rate       : 0.8461
COMET INFO:     loss                : binary_crossentropy
COMET INFO:     max_bins            : 255
COMET INFO:     max_depth           : 4
COMET INFO:     max_iter            : 100
COMET INFO:     max_leaf_nodes      : 11
COMET INFO:     min_samples_leaf    : 7
COMET INFO:     n_bins              : 256
COMET INFO:     n_iter_no_change    : 1
COMET INFO:     random_state        : 42
COMET INFO:     scoring             : 1
CO