#### In this notebook we develop a pipeline for hyperparameter tuning for UMAP + HDBSCAN.

We need to tune the following params:

UMAP:
- n_neighbors: [2, 0.25*len(df)]
- min_dist: [0, 0.99]
- n_components: [2, n_features]
- metric: [9 metrics for binary data]

HDBSCAN:
- min_cluster_size:
- min_samples: 
Note: If you wish to explore different min_cluster_size settings with a fixed min_samples value, especially for larger dataset sizes, you can cache the hard computation, and recompute only the relatively cheap flat cluster extraction using the memory parameter, which makes use of joblib
- cluster_selection_epsilon: ?
[- alpha]X
[- leaf clustering, not EOM]


##### Here we use the DBCV score, but could try others?


In [1]:
RANDOM_SEED = 42

In [2]:
from utilities import load_symptom_data
import hdbscan
import numpy as np
import pandas as pd
import time
import wandb

from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [3]:
df = load_symptom_data('../data/cleaned_data_SYMPTOMS_9_13_23.csv')

In [4]:
n_iter = 2

In [5]:
# random_search = RandomizedSearchCV(
#     hdb,
#     param_distributions=hyper_params,
#     n_iter=n_iter,
#     scoring=clustering_score,
#     random_state=RANDOM_SEED
# )

# grid_search = GridSearchCV(
#     hdb,
#     param_grid=hyper_params,
#     scoring=clustering_score
# )

In [6]:
# start_time = time.time()
# random_search.fit(df)
# elapsed_time = time.time() - start_time
# print("%d fits took %.1f minutes" % (n_iter, elapsed_time/60))

In [7]:
# import itertools

# hyper_params = {
#     'penalty': ['l1', 'l2'],
#     'class_weight': [None, 'balanced'],
#     'max_iter': [500, 1000, 30]
# }


# a = hyper_params.values()
# combinations = list(itertools.product(*a))

##### Trying different approach to rescue grid search!

- To get GridSearchCV to fit and return the score for the full dataset, we need to use a predefined split with one copy of the data fro training and another copy for validation.
- We need to create our own scoring function with the correct signature (i.e. no need for y_true), as below.
- Need to make sure refit=False
- Need to make sure that random state is the same for each split ???

#### The following is basically working, but needs converting the hdbscan and different scoring metrics...
#### Also needs porting to scikit-optimize...Note: you need to specifiy the search space differently. 

#### And we need to add pipeline that includes a dim reduction algo.

### Necessary to downgrade numpy to <1.24 because skopt uses np.int :/

### Questions: should DVBC score use local value of 'metric' - problematic for comparing across different runs...

DCBV not working with CV.....

In [8]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.model_selection import PredefinedSplit
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA



from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

In [9]:
run = wandb.init(
        name='run1',
        project='test_clulster',
        config={}
    )

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/rustybilges/.netrc


In [101]:
ddf = pd.concat([df, df])

In [102]:
split = PredefinedSplit([0 if i < len(df) else 1 for i in range(len(ddf.index))])

In [103]:
test_id = np.array([0 if i < len(df) else 1 for i in ddf.index])

In [205]:
pca = PCA(random_state=42)

In [308]:
hdb = hdbscan.HDBSCAN(gen_min_span_tree=True, core_dist_n_jobs=4)

In [309]:
kmeans = KMeans(n_init='auto', random_state=42)

## Note: getting overflow in this version of DBCV when distances are small. Is there another implementation we can use?

Note: this code may not work for n_jobs!=1 because of the way we obtain the iterations number from the length of the otpimisation result.

In [347]:
# Note: these scores use model.steps[1][1].labels_ instead of model.steps.labels_ because
# they are accessing the clustering model which is the second step in the pipeline.

def dbcv(data, labels, metric):
    return hdbscan.validity.validity_index(
            data, labels,
            metric=metric
        )
    
def silhouette(data, labels, metric):
    num_labels = len(set(labels))
    if num_labels == 1:
        print("Warning: Valid number of clusters must be 2 or more.")
        return 0
    else:
        return silhouette_score(data, labels)

def cv_score(model, X, metric='euclidean', score='dbcv'):
    """
    If score == 'all' we return a dictionary of all scores, which
    can be logged to wandb on each iteration. 

    Otherwise this is intended for use as a scorer in <X>SearchCV methods.
    In that case metric should be fixed to allow comparison across different runs.
    """
    score_dict = {
        'silhouette': silhouette,
        'dbcv': dbcv
    }
    # TODO: move this as not all clustering algos have 'metric' parameter.
    if metric == None:
        metric = model.steps[1][1].get_params()['metric']
        
    model.fit(X)
    labels = model.steps[1][1].labels_
    data = model.steps[0][1].transform(X)
    return score_dict[score](data, labels, metric)


In [343]:
hyper_params = {
    'pca__n_components': Integer(5, 100),#, 30, 45, 60],
    'hdbscan__min_samples': Integer(1,1000),
    'hdbscan__min_cluster_size':Integer(10, 2000),  
    'hdbscan__cluster_selection_method' : Categorical(['eom', 'leaf']),
    'hdbscan__cluster_selection_epsilon' : Real(0.0, 100.0),
    'hdbscan__metric' : Categorical(['euclidean', 'manhattan'])
}

# hyper_params = {
#     'pca__n_components': [5, 15],#, 30, 45, 60],
#     'kmeans__n_clusters': Integer(2, 20)
# }

In [312]:
pipe = Pipeline(steps=[('pca', pca), ('hdbscan', hdb)])
# pipe = Pipeline(steps=[('pca', pca), ('kmeans', kmeans)])

In [313]:
tunning = BayesSearchCV(
   estimator=pipe,
   search_spaces=hyper_params,
   scoring=cv_score,
   cv=split,
   n_jobs=-1,
   refit=False,
   return_train_score=True,
   n_iter=20
)

In [333]:
# TODO: add wandb logging. 
# Include labels_ and params and save to disk every X iterations... 
def wandb_callback(result):
    iter = len(result['x_iters'])
    print('Iteration %d' %iter)
    if iter > 1:
        # try:
            print(tunning.best_score_)
            print("Current params: ", tunning.cv_results_['params'][iter - 1])
        # except:
            # print("No best score found yet.")

In [334]:
start_time = time.time()
tunning.fit(ddf.to_numpy(), callback=wandb_callback)
elapsed_time = time.time() - start_time
print(elapsed_time)

Iteration 1
Iteration 2
0.030048555770067233
20
Current params:  OrderedDict([('hdbscan__cluster_selection_epsilon', 20.110439752267144), ('hdbscan__cluster_selection_method', 'eom'), ('hdbscan__metric', 'euclidean'), ('hdbscan__min_cluster_size', 1380), ('hdbscan__min_samples', 144), ('pca__n_components', 24)])
Iteration 3
0.030048555770067233
20
Current params:  OrderedDict([('hdbscan__cluster_selection_epsilon', 96.63780949235175), ('hdbscan__cluster_selection_method', 'eom'), ('hdbscan__metric', 'euclidean'), ('hdbscan__min_cluster_size', 1841), ('hdbscan__min_samples', 128), ('pca__n_components', 70)])
Iteration 4
0.030048555770067233
20
Current params:  OrderedDict([('hdbscan__cluster_selection_epsilon', 39.526697537020105), ('hdbscan__cluster_selection_method', 'eom'), ('hdbscan__metric', 'euclidean'), ('hdbscan__min_cluster_size', 1609), ('hdbscan__min_samples', 295), ('pca__n_components', 61)])
Iteration 5
0.030048555770067233
20
Current params:  OrderedDict([('hdbscan__cluste

In [336]:
tunning.total_iterations

120

In [337]:
tunning.best_score_

0.0

In [338]:
tunning.best_params_

OrderedDict([('hdbscan__cluster_selection_epsilon', 4.46182243476504),
             ('hdbscan__cluster_selection_method', 'leaf'),
             ('hdbscan__metric', 'euclidean'),
             ('hdbscan__min_cluster_size', 603),
             ('hdbscan__min_samples', 382),
             ('pca__n_components', 42)])

In [339]:
pipe = Pipeline(steps=[('pca', pca), ('hdbscan', hdb)])
# pipe = Pipeline(steps=[('pca', pca), ('kmeans', kmeans)])
pipe.set_params(**tunning.best_params_)

In [348]:
def cv_results_sanity_check(pipe, df, cv_results):

    bs = tunning.best_score_
    bp = tunning.best_params_
    
    pipe.set_params(**bp)

    try:
        assert bs == cv_score(pipe, df.to_numpy())
    except:
        print(bs, cv_score(pipe, df.to_numpy()))
    bid = np.where(tunning.cv_results_['mean_test_score'] == bs)[0][0]

    assert bp == tunning.cv_results_['params'][bid]
    assert bs == tunning.cv_results_['split0_test_score'][bid]
    assert bs == tunning.cv_results_['split1_test_score'][bid]
    assert bs == tunning.cv_results_['split0_train_score'][bid]
    assert bs == tunning.cv_results_['split1_train_score'][bid]

    for i, s in enumerate(tunning.cv_results_['split0_test_score']):
        assert (
            s == tunning.cv_results_['split1_test_score'][i]
        )

    print("These search results passed all sanity checks. They are deterministic and consistent. :)")

In [349]:
cv_results_sanity_check(pipe, df, tunning.cv_results_)

These search results passed all sanity checks. They are deterministic and consistent. :)


In [350]:
cv_score(pipe, df.to_numpy())

0