# Unleash the Ray - Grid Search

Let's revisit our grid search example but now with Ray

A lot of this code is going to be familiar as we already had our pipeline wraped in a function

In [1]:
%load_ext autoreload
%autoreload 2

from dependencies import *

Loading dependencies we have already seen...
Done...


In [2]:
import ray
from ray import tune

### Let's start Ray

In [4]:
ray.shutdown()
ray.init(num_cpus=10, num_gpus=0, include_dashboard=True)

2020-11-06 15:15:11,425	INFO services.py:1164 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.2.140',
 'raylet_ip_address': '192.168.2.140',
 'redis_address': '192.168.2.140:62656',
 'object_store_address': '/tmp/ray/session_2020-11-06_15-15-10_731830_11885/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-11-06_15-15-10_731830_11885/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2020-11-06_15-15-10_731830_11885',
 'metrics_export_port': 64493}

After initialisation the [Ray Dashboard](https://docs.ray.io/en/master/ray-dashboard.html) is available on the **webui_url** port

## Setup some raytune compatible training code

Very similar to before except now we have an end-to-end function

In [5]:

# differences from what we've seen before, this is an end to end training function
# where we are loading the dataset running our complete train and test loop whilst
# 
def e2e_simple_training(config):
    
    #threadsafe
    X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
    
    # chose your CV strategy
    splitter = StratifiedKFold(n_splits=5)
    
    # run k fold training and testing
    f1_scores = [] # keep hold of all individual scores
    for train_ind, test_ind in splitter.split(X, y):
        pipeline = make_pipeline(RobustScaler(),
                                  RandomForestClassifier(random_state=42))

        pipeline.set_params(**config)
        pipeline.fit(X[train_ind], y[train_ind])
        
        y_pred = pipeline.predict(X[test_ind])
        
        f1_scores.append(f1_score(y_pred, y[test_ind]))
    
    # use tunes reporter
    tune.track.log(mean_f1_score=np.array(f1_scores).mean(),
                std_f1_score=np.array(f1_scores).std(),
                # and we can actually add any metrics we like
                done=True)

Previously we had a param grid like this

```
param_grid = {
    'randomforestclassifier__n_estimators': [1,5,15,50,100],
    'randomforestclassifier__criterion': ['gini', 'entropy'],
    'randomforestclassifier__bootstrap': [True, False]
}
```

### TODO convert this to a set of ray search spaces

The Ray config object is freeform, we imprint our own structure.

However, tunable parameters need to be represented by tune distribution object >> [read the docs](https://docs.ray.io/en/latest/tune/api_docs/grid_random.html?highlight=tune.grid#random-distributions-api)

In [None]:
ray_tuning_config = {
    'randomforestclassifier__n_estimators': tune.grid_search([1,5,15,50,100])
}

In [None]:
analysis = tune.run(
                e2e_simple_training,
                config=ray_tuning_config,
                resources_per_trial=dict(cpu=1, gpu=0),
                local_dir="~/ray_results/grid_search")

In [None]:
df = analysis.dataframe()
print(df.columns)
df.head()

In [None]:
print("Best config: ", analysis.get_best_config(metric="mean_f1_score"))

In [None]:
from scipy.stats import norm

def plot_some_tune_results(df):
    fig, ax = plt.subplots(1, 1, figsize=(16,6))
    x = np.linspace(0.85, 1.0, 100)

    n_estimators = df['config/randomforestclassifier__n_estimators'].values.tolist()

    lines = []
    for mu, sigma in zip(df['mean_f1_score'], df['std_f1_score']):
        pdf = norm.pdf(x, mu, sigma)
        line, = ax.plot(x, pdf, alpha=0.6)
        ax.axvline(mu, color=line.get_color())
        ax.text(mu, pdf.max(), f"{mu:.3f}", color=line.get_color(), fontsize=14)
        lines.append(line)

    plt.legend(handles=lines, labels=n_estimators, title="n estimators")
    ax.set_title(f"Average F1 Scores")
    
plot_some_tune_results(df)

## Really increase the size of the search space

In [None]:
#
# 6D search space - 960 combinations - 4800 calls to fit
#

ray_tuning_config = {
    'randomforestclassifier__n_estimators': tune.grid_search([1,5,15,50,100]),
    'randomforestclassifier__criterion': tune.grid_search(['gini', 'entropy']),
    'randomforestclassifier__max_features': tune.grid_search(['auto', 'sqrt', 'log2']),
#     'randomforestclassifier__bootstrap': tune.grid_search([True, False]),
#     'randomforestclassifier__min_samples_leaf': tune.grid_search([1,2,3,4]),
    'randomforestclassifier__min_samples_split': tune.grid_search([3,4,5,6])
}

In [None]:
analysis = tune.run(
                e2e_simple_training,
                config=ray_tuning_config,
                resources_per_trial=dict(cpu=1, gpu=0)
                )

In [None]:
from pprint import pprint
print("Best config: ")
pprint(analysis.get_best_config(metric="mean_f1_score"))

In [None]:
df = analysis.dataframe()
top_n_df = df.nlargest(10, "mean_f1_score")

In [None]:
plot_some_tune_results(top_n_df)

In [None]:
%load_ext tensorboard

In [None]:
from tensorboard import notebook
%tensorboard --logdir "~/ray_results/grid_search"
notebook.display(height=1000) 

### Once you are all done, shutdown Ray

In [None]:
ray.shutdown()