# SEG 2016 Facies Competition


In 2016 Matt @ Agile & Brendon @ Enthought setup a Machine Learning Contest with the SEG. 

The objective was to predict facies logs from a small set of well log data. The image below (source: www.agilescientific.com) shows the wirelines and facies targets for one well

![](agile_blog_seg_facies_image.png)

Two wells were held out as blind and used to create the final scores.

![](leaderboard.png)

In this notebook, we've taken the winning submission from `LA_Team` Lukas Mosser & Alfredo de la Fuente.

This Was Gradient Boosted Trees selected and tuned by TPOT.

here we are going to tune that model with Raytune instead


# Dependencies

In [None]:
%load_ext autoreload
%autoreload 2

from dependencies import *

We have already applied the preprocessing steps from the contest entry notebook and saved these to `h5py` so we can load these here

In [2]:
import h5py


# define a loading function
def setup(filepath):
    with h5py.File(filepath, 'r') as f:
        X_train = f["train_x"][:]
        y_train = f["train_y"][:]
        group_train = f["train_groups"][:]
        train_wells = f["train_groups"].attrs["well_names"]        
        
        X_test = f["test_x"][:]
        y_test = None
#         y_test = f["test_y"]
        group_test = f["test_groups"][:]
        test_wells = f["test_groups"].attrs["well_names"]

    return X_train, y_train, group_train, X_test, y_test, group_test, (train_wells, test_wells)

The tuned model parameters
```
XGBClassifier(learning_rate=0.12,
              max_depth=3,
              min_child_weight=10,
              n_estimators=150,
              seed=seed,
              colsample_bytree=0.9)
```

Let's check the data is on the expected path

In [3]:
from os import path
filepath = path.abspath('../datasets/seg_2016_facies/la_team_5_data.h5py')
print(filepath)

/home/lena/git/tutorial-raytune-hyper/datasets/seg_2016_facies/la_team_5_data.h5py


## Tuning function

Again we define an end to end tuning function

In [4]:
from sklearn.model_selection import LeavePGroupsOut
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from scipy.signal import medfilt
from filelock import FileLock

def e2e_train_and_test(config, **kwargs):
    
    # load the data
    X, y, groups, X_test, y_test, group_test, well_names = setup(kwargs['filepath'])
    
    #
    # chose your CV strategy. groups == wells
    #
    splitter = LeavePGroupsOut(1)
    
    #
    # run k fold training and validation
    #
    f1_scores = [] # keep hold of all individual scores
    for train_ind, val_ind in splitter.split(X, y, groups=groups):
        pipeline = make_pipeline(RobustScaler(),
                                  XGBClassifier())

        pipeline.set_params(**config)
        pipeline.fit(X[train_ind], y[train_ind])
        
        y_pred = pipeline.predict(X[val_ind])
        
        f1_scores.append(f1_score(y_pred, y[val_ind], average='micro'))
    
        # Clean isolated facies for each well
        y_pred = medfilt(y_pred, kernel_size=5)

    # use tunes reporter to send metric to tune.run()
    tune.report(mean_f1_score=np.array(f1_scores).mean(),
                std_f1_score=np.array(f1_scores).std())

The following config is locked to the winning parameters

In [6]:
tuning_config = {
    'xgbclassifier__learning_rate': 0.12,
    'xgbclassifier__max_depth': 3,
    'xbgclassifier__min_child_weight' :10,
    'xbgclassifier__n_estimators': 150,
    'xgbclassifier__seed':1773,
    'xgbclassifier__colsample_bytree':0.9
}

Let's update that to use distributions

In [7]:
ray_tuning_config = {
    'xgbclassifier__learning_rate': tune.loguniform(0.001, 0.5),
    'xgbclassifier__max_depth': tune.randint(1, 10),
    'xgbclassifier__min_child_weight': tune.loguniform(0.1,100),
    'xgbclassifier__n_estimators': tune.randint(5,200),
    'xgbclassifier__colsample_bytree': tune.choice([0.4, 0.6, 0.8, 1.0]),
    'xgbclassifier__lambda': tune.choice([0,1]),
    'xgbclassifier__seed': 42   # set as a constant, so it's always the same
}

In [9]:
ray.shutdown()
ray.init(num_cpus=5, num_gpus=0, include_dashboard=True)

2020-11-09 15:59:46,932	INFO services.py:1164 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '192.168.123.68',
 'raylet_ip_address': '192.168.123.68',
 'redis_address': '192.168.123.68:17328',
 'object_store_address': '/tmp/ray/session_2020-11-09_15-59-46_197278_28024/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-11-09_15-59-46_197278_28024/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2020-11-09_15-59-46_197278_28024',
 'metrics_export_port': 63124}

In [None]:
filepath = path.abspath('../datasets/seg_2016_facies/la_team_5_data.h5py')

# wrap our end to end function to inject our filepath
def e2e_seg(config):
    return e2e_train_and_test(config, filepath=filepath)

analysis = tune.run(
                e2e_seg,
                config=ray_tuning_config,
    
                num_samples=15, # Specify the number of samples to make from (non grid) distributions
    
                resources_per_trial=dict(cpu=1, gpu=0),
                
                local_dir="~/ray_results/seg_facies")

In [10]:
from pprint import pprint
print("Best config: ")
pprint(analysis.get_best_config(metric="mean_f1_score"))

Best config: 


NameError: name 'analysis' is not defined

In [None]:
df = analysis.dataframe()
top_n_df = df.nlargest(10, "mean_f1_score")

In [None]:
top_n_df.head()

In [None]:
plot_some_tune_results(top_n_df, (0.3, 1.0))

In [None]:
%load_ext tensorboard
from tensorboard import notebook 
%tensorboard --logdir "~/ray_results/seg_facies"
notebook.display(height=1000)

In [None]:
ray.shutdown()