# Bayes Search CV Example

The `BayesSearchCV` class is used to search for the best parameters for a given Iguanas Pipeline.

The process is as follows:

* Generate k-fold stratified cross validation datasets. 
* For each of the training and validation datasets:
    * Fit the pipeline on the training set using a set of parameters chosen by the Bayesian Optimiser from a given set of ranges.
    * Apply the pipeline to the validation set to return a prediction.
    * Use the provided `scorer` to calculate the score of the prediction.
* Return the parameter set which generated the highest mean overall score across the validation datasets.

---

## Import packages

In [1]:
from iguanas.rule_generation import RuleGeneratorDT
from iguanas.rule_selection import SimpleFilter, CorrelatedFilter, BayesSearchCV
from iguanas.metrics import FScore, JaccardSimilarity
from iguanas.rbs import RBSOptimiser, RBSPipeline
from iguanas.correlation_reduction import AgglomerativeClusteringReducer
from iguanas.pipeline import LinearPipeline
from iguanas.pipeline.class_accessor import ClassAccessor
from iguanas.space import UniformFloat, UniformInteger, Choice

import pandas as pd
from sklearn.model_selection import train_test_split
from category_encoders.one_hot import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

## Read in data

Let's read in the famous Titanic data set and split it into training and test sets:

In [2]:
df = pd.read_csv('../../../examples/dummy_data/titanic.csv', index_col='PassengerId')
target_col = 'Survived'
cols_to_drop = ['Name', 'Ticket', 'Cabin']
X = df.drop([target_col] + cols_to_drop, axis=1)
y = df[target_col]

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    random_state=42
)

## Data processing

Let's apply the following simple steps to process the data:

* One hot encode categorical variables (accounting for nulls)
* Impute numeric features with -1

In [4]:
# OHE
encoder = OneHotEncoder(
    use_cat_names=True
)
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

# Impute
X_train.fillna(-1, inplace=True)
X_test.fillna(-1, inplace=True)

  elif pd.api.types.is_categorical(cols):


----

## Set up pipeline

Let's say that we want to apply the following processes as part of our pipeline:

1. Rule generation step
    * Use `RuleGeneratorDT` to generate rules using the processed data.
2. Rule processing step
    * Apply `SimpleFilter`.
    * Apply `CorrelatedFilter`.
3. Rule predictor step
    * Use the `RBSOptimiser` to optimise an `RBSPipeline` for F1 score. This will create a rule predictor.

However, we don't know what pipeline hyperparameter values will generate the best F1 score for the final rule predictor. This is where the `BayesSearchCV` class comes in - **it allows us to find the best pipeline hyperparameter values whilst also reducing the likelihood of overfitting.**

Let's first create the pipeline by instantiating the relevant classes:

In [5]:
f1 = FScore(beta=1)
js = JaccardSimilarity()

In [6]:
# Rule generation
generator = RuleGeneratorDT(
    metric=f1.fit,
    n_total_conditions=4,
    tree_ensemble=RandomForestClassifier(
        n_estimators=10,
        random_state=0
    )
)
# Rule processing
simple_filterer = SimpleFilter(
    threshold=0.1, 
    operator='>=', 
    metric=f1.fit
)
corr_filterer = CorrelatedFilter(
    correlation_reduction_class=AgglomerativeClusteringReducer(
        threshold=0.9, 
        strategy='top_down', 
        similarity_function=js.fit, 
        metric=f1.fit
    )
)
# Rule predictor
rbs_pipeline = RBSPipeline(
    config=[],
    final_decision=0
)
rbs_optimiser = RBSOptimiser(
    pipeline=rbs_pipeline,
    metric=f1.fit, 
    pos_pred_rules=ClassAccessor(
        class_tag='corr_filterer', 
        class_attribute='rules_to_keep'
    ),
    n_iter=10
)

**Note:** the argument passed to the `pos_pred_rules` parameter in the `RBSOptimiser` class is a `ClassAccessor` object. This takes the names of the rules that remain after the `CorrelatedFilter` has been applied and passes it to the `pos_pred_rules` parameter of the `RBSOptimiser` class.

Now we can create the steps of our pipeline. Each step should be a tuple of two elements:

1. The first element should be a string which refers to the step.
2. The second element should be the instantiated class which is run as part of the pipeline.

In [7]:
steps = [
    ('generator', generator),
    ('simple_filterer', simple_filterer),
    ('corr_filterer', corr_filterer),
    ('rbs_optimiser', rbs_optimiser)
]

Finally, we can instantiate our pipeline:

In [8]:
lp = LinearPipeline(steps=steps)

## Define the search space

Now we need to define the search space for each of the relevant parameters of our pipeline. To do this, we create a dictionary, where each key corresponds to the tag used for the relevant pipeline step. Each value should be a dictionary of the parameters (keys) and their search spaces (values). Search spaces should be defined using the classes in the `iguanas.space` module:

In [9]:
search_spaces = {
    'generator': {
        'n_total_conditions': UniformInteger(1, 5),
        'target_feat_corr_types': Choice([
            'Infer',
            None
        ])
    },
    'simple_filterer': {
        'threshold': UniformFloat(0, 1),
    },
    'corr_filterer': {
        'threshold': UniformFloat(0, 1)
    },    
}

Based on the search spaces above, we'll be optimising the following parameters across the following ranges:

* **generator**
    * `n_total_conditions`: Integers from 1 to 5
    * `target_feat_corr_types`: Either 'Infer' or None.
* **simple_filterer**
    * `threshold`: Floats from 0 to 1
* **corr_filterer**
    * `threshold`: Floats from 0 to 1

## Optimise the pipeline hyperparameters

Now that we have our pipeline and search spaces defined, we can instantiate the `BayesSearchCV` class. We'll split our data into 3 cross-validation datasets and try 20 different parameter sets:

In [35]:
bs = BayesSearchCV(
    pipeline=lp, 
    search_spaces=search_spaces, 
    metric=f1.fit, 
    cv=3, 
    n_iter=15,
    num_cores=3,
    error_score=0,
    verbose=1
)

Finally, we can run the `fit` method to optimise the hyperparameters of the pipeline:

In [36]:
bs.fit(X_train, y_train)

--- Optimising pipeline parameters ---
 13%|█▎        | 2/15 [00:04<00:23,  1.80s/trial, best loss: -0.6354597846910178]



 27%|██▋       | 4/15 [00:12<00:31,  2.84s/trial, best loss: -0.6354597846910178]



 47%|████▋     | 7/15 [00:15<00:11,  1.43s/trial, best loss: -0.6354597846910178]



 87%|████████▋ | 13/15 [00:23<00:02,  1.05s/trial, best loss: -0.6354597846910178]



100%|██████████| 15/15 [00:24<00:00,  1.61s/trial, best loss: -0.6965755602560381]
--- Refitting on entire dataset with best pipeline ---


### Outputs

The `fit` method doesn't return anything. See the `Attributes` section in the class docstring for a description of each attribute generated:

In [37]:
bs.best_score

0.6965755602560381

In [38]:
bs.best_params

{'corr_filterer': {'threshold': 0.13637152094471683},
 'generator': {'target_feat_corr_types': 'Infer'},
 'simple_filterer': {'threshold': 0.6444172588081419}}

In [39]:
bs.best_index

14

In [40]:
bs.cv_results.head()

Unnamed: 0,Params,corr_filterer__threshold,generator__n_total_conditions,generator__target_feat_corr_types,simple_filterer__threshold,FoldIdx,Scores,MeanScore,StdDevScore
14,{'corr_filterer': {'threshold': 0.136371520944...,0.136372,5.0,Infer,0.644417,"[0, 1, 2]","[0.7092198581560283, 0.6842105263157895, 0.696...",0.696576,0.010212
13,{'corr_filterer': {'threshold': 0.709546423751...,0.709546,2.0,Infer,0.388521,"[0, 1, 2]","[0.6783625730994152, 0.6153846153846154, 0.682...",0.658794,0.030745
0,{'corr_filterer': {'threshold': 0.291012698379...,0.291013,2.0,Infer,0.486047,"[0, 1, 2]","[0.6359447004608295, 0.6355140186915889, 0.634...",0.63546,0.00042
8,{'corr_filterer': {'threshold': 0.110728226805...,0.110728,2.0,Infer,0.510489,"[0, 1, 2]","[0.6359447004608295, 0.6355140186915889, 0.634...",0.63546,0.00042
10,{'corr_filterer': {'threshold': 0.684872215972...,0.684872,2.0,Infer,0.369537,"[0, 1, 2]","[0.6120689655172414, 0.6534653465346535, 0.611...",0.625548,0.019744


## Apply the optimised pipeline

We can apply our optimised pipeline to a new data set and make a prediction using the `predict` method:

In [41]:
y_pred_test = bs.predict(X_test)

### Outputs

The `predict` method returns the prediction generated by class in the final step of the pipeline - in this case, the `RBSOptimiser`:

In [42]:
y_pred_test

PassengerId
710    0
440    0
841    0
721    1
40     1
      ..
716    0
526    0
382    1
141    1
174    0
Name: Stage=0, Decision=1, Length: 295, dtype: int64

We can now calculate the F1 score of our optimised pipeline using the test data:

In [43]:
f1_opt = f1.fit(y_pred_test, y_test)

Comparing this to our original, unoptimised pipeline:

In [44]:
lp.fit(X_train, y_train, None)
y_pred_test_init = lp.predict(X_test)

In [45]:
f1_init = f1.fit(y_pred_test_init, y_test)

In [46]:
print(f'Percentage improvement in F1 score: {round(100*(f1_opt-f1_init)/f1_init, 2)}%')

Percentage improvement in F1 score: 30.39%


---