# Parallel Pipeline Example

A Parallel Pipeline is a set of steps which run independently - their outputs are then combined and returned. Each step should be an instantiated class with both `fit` and `transform` methods.

The below diagram shows the high level structure of a Parallel Pipeline:

![title](images/parallel_pipeline_structure.png)

It is used to carry out multiple processes seperately and combine their outputs. One such use case is when we want to both:

1. Generate a new rule set, and:
2. Optimise an existing rule set, then:
3. Use the combined rule sets to optimise a decision engine

An example of this workflow is shown below:

![title](images/parallel_pipeline_example.png)

The rule generation and rule optimisation steps would be added to a Parallel Pipeline, so they are run separately and their outputs are combined. This Parallel Pipeline would be added to a Linear Pipeline, along with the decision engine optimisation step, so that the output of the Parallel Pipeline is fed into the decision engine optimiser.

**We'll see how this workflow can be generated in the following example.**

---

## Import packages

In [1]:
from iguanas.rule_generation import RuleGeneratorDT
from iguanas.rule_optimisation import BayesianOptimiser
from iguanas.rule_selection import SimpleFilter, CorrelatedFilter
from iguanas.metrics import FScore, Precision, JaccardSimilarity
from iguanas.rbs import RBSOptimiser, RBSPipeline
from iguanas.correlation_reduction import AgglomerativeClusteringReducer
from iguanas.pipeline import LinearPipeline, ParallelPipeline, ClassAccessor
from iguanas.rules import Rules

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from category_encoders.one_hot import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

## Read in data

Let's read in the famous Titanic data set and split it into training and test sets:

In [2]:
df = pd.read_csv('../../../examples/dummy_data/titanic.csv', index_col='PassengerId')
target_col = 'Survived'
cols_to_drop = ['Name', 'Ticket', 'Cabin']
X = df.drop([target_col] + cols_to_drop, axis=1)
y = df[target_col]

In [3]:
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    random_state=42
)

## Data processing

Let's apply the following simple steps to process the data:
* One hot encode categorical variables (accounting for nulls)
* Impute numeric features with -1

In [4]:
# OHE
encoder = OneHotEncoder(
    use_cat_names=True
)
X_train = encoder.fit_transform(X_train_raw)
X_test = encoder.transform(X_test_raw)

# Impute
X_train.fillna(-1, inplace=True)
X_test.fillna(-1, inplace=True)

  elif pd.api.types.is_categorical(cols):


----

## Set up pipeline

To create the worflow shown at the beginning of the notebook, let's first assume we have the following existing rule set (stored in the standard Iguanas string format):

In [5]:
existing_rules = Rules(
    rule_strings = {
        'AgeRule': "(X['Age']>0)|(X['Age'].isna())"
    }
)

**Note:** these rules use the unprocessed data, as they contain conditions that look for null values. We'll need to use the unprocessed data when optimising these rules.

To use in a rule optimiser, we need to convert the rules to the standard Iguanas lambda expression format:

In [6]:
existing_rule_lambdas = existing_rules.as_rule_lambdas(
    as_numpy=False, 
    with_kwargs=True
)

Now let's set up our Parallel Pipeline - this will consist of a rule generation step and a rule optimisation step. We'll optimise both our rules and our decision engine based on the **F1 score**:

In [7]:
f1 = FScore(beta=1)

# Rule generation
generator = RuleGeneratorDT(
    metric=f1.fit,
    n_total_conditions=4,
    tree_ensemble=RandomForestClassifier(
        n_estimators=10,
        random_state=0
    )
)
# Rule optimisation
optimiser = BayesianOptimiser(
    rule_lambdas=existing_rule_lambdas,
    lambda_kwargs=existing_rules.lambda_kwargs,
    metric=f1.fit,
    n_iter=5
)

In [8]:
pp = ParallelPipeline(
    steps=[
        ('generator', generator),
        ('optimiser', optimiser)
    ]
)

Now that we have our Parallel Pipeline defined, we can define our Linear Pipeline, the first step of which will be our Parallel Pipeline. This will run the rule generation and optimisation steps separately, combine their outputs, and feed it into the decision engine optimiser (our second step in the Linear Pipeline):

In [9]:
# Decision engine optimiser
rbs_pipeline = RBSPipeline(
    config=[], # Use an empty list here - the RBSOptimiser will create the config
    final_decision=0
)
rbs_optimiser = RBSOptimiser(
    pipeline=rbs_pipeline,
    metric=f1.fit, 
    pos_pred_rules=ClassAccessor(
        class_tag='pp', 
        class_attribute='rule_names'
    ),
    n_iter=10,
    rules=
)

In [10]:
lp = LinearPipeline(
    steps=[
        ('pp', pp),
        ('rbs_optimiser', rbs_optimiser)
    ]
)

**Note:** The argument passed to the `pos_pred_rules` parameter in the `RBSOptimiser` class is a `ClassAccessor` object. This takes the names of the rules that are present in the concatenated output produced by the `ParallelPipeline` and passes it to the `pos_pred_rules` parameter of the `RBSOptimiser` class.

## Using the pipeline

### `fit` method

By running the `fit` method, we sequentially run the `fit_transform` methods of each step in the pipeline, except for the last step, where the `fit` method is run. 

**Note:** we need to pass the unprocessed data to the rule optimiser step - we can do this by feeding a dictionary to the parameter `X`, where the key of the dictionary corresponds to the step where the given dataset (value) should be passed:

In [11]:
lp.fit(
    X={
        'generator': X_train,
        'optimiser': X_train_raw,
    }, 
    y=y_train, 
    sample_weight=None
)

#### Outputs

The `fit` method doesn't return anything. However, you can access the attributes of the fitted classes using the `get_params` method.

To see the rules that remain after the decision engine optimisation, we first need to extract the `rules_to_keep` attribute from the `rbso_optimiser` stage of the Linear Pipeline:

In [31]:
rules_to_keep = lp.get_params()['rbs_optimiser']['rules_to_keep']

In [None]:
lp.get_params()['generator']['rule_strings']

### `fit_predict` method

By running the `fit_predict` method, we sequentially run the `fit_transform` methods of each step in the pipeline, except for the last step, where the `fit_predict` method is run.

**Note:** we need to pass the unprocessed data to the rule optimiser step - we can do this by feeding a dictionary to the parameter `X`, where the key of the dictionary corresponds to the step where the given dataset (value) should be passed:

In [14]:
y_pred_train = lp.fit_predict(
    X={
        'generator': X_train,
        'optimiser': X_train_raw,
    }, 
    y=y_train, 
    sample_weight=None
)

#### Outputs

The `fit_predict` method returns the prediction generated by class in the final step of the pipeline - in this case, the `RBSOptimiser`:

In [15]:
y_pred_train

0      1
1      1
2      1
3      1
4      1
      ..
591    1
592    1
593    1
594    1
595    1
Name: Stage=0, Decision=1, Length: 596, dtype: int64

### `predict` method

By running the `predict` method, we sequentially run the `transform` methods of each step in the pipeline, except for the last step, where the `predict` method is run. Note that before using this method, you should first run either the `fit` or `fit_predict` methods:

**Note:** we need to pass the unprocessed data to the rule optimiser step - we can do this by feeding a dictionary to the parameter `X`, where the key of the dictionary corresponds to the step where the given dataset (value) should be passed:

In [21]:
y_pred_test = lp.predict(
    X={
        'generator': X_test,
        'optimiser': X_test_raw,
    }
)

#### Outputs

The `predict` method returns the prediction generated by class in the final step of the pipeline - in this case, the `RBSOptimiser`:

In [22]:
y_pred_test

0      1
1      1
2      1
3      1
4      1
      ..
290    1
291    1
292    1
293    1
294    1
Name: Stage=0, Decision=1, Length: 295, dtype: int64

This approach is very powerful when optimising hyperparameters for the overall performance of a Rules-Based System - see the `BayesSearchCV` class in the `rule_selection` module for more information.

---