# Metaheuristic course - Session 4

> For this session you will need to install the sklearn python package.

In [1]:
from typing import List, Tuple, Union
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer, StandardScaler
from sklearn.svm import SVC

## Let's make a pipeline

A pipeline is a concatenation of structures that can help us process a dataset and train a model for a given task. In this session we will be building simple pipelines using a few transformation structures and some clasification estimators.

Use the `get_pipeline` function defined below to build a pipeline. The arguments of this function are:

- `pca_pos`: (Integer between 0 and 2, or equals -1) Position of the PCA in the pipeline. If -1 then no PCA is used.
- `pca_n_components`: If PCA is used, this defines the `n_components` PCA hyperparameter.
- `normalizer_pos`: (Integer between 0 and 2, or equals -1) Position of the Normalizer in the pipeline. If -1 then no Normalizer is used.
- `normalizer_norm`: If Normalizer is used, this defines the `norm` Normalizer hyperparameter.
- `standar_scaler_pos`: (Integer between 0 and 2, or equals -1) Position of the StandarScaler in the pipeline. If -1 then no StandarScaler is used.
- `use_rfc`: (Boolean) Defines if the RandomForestCalsifier is used. 
- `rfc_n_estimators`: If RFC is used, this defines the `n_estimators` hyperparameter.
- `rfc_max_depth`: If RFC is used, this defines the `n_estimators` hyperparameter.
- `use_knc`: (Boolean) Defines if the KNeighborsClasifier is used.
- `knc_n_neighbors`: If KNC is used, this defines the `n_neighbors` hyperparameter.
- `use_svc`: (Boolean) Defines if the RandomForestCalsifier is used.
- `svc_c`: If SVC is used, this defines the `C` hyperparameter.
- `svc_degree`: If SVC is used, this defines the `degree` hyperparameter.

> The default values for the hyperparameters are the same as the ones defined by sklearn.

In [2]:
def get_pipeline(
    pca_pos: int = -1,
    pca_n_components=None,
    normalizer_pos: int = -1,
    normalizer_norm: str = "l2",
    standar_scaler_pos: int = -1,
    use_rfc: bool = False,
    rfc_n_estimators: int = 100,
    rfc_max_depth: Union[int, None] = None,
    use_knc: bool = False,
    knc_n_neighbors: int = 5,
    use_svc: bool = False,
    svc_c: float = 1.0,
    svc_degree: int = 3,
) -> Pipeline:
    pipeline: List[Union[None, Tuple]] = [None] * 4

    assert pca_pos == -1 or 0 <= pca_pos <= 2
    assert normalizer_pos == -1 or 0 <= normalizer_pos <= 2
    assert standar_scaler_pos == -1 or 0 <= standar_scaler_pos <= 2

    if pca_pos >= 0:
        pipeline[pca_pos] = ("pca", PCA(n_components=pca_n_components))

    if normalizer_pos >= 0:
        pipeline[normalizer_pos] = ("normalizer", Normalizer(norm=normalizer_norm))

    if standar_scaler_pos >= 0:
        pipeline[standar_scaler_pos] = ("scaler", StandardScaler())

    pipeline = [item for item in pipeline if item is not None]
        
    assert sum((use_knc, use_rfc, use_svc)) == 1, "Exactly one classifier must be defined"

    if use_rfc:
        pipeline.append(
            (
                "rdf",
                RandomForestClassifier(
                    n_estimators=rfc_n_estimators,
                    max_depth=rfc_max_depth,
                ),
            )
        )
    if use_knc:
        pipeline.append(
            (
                "knc",
                KNeighborsClassifier(
                    n_neighbors=knc_n_neighbors,
                ),
            )
        )
    if use_svc:
        pipeline.append(
            (
                "svc",
                SVC(
                    C=svc_c,
                    degree=svc_degree,
                ),
            )
        )

    return Pipeline(pipeline)

For example, let's define some pipelines:

In [3]:
get_pipeline(use_knc=True)

In [4]:
get_pipeline(pca_pos=1, normalizer_pos=0, use_rfc=True, rfc_n_estimators=10)

Once you have build your pipeline you can train and test the estimator against a dataset. For example, let's use the iris sklearn dataset:

In [5]:
iris_ds = load_iris()
X, y = iris_ds["data"], iris_ds["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35)

p2 = get_pipeline(pca_pos=1, normalizer_pos=0, use_rfc=True, rfc_n_estimators=10)
p2.fit(X_train, y_train)
p2.score(X_test, y_test)

0.9245283018867925

As you can see, this secuence of steps (pipeline) can predict very well the data from the iris dataset. However this is a small and very simple dataset. You can find more information about the sklearn dataset [here](https://scikit-learn.org/stable/datasets/toy_dataset.html).

So, the question is: given a dataset, what is the best pipeline you can build for a given task (e.g. clasification)?

Your task today is to build an heuristic that finds that pipeline using the tools of this notebook :)

In [6]:
# Write your code here