# **Customize your pipeline**

In this example, we'll look at how you can individualize single analysis steps by assigning 
a "type" to a column (`X_types`), using the `penguins` example dataset.

In [1]:
# Authors: Vincent Küppers <v.kueppers@fz-juelich.de>
#          Hanwen Bi <h.bi@fz-juelich.de>
#          
# 
# License: AGPL

from seaborn import load_dataset
from julearn.pipeline import PipelineCreator
from julearn import run_cross_validation

Load dataset, remove rows with missing values   
define features + target

In [2]:
df_penguins = load_dataset("penguins")
df_penguins = df_penguins.dropna().reset_index(drop=True)
df_penguins = df_penguins.replace({"sex": {"Female": 1, "Male": 2}})

df_penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,2
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,1
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,1
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,1
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,2


In [3]:
X = df_penguins.iloc[:,2:,].columns.tolist()
y = "species"

Define custom types for columns (of input features, X).  
We will use those types in the `PipelineCreator` to specify input.  
To adress all features in the `PipelineCreator` use `*`.  

! Important: if you define X types, you also need to be specific with `apply_to`. In the PipelineCreator you can set to which features your model (here `svm`) is applied to. If no input is given, all processing steps (including the final model) are applied to all *non* defined features (i.e. `continuous`). By default PCA output is of type:`continuous`.

In [4]:
X_types = {
    "bill": ["bill_length_mm", "bill_depth_mm"],
    "body": ["flipper_length_mm", "body_mass_g"],
    "our_confound": ["sex"]
}

creator_1 = PipelineCreator(problem_type="classification", apply_to="*")

creator_1.add("zscore", apply_to="*")
creator_1.add("pca", apply_to=["bill"], n_components=1)
creator_1.add("svm")

scores_1, model_1 = run_cross_validation(
            X=X, y=y, data=df_penguins, 
            X_types = X_types,
            model = creator_1, 
            return_estimator="final"
)
print(scores_1['test_score'])

0    1.000000
1    0.940299
2    0.850746
3    0.969697
4    1.000000
Name: test_score, dtype: float64


We can also z-score by the X_types defined before. Additionally we will `minmaxscale` other variables.

In [5]:
creator_2 = PipelineCreator(problem_type="classification", apply_to="*")

creator_2.add("zscore", apply_to="bill")
creator_2.add("scaler_minmax", apply_to="body")
creator_2.add("remove_confound", apply_to=["bill", "body"], confounds=["our_confound"])
creator_2.add("svm")

scores_2, model_2 = run_cross_validation(
            X=X, y=y, data=df_penguins, 
            X_types = X_types,
            model = creator_2, 
            return_estimator="final"
)
print(scores_2['test_score'])

0    1.000000
1    0.985075
2    0.910448
3    0.984848
4    1.000000
Name: test_score, dtype: float64


Now, let's compare both preprocessing pipelines.

In [6]:
from julearn.inspect import preprocess
help(preprocess)

Help on function preprocess in module julearn.inspect.preprocess:

preprocess(pipeline, X, data, until=None, with_column_types=False)



By setting the parameter `until=` to pipeline step, you can track how the variables were transformed until that step (including).

In [10]:
print('variables before pipeline \n', df_penguins[X].head())
print('variables after zscore \n', preprocess(model_1, X, df_penguins, until='zscore').head())
print('variables after pca \n', preprocess(model_1, X, df_penguins, until='pca').head())

variables before pipeline 
    bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
0            39.1           18.7              181.0       3750.0    2
1            39.5           17.4              186.0       3800.0    1
2            40.3           18.0              195.0       3250.0    1
3            36.7           19.3              193.0       3450.0    1
4            39.3           20.6              190.0       3650.0    2
variables after zscore 
    bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g       sex
0       -0.896042       0.780732          -1.426752    -0.568475  0.991031
1       -0.822788       0.119584          -1.069474    -0.506286 -1.009050
2       -0.676280       0.424729          -0.426373    -1.190361 -1.009050
3       -1.335566       1.085877          -0.569284    -0.941606 -1.009050
4       -0.859415       1.747026          -0.783651    -0.692852  0.991031
variables after pca 
        pca0  flipper_length_mm  body_mass_g       sex
0

We can also see how it looks like after remove confounds.

In [12]:
print('variables before pipeline \n', df_penguins[X].head())
print('variables after zscore \n', preprocess(model_2, X, df_penguins, until='zscore').head())
print('variables after pca \n', preprocess(model_2, X, df_penguins, until='remove_confound').head())

variables before pipeline 
    bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
0            39.1           18.7              181.0       3750.0    2
1            39.5           17.4              186.0       3800.0    1
2            40.3           18.0              195.0       3250.0    1
3            36.7           19.3              193.0       3450.0    1
4            39.3           20.6              190.0       3650.0    2
variables after zscore 
    bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
0       -0.896042       0.780732              181.0       3750.0  2.0
1       -0.822788       0.119584              186.0       3800.0  1.0
2       -0.676280       0.424729              195.0       3250.0  1.0
3       -1.335566       1.085877              193.0       3450.0  1.0
4       -0.859415       1.747026              190.0       3650.0  2.0
variables after pca 
    flipper_length_mm  body_mass_g  bill_length_mm  bill_depth_mm
0          -0.398406