# PandasPipeline & PandasTransformer

<div class="alert alert-info">
    
**This notebook will explain you how to use our pipeline and transformer wrapper.**

    

Pipeline from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) is an amazing way to refactor and have production ready preprocessing code.
   
But when you provide a pandas Dataframe to a Pipeline, it returns a numpy array.
The PandasPipeline is a wrapper of the standard [scikit-learn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and let you get a pandas DataFrame as result.
With that, you can have more control on preprocessing steps. For example you have a step that created a new column, you can add a step that preprocess this newly created column.
    
</div>

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-data" data-toc-modified-id="Load-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#How-to-use-PandasPipeline" data-toc-modified-id="How-to-use-PandasPipeline-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>How to use PandasPipeline</a></span><ul class="toc-item"><li><span><a href="#SelectColumnsTransformer" data-toc-modified-id="SelectColumnsTransformer-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>SelectColumnsTransformer</a></span></li><li><span><a href="#DropColumnsTransformer" data-toc-modified-id="DropColumnsTransformer-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>DropColumnsTransformer</a></span></li><li><span><a href="#EncoderTransformer" data-toc-modified-id="EncoderTransformer-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>EncoderTransformer</a></span></li><li><span><a href="#FunctionTransformer" data-toc-modified-id="FunctionTransformer-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>FunctionTransformer</a></span><ul class="toc-item"><li><span><a href="#apply_by_multiprocessing-mode" data-toc-modified-id="apply_by_multiprocessing-mode-2.4.1"><span class="toc-item-num">2.4.1&nbsp;&nbsp;</span>apply_by_multiprocessing mode</a></span></li><li><span><a href="#apply" data-toc-modified-id="apply-2.4.2"><span class="toc-item-num">2.4.2&nbsp;&nbsp;</span>apply</a></span></li><li><span><a href="#vectorized-mode" data-toc-modified-id="vectorized-mode-2.4.3"><span class="toc-item-num">2.4.3&nbsp;&nbsp;</span>vectorized mode</a></span></li></ul></li></ul></li></ul></div>

---

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import sys
import pandas as pd
from sklearn.pipeline import Pipeline

sys.path.append("..")

In [None]:
from john_toolbox.preprocessing.pandas_transformers import (
    SelectColumnsTransformer,
    DebugTransformer,
    DropColumnsTransformer,
    EncoderTransformer,
    FunctionTransformer
)

from john_toolbox.preprocessing.pandas_pipeline import (
    PandasPipeline
)

<div class="alert alert-danger">
    
Please change to logging.DEBUG if you want to track log.

</div>

In [None]:
from john_toolbox.utils.logger_config import loggers
import logging

for logger in loggers:
    loggers[logger].setLevel(logging.INFO)  # set to logging.DEBUG for debugging


## Load data

---

In [None]:
df = pd.read_csv("../tests/multi_class_dataset.csv")

In [None]:
df.shape

In [None]:
df.head()

## How to use PandasPipeline

<div class="alert alert-info">
    
**You need to define steps with pandas transformers.**
<br>
    

The package implements the following Pandas Transformers :
* **SelectColumnsTransformer** : used to filter columns
* **DropColumnsTransformer** : used to drop one or multiples columns
* **EncoderTransformer** : used as a wrapper for encoder transformer from sklearn like LabelEncoder
* **FunctionTransformer** : used to pass function to apply transform in column

</div>

---

### SelectColumnsTransformer

In [None]:
steps = [
    (
        "select_column", SelectColumnsTransformer(
            columns=["formation", "contenu"])
    ),
]

In [None]:
pipeline = PandasPipeline(
    steps=steps,
    target_name="formation",
    verbose=True
)

In [None]:
tmp_df = pipeline.fit_transform(df)

In [None]:
tmp_df.head()

### DropColumnsTransformer

In [None]:
steps = [
    (
        "drop_column", DropColumnsTransformer(
            columns_to_drop=["nature", "solution"])
    ),
]

In [None]:
pipeline = PandasPipeline(
    steps=steps,
    target_name="formation",
    verbose=True
)

In [None]:
tmp_df = pipeline.fit_transform(df)

In [None]:
tmp_df.head()

### EncoderTransformer

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
steps = [
    (
        'ohe', EncoderTransformer(
            encoder=OneHotEncoder,
            column="formation",
            new_cols_prefix="ohe",
            is_drop_input_col=False,
        )
    )
]

In [None]:
pipeline = PandasPipeline(
    steps=steps,
    target_name="formation",
    verbose=True
)

In [None]:
tmp_df = pipeline.fit_transform(df)

In [None]:
tmp_df.head()

### FunctionTransformer 

The class FunctionTransformer handles function preprocessing.
It contains 3 modes:
- apply_by_multiprocessing : apply a function with all cpu core
- apply : simple apply function on one column
- vectorized : vectorized operation with pandas
    

#### apply_by_multiprocessing mode

In [None]:
def add_prefix(x, prefix):
    return x + prefix

steps = [
    (
        "lambda_func_by_multiprocessing",
        FunctionTransformer(
            column="formation",
            mode="apply_by_multiprocessing",
            func=add_prefix,
             dict_args={
                 "prefix": "_prefix"
             },
        ),
    ),
]

pipeline = PandasPipeline(
    steps=steps,
    target_name="formation",
    verbose=True
)

In [None]:
tmp_df = pipeline.fit_transform(df)

#### apply

In [None]:
def add_prefix(x, prefix):
    return x + prefix


steps = [
    (
        "lambda_func",
        FunctionTransformer(
            column="formation",
            mode="apply",
            func=add_prefix,
             dict_args={
                 "prefix": "_prefix"
             },
        ),
    ),
]

pipeline = PandasPipeline(
    steps=steps,
    target_name="formation",
    verbose=True
)

In [None]:
tmp_df = pipeline.fit_transform(df)

#### vectorized mode

In [None]:
def add_prefix(df, prefix):
    return df["formation"] + prefix


steps = [
    (
        "vectorized_func",
        FunctionTransformer(
            column=None,
            mode="vectorized",
            func=add_prefix,
             dict_args={
                 "prefix": "_prefix"
             },
        ),
    ),
]


pipeline = PandasPipeline(
    steps=steps,
    target_name="formation",
    verbose=True
)

In [None]:
tmp_df = pipeline.fit_transform(df)