# 1 - Method chaining and pipelines with Pandas 

<b>Tutorial by:</b> Aleide Hoeijmakers  <b>Date:</b> 18-09-2019

In [1]:
import os
tmp = os.getcwd()
os.chdir(tmp.split("data-science-core")[0] + "data-science-core")

In [2]:
# import packages
import numpy   as np
import pandas  as pd
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.datasets import load_boston
from neuropy.anomaly_detection import univariate_outlier_removal
from neuropy.utils import chain_snap

%matplotlib inline

In [3]:
# Import boston dataset
boston = load_boston()
boston_df = pd.DataFrame(data=np.c_[boston['data'], boston['target']], columns=list(boston['feature_names']) + ['target'])
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


### Method Chaining

Often the data must be preprocessed before plotting or learning anything. Pandas provides a way to chain multiple methods. If we take the boston dataset as an example, we decide we want to apply multiple methods to the data. 

In [4]:
prep_chain_df = (
    boston_df
    .rename(columns=str.lower)
    .drop('chas', axis=1)
    .assign(rad_2=lambda d: d['rad'] + 2,
            timestamp=lambda d: datetime.now())
    .set_index('timestamp')
    .dropna()
    .reset_index()
)

In [5]:
prep_chain_df.head()

Unnamed: 0,timestamp,crim,zn,indus,nox,rm,age,dis,rad,tax,ptratio,b,lstat,target,rad_2
0,2021-02-08 16:23:55.124507,0.00632,18.0,2.31,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0,3.0
1,2021-02-08 16:23:55.124507,0.02731,0.0,7.07,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,4.0
2,2021-02-08 16:23:55.124507,0.02729,0.0,7.07,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7,4.0
3,2021-02-08 16:23:55.124507,0.03237,0.0,2.18,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4,5.0
4,2021-02-08 16:23:55.124507,0.06905,0.0,2.18,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,5.0


### Pipelines with chaining

Even better would be to have seperate functions for different ascpect of the data processing. 

Multiple reasons:
- We can give descriptive names to the function which allows us to have a better grasp on what is happening and when.
- If there is ever a bug, this pipeline will make it easier for us to figure out where it is. Since every step is merely a function. 
- We can write tests for these small pipeline steps such that we can test for expected behavior. 
- We can automate logging a bit (more on this later). 

In [6]:
def strip_column_names(df):
    return df.rename(columns=str.lower)

In [7]:
def assign_new_features(df):
    return df.assign(rad_2=lambda d: d['rad'] + 2,
                     timestamp=lambda d: datetime.now())

In [8]:
def remove_outliers(df):
    df, _, _ =  univariate_outlier_removal(df)    
    return df

In [9]:
prep_pipe_df = (
    boston_df
    .pipe(strip_column_names)
    .drop('chas', axis=1)
    .pipe(assign_new_features)
    .set_index('timestamp')
    .pipe(remove_outliers)
    .pipe(chain_snap)
    .dropna()
    .pipe(chain_snap)
    .reset_index()
)

(506, 14)
(506, 14)


As you can see, all the methods are applied in once. It acts as a pipeline, where we start with renaming the columns and end with resetting the index.

<b>Pros:</b>
- Method chaining enhances readibility, instead of writing all these functions in seperate cells we only need one (I normally package all my functions, so that these can be imported and are not in the notebook itself). 
- Therefore, we are better aware of what we are doing to the data and thus lower the chance on mistakes. 
- We did not change the original dataset!

You might be tempted to think that you are limited by this way of writing code, but you actually get to still do nearly everything. 

- add/overwrite columns `.assign()`
- filter rows `.loc[]`
- make a grouped object `.groupby()`
- shorthand aggregation for groupby `.agg()`
- general aggregation for groupby `.apply()`
- sorting rows `.sort_values()`
- reset the index `.reset_index()`
- select top/bottom rows `.head()/.tail()`

Note that all these methods do not change the original dataset as base behaviour.

<font color='red'>Con:</font>
- Debugging chains can become harder. But! The solution to this is to add decorators with logging functionality. More on this can be found in the next tutorial: 2_logger_decorator.ipynb