In [None]:
import pandas as pd
import numpy as np

# Modern Pipelines

In the notebooks so far, we have focussed on what pandas has to offer. However, we have not discussed how to use pandas in practice. In this notebook we will highlight a problem with certain workflows and demonstrate a workflow that could be adopted instead. 

Let's pretend that we've read in a timeseries and that this is the new data.

In [None]:
def make_ts_df():
    dates = [str(_) for _ in pd.date_range("2018-01-01", "2019-01-01")]
    values = [np.nan if np.random.random() < 0.05 else _ for _ in np.random.normal(0, 1, 366)]
    return pd.DataFrame({"date": dates, "value": values})

date_df = make_ts_df()
date_df

Before we start analysing the data, let's imagine we want to do the following:

- Get rid of the redundant hours.
- Clean the `nan` values.
- Remove outliers. 

One way of doing it could be like so.

In [None]:
(
    date_df
    .assign(date=lambda d: pd.to_datetime(d.date))
    .dropna()
    .loc[lambda d: d.value > -2.0]
    .loc[lambda d: d.value < 2.0]
)

This is the way we've been doing it so far, but we can do better.

If you were to just look at the code above it could be a bit hard to understand what is going on.

Also, if we were to get a new date dataframe, we'd have to start all over again. 

Whilst this is not a big issue when we are only doing 3 processing steps, as the amount of processing increases it could become time consuming.

## Pipeline abstraction

In [None]:
def parse_types(dataf):
    """Remove the hours from dates"""
    return (dataf
            .assign(date=lambda d: pd.to_datetime(d['date'])))

def clean_nan(dataf):
    """Get rid of rows with missing values"""
    return (dataf.dropna())

def fill_nan(dataf):
    """Fill nan values with 0"""
    return (dataf.fillna(1))

def remove_outliers(dataf):
    """Remove values greater than 2 and less than -2"""
    return (dataf
            .loc[lambda d: d['value'] > -2.0]
            .loc[lambda d: d['value'] < 2.0])

prep_df = (date_df
           .pipe(parse_types)
           .pipe(clean_nan)
           .pipe(remove_outliers))
prep_df

The `.pipe()` method allows us to pass a function that accepts a dataframe as it's first argument. This is a very nice flow. 

- We can easily use this pipeline (or parts of this pipeline) for different datasets.

<img src="../images/lego.jpg" width="400" height="400" align="center"/>

- If there is ever a bug this pipeline will make it easier for us to figure out where it is. Since every step is merely a function, we'll know eactly where the process is breaking. 

- We can give the function a descriptive name and on a pipeline level this allows us to see "what" is happening "when". 


In [None]:
# e.g. You may not have seen how the parse_types function works yet
? parse_types

### Caveats 

We should be careful when we are writing `.pipe`-lines. The function going into a `.pipe()` might not be stateless. Here's an example:

In [None]:
def rename_columns(dataf):
    dataf.columns = ["a", "b"]
    return dataf 

date_df = make_ts_df()
date_df.pipe(rename_columns).columns, date_df.columns

In such a situation it is best to include a `.copy()` command. 

In [None]:
def rename_columns(dataf):
    dataf = dataf.copy()
    dataf.columns = ["a", "b"]
    return dataf 

date_df = make_ts_df()
date_df.pipe(rename_columns).columns, date_df.columns

Be careful with this. We want our functions to be stateless, otherwise we lose the benefits.

Alternatively, you could use the `.rename()` method.

In [None]:
def rename_columns(dataf):
    return dataf.rename(columns={"date": "a", "value": "b"})

date_df = make_ts_df()
date_df.pipe(rename_columns).columns, date_df.columns

## Pipeline abstraction on higher Levels

To fully appreciate what the pandas pipelines can do, let us rewrite one function.

In [None]:
def remove_outliers(dataf, min_value=-2.0, max_value=2.0):
    return (dataf
            .loc[lambda d: d['value'] > min_value]
            .loc[lambda d: d['value'] < max_value])

prep_df = (date_df
           .pipe(parse_types)
           .pipe(clean_nan)
           .pipe(remove_outliers, max_value=0.5))
prep_df

The `.pipe()` can accept keyword arguments. This allows you to change, say, threshold values on a high level. No need to change the original function, you can change things from a higher level. This is great because it will encourage you to write functions that are general. 

# Conclusion 

> **"Pipelines are the only correct way to write pandas."**

This is a bold statement, but some of people very strongly about this. 

Even if you take this statement with a grain of salt, it is important to write your code in such a way that your notebooks remains clear - if it takes a lot of effort to understand the code of your colleagues, then your team will be slower than you want it to be. 

A notebook is a great scratchpad, but that is no excuse to write unclear code!