# DataFrames: Custom functions

This notebook uses `.apply()` and `.map_paritions()` to utilize cutom functions with Dask dataframes. It will discuss both common use and best practices.

## Start Dask Client for Dashboard

Starting the Dask Client is optional.  It will provide a dashboard which 
is useful to gain insight on the computation.  

The link to the dashboard will become visible when you create the client below.  We recommend having it open on one side of your screen while using your notebook on the other side.  This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

In [None]:
from dask.distributed import Client
client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client

## Artifical dataset

We create an artificial timeseries dataset to help us work with custom operations

In [None]:
import dask
df = dask.datasets.timeseries()
df

This dataset is small enough to fit in memory, so we persist it now.

You would skip this step if your dataset becomes too large to fit into memory.

In [None]:
df = df.persist()

## Apply by row

There are times when you need to custom functions that operate on Dask DataFrames. Here's a simple function that operates on one row at a time.

In [None]:
def custom_function(row):
    if row['x'] < row['y']:
        return row['x'] * row['y']
    else:
        return row['x'] + row['y']

This function is computed row-by-row through each partition using the `.apply()` method. It is best practice to use `meta=` to declare the datatype returned by the function.

In [None]:
df['result'] = df.apply(custom_function, axis='columns', meta=float)
df['result'].head()

## Map by partition

Binning values along a column is easily achieved in Pandas using `pd.cut`. We'd like to efficiently apply this function to our Dask DataFrame.

`.map_partitions()` applies the function independently to each chunk, which are read into memory as Pandas Series.

In [None]:
import numpy as np
import pandas as pd

bins = np.linspace(-1, 1, 4)
labels = labels=['low','medium','high']

df['x_bin'] = df['x'].map_partitions(pd.cut, bins=bins, labels=labels, meta=object)

The bin are pre-computed and applied to each partition to create the new `'x_bin'` column with the object dtype.

For `pd.cut` it is important to provide bin edges 

In [None]:
df['x_bin'].head()

Now that we have labels we can use them for further processing, like groupby.

In [None]:
avg = df.groupby('x_bin')['y'].mean()
avg.compute()