# Customizing Buckaroo
Buckaroo consists of
* The BuckarooWidget which coordinates updates to the frontend, management of analysis, the lowcode UI, and auto_cleaning functionality.
* The frontend JS Table - handles display of dataframes, configured through interfaces.
* The pluggable analysis framework which orders execution of customized analysis objects, and handles catching errors along with error reporting.
* User supplied analyis objects, these operate on the dataframes to build the summary stats table, and configure the frontend display


In [None]:
import numpy as np
import pandas as pd
import buckaroo
from buckaroo.buckaroo_widget import BuckarooWidget

#df = pd.read_csv("https://s3.amazonaws.com/tripdata/201401-citibike-tripdata.zip")
df = pd.read_parquet("./citibike-trips-2016-04.parq")
df

**These docs need updating for 0.5** Take a look at the [customizations](https://github.com/paddymul/buckaroo/tree/main/buckaroo/customizations) directory in the codebase and file some bugs asking for your suggested improvement.  I expect to add a lot more xamples around the 0.6 series

# Adding a summary stat
Buckaroo is completely customizeable.  In the next cells we will add `Variance` to an instance of the BuckarooWidget with the `Pluggable Analysis Framework`.

## Why was the Pluggable Analysis Framework built?
The `Pluggable Analysis Framework` is engineered to allow summary_stats to be built up piecemeal and incrementally.  Traditionally when writing bits of analysis code, the tendency is to have large brittle functions that do a lot at once.  Adding extra stats either requires copying and pasting the existing function with one small addition, writing each stat independently and possibly recomputing existing stats, having a strictly ordered set of analysis functions, or some complex adhoc argument passing scheme.  I have written adhoc versions in each of these patterns.  Problems are manifest and the aparatus rarely survives even copy-pasting to the next notebook.

## How does the Pluggable Analysis Framework work?
The `Pluggable Analysis Framework` is built around a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph) of `ColAnalysis` nodes that can depend (`requires_summary`) on other summary stats which `provides_summary` the right values, and provide one or more summary stats.  Nodes cand be added to the dag with `add_analysis`.  If a class with the same name is inserted into the DAG, the newly inserted node replaces the previous instantiation.  This all facilitates interactive development of analysis functions.  During execution errors are caught and execution proceeds.  This is important because breaking the default dataframe mechanism is a show stopping problem for users

In [None]:
w = BuckarooWidget(df[:500])
w

In [None]:
from buckaroo.pluggable_analysis_framework.pluggable_analysis_framework import (ColAnalysis)
from buckaroo.dataflow.dataflow_extras import StylingAnalysis

class Variance(StylingAnalysis):
    #note we also override pinned rows so you can see this new stat, this is mainly used for development. 
    # Normally you would want to extend "ColAnalysis" to just provide the stat
    pinned_rows = [
        {'primary_key_val': 'variance', 'displayer_args': {'displayer': 'obj' }}]

    provides_summary = ["variance"]
    #a bit hacky, the newly added analyis needs to be the last in the dependency chain
    requires_summary = ["histogram"]

    @staticmethod
    def series_summary(sampled_ser, ser):
        if pd.api.types.is_numeric_dtype(ser):
            return dict(variance=ser.var())
        return dict(variance="NA")

w.add_analysis(Variance)

## Basic Unit testing is built in

Because there are so many corner cases with numerical code, every time a new summary stat is added, a variety of simple tests are run against it.  This lets you discover bugs earlier.

In [None]:
#broken as 0f 0.6

small_df = df[:500][df.columns[:4]]
# we are going to create, but not display a BuckarooWidget here, we are looking at the error behavior
w = BuckarooWidget(small_df, debug=True)

class Variance(ColAnalysis):
    provides_summary = ["variance"]
    requires_summary = ["mean"]
    
    @staticmethod
    def summary(sampled_ser, summary_ser, ser):
        mean = summary_ser.get('mean', False)
        arr = ser.to_numpy()
        #toggle SIMULATED_BUG to easily see behavior with and without a bug
        SIMULATED_BUG = False
        if SIMULATED_BUG:
            if mean in [pd.NA, np.nan, False]:
                return dict(variance="NA")
        else:
            if mean is pd.NA or mean is np.nan or mean is False:
                return dict(variance="NA")
        if mean and pd.api.types.is_integer_dtype(ser):
            return dict(variance=np.mean((arr - mean)**2))
        elif mean and pd.api.types.is_float_dtype(ser):
            return dict(variance=np.mean((arr - mean)**2))
        return dict(variance="NA")

w.add_analysis(Variance)

In [None]:
from buckaroo.pluggable_analysis_framework.analysis_management import PERVERSE_DF
Variance.summary(PERVERSE_DF['all_nan'], pd.Series({'mean': np.nan, }), PERVERSE_DF['all_nan']) # boolean value of NA is ambiguous

## Reproducing errors in the notebook
Buckaroo printed reproduction instructions like
```
from buckaroo.pluggable_analysis_framework.analysis_management import PERVERSE_DF
Variance.summary(PERVERSE_DF['all_nan'], pd.Series({'mean': np.nan, }), PERVERSE_DF['all_nan']) # boolean value of NA is ambiguous

```

`PERVERSE_DF` is a DataFame with all kinds of edgecases that normally trip up numerical code.  You can run the above two lines, and quickly start iterating on your `ColAnalysis` class to fix the error.  Normally adhoc analysis code that iterates over a list of functions blows up in a stack trace referencing an anonymous function in the middle of a for loop called with opaque variables.  Bucakroo gives you a single line that can reproduce the error, with easily inspectable variables

In [None]:
from buckaroo.pluggable_analysis_framework.analysis_management import PERVERSE_DF
Variance.summary(PERVERSE_DF['all_nan'], pd.Series({'mean': np.nan, }), PERVERSE_DF['all_nan']) # boolean value of NA is ambiguous

## Quiet mode
Sometimes you just want to get on with it.  Buckaroo has a setting for that too, set `quiet=True` and unit test errors, and regular processing errors will be silenced.  Not recommended, but if I didn't add it, users would write their own adhoc version.

In [None]:
w = buckaroo.BuckarooWidget(small_df)
#There are errors in the following functions, quiet = True will ignore them

def int_digits(n):
    if np.isnan(n):
        return 1
    if n == 0:
        return 1
    if np.sign(n) == -1:
        return int(np.floor(np.log10(np.abs(n)))) + 2
    return int(np.floor(np.log10(n)+1))
class MinDigits(ColAnalysis):
    
    requires_summary = ["min"]
    provides_summary = ["min_digits"]
    quiet = True
    
    @staticmethod
    def summary(sampled_ser, summary_ser, ser):
        is_numeric = pd.api.types.is_numeric_dtype(sampled_ser.dtype)
        if is_numeric:
            return {
                'min_digits':int_digits(summary_ser.loc['min'])}
        else:
            return {
                'min_digits':0}
w.add_analysis(MinDigits)
w

# Making a new default dataframe display function

## Adding a Command to the Low Code UI
Previous versions of Buckaroo included a customizable low code UI.  This is temporarily deprecated as of 0.6
Look at https://github.com/paddymul/buckaroo/blob/86df365278ac6933f7266c0b055a2ff90b072e9a/example-notebooks/Customizing-Buckaroo.ipynb for more info and install buckaroo at 0.4.6 or 0.5.1

In [None]:
from buckaroo.widget_utils import disable
from IPython.core.getipython import get_ipython
from IPython.display import display
import warnings

disable()
def my_display_as_buckaroo(df):
    w  = BuckarooWidget(df, showCommands=False)
    #the analysis we added throws warnings, let's muffle that when used as the default display
    warnings.filterwarnings('ignore')
    w.add_analysis(Skew)
    warnings.filterwarnings('default')
    return display(w)

def my_enable():
    """
    Automatically use buckaroo to display all DataFrames
    instances in the notebook.

    """
    ip = get_ipython()
    if ip is None:
        print("must be running inside ipython to enable default display via enable()")
        return
    ip_formatter = ip.display_formatter.ipython_display_formatter
    ip_formatter.for_type(pd.DataFrame, my_display_as_buckaroo)
my_enable()