# Customizing Buckaroo
Buckaroo consists of
* The BuckarooWidget which coordinates updates to the frontend, management of analysis, the lowcode UI, and auto_cleaning functionality.
* The frontend JS Table - handles display of dataframes, configured through interfaces.
* The pluggable analysis framework which orders execution of customized analysis objects, and handles catching errors along with error reporting.
* User supplied analyis objects, these operate on the dataframes to build the summary stats table, and configure the frontend display


## Adding a Command to the Low Code UI

In [None]:
import numpy as np
import pandas as pd
import buckaroo

In [None]:
df = pd.read_csv("https://s3.amazonaws.com/tripdata/201401-citibike-tripdata.zip")
w = buckaroo.BuckarooWidget(df[:500], showCommands=True, auto_clean=False) #turning autoType=False to reduce clutter in the operations
w

In [None]:
from buckaroo.customizations.all_transforms import Command
from buckaroo.jlisp.lispy import s
#Here we start adding commands to the Buckaroo Widget.  Every call to add_command replaces a command with the same name
@w.add_command
class GroupBy3(Command):
    command_default = [s("groupby3"), s('df'), 'col', {}]
    command_pattern = [[3, 'colMap', 'colEnum', ['null', 'sum', 'mean', 'median', 'count']]]
    @staticmethod 
    def transform(df, col, col_spec):
        grps = df.groupby(col)
        
        df_contents = {}
        for k, v in col_spec.items():
            if v == "sum":
                df_contents[k] = grps[k].apply(lambda x: x.sum())
            elif v == "mean":
                df_contents[k] = grps[k].apply(lambda x: x.mean())
            elif v == "median":
                df_contents[k] = grps[k].apply(lambda x: x.median())
            elif v == "count":
                df_contents[k] = grps[k].apply(lambda x: x.count())
        return pd.DataFrame(df_contents)

    @staticmethod 
    def transform_to_py(df, col, col_spec):
        commands = [
            "    grps = df.groupby('%s')" % col,
            "    df_contents = {}"
        ]
        for k, v in col_spec.items():
            if v == "sum":
                commands.append("    paddydf_contents['%s'] = grps['%s'].apply(lambda x: x.sum())" % (k, k))
            elif v == "mean":
                commands.append("    df_contents['%s'] = grps['%s'].apply(lambda x: x.mean())" % (k, k))
            elif v == "median":
                commands.append("    df_contents['%s'] = grps['%s'].apply(lambda x: x.median())" % (k, k))
            elif v == "count":
                commands.append("    df_contents['%s'] = grps['%s'].apply(lambda x: x.count())" % (k, k))
        commands.append("    df = pd.DataFrame(df_contents)")
        return "\n".join(commands)


Note that `groupby2` has been added to the commands

**These docs need updating for 0.5** Take a look at the [customizations](https://github.com/paddymul/buckaroo/tree/main/buckaroo/customizations) directory in the codebase and file some bugs asking for your suggested improvement.  I expect to add a lot more xamples around the 0.6 series

# Adding a summary stat
Buckaroo is completely customizeable.  In the next cells we will add `Variance` to an instance of the BuckarooWidget with the `Pluggable Analysis Framework`.

## Why was the Pluggable Analysis Framework built?
The `Pluggable Analysis Framework` is engineered to allow summary_stats to be built up piecemeal and incrementally.  Traditionally when writing bits of analysis code, the tendency is to have large brittle functions that do a lot at once.  Adding extra stats either requires copying and pasting the existing function with one small addition, writing each stat independently and possibly recomputing existing stats, having a strictly ordered set of analysis functions, or some complex adhoc argument passing scheme.  I have written adhoc versions in each of these patterns.  Problems are manifest and the aparatus rarely survives even copy-pasting to the next notebook.

## How does the Pluggable Analysis Framework work?
The `Pluggable Analysis Framework` is built around a DAG of `ColAnalysis` nodes that can depend on other summary stats, and provide one or more summary stats.  Nodes cand be added to the dag with `add_analysis`.  If a class with the same name is inserted into the DAG, the newly inserted node replaces the previous instantiation.  This all facilitates interactive development of analysis functions.  During execution errors are caught and execution proceeds.  This is important because breaking the default dataframe mechanism is a show stopping problem for users

In [None]:
w = buckaroo.BuckarooWidget(df, showCommands=False)
w

In [None]:
from buckaroo.pluggable_analysis_framework.pluggable_analysis_framework import (ColAnalysis)
class Variance(ColAnalysis):
    provides_summary = ["variance"]
    #a bit hacky, the newly added analyis needs to be the last in the dependency chain
    requires_summary = ["histogram"]

    @staticmethod
    def series_summary(sampled_ser, ser):
        if pd.api.types.is_numeric_dtype(ser):
            return dict(variance=ser.var())
        return dict(variance="NA")

    
    summary_stats_display = [
        'dtype', 'length', 'nan_count', 'distinct_count', 'empty_count',
        'empty_per', 'unique_per', 'nan_per', 
        'is_numeric', 'is_integer', 'is_datetime',
        'mode', 'min', #'max', 
        'mean', 
        # we must add variance to the list of summary_stats_display, otherwise our new stat won't be displayed
        'variance']
w.add_analysis(Variance)

analysis is added interactively,  toggle the summary stats view on the widget above and notice that `variance` has been added

## Basic Unit testing is built in

Because there are so many corner cases with numerical code, every time a new summary stat is added, a variety of simple tests are run against it.  This lets you discover bugs earlier.

In [None]:
small_df = df[:500][df.columns[:4]]
# we are going to create, but not display a BuckarooWidget here, we are looking at the error behavior
w = buckaroo.BuckarooWidget(small_df, showCommands=False, debug=True)

class Variance(ColAnalysis):
    provides_summary = ["variance"]
    requires_summary = ["mean"]
    
    @staticmethod
    def summary(sampled_ser, summary_ser, ser):
        mean = summary_ser.get('mean', False)
        arr = ser.to_numpy()
        #toggle SIMULATED_BUG to easily see behavior with and without a bug
        SIMULATED_BUG = True
        if SIMULATED_BUG:
            if mean in [pd.NA, np.nan, False]:
                return dict(variance="NA")
        else:
            if mean is pd.NA or mean is np.nan or mean is False:
                return dict(variance="NA")
        if mean and pd.api.types.is_integer_dtype(ser):
            return dict(variance=np.mean((arr - mean)**2))
        elif mean and pd.api.types.is_float_dtype(ser):
            return dict(variance=np.mean((arr - mean)**2))
        return dict(variance="NA")
    
    summary_stats_display = [
        'dtype', 'length', 'nan_count', 'distinct_count', 'empty_count',
        'empty_per', 'unique_per', 'nan_per', 
        'is_numeric', 'is_integer', 'is_datetime',
        'mode', 'min', 'max', 'mean', 
        # we must add variance to the list of summary_stats_display, otherwise our new stat won't be displayed
        'variance']

w.add_analysis(Variance)

In [None]:
from buckaroo.pluggable_analysis_framework.analysis_management import PERVERSE_DF
Variance.summary(PERVERSE_DF['all_nan'], pd.Series({'mean': np.nan, }), PERVERSE_DF['all_nan']) # boolean value of NA is ambiguous

## Reproducing errors in the notebook
Buckaroo printed reproduction instructions like
```
from buckaroo.pluggable_analysis_framework.analysis_management import PERVERSE_DF
Variance.summary(PERVERSE_DF['all_nan'], pd.Series({'mean': np.nan, }), PERVERSE_DF['all_nan']) # boolean value of NA is ambiguous

```

`PERVERSE_DF` is a DataFame with all kinds of edgecases that normally trip up numerical code.  You can run the above two lines, and quickly start iterating on your `ColAnalysis` class to fix the error.  Normally adhoc analysis code that iterates over a list of functions blows up in a stack trace referencing an anonymous function in the middle of a for loop called with opaque variables.  Bucakroo gives you a single line that can reproduce the error, with easily inspectable variables

In [None]:
from buckaroo.pluggable_analysis_framework.analysis_management import PERVERSE_DF
Variance.summary(PERVERSE_DF['all_nan'], pd.Series({'mean': np.nan, }), PERVERSE_DF['all_nan']) # boolean value of NA is ambiguous

## Quiet mode
Sometimes you just want to get on with it.  Buckaroo has a setting for that too, set `quiet=True` and unit test errors, and regular processing errors will be silenced.  Not recommended, but if I didn't add it, users would write their own adhoc version.

In [None]:
w = buckaroo.BuckarooWidget(small_df, showCommands=False)
#There are errors in the following functions, quiet = True will ignore them

def int_digits(n):
    if np.isnan(n):
        return 1
    if n == 0:
        return 1
    if np.sign(n) == -1:
        return int(np.floor(np.log10(np.abs(n)))) + 2
    return int(np.floor(np.log10(n)+1))
class MinDigits(ColAnalysis):
    
    requires_summary = ["min"]
    provides_summary = ["min_digits"]
    quiet = True
    
    @staticmethod
    def summary(sampled_ser, summary_ser, ser):
        is_numeric = pd.api.types.is_numeric_dtype(sampled_ser.dtype)
        if is_numeric:
            return {
                'min_digits':int_digits(summary_ser.loc['min'])}
        else:
            return {
                'min_digits':0}
w.add_analysis(MinDigits)
w

# Making a new default dataframe display function

In [None]:
from buckaroo.widget_utils import disable
from IPython.core.getipython import get_ipython
from IPython.display import display
import warnings

disable()
def my_display_as_buckaroo(df):
    w  = BuckarooWidget(df, showCommands=False)
    #the analysis we added throws warnings, let's muffle that when used as the default display
    warnings.filterwarnings('ignore')
    w.add_analysis(Skew)
    warnings.filterwarnings('default')
    return display(w)

def my_enable():
    """
    Automatically use buckaroo to display all DataFrames
    instances in the notebook.

    """
    ip = get_ipython()
    if ip is None:
        print("must be running inside ipython to enable default display via enable()")
        return
    ip_formatter = ip.display_formatter.ipython_display_formatter
    ip_formatter.for_type(pd.DataFrame, my_display_as_buckaroo)
my_enable()