# Extending Buckaroo for pandas
Buckaroo is built for exploratory data analysis on unknown data.  Data in the wild is incredibly varied and so are the ways of visualizing it. Most table tools are built around allowing a single bespoke customization, with middle of the road defaults. Buckaroo takes a different approach. Buckaroo lets you build many highly specific configurations and then toggle between them quickly.  This makes it easier to build each configuration because you don't have to solve for every possibility.

This document walks you through how to add your own analysis to Buckaroo and allow users to toggle it

The extension points are
* [PluggableAnalysisFramework](https://buckaroo-data.readthedocs.io/en/latest/articles/pluggable.html) Used to add summary stats and column metadata for use by other steps
* [Styling](./styling-howto.ipynb) control the visual display of the table
* PostProcessing used to transform an entire dataframe
* AutoCleaning Automate transformations for dropping nulls, removing outliers and other pre-processing steps, cleans the dataframe and generates python code.  Not yet supported in 0.6

Each extension point is composable, and can be interactively mixed and matched

In [5]:
import pandas as pd
import numpy as np
from buckaroo.dataflow.dataflow_extras import StylingAnalysis
from buckaroo.pluggable_analysis_framework.pluggable_analysis_framework import ColAnalysis
from buckaroo import BuckarooWidget

In [6]:
ROWS = 200
typed_df = pd.DataFrame({'int_col':np.random.randint(1,50, ROWS), 'float_col': np.random.randint(1,30, ROWS)/.7,
                         'timestamp':["2020-01-01 01:00Z", "2020-01-01 02:00Z", "2020-02-28 02:00Z", "2020-03-15 02:00Z", None] * 40,
                         "str_col": ["foobar", "Realllllly long string", "", None, "normal"]* 40})
typed_df['timestamp'] = pd.to_datetime(typed_df['timestamp'])

In [7]:
bw = BuckarooWidget(typed_df)
bw

BuckarooWidget(buckaroo_options={'sampled': ['random'], 'auto_clean': ['aggressive', 'conservative'], 'post_pr…

In [8]:
bw.df_display_args

{'main': {'data_key': 'main',
  'df_viewer_config': {'pinned_rows': [{'primary_key_val': 'dtype',
     'displayer_args': {'displayer': 'obj'}},
    {'primary_key_val': 'histogram',
     'displayer_args': {'displayer': 'histogram'}}],
   'column_config': [{'col_name': 'index',
     'displayer_args': {'displayer': 'float',
      'min_fraction_digits': 0,
      'max_fraction_digits': 0}},
    {'col_name': 'int_col',
     'displayer_args': {'displayer': 'float',
      'min_fraction_digits': 0,
      'max_fraction_digits': 0}},
    {'col_name': 'float_col',
     'displayer_args': {'displayer': 'float',
      'min_fraction_digits': 3,
      'max_fraction_digits': 3}},
    {'col_name': 'timestamp',
     'tooltip_config': {'tooltip_type': 'simple', 'val_column': 'timestamp'},
     'displayer_args': {'displayer': 'obj'}},
    {'col_name': 'str_col',
     'tooltip_config': {'tooltip_type': 'simple', 'val_column': 'str_col'},
     'displayer_args': {'displayer': 'string', 'max_length': 35}}],
   

# Using the Pluggable Analysis Framework

The PAF allows users to add summary analysis that runs for every dataframe, and exposes created measures to subsequent steps.
There are implementations for pandas and polars.  Individual analysis classes cna depend on other calsess that provide measures, the framwork ensures that they are excecuted in the correct order.

These measures form the column metadata used by styling, and the summary information used for pinned rows.

You can read more here

* https://github.com/paddymul/buckaroo/blob/main/tests/unit/analysis_management_test.py
* https://github.com/paddymul/buckaroo/blob/main/buckaroo/customizations/analysis.py

The following cell adds a 99th quintile measure and displays it.

In [None]:
class Quin99Analysis(StylingAnalysis):
    provides_defaults = {'quin99': None}

    @staticmethod
    def series_summary(sampled_ser, ser):
        if pd.api.types.is_numeric_dtype(ser) and not pd.api.types.is_bool_dtype(ser):
            return dict(
                quin99=ser.quantile(.99))
        return {}
    
    pinned_rows = [{'primary_key_val': 'quin99', 'displayer_args': {'displayer': 'obj' }}]
    df_display_name = 'quin99'
    data_key = "empty"  # the non pinned rows will pull from the empty dataframe

pbw = BuckarooWidget(typed_df)
pbw.add_analysis(Quin99Analysis)
pbw

# Adding a styling analysis
The `StylingAnalysis` class is used to control the display of a column based on the column metadata.  


Overriding the `config_from_column_metadata(col:str, sd:SingleColumnMetadata) -> ColumnConfig` computes the config for a single column given that column's metadata.

This lets you customize based on metadata collected about a column.  This works with the [PluggableAnalysisFramework](https://buckaroo-data.readthedocs.io/en/latest/articles/pluggable.html),  you can specify required fields that are necessary.  Adding requirements like this guarantees that errors are spotted early.

The same StylingAnalysis class can generally work for both Polars and Pandas because it only receives a dictionary with simple python values.

The following cell defines two StylingAnalysis, one that shows great detail `everything` the other shows shortened versions `Abrev`

In [None]:
class EverythingStyling(StylingAnalysis):
    """
    This styling shows as much detail as possible
    """
    df_display_name = "Everything"
    requires_summary = ["histogram", "is_numeric", "dtype", "_type"]
    pinned_rows = [{'primary_key_val': 'dtype', 'displayer_args': {'displayer': 'obj' }}]

    #Styling analysis handles column iteration for us.
    @classmethod
    def style_column(kls, col:str, column_metadata):
        digits = 10
        t = column_metadata['_type']
        if column_metadata['is_integer']:
            disp = {'displayer': 'float', 'min_fraction_digits':0, 'max_fraction_digits':0}
        elif column_metadata['is_numeric']:
            disp = {'displayer': 'float', 'min_fraction_digits':digits, 'max_fraction_digits':digits}            
        elif t == 'temporal':
            disp = {'displayer': 'datetimeLocaleString','locale': 'en-US',  'args': {}}
        elif t == 'string':
            disp = {'displayer': 'string', 'max_length': 100}
        else:
            disp = {'displayer': 'obj'}
        return {'col_name':col, 'displayer_args': disp }

class AbrevStyling(StylingAnalysis):
    """This styling shows shortened versions of columns """
    requires_summary = ["histogram", "is_numeric", "dtype", "_type"]
    df_display_name = "Abrev"
    pinned_rows = []

    @classmethod
    def style_column(kls, col:str, column_metadata):
        digits = 3
        t = column_metadata['_type']
        if column_metadata['is_integer']:
            disp = {'displayer': 'float', 'min_fraction_digits':0, 'max_fraction_digits':0}
        elif column_metadata['is_numeric']:
            disp = {'displayer': 'float', 'min_fraction_digits':digits, 'max_fraction_digits':digits}
        elif t == 'temporal':
            disp = {'displayer': 'datetimeLocaleString','locale': 'en-US',  'args': {}}
        elif t == 'string':
            disp = {'displayer': 'string', 'max_length':10}
        else:
            disp = {'displayer': 'obj'}
        return {'col_name':col, 'displayer_args': disp }

sbw = BuckarooWidget(typed_df)
sbw.add_analysis(EverythingStyling)
sbw.add_analysis(AbrevStyling)
sbw

Let's look at pinned_rows, they can be modified by setting `pinned_rows` on Buckaroo Instaniation

# lets add a post processing method

In [None]:
bw = BuckarooWidget(typed_df[:5])  #this throws a bunch of warnings
@bw.add_processing
def transpose(df):
    return df.transpose()
bw

In [None]:
class ValueCountPostProcessing(ColAnalysis):
    @classmethod
    def post_process_df(kls, df):
        dfs = []
        for c in df.columns:
            vc = df[c].value_counts()
            if len(vc) > 10:
                vc = vc.iloc[:10]
            tdf = pd.DataFrame({'%s_vals' %c:vc.index.values, '%s_counts'% c:vc.values})
            tdf.reindex(pd.RangeIndex(10))
            dfs.append(tdf)
        return [pd.concat(dfs, axis=1), {}]
    post_processing_method = "value_counts"

class ShowErrorsPostProcessing(ColAnalysis):
    @classmethod
    def post_process_df(kls, df):
        tdf = df.copy()
        tdf['errored_float'] = "some error"
        tdf.loc[typed_df['float_col'].lt(20) == False, 'errored_float'] = None
        extra_column_config = {
            'float_col': {'column_config_override': {
                'color_map_config': {
                    'color_rule': 'color_not_null',
                    'conditional_color': 'red',
                    'exist_column': 'errored_float'},
                'tooltip_config': { 'tooltip_type':'simple', 'val_column': 'errored_float'}}},
            'errored_float': {'column_config_override': {'merge_rule': 'hidden'}}}
        return (tdf, extra_column_config)
    post_processing_method = "show_errors"

# In this case we are going to extend BuckarooWidget so we can take this combination with us
base_a_klasses = BuckarooWidget.analysis_klasses.copy()
base_a_klasses.extend([ValueCountPostProcessing, 
                       ShowErrorsPostProcessing])
class VCBuckarooWidget(BuckarooWidget):
    analysis_klasses = base_a_klasses
vcb = VCBuckarooWidget(typed_df, debug=False)
vcb

## Where to use PostProcessing
Post processing functions are no argument transformations.  I can't think of a lot of generic whole dataframe operations.

`ValueCount` and `Transpose` are generic.  `ShowErrors` depends on two specific columns.

I expect Post processing to be very useful for small custom apps built on top of Buckaroo.  When you know the columns and you want a strict set of transforms, PostProcessing is a great fit.



Post processing is also useful when combined with a preprocessing function to compare DataFrames

Here is some pseudo code
```python
class ComparePost(ColAnalysis):

    @classmethod
    def post_process_df(kls, df):
        df1,df2 = split_columns("|")
        compare_df = run_compare(df1, df2)
        return [compare_df, {}]
    post_processing_method = 'compare'
    
class CompareWidget(BuckarooWidget):
    analysis_klasses = [ComparePost]
    
def compare(df1, df2):
    joined = pd.concat([prefix_columns(df1, 'df1|'), prefix_columns(df2, 'df21|')])
    return CompareWidget(joined)

#run this by the following command
compare(sales_march_2022_df, sales_march_2023_df)
```

# Putting it all together

You can compose (combine) the PluggableAnalysisFramework, PostProcessing and Styling into a single widget.  And you can manipulate PostProcessing separately from Styling.

In [None]:
from buckaroo.customizations.analysis import (TypingStats, ComputedDefaultSummaryStats, DefaultSummaryStats)
from buckaroo.customizations.histogram import (Histogram)
from buckaroo.customizations.styling import DefaultSummaryStatsStyling, DefaultMainStyling

class KitchenSinkWidget(BuckarooWidget):
    #let's be explicit here and show all of the built in analysis klasses
    analysis_klasses = [
    TypingStats, DefaultSummaryStats,
    Histogram, ComputedDefaultSummaryStats,
    # default buckaroo styling
    DefaultSummaryStatsStyling, DefaultMainStyling,
    # our Quin99 analysis
    Quin99Analysis,  # adds a styling method
    #our PostProcessing classes
    ValueCountPostProcessing, ShowErrorsPostProcessing,
    #our styling methods
    EverythingStyling, AbrevStyling]
ksw = KitchenSinkWidget(typed_df)
ksw

# Why aren't there click handlers?

Buckaroo doesn't allow arbitrary click handlers and this is by design.  When you allow arbitrary click handlers, you then have to manage state.  If you have noticed, every method of extending buckaroo is a pure function.  Managing application state is difficult and the primary source of errors when building GUIs.

Buckaroo is designed purely around displaying DataFrames along with the most common operations that are performed on DataFrames.  If you want more traditional app experiences, right now you can use IPYWidgets and integrate buckaroo into it.  Soon I will be releasing the DFViewer (core component that shows the table) for Streamlit and Solara.

# What about autocleaning and the low code UI

Auto cleaning and the low code UI work together for more fine grained editting of data.  The low code UI presents a gui that works on columns and allows functions with arguments.  

Auto cleaning works to suggest operations that are then loaded into the low code ui.  Then these operations can be editted or removed.
Auto cleaning options can be cycled through to generate different cleanings.

## Why did this release remove auto cleaning and the low code UI?

Although auto cleaning and the low code UI is my favorite feature of Buckaroo, and the first part I built, it hasn't seemed to have gained traction with users.  Buckaroo for that matter hasn't gained a lot of traction.  For the time being I have decided to put more effort into refining and promoting the parts of Buckaroo that people do understand.  