# Extending Buckaroo
Buckaroo is built for exploratory data analysis on unknown data.  Data in the wild is incredibly varied and so are the ways of visualizing it. Most table tools are built around allowing a single bespoke customization, with middle of the road defaults. Buckaroo takes a different approach. Buckaroo lets you build many highly specific configurations and then toggle between them quickly.  This makes it easier to build each configuration because you don't have to solve for every possibility.

This document walks you through how to add your own analysis to Buckaroo and allow users to toggle it

The extension points are
* [PluggableAnalysisFramework](https://buckaroo-data.readthedocs.io/en/latest/articles/pluggable.html) Used to add summary stats and column metadata for use by other steps
* [Styling](./styling-howto.ipynb) control the visual display of the table
* PostProcessing used to transform an entire dataframe
* AutoCleaning Automate transformations for dropping nulls, removing outliers and other pre-processing steps, cleans the dataframe and generates python code.  Not yet supported in 0.6

Each extension point is composable, and can be interactively mixed and matched

In [None]:
import pandas as pd
import numpy as np
from buckaroo.dataflow_traditional import StylingAnalysis
from buckaroo.pluggable_analysis_framework.pluggable_analysis_framework import ColAnalysis
import polars as pl
from polars import functions as F
from buckaroo.polars_buckaroo import PolarsBuckarooWidget

In [None]:
ROWS = 200
typed_df = pl.DataFrame({'int_col':np.random.randint(1,50, ROWS), 'float_col': np.random.randint(1,30, ROWS)/.7,
                         'timestamp':["2020-01-01 01:00Z", "2020-01-01 02:00Z", "2020-02-28 02:00Z", "2020-03-15 02:00Z", None] * 40,
                         "str_col": ["foobar", "Realllllly long string", "", None, "normal"]* 40})
typed_df = typed_df.with_columns(timestamp=pl.col('timestamp').str.to_datetime() )

In [None]:
pbw = PolarsBuckarooWidget(typed_df)
pbw

# Adding a styling analysis
The `StylingAnalysis` class is used to control the display of a column based on the column metadata.  


Overriding the `config_from_column_metadata(col:str, sd:SingleColumnMetadata) -> ColumnConfig` computes the config for a single column give that column's metadata.

This lets you customize based on metadata collected about a column.  This works with the [PluggableAnalysisFramework](https://buckaroo-data.readthedocs.io/en/latest/articles/pluggable.html),  you can specify required fields that are necessary.  Adding requirements like this garuntees that errors are spotted early.

StylingAnalysis works for both Polars and Pandas because it only receives a dictionary with simple python values

In [None]:
from typing import Any

class EverythingStyling(StylingAnalysis):
    """
    This styling shows as much detail as possible
    """
    df_display_name = "everything"

    pinned_rows = [
        {'primary_key_val': 'dtype', 'displayer_args': {'displayer': 'obj' }}]


    #Styling analysis handles column iteration for us.
    
    #the type should be
    #def style_column(col:str, column_metadata: SingleColumnMetadata) -> ColumnConfig:
    @classmethod
    def style_column(kls, col:str, column_metadata: Any) -> Any:
        digits = 10
        if column_metadata['is_integer']:
            disp = {'displayer': 'float', 'min_fraction_digits':0, 'max_fraction_digits':0}
        elif column_metadata['is_numeric']:
            disp = {'displayer': 'float', 'min_fraction_digits':digits, 'max_fraction_digits':digits}
        # FIXME, because we don't have a DataFrame library agnostic way of saying "is_string" 
        # this styling analysis will only work with polars
        elif column_metadata['dtype'] == pl.String:
            disp = {'displayer': 'string'}
        elif column_metadata['dtype'] == pl.Datetime:
            disp =  {'displayer': 'datetimeDefault'}
        else:
            disp = {'displayer': 'obj'}
        return {'col_name':col, 'displayer_args': disp }

class AbrevStyling(StylingAnalysis):
    """
    This styling shows as much detail as possible
    """
    df_display_name = "Abrev"

    pinned_rows = [
        {'primary_key_val': 'dtype', 'displayer_args': {'displayer': 'obj' }}]

    @classmethod
    def style_column(kls, col:str, column_metadata: Any) -> Any:
        digits = 3
        if column_metadata['is_integer']:
            disp = {'displayer': 'float', 'min_fraction_digits':0, 'max_fraction_digits':0}
        elif column_metadata['is_numeric']:
            disp = {'displayer': 'float', 'min_fraction_digits':digits, 'max_fraction_digits':digits}
        elif column_metadata['dtype'] == pl.Datetime:
            disp = {'displayer': 'datetimeLocaleString','locale': 'en-US',  'args': {}}
        elif column_metadata['dtype'] == pl.String:
            disp = {'displayer': 'string', 'max_length':15}
        else:
            disp = {'displayer': 'obj'}
        return {'col_name':col, 'displayer_args': disp }
base_a_klasses = PolarsBuckarooWidget.analysis_klasses.copy()
base_a_klasses.extend([EverythingStyling, AbrevStyling])
class EverythingAbrevWidget(PolarsBuckarooWidget):
    analysis_klasses = base_a_klasses
sbw = EverythingAbrevWidget(
    typed_df,
    #column_config_overrides={'timestamp':  {'displayer_args':  {  'displayer': 'datetimeDefault'}}}                       
                           )
sbw

In [None]:
bw_ = PolarsBuckarooWidget(
    typed_df, 
    column_config_overrides={
        'int_col': {'merge_rule': 'hidden'}})
bw_

Let's look at pinned_rows, they can be modified by setting `pinned_rows` on Buckaroo Instaniation

# lets add a post processing method

In [None]:
from polars import functions as F
from buckaroo.pluggable_analysis_framework.polars_analysis_management import PolarsAnalysis

In [None]:

typed_df.select(F.all(),
                pl.col('float_col').lt(5).replace(True, "foo").replace(False, None).alias('errored_float'))

In [None]:
class ValueCountPostProcessing(PolarsAnalysis):
    @classmethod
    def post_process_df(kls, df):
        result_df = df.select(
            F.all().value_counts().implode().list.gather(pl.arange(0, 10), null_on_oob=True).explode().struct.rename_fields(['val', 'unused_count']).struct.field('val').prefix('val_'),
            F.all().value_counts().implode().list.gather(pl.arange(0, 10), null_on_oob=True).explode().struct.field('count').prefix('count_'))
        return [result_df, {}]
    post_processing_method = "value_counts"
    

class TransposeProcessing(ColAnalysis):
    @classmethod
    def post_process_df(kls, df):
        return [df.transpose(), {}]
    post_processing_method = "transpose"
class ShowErrorsPostProcessing(PolarsAnalysis):
    @classmethod
    def post_process_df(kls, df):
        df.select
        result_df = df.select(
            F.all(),
                              
            pl.col('float_col').lt(5).replace(True, "foo").replace(False, None).alias('errored_float'))
        extra_column_config = {
            'index': {},
            'float_col' : {'column_config_override': { 
                               {'color_map_config': {
                                'color_rule': 'color_not_null',
                                'conditional_color': 'red',
                                'exist_column': 'errored_float'}}}}}

        #return [result_df, extra_column_config]
        return [result_df, {}]

    post_processing_method = "show_errors"
    
    
base_a_klasses = PolarsBuckarooWidget.analysis_klasses.copy()
base_a_klasses.extend([#ValueCountPostProcessing, 
                       #TransposeProcessing, 
                       ShowErrorsPostProcessing])
class VCBuckarooWidget(PolarsBuckarooWidget):
    analysis_klasses = base_a_klasses
vcb = VCBuckarooWidget(typed_df, debug=False,
                      column_config_overrides={'float_col': {'color_map_config': {
                                'color_rule': 'color_not_null',
                                'conditional_color': 'red',
                                'exist_column': 'errored_float'}}}
                      )
vcb

In [None]:
class AdaptingStylingAnalysis(SimpleStylingAnalysis):
    requires_summary = ["histogram", "is_numeric", "dtype", "is_integer"]
    pinned_rows = [
        obj_('dtype'),
        {'primary_key_val': 'histogram', 'displayer_args': { 'displayer': 'histogram' }}]

    @staticmethod
    def single_sd_to_column_config(col, sd):
        digits = 3
        if sd['is_integer']:
            disp = {'displayer': 'float', 'minimumFractionDigits':0, 'maximumFractionDigits':0}
        elif sd['is_numeric']:
            disp = {'displayer': 'float', 'minimumFractionDigits':digits, 'maximumFractionDigits':digits}
        else:
            disp = {'displayer': 'obj'}
        return {'col_name':col, 'displayer_args': disp }

base_a_klasses = PolarsBuckarooWidget.analysis_klasses.copy()
base_a_klasses.extend([AdaptingStylingAnalysis, ValueCountPostProcessing])
class ABuckarooWidget(PolarsBuckarooWidget):
    analysis_klasses = base_a_klasses
acb = ABuckarooWidget(typed_df)
acb