# How Buckaroo helps Data Scientists

We all hear terms like data science, AI, machine learning and Big Data. 

This presentation walks through some of the core tools practioners use and how my tool - Buckaroo helps their workflows.



# The tools that Datascientists commonly use.

This presentation is for my less technical friends and family.  Don't worry about the details of the code, this is meant to give a broad overview.

## Python

Python is the preferred language of data scientists. It is easy to learn because it uses simple keywords

In [None]:
a = 5
def add_10(argument):
    return argument + 10
add_10(a)

## NumPy

Matrix math is a the heart of image recognition and machine learning.  The open source NumPy library extends python with fast matrix math operations (25 to 100 times faster than raw python).  [Travis Oliphant](https://en.wikipedia.org/wiki/Travis_Oliphant) wrote NumPy while earning a PhD in Electrical Engineering.  I worked for Travis from 2012 to 2014.

Here we see NumPy generating random numbers, performing a linerar regression, then plotting it through the open source Matplotlib library

In [None]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
rng = np.random.default_rng(1234)
# Generate data
x = rng.uniform(0, 10, size=100)
y = x + rng.normal(size=100)

# Initialize layout
fig, ax = plt.subplots(figsize = (9, 9))

# Add scatterplot
ax.scatter(x, y, s=60, alpha=0.7, edgecolors="k")

# Fit linear regression via least squares with numpy.polyfit
# It returns an slope (b) and intercept (a)
# deg=1 means linear fit (i.e. polynomial of degree 1)
b, a = np.polyfit(x, y, deg=1)

# Create sequence of 100 numbers from 0 to 100 
xseq = np.linspace(0, 10, num=100)

# Plot regression line
ax.plot(xseq, a + b * xseq, color="k", lw=2.5);

## Pandas

In 2011 [Wes Mckinney](https://wesmckinney.com/) created Pandas on top of NumPy to help with his work as a quant researcher.
Pandas makes regular tasks with real world data much easier like cleaning dirty data and manipulating timeseries observations.

In [None]:
import pandas as pd
df = pd.read_csv("/Users/paddy/code/example-notebooks/citibike-trips.csv")
df

In [None]:
# Once you have a dataset like this there are a lot of operations you might want to perform
df['tripduration'].mean()

In [None]:
df['start station name'].value_counts()

In [None]:
df.groupby('start station name').mean('tripduration')

In [None]:
df['tripduration'].hist()

In [None]:
df[(df['tripduration'] > df['tripduration'].quantile(.01)) & (df['tripduration'] < df['tripduration'].quantile(.99))]['tripduration'].hist()

## The Jupyter notebook

You are seeing this presentation in the Jupyter notebook.  The jupyter programming environment was created by grad students in 2011 to be used interactively and to leave an artifiact that combines narrative text, analysis code, and graphics.  The interactivity is key to analyzing and manipulating data.

# Why I wrote Buckaroo

Thank you for bearing with me this far.

You now have seen the PyData ecosystem and a small sample of how it is used.
These are all powerful tools, but a bit cumbersome to use.  I look at multiple different datasets a day, and I want to quickly understand them.  I don't want to type a bunch of commands to get the overview I'm looking for and I want to be able to look at the raw data.  Here is Buckaroo

In [None]:
import pandas as pd
from buckaroo.buckaroo_widget import BuckarooWidget
df = pd.read_csv("/Users/paddy/code/example-notebooks/citibike-trips.csv") 
df

# Adding color mapping by range in column.


In [None]:
BuckarooWidget(
    df,
    column_config_overrides={
        'tripduration': {'color_map_config': {
        'color_rule': 'color_map',
        'map_name': 'BLUE_TO_YELLOW'}}})

## Adding a summary stat
Here we right a simple function to compute `is_monotonic` or does a column only go in a single direction (up or down) without changing directions

In [None]:
from buckaroo.pluggable_analysis_framework.pluggable_analysis_framework import ColAnalysis
from buckaroo.dataflow.dataflow import StylingAnalysis
class MonotonicStat(StylingAnalysis):
    provides_defaults = dict(is_monotonic=False)
    
    @staticmethod
    def series_summary(sampled_ser, ser):
        return dict(is_monotonic=(ser.is_monotonic_increasing or ser.is_monotonic_decreasing))
    pinned_rows = [
        { 'primary_key_val': 'is_monotonic', 'displayer_args': {'displayer': 'obj'}}]
    df_display_name = "is_monotonic"
    data_key = "empty"
    
w = BuckarooWidget(df)
w.add_analysis(MonotonicStat)
w

# Adding a post processing function
We add a post processing function that takes diff (current row sutracted from previous the previous row).  Summary stats are then run on this modified dataframe.

In [None]:
w = BuckarooWidget(df)
@w.add_processing
def numeric_diff(df):
    only_num_df = df.select_dtypes(include=np.number)
    return only_num_df.diff()
w

In [None]:
import polars as pl
pl.read_csv("/Users/paddy/code/example-notebooks/citibike-trips.csv") 

# That's Buckaroo

I designed it to so the base widget has sensible defaults.  Users can also simply plumb in their own analysis code and toggle between views.

Because data in the wild is so diverse, there is not a single *Correct* way to view it.  Instead of coming up with a single *best* default, Buckaroo allows multiple analyses to be quickly toggled through

# Explaining Buckarooto my non-technical friends 

This presentation show my side project Buckaroo and the tools it is built on top of.

## Background on python and how it's used in datascience

Before we dive into exciting visuals and interactive programs, let’s lay some groundwork.

Python is a popular programming language, you might also have heard of java, C, and javascript.  Datascientists have come to rely on python because it balances speed of execution (C is faster) with ease of use and learning.

### NumPy

Matrix math is at the heart of linear regression, image recognition, and AI like ChatGPT. NumPy was written by Travis Oliphant in 2006 to accelerate matrix operations in python.   In many cases NumPy accelerates matrix operations to 25-100x faster than raw python.

Here is NumPy and the Matplotlib charting

# Pandas

In 2011 financial quant researcher Wes Mckinney released pandas which made analysis of realworld data easier, timeseries data in particular.  Pandas was built on top of NumPy, and allowed computations to be run on mixed datasets (you could have a set of temperature observations ordered by time of day, with a string column for location).

Pandas took what was already technically possible and increased the usability so more (work|concepts|manipulations) could be expressed in a natural way.

The following code shows reading a 300,000 line csv file about citibike trips

# The Jupyter notebook

I have been demonstrating the entire PyData ecosystem inside of the Jupyter notebook.  This is an interactive analysis and documentation environment built around python.  While a grad student Fernando Perez wanted a better [interactive environment](https://en.wikipedia.org/wiki/IPython) for playing with data in 2001.  In 2011 the IPython first released the [jupyter notebook](https://en.wikipedia.org/wiki/Project_Jupyter) interface you see here.

Combining small snippets of analysis code, with charts, and narrative text allowed academics to write and share research in ways that were cumbersome before.
(Maybe show emacs/vscode traditional method of writing code).  This is particularly important for data intensive analysis.  You need to look at the data and play with it iteratively.  This interface works very well for the problem that data scientists and academics deal with every day.


this is data anlysis 

a typical programmer (data scientist) would do these things

we all hear terms like data science, big data, and AI.  Here is a brief view into what that actually looks like

... I built Buckaroo to enahnce those workflows, here it is

In order for me to explain buckaroo I have to give some background

