# The column's the limit:
## Interactive laptop large data analysis with Buckaroo and Polars

```


































```


By Paddy Mullen

paddy@paddymullen.com

https://github.com/paddymul/buckaroo

https://www.linkedin.com/in/paddymullen/

# The column's the limit:
## Interactive laptop large data analysis with Buckaroo and Polars

* Intro
* Quick demo of laptop large data
* Tour of Buckaroo and Why I built it
* Unique approach, unique challenge with engineering Buckaroo
* Explaining the common laptop large data problems in notebooks
* Walking through Buckaroo's approach to large data
* Demo time
* Adding your own analytics
* Polars/Lazy/pl_series hash brief intro
* Cool other things to do with the execution framework
* Conclusion

## Prior experience - Who has?
* run jupyter
* used pandas
* used polars
* seen the jupyter crash dialog
* used buckaroo
* left their notebook environment to view a dataframe in excel


# The dark ages of Pandas tables
Let's look at trip by trip NYC citibike trips in 2016

In [None]:
import pandas as pd
pd.read_parquet("../citibike-trips-2016-04.parq")

In [None]:
df = pd.read_parquet("../citibike-trips-2016-04.parq")
df.sort_values("tripduration")

In [None]:
df.describe()

In [None]:
df['tripduration'].hist()

# Why I built Buckaroo

I knew how to manipulate dataframes and run summary stats, I just didn't want to have to type commands to do it every time.  I wanted a better tool and a better experience.

Look at the data, and have tools that make it easy to look at the data.  It's the most fundamental step of data analysis


# Normal buckaroo

In [None]:
import pandas as pd
import polars as pl
import buckaroo
pd.read_parquet("../citibike-trips-2016-04.parq")

# Buckaroo is a different use context than most pandas/polars code
Most tutorials and example code are oriented around "Here's a dataset, here's a slight description of it, now knowing that, let's operate on it"

Buckaroo lives in a different world.
Because Buckaroo is built to be the default Dataframe display mechanism, it needs to just work.
it can get all manner of dataframes thrown at it, small, big, 100s of columns, multi indexes, dataframes with NaN/Infinity
Ask me how I know that those can cause problems.

Even compared to most plotting libraries, Buckaroo operates in a different space.  Plotting libraries are built to be explicitly configurable.  Buckaroo is built to be opionated, and give you at least a quick understanding of data you have never seen before.

# Tradeoffs that dataframe display libraries make for big data
where do they cut corners

* `df.head()`
* sampling
* manual pagination
* no tradeoffs - send a huge payload crash your browser
* require every common action to be a block of code so you can say "we only do what you ask"

# Why I decided to add Lazy processing to Buckaroo

* Got a call from my cousin
* Tell me about your use cases

# Clear the cache

In [None]:
from buckaroo.file_cache import cache_utils as cu
from buckaroo.read_utils import read_df, read
cu.clear_file_cache()

# Watch Buckaroo on a file to big to be processed in one go
this is a 450Mb parquet file of uber trips in NYC in 2024, it has 19M rows

In [None]:
import polars as pl
from buckaroo.lazy_infinite_polars_widget import LazyInfinitePolarsBuckarooWidget
uber_trips_02_fname = "/Users/paddy/Downloads/fhvhv_tripdata_2024-02.parquet"
uber_trips_02_ldf = pl.scan_parquet("/Users/paddy/Downloads/fhvhv_tripdata_2024-02.parquet") # a smaller data file
LazyInfinitePolarsBuckarooWidget(uber_trips_02_ldf, file_path="/Users/paddy/Downloads/fhvhv_tripdata_2024-02.parquet") #, show_message_box=True)              

# How did buckaroo deal with this large file

## Lazy loading of data between the frontend and python
Buckaroo has been doing this for a year, with parquet data encoding
## Lazy loading of data off the disk
show no summary stats LazyInfinitePolarsBuckarooWidget
```python
pl.scan_parquet()
```
## Reliable execution in background processes with timeouts
show the background processing of summary stats
## Caching of computed values
Polars extension for hashing series
reload a large datafarame
## The execution log
*Doc it hurts when I do this*

In [None]:
from polars import functions as F
import polars.selectors as cs

from buckaroo.pluggable_analysis_framework.polars_analysis_management import PolarsAnalysis
from buckaroo.pluggable_analysis_framework.utils import json_postfix
from buckaroo.styling_helpers import obj_
from buckaroo.customizations.polars_analysis import NOT_STRUCTS

# Let's look again with some extra debugging tools


In [None]:
from buckaroo.lazy_infinite_polars_widget import LazyInfinitePolarsBuckarooWidget
cu.clear_file_cache()
uber_trips_02_fname = "/Users/paddy/Downloads/fhvhv_tripdata_2024-02.parquet"
ldf = read_df(uber_trips_02_fname)
widget = LazyInfinitePolarsBuckarooWidget(ldf, file_path=uber_trips_02_fname, show_message_box=True)
widget

# Adding a stat at runtime
Buckaroo has the Pluggable Analysis Framework
which is built to allow easy construction of analytics that depend on each other.  

In [None]:
class SumAnalysis(PolarsAnalysis):
    """
    Analysis that computes the sum of numeric columns.
    This uses a polars expression (select_clauses) that is executed.
    """
    provides_defaults = {'sum': 0}
    
    select_clauses = [
        cs.numeric().sum().name.map(json_postfix('sum')),
    ]
# Add SumAnalysis - this adds a polars expression
widget.add_analysis(
    SumAnalysis,
    pinned_row_configs=[obj_('sum')])

# Pluggable analysis framework
one of the early niceities I built into buckaroo is the pluggable analysis framework.

How many times have you written some version of this code
```python
funcs = {
    'mean': lambda x: x.mean(),
    'std': lambda x: x.std(),
}

stats_dict = defaultdict(lambda: {})
for col in df.columns:
    for name, measure in funcs.items():
        stats[col][name] = measure(df[col])
```
It's clever for hacking around in a notebook, you can stats easily.

then it breaks in the middle of two nested for loops.

and you just move on because you have other work to do.

# Show the PAF and explain how it creates a DAG


```python
class Variance(ColAnalysis):
    provides_summary = ["variance"]
    requires_summary = ["mean"]
    
    @staticmethod
    def summary(sampled_ser, summary_ser, ser):
        mean = summary_ser.get('mean', False)
        arr = ser.to_numpy()
        if mean is pd.NA or mean is np.nan or mean is False:
            return dict(variance="NA")
        if mean and pd.api.types.is_integer_dtype(ser):
            return dict(variance=np.mean((arr - mean)**2))
        elif mean and pd.api.types.is_float_dtype(ser):
            return dict(variance=np.mean((arr - mean)**2))
        return dict(variance="NA")
```

# Execution Strategy
![execution_strategy_diagram.png](./execution_strategy_diagram.png)

In [None]:
!ls -alhstr ../2024-01-05_tripdata.parq

In [None]:
import os
os.getpid()

# Let's process the 2.5 GB file while I explain the implmentation details
100M rows

In [None]:
from buckaroo.read_utils import read
read("../2024-01-05_tripdata.parq", show_message_box=True, timeout_secs=13.0)

# Conclusion

```
pip install buckaroo

import buckaroo
```

Give [buckaroo](https://github.com/paddymul/buckaroo) a star on github, file a bug, get in touch, tell me about table problems you have.