# The column's the limit:
## Interactive laptop large data analysis with Buckaroo and Polars

* Describe Buckaroo table widget, demonstrate it thorougly
* explain why I built it briefly
  * I want people to ahve the takeaway of my tools should do better for me, and I should make better tools
* , and some unique challenges I faced while building it because of the different use case - move more of this to the end of the presentation
* explain the problems of dealing with laptop large data interactively
* show a high level example of the key pillars of Buckaroo's approach to laptop large data
* Technical dive into how buckaroo solves these large data problems
  * This is where we talk about a bit of rust and polars
* How you can use the execution framework
* Cool future areas of exploration that the execution framework enables
* conclusion - why hasn't this been built before


ask about prior experience
jupyter, pandas, polars, crashung jupyter, opening excel to see a tabular view of a dataframe


# The dark ages of Pandas tables
Let's look at trip by trip NYC citibike trips in 2016

In [None]:
import pandas as pd
pd.read_parquet("./citibike-trips-2016-04.parq")

In [None]:
df = pd.read_parquet("./citibike-trips-2016-04.parq")
df.sort_values("tripduration")

In [None]:
df.describe()

In [None]:
df['tripduration'].hist()

# Why I built Buckaroo

I knew how to manipulate dataframes and run summary stats, I just didn't want to have to type commands to do it every time.  I wanted a better tool and a better experience.

Look at the data, and have tools that make it easy to look at the data.  It's the most fundamental step of data analysis


# Normal buckaroo
show example that samples

In [None]:
import pandas as pd
import polars as pl
import buckaroo
pd.read_parquet("./citibike-trips-2016-04.parq")

# Buckaroo is a different use context than most pandas/polars code
Most tutorials and example code are oriented around "Here's a dataset, here's a slight description of it, now knowing that, let's operate on it"

Buckaroo lives in a different world.
Because Buckaroo is built to be the default Dataframe display mechanism, it needs to just work.
it can get all manner of dataframes thrown at it, small, big, 100s of columns, multi indexes, dataframes with NaN/Infinity
Ask me how I know that those can cause problems.

Even compared to most plotting libraries, Buckaroo operates in a different space.  Plotting libraries are built to be explicitly configurable.  Buckaroo is built to be opionated, and give you at least a quick understanding of data you have never seen before.

# Tradeoffs that dataframe display libraries make for big data
where do they cut corners

* `df.head()`
* sampling
* manual pagination
* no tradeoffs - send a huge payload crash your browser
* require every common action to be a block of code so you can say "we only do what you ask"

# Clear the cache

In [None]:
from buckaroo.file_cache import cache_utils as cu
cu.clear_file_cache()

# Watch Buckaroo on a file to big to be processed in one go
this is a 450Mb parquet file of uber trips in NYC in 2024, it has 19M rows

In [None]:
import logging
logging.getLogger("buckaroo").setLevel(logging.ERROR) 
from buckaroo.lazy_infinite_polars_widget import LazyInfinitePolarsBuckarooWidget
import polars as pl
uber_trips_02_fname = "/Users/paddy/Downloads/fhvhv_tripdata_2024-02.parquet"
uber_trips_02_ldf = pl.scan_parquet("/Users/paddy/Downloads/fhvhv_tripdata_2024-02.parquet") # a smaller data file
LazyInfinitePolarsBuckarooWidget(uber_trips_02_ldf, file_path="/Users/paddy/Downloads/fhvhv_tripdata_2024-02.parquet") #, show_message_box=True)              

# How did buckaroo deal with this large file

## Lazy loading of data between the frontend and python
Buckaroo has been doing this for a year, with parquet data encoding
## Lazy loading of data off the disk
show no summary stats LazyInfinitePolarsBuckarooWidget
```python
pl.scan_parquet()
pl.sink_parquet()
```
## reliable execution in background processes with timeouts
show the background processing of summary stats
## caching of computed values
reload a large datafarame


# Let's look again with some extra debugging tools


In [None]:
from buckaroo.file_cache import cache_utils as cu
cu.clear_file_cache()
from buckaroo.read_utils import read, read_df
import polars as pl
from buckaroo.lazy_infinite_polars_widget import LazyInfinitePolarsBuckarooWidget
uber_trips_02_fname = "/Users/paddy/Downloads/fhvhv_tripdata_2024-02.parquet"
ldf = read_df(uber_trips_02_fname)
widget = LazyInfinitePolarsBuckarooWidget(ldf, file_path=uber_trips_02_fname, show_message_box=True)
widget

# Pluggable analysis framework
one of the early niceities I built into buckaroo is the pluggable analysis framework.

How many times have you written some version of this code
```python
funcs = {
    'mean': lambda x: x.mean(),
    'std': lambda x: x.std(),
}

stats_dict = defaultdict(lambda: {})
for col in df.columns:
    for name, measure in funcs.items():
        stats[col][name] = measure(df[col])
```
It's clever for hacking around in a notebook, you can stats easily.

then it breaks in the middle of two nested for loops.

and you just move on because you have other work to do.

# Show the PAF and explain how it creates a DAG


```python
class Variance(ColAnalysis):
    provides_summary = ["variance"]
    requires_summary = ["mean"]
    
    @staticmethod
    def summary(sampled_ser, summary_ser, ser):
        mean = summary_ser.get('mean', False)
        arr = ser.to_numpy()
        if mean is pd.NA or mean is np.nan or mean is False:
            return dict(variance="NA")
        if mean and pd.api.types.is_integer_dtype(ser):
            return dict(variance=np.mean((arr - mean)**2))
        elif mean and pd.api.types.is_float_dtype(ser):
            return dict(variance=np.mean((arr - mean)**2))
        return dict(variance="NA")
```

# Adding a stat at runtime
Buckaroo has the Pluggable Analysis Framework
which is built to allow easy construction of analytics that depend on each other.  

In [None]:
from polars import functions as F
import polars.selectors as cs

from buckaroo.pluggable_analysis_framework.polars_analysis_management import PolarsAnalysis
from buckaroo.pluggable_analysis_framework.utils import json_postfix
from buckaroo.styling_helpers import obj_
from buckaroo.customizations.polars_analysis import NOT_STRUCTS

class SumAnalysis(PolarsAnalysis):
    """
    Analysis that computes the sum of numeric columns.
    This uses a polars expression (select_clauses) that is executed.
    """
    provides_defaults = {'sum': 0}
    
    select_clauses = [
        cs.numeric().sum().name.map(json_postfix('sum')),
    ]
# Add SumAnalysis - this adds a polars expression
widget.add_analysis(
    SumAnalysis,
    pinned_row_configs=[obj_('sum')])


In [None]:
@buckaroo.exec_expression
def asdf()


asdf(lazy_df)


In [None]:
import os
os.getpid()

In [None]:
!ls -alhstr 2024-01-05_tripdata.parq

# Execution Strategy
![execution_strategy_diagram.png](./execution_strategy_diagram.png)

# Let's process the 2.5 GB file while I explain the implmentation details


In [None]:
from buckaroo.read_utils import read
read("2024-01-05_tripdata.parq")

# Other areas of exploration

Bisectors for fidning bugs.


In [None]:
#multi process diagram, 
#screenshots of 

In [None]:
#talk about the polars plugin

In [None]:
# talk about the novelty of using multiprocessing not for parralelism but for reliability, and using multiiprocessing one process at a time


In [None]:
# an example polars query that takes mean of foo and max of bar
df.select([pl.col('foo').mean(), 
           pl.col('bar').max()
          ])
#if the above failed you could run fewer expressions
df.select([pl.col('foo').mean()])
df.select([pl.col('bar').max()])


# an example polars query that takes the mean of every numeric column
df.select([cs.numeric().mean()])



when you're dealing with a large dataframe instead of runnign

```
df.select([pl.col('foo').mean(), 
           pl.col('bar').max()
          ])
          ```
you can leverage Buckaroo's execution framework and run
```
buckaroo.execute(df, [
           pl.col('foo').mean(), 
           pl.col('bar').max()])
```


          ]
```

In [None]:
#pl.read_parquet("2024-01-05_tripdata.parq")  #this line will crash the ipython kernel eventually
read("2024-01-05_tripdata.parq")

In [None]:
combined = pl.concat([
        pl.scan_parquet("/Users/paddy/Downloads/fhvhv_tripdata_2024-01.parquet"),
    pl.scan_parquet("/Users/paddy/Downloads/fhvhv_tripdata_2024-02.parquet"),
    pl.scan_parquet("/Users/paddy/Downloads/fhvhv_tripdata_2024-03.parquet"),
    pl.scan_parquet("/Users/paddy/Downloads/fhvhv_tripdata_2024-04.parquet"),
    pl.scan_parquet("/Users/paddy/Downloads/fhvhv_tripdata_2024-05.parquet")
])
# Only executes when you write
#combined.sink_parquet("2024-01-05_tripdata.parq")

In [None]:
#pl.read_parquet("2024-01-05_tripdata.parq")
#crashes

In [None]:
from buckaroo.read_utils import read, read_df
import polars as pl
%time read("2024-01-05_tripdata.parq")

In [None]:
from buckaroo.read_utils import read, read_df
read("/Users/paddy/Downloads/fhvhv_tripdata_2024-02.parquet")

In [None]:
import polars as pl
from buckaroo.lazy_infinite_polars_widget import LazyInfinitePolarsBuckarooWidget
FNAME="/Users/paddy/Downloads/fhvhv_tripdata_2024-02.parquet"
ldf = read_df(FNAME)
bw1 = LazyInfinitePolarsBuckarooWidget(ldf, timeout_secs=10, file_path=FNAME)


In [None]:
ldf2 = ldf.select(pl.col(ldf.columns[3:10]))
bw = LazyInfinitePolarsBuckarooWidget(ldf2, timeout_secs=10, file_path=FNAME)
bw

In [None]:
# get people more up to speed with the problems i'm solving
why buckaroo is different


In [None]:
from buckaroo.lazy_infinite_polars_widget import LazyInfinitePolarsBuckarooWidget
from buckaroo.polars_buckaroo import PolarsBuckarooWidget
import polars as pl
ldf = pl.scan_parquet("./citibike-trips-2016-04.parq").collect().sample(5000).lazy()
ldf = pl.scan_parquet("./citibike-trips-2016-04.parq")
bw = LazyInfinitePolarsBuckarooWidget(ldf)
bw

In [None]:
import pandas as pd
pd.read_parquet("./citibike-trips-2016-04.parq")

In [None]:
df = pl.read_parquet("./citibike-trips-2016-04.parq")
PolarsBuckarooWidget(df)

In [None]:
bw.df_data_dict['all_stats']

In [None]:
df

In [None]:
df = pl.read_parquet("./citibike-trips-2016-04.parq")
PolarsBuckarooWidget(df)

In [None]:
ldf = pl.scan_parquet("/Users/paddy/JULY_FULL2.parq")
bw = LazyInfinitePolarsBuckarooWidget(ldf, timeout_secs=60)
bw

In [None]:
ldf = pl.scan_parquet("/Users/paddy/3m_july.2.parq")
ldf = ldf.slice(0,350)
ldf = ldf.select([pl.col(n) for n in ldf.collect_schema().names()[:20]])

# ldf.collect()