# Benchmarking `Nested-Pandas` vs `Pandas`

This notebook offers timing comparisions between the `Nested-Pandas` MVP Implementation and a `Pandas` equivalent workflow. 


Tested on a dummy dataset of 1000 Lightcurves with 1000 observations each.

In [1]:
from nested_pandas import NestedFrame, read_parquet
from nested_pandas.utils import count_nested
from nested_pandas.series import packer

import pandas as pd
import numpy as np

from light_curve import Amplitude
amplitude = Amplitude()

## Data Loading - Parquet Reading

### Pandas

In [2]:
%%timeit

object = pd.read_parquet("objects.parquet")
source = pd.read_parquet("ztf_sources.parquet").sort_index() # sorting the index is a more fair comparison

63.2 ms ± 524 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Nested-Pandas

In [3]:
%%timeit
#Read in parquet data
nf = read_parquet(
data="objects.parquet",
to_pack={"ztf_sources": "ztf_sources.parquet"},  ##auto packs these source files
)

111 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [4]:
# actually load the data
object = pd.read_parquet("objects.parquet")
source = pd.read_parquet("ztf_sources.parquet")

source_sorted = pd.read_parquet("ztf_sources.parquet").sort_index()

nf = read_parquet(
data="objects.parquet",
to_pack={"ztf_sources": "ztf_sources.parquet"},  ##auto packs these source files
)

### Bottleneck: Index Sorting

Index sorting is a well-known slow operation in Pandas, we currently do it implicitly in packing operations. Dask motivates doing it in almost all cases at scale (outside of having divisions pre-calculated), so the one-time upfront cost for Nested-Pandas seems to be more reasonable.

In [5]:
%%timeit
source = pd.read_parquet("ztf_sources.parquet")

19.4 ms ± 43.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [6]:
%%time
source_sorted = pd.read_parquet("ztf_sources.parquet").sort_index() # sorting the index is a more fair comparison

CPU times: user 62.7 ms, sys: 10 ms, total: 72.8 ms
Wall time: 64.7 ms


#### Factor of ~3x slowdown in packing operation depending on sorted state

In [7]:
%%timeit
packer.pack_flat(source)

88.2 ms ± 84 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [8]:
%%timeit
packer.pack_flat(source_sorted)

27.2 ms ± 52.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Filtering

### Pandas

In [9]:
%%timeit
filtered_object = object.query("ra > 10.0")

#sync to source
filtered_source = filtered_object[[]].join(source, how="left")

22.5 ms ± 339 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Nested-Pandas

In [10]:
%%timeit
nf.query("ra > 10.0")

5.19 ms ± 8.36 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Utility Operations - Calculating Total Number of Observations

In [11]:
count_nested(nf, "ztf_sources")

Unnamed: 0,ra,dec,ztf_sources,n_ztf_sources
0,17.447868,35.547046,mjd flux band 0 8.420511...,1000
1,1.020437,4.353613,mjd flux band 0 14.143429...,1000
2,3.695975,31.130105,mjd flux band 0 7.190259...,1000
3,13.242558,6.099142,mjd flux band 0 1.708140...,1000
4,2.744142,48.444456,mjd flux band 0 18.837824...,1000
...,...,...,...,...
995,6.547263,40.249140,mjd flux band 0 4.055585...,1000
996,18.391919,17.643616,mjd flux band 0 10.358167...,1000
997,18.587638,46.568135,mjd flux band 0 3.871603...,1000
998,10.871655,6.719466,mjd flux band 0 0.886458...,1000


### Pandas

In [12]:
%%timeit
nobs = source.groupby(level=0).apply(lambda x: len(x))

object.assign(nobs=nobs)

27 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Nested-Pandas

In [13]:
%%timeit
count_nested(nf, "ztf_sources")

10.1 ms ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Utility Operations - Calculating Total Number of Observations By Band

In [14]:
count_nested(nf, "ztf_sources", by="band", join=True)

Unnamed: 0,ra,dec,ztf_sources,n_ztf_sources_r,n_ztf_sources_g
0,17.447868,35.547046,mjd flux band 0 8.420511...,507,493
1,1.020437,4.353613,mjd flux band 0 14.143429...,496,504
2,3.695975,31.130105,mjd flux band 0 7.190259...,496,504
3,13.242558,6.099142,mjd flux band 0 1.708140...,501,499
4,2.744142,48.444456,mjd flux band 0 18.837824...,501,499
...,...,...,...,...,...
995,6.547263,40.249140,mjd flux band 0 4.055585...,500,500
996,18.391919,17.643616,mjd flux band 0 10.358167...,497,503
997,18.587638,46.568135,mjd flux band 0 3.871603...,522,478
998,10.871655,6.719466,mjd flux band 0 0.886458...,453,547


### Pandas

In [15]:
%%timeit
band_counts = source.groupby(level=0).apply(lambda x: x[["band"]].value_counts().reset_index()).pivot_table(values="count", index="index", columns="band", aggfunc="sum")
object.join(band_counts[["g","r"]])

513 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Nested-Pandas

In [16]:
%%timeit
count_nested(nf, "ztf_sources", by="band", join=True)

211 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Applying Functions

In [17]:
from light_curve import Amplitude
amplitude = Amplitude()

### Pandas

In [18]:
%%timeit
source.groupby(level=0).apply(lambda x: amplitude(np.array(x.mjd), np.array(x.flux)))

63.7 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Nested-Pandas

In [19]:
%%timeit
nf.reduce(amplitude, "ztf_sources.mjd", "ztf_sources.flux")

16 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Full Workflow

Note: The computational cost in this example is dominated by the by-band nobs calculation function.

### Pandas

In [21]:
%%timeit

# Read data
object = pd.read_parquet("objects.parquet")
source = pd.read_parquet("ztf_sources.parquet").sort_index()

# Filter on object
filtered_object = object.query("ra > 10.0")
#sync object to source --removes any index values of source not found in object
filtered_source = filtered_object[[]].join(source, how="left")

# Count nobs and add it to the object table
band_counts = source.groupby(level=0).apply(lambda x: x[["band"]].value_counts().reset_index()).pivot_table(values="count", index="index", columns="band", aggfunc="sum")
filtered_object = filtered_object.join(band_counts[["g","r"]])

# Filter on our nobs
filtered_object = filtered_object.query("g > 520")
filtered_source = filtered_object[[]].join(source, how="left")

# Calculate Amplitude
filtered_source.groupby(level=0).apply(lambda x: amplitude(np.array(x.mjd), np.array(x.flux)))

603 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Nested-Pandas

In [22]:
%%timeit
#Read in parquet data
nf = read_parquet(
data="objects.parquet",
to_pack={"ztf_sources": "ztf_sources.parquet"},  ##auto packs these source files
)

# Filter on object
nf = nf.query("ra > 10.0")

# Count nobs and add it to the base layer
nf = count_nested(nf, "ztf_sources", by="band", join=True)

# Filter on our nobs
nf =nf.query("n_ztf_sources_g > 520")

# Calculate Amplitude
nf.reduce(amplitude, "ztf_sources.mjd", "ztf_sources.flux")

223 ms ± 1.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Full Workflow Without Counting By-Band

### Pandas

In [23]:
%%timeit

# Read data
object = pd.read_parquet("objects.parquet")
source = pd.read_parquet("ztf_sources.parquet").sort_index()

# Filter on object
filtered_object = object.query("ra > 10.0")
#sync object to source --removes any index values of source not found in object
filtered_source = filtered_object[[]].join(source, how="left")

# Count total nobs
nobs = source.groupby(level=0).apply(lambda x: len(x))
filtered_object.assign(nobs=nobs)

# Calculate Amplitude
filtered_source.groupby(level=0).apply(lambda x: amplitude(np.array(x.mjd), np.array(x.flux)))

139 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Nested-Pandas

In [24]:
%%timeit
#Read in parquet data
nf = read_parquet(
data="objects.parquet",
to_pack={"ztf_sources": "ztf_sources.parquet"},  ##auto packs these source files
)

# Filter on object
nf = nf.query("ra > 10.0")

# Count total nobs
nf = count_nested(nf, "ztf_sources", by=None)

# Calculate Amplitude
nf.reduce(amplitude, "ztf_sources.mjd", "ztf_sources.flux")

134 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Nested-Pandas Operations

In [25]:
%%timeit
nf.ztf_sources.nest.to_flat()

15.4 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [26]:
%%timeit
nf.ztf_sources.nest.to_lists()

166 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [27]:
%%timeit
nf_object = NestedFrame(object)

nf_object.add_nested(source_sorted, "source") # Having to sort does dramatically increase the cost of this

29.3 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
