# `Nested-Pandas` MVP Demo

Goal: Present Motivation for the `nested-pandas` package. Show a set of examples of how the current API works. Use this as a springboard to collect feedback from team on:
* Clarity of API: Does the syntax intuitively reflect the actual data operation?
* Use-case coverage: Are there operations that you think this won't work well on? 
* Alternatives: Any ideas for alternatives or improvements

## Motivation

### Object and Source

With `TAPE`, we've been striving to build a package that tailors the typical usage paradigms of pandas towards time-domain use cases. The most fundamental aspect of this has been the relationship between "Object" and "Source":
* Object: Information describing properties of individual astrophysical objects in the sky
* Source: Time-stamped measurements/detections, which for our purposes we assume are associated with Objects in some way

In [1]:
import pandas as pd

object = pd.read_parquet("objects.parquet")
object

Unnamed: 0,ra,dec
0,17.447868,35.547046
1,1.020437,4.353613
2,3.695975,31.130105
3,13.242558,6.099142
4,2.744142,48.444456
...,...,...
995,6.547263,40.249140
996,18.391919,17.643616
997,18.587638,46.568135
998,10.871655,6.719466


In [2]:
source = pd.read_parquet("ztf_sources.parquet").sort_index()
source

Unnamed: 0_level_0,mjd,flux,band
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,8.420511,259.454128,r
0,14.442831,29.947062,g
0,17.276088,250.422340,r
0,11.874109,0.395589,g
0,15.418783,228.769717,g
...,...,...,...
999,1.983920,77.994248,g
999,18.761759,77.129061,g
999,14.686133,1.661199,g
999,7.810396,249.381225,r


### Traditional Pandas Solution, and Why It's Lacking

In Pandas, the recommended way to think about this dataset is as two separate tables. However, complete separation of these two leads to some friction:

#### 1. These two tables are more directly related in the context of user workflows/analysis

In [3]:
# If I query object, chances are I want any removed objects
# removed from source as well
object = object.query("ra > 5.0")

# I have to "sync" the two tables to achieve this
source = source.join(object[[]], how="inner")
source

Unnamed: 0,mjd,flux,band
0,8.420511,259.454128,r
0,14.442831,29.947062,g
0,17.276088,250.422340,r
0,11.874109,0.395589,g
0,15.418783,228.769717,g
...,...,...,...
999,1.983920,77.994248,g
999,18.761759,77.129061,g
999,14.686133,1.661199,g
999,7.810396,249.381225,r


#### 2. Sources in this case are frequently used as grouped timeseries

In [4]:
# We need to groupby anytime we want to use the sources as a set of timeseries
# Can be costly to do this every time
source.groupby(level=0).apply(lambda x: max(x.flux))

0      299.924162
3      299.787405
6      299.945636
7      299.712976
8      299.743558
          ...    
995    299.784388
996    299.868455
997    299.908486
998    299.968485
999    299.105706
Length: 738, dtype: float64

### What we've heard users want

In working with TAPE users, pretty much everyone (small n) has commented: "Can I just put my lightcurves in my object table". The most straightforward idea is to have array columns like "ztf_flux" within the object table, but this is an anti-pattern within Pandas

In [5]:
# It does work...
import numpy as np
array_df = pd.DataFrame({"a":[1,2], "b":[np.array([0,2,4]),np.array([5,10,15])]})
array_df

Unnamed: 0,a,b
0,1,"[0, 2, 4]"
1,2,"[5, 10, 15]"


In [6]:
# But what do you do with it...
# most filtering things are not going to work
# array_df.query("b > 3") # this fails

# apply is okay (but apply is also slow in Pandas)
array_df.apply(lambda x: x.b+1, axis=1) 
#array_df.apply(lambda x: len(x.b), axis=1) 

0      [1, 3, 5]
1    [6, 11, 16]
dtype: object

So we want something that embeds nested data within a dataframe, but we don't want to lose our ability to actually interact with the data. This is the main motivator for `nested-pandas`

## API Demo

In [7]:
from nested_pandas import NestedFrame, read_parquet
from nested_pandas.utils import count_nested

#### Load in Parquet Data

In [8]:
%%time
#Read in parquet data
nf = read_parquet(
data="objects.parquet",
to_pack={"ztf_sources": "ztf_sources.parquet", "ps1_sources": "ps1_sources.parquet"},  ##auto packs these source files
)
nf

CPU times: user 144 ms, sys: 21.4 ms, total: 166 ms
Wall time: 153 ms


Unnamed: 0,ra,dec,ztf_sources,ps1_sources
0,17.447868,35.547046,mjd flux band 0 8.420511...,mjd flux band 0 0.091356...
1,1.020437,4.353613,mjd flux band 0 14.143429...,mjd flux band 0 12.475696...
2,3.695975,31.130105,mjd flux band 0 7.190259...,mjd flux band 0 13.717712...
3,13.242558,6.099142,mjd flux band 0 1.708140...,mjd flux band 0 16.759764...
4,2.744142,48.444456,mjd flux band 0 18.837824...,mjd flux band 0 18.139101...
...,...,...,...,...
995,6.547263,40.249140,mjd flux band 0 4.055585...,mjd flux band 0 5.474614...
996,18.391919,17.643616,mjd flux band 0 10.358167...,mjd flux band 0 11.889307...
997,18.587638,46.568135,mjd flux band 0 3.871603...,mjd flux band 0 16.421570...
998,10.871655,6.719466,mjd flux band 0 0.886458...,mjd flux band 0 14.044775...


#### Selecting a Lightcurve

In [9]:
# select a lightcurve from ZTF
nf.loc[55]["ztf_sources"]

Unnamed: 0,mjd,flux,band
0,19.283298,3.190218,r
1,6.764094,164.424921,g
2,12.453877,77.067563,g
3,16.949276,48.557068,g
4,8.016122,275.400828,g
...,...,...,...
995,3.946276,201.948471,g
996,17.230772,41.307437,g
997,1.127104,294.265138,g
998,4.888028,7.452264,g


In [10]:
# Look at PS1 as well
nf.loc[55]["ps1_sources"]

Unnamed: 0,mjd,flux,band
0,8.628750,238.269158,g
1,6.200912,78.045097,g
2,15.564734,203.777988,g
3,17.045652,223.307389,i
4,12.279699,234.880814,g
...,...,...,...
295,19.857081,220.440118,i
296,9.867431,291.862953,i
297,18.930018,176.191457,g
298,16.147299,117.487382,g


#### Filtering

##### Normal Queries Work As Expected

In [11]:
%%time
# pre-filter on base columns
nf = nf.query("ra > 5.0")
nf

CPU times: user 11.9 ms, sys: 2.4 ms, total: 14.3 ms
Wall time: 13.5 ms


Unnamed: 0,ra,dec,ztf_sources,ps1_sources
0,17.447868,35.547046,mjd flux band 0 8.420511...,mjd flux band 0 0.091356...
3,13.242558,6.099142,mjd flux band 0 1.708140...,mjd flux band 0 16.759764...
6,5.967232,45.613322,mjd flux band 0 0.781435...,mjd flux band 0 9.987625...
7,13.498501,11.653751,mjd flux band 0 10.141100...,mjd flux band 0 18.892070...
8,10.285046,49.291522,mjd flux band 0 0.438951...,mjd flux band 0 15.354670...
...,...,...,...,...
995,6.547263,40.249140,mjd flux band 0 4.055585...,mjd flux band 0 5.474614...
996,18.391919,17.643616,mjd flux band 0 10.358167...,mjd flux band 0 11.889307...
997,18.587638,46.568135,mjd flux band 0 3.871603...,mjd flux band 0 16.421570...
998,10.871655,6.719466,mjd flux band 0 0.886458...,mjd flux band 0 14.044775...


##### Subset Queries with Hierarchical Column Names

In [12]:
%%time
# Filtering a nested subset, does not filter the top level dataframe!
nf_g = nf.query("ztf_sources.band == 'g'")
nf_g

CPU times: user 33.2 ms, sys: 6.08 ms, total: 39.3 ms
Wall time: 38.7 ms


Unnamed: 0,ra,dec,ztf_sources,ps1_sources
0,17.447868,35.547046,mjd flux band 0 9.252944...,mjd flux band 0 0.091356...
3,13.242558,6.099142,mjd flux band 0 1.708140...,mjd flux band 0 16.759764...
6,5.967232,45.613322,mjd flux band 0 5.367633...,mjd flux band 0 9.987625...
7,13.498501,11.653751,mjd flux band 0 2.468269...,mjd flux band 0 18.892070...
8,10.285046,49.291522,mjd flux band 0 0.438951...,mjd flux band 0 15.354670...
...,...,...,...,...
995,6.547263,40.249140,mjd flux band 0 1.890532...,mjd flux band 0 5.474614...
996,18.391919,17.643616,mjd flux band 0 10.358167...,mjd flux band 0 11.889307...
997,18.587638,46.568135,mjd flux band 0 3.871603...,mjd flux band 0 16.421570...
998,10.871655,6.719466,mjd flux band 0 0.886458...,mjd flux band 0 14.044775...


Top level dataframe rows are not filtered:

In [13]:
print(f" Length of nf: {len(nf)}")
print(f" Length of nf_g: {len(nf_g)}")

 Length of nf: 738
 Length of nf_g: 738


Nested ztf_sources are filtered:

In [14]:
print(f" Length of nf.ztf_sources: {nf.ztf_sources.nest.flat_length}")
print(f" Length of nf_g.ztf_sources: {nf_g.ztf_sources.nest.flat_length}")

 Length of nf.ztf_sources: 738000
 Length of nf_g.ztf_sources: 369052


#### Other Filtering

In [15]:
%%time
#drop nans from timeseries struct
nf = nf.dropna(subset=["ztf_sources.flux", "ztf_sources.mjd"]) # passes the dropna command along to the ztf_sources struct
nf

CPU times: user 37.5 ms, sys: 7.23 ms, total: 44.7 ms
Wall time: 44.1 ms


Unnamed: 0,ra,dec,ztf_sources,ps1_sources
0,17.447868,35.547046,mjd flux band 0 8.420511...,mjd flux band 0 0.091356...
3,13.242558,6.099142,mjd flux band 0 1.708140...,mjd flux band 0 16.759764...
6,5.967232,45.613322,mjd flux band 0 0.781435...,mjd flux band 0 9.987625...
7,13.498501,11.653751,mjd flux band 0 10.141100...,mjd flux band 0 18.892070...
8,10.285046,49.291522,mjd flux band 0 0.438951...,mjd flux band 0 15.354670...
...,...,...,...,...
995,6.547263,40.249140,mjd flux band 0 4.055585...,mjd flux band 0 5.474614...
996,18.391919,17.643616,mjd flux band 0 10.358167...,mjd flux band 0 11.889307...
997,18.587638,46.568135,mjd flux band 0 3.871603...,mjd flux band 0 16.421570...
998,10.871655,6.719466,mjd flux band 0 0.886458...,mjd flux band 0 14.044775...


#### Custom Utility Functions - `count_nested`

In [16]:
from nested_pandas.utils import count_nested

count_nested(nf, "ztf_sources")

Unnamed: 0,ra,dec,ztf_sources,ps1_sources,n_ztf_sources
0,17.447868,35.547046,mjd flux band 0 8.420511...,mjd flux band 0 0.091356...,1000
3,13.242558,6.099142,mjd flux band 0 1.708140...,mjd flux band 0 16.759764...,1000
6,5.967232,45.613322,mjd flux band 0 0.781435...,mjd flux band 0 9.987625...,1000
7,13.498501,11.653751,mjd flux band 0 10.141100...,mjd flux band 0 18.892070...,1000
8,10.285046,49.291522,mjd flux band 0 0.438951...,mjd flux band 0 15.354670...,1000
...,...,...,...,...,...
995,6.547263,40.249140,mjd flux band 0 4.055585...,mjd flux band 0 5.474614...,1000
996,18.391919,17.643616,mjd flux band 0 10.358167...,mjd flux band 0 11.889307...,1000
997,18.587638,46.568135,mjd flux band 0 3.871603...,mjd flux band 0 16.421570...,1000
998,10.871655,6.719466,mjd flux band 0 0.886458...,mjd flux band 0 14.044775...,1000


In [17]:
%%time
# calculate nobs for timeseries and use it to filter objects
nf = count_nested(nf, "ztf_sources", by="band") # calculates number of observations by band for the ztf_sources struct
nf = count_nested(nf, "ps1_sources", by="band")
nf

CPU times: user 329 ms, sys: 15.1 ms, total: 345 ms
Wall time: 340 ms


Unnamed: 0,ra,dec,ztf_sources,ps1_sources,n_ztf_sources_r,n_ztf_sources_g,n_ps1_sources_g,n_ps1_sources_i
0,17.447868,35.547046,mjd flux band 0 8.420511...,mjd flux band 0 0.091356...,507,493,154,146
3,13.242558,6.099142,mjd flux band 0 1.708140...,mjd flux band 0 16.759764...,501,499,147,153
6,5.967232,45.613322,mjd flux band 0 0.781435...,mjd flux band 0 9.987625...,517,483,153,147
7,13.498501,11.653751,mjd flux band 0 10.141100...,mjd flux band 0 18.892070...,511,489,156,144
8,10.285046,49.291522,mjd flux band 0 0.438951...,mjd flux band 0 15.354670...,511,489,145,155
...,...,...,...,...,...,...,...,...
995,6.547263,40.249140,mjd flux band 0 4.055585...,mjd flux band 0 5.474614...,500,500,152,148
996,18.391919,17.643616,mjd flux band 0 10.358167...,mjd flux band 0 11.889307...,497,503,155,145
997,18.587638,46.568135,mjd flux band 0 3.871603...,mjd flux band 0 16.421570...,522,478,147,153
998,10.871655,6.719466,mjd flux band 0 0.886458...,mjd flux band 0 14.044775...,453,547,137,163


#### Applying Custom Functions with `reduce`

In [27]:
from light_curve import Amplitude
amplitude = Amplitude()

amplitude(np.array([1.,2.]), np.array([3.,4.]))

array([0.5])

In [19]:
%%time
# Filter down to just r-band ztf_sources
nf_r = nf.query("ztf_sources.band == 'r'")

# Apply a function to the r-band ztf_sources
r_amp = nf_r.reduce(amplitude, "ztf_sources.mjd", "ztf_sources.flux")

r_amp

CPU times: user 43.1 ms, sys: 6.52 ms, total: 49.6 ms
Wall time: 48.8 ms


0      [149.78651880558448]
3       [149.4018127546554]
6      [149.05988824872153]
7       [149.4929899669729]
8       [149.3788120046465]
               ...         
995     [149.6365531697762]
996     [149.5770408104668]
997    [149.78161224181568]
998    [148.51112913708582]
999     [148.7918197517282]
Length: 738, dtype: object

In [40]:
def test(*args):
    return args

nf.reduce(test, "ra", "ztf_sources.mjd")

0      (17.447868083279175, [8.420511002293274, 5.327...
3      (13.24255766301299, [1.7081396490043477, 7.383...
6      (5.967231875394647, [0.7814348397143456, 5.367...
7      (13.49850119715583, [10.141100190324941, 2.468...
8      (10.285045920927857, [0.43895128669361405, 15....
                             ...                        
995    (6.547262910599201, [4.05558524554144, 10.5316...
996    (18.39191942589801, [10.358166660800896, 11.90...
997    (18.58763783178845, [3.871602591562535, 4.1855...
998    (10.871654686110258, [0.886458380208146, 3.835...
999    (15.46698176670599, [15.703350291127693, 10.93...
Length: 738, dtype: object

## Internals/Lower-Level API

### `nest` accessor

Extending Pandas with a custom access interface

In [20]:
# custom accessor for series objects
nf.ztf_sources.nest

<nested_pandas.series.accessor.NestSeriesAccessor at 0x1640ba7d0>

In [41]:
%%time
# Let's us define a custom API for working with this series of data
nf["ztf_sources"].nest.to_flat() # view the data as a "flat" source table

CPU times: user 16 ms, sys: 7.38 ms, total: 23.4 ms
Wall time: 23.4 ms


Unnamed: 0,mjd,flux,band
0,8.420511,259.454128,r
0,5.327694,217.656239,r
0,9.252944,223.816007,g
0,12.441739,208.315431,r
0,9.949815,211.904192,g
...,...,...,...
999,13.841622,154.950267,g
999,1.079974,50.438131,g
999,7.387044,257.837608,r
999,2.678137,181.629844,g


In [22]:
%%time
# View as lists
nf.ztf_sources.nest.to_lists() # this is particularly useful (and fast) for timeseries-style calculations

CPU times: user 474 µs, sys: 33 µs, total: 507 µs
Wall time: 487 µs


Unnamed: 0,mjd,flux,band
0,[ 8.420511 5.3276943 9.2529439 12.441739...,[2.59454128e+02 2.17656239e+02 2.23816007e+02 ...,['r' 'r' 'g' 'r' 'g' 'r' 'g' 'g' 'r' 'g' 'g' '...
3,[ 1.70813965 7.3833651 19.42070782 8.084487...,[172.69924296 284.60001822 114.41554808 98.44...,['g' 'r' 'r' 'g' 'r' 'r' 'r' 'g' 'g' 'g' 'r' '...
6,[7.81434840e-01 5.36763323e+00 7.86407256e+00 ...,[2.28707282e+02 1.90537067e+02 4.08044850e+01 ...,['r' 'g' 'r' 'g' 'r' 'g' 'g' 'g' 'r' 'g' 'r' '...
7,[10.14110019 2.46826862 12.4773163 7.294826...,[157.93930088 230.39931904 136.27273325 113.15...,['r' 'g' 'g' 'g' 'g' 'r' 'r' 'r' 'r' 'g' 'r' '...
8,[ 0.43895129 15.26494194 11.63733181 9.643635...,[1.92750977e+02 2.87939293e+02 2.81467531e+01 ...,['g' 'g' 'r' 'r' 'r' 'g' 'r' 'r' 'g' 'r' 'g' '...
...,...,...,...
995,[4.05558525e+00 1.05316419e+01 4.17767712e+00 ...,[ 26.40333204 20.47679516 93.36638403 223.92...,['r' 'r' 'r' 'g' 'r' 'r' 'g' 'r' 'r' 'g' 'r' '...
996,[1.03581667e+01 1.19045379e+01 1.30105173e+01 ...,[224.49828478 243.27675116 204.41242598 169.58...,['g' 'g' 'r' 'r' 'g' 'r' 'g' 'g' 'r' 'g' 'g' '...
997,[ 3.87160259 4.18551565 18.37020572 19.365806...,[ 69.54518768 57.59172843 28.25031788 146.36...,['g' 'r' 'r' 'r' 'g' 'r' 'g' 'r' 'g' 'g' 'g' '...
998,[8.86458380e-01 3.83515191e+00 9.48565195e-01 ...,[2.60842240e+02 6.61190294e+01 2.23921192e+02 ...,['g' 'g' 'g' 'r' 'g' 'r' 'r' 'g' 'r' 'g' 'r' '...


### NestedExtensionArray
Array Datatype that provides a custom API to the array objects

In [23]:
# Extension of ArrowExtensionArray
nf.ztf_sources.array[0:2]

<NestedExtensionArray>
[           mjd        flux band
0     8.420511  259.454128    r
1     5.327694  217.656239    r
2     9.252944  223.816007    g
3    12.441739  208.315431    r
4     9.949815  211.904192    g
..         ...         ...  ...
995   8.040372  256.644230    g
996   6.575870   98.900117    g
997  12.866500  173.803123    g
998  14.993332   23.661618    g
999  12.126280    5.207718    r

[1000 rows x 3 columns],            mjd        flux band
0     1.708140  172.699243    g
1     7.383365  284.600018    r
2    19.420708  114.415548    r
3     8.084488   98.443232    g
4    13.443821  206.188415    r
..         ...         ...  ...
995  12.652365  152.645766    r
996  13.859081  184.753689    g
997   5.237089   77.442657    r
998  17.233947   15.467742    g
999   9.221389  107.427127    g

[1000 rows x 3 columns]]
Length: 2, dtype: nested<mjd: [double], flux: [double], band: [string]>

## Filtering API Alternatives?

We're most worried about the filtering API. Are there better alternatives to shoot for?

In [None]:
# options for querying the nested dataframes
# 1
df = df.query("dia.flux > 2.0")
# 2.1
df = df.nest.query("dia", "flux > 2.0")
# 2.2
df = df.nest["dia"].query("flux > 2.0")
# 3.1
df["dia"] = df["dia"].apply(lambda lc: lc.query("flux > 2.0"))
# 3.2
df["dia"] = df["dia"].nest.query("flux > 2.0")
