# Introduction to Nested-Pandas

This notebook explores the Nested-Pandas API, showing the basics of nesting data and touring the various ways of working with nested data.

## Exploring the NestedFrame Interface

In [20]:
import nested_pandas as npd
import light_curve as licu
import pandas as pd
import numpy as np

from nested_pandas.utils import count_nested

In [None]:
# Load an example dataset
from nested_pandas.datasets import generate_data
ndf = generate_data(50,100, seed=1)
ndf

Unnamed: 0_level_0,a,b,nested
t,flux,band,Unnamed: 3_level_1
0,0.417022,0.038734,t  flux  band  6.532898  77.388964  r  100 rows × 3 columns
t,flux,band,
6.532898,77.388964,r,
1,0.720324,1.357671,10.541162  82.03493  r  100 rows × 3 columns
10.541162,82.03493,r,
2,0.000114,0.423256,17.718842  85.162961  r  100 rows × 3 columns
17.718842,85.162961,r,
3,0.302333,0.531093,7.145395  15.152488  r  100 rows × 3 columns
7.145395,15.152488,r,
4,0.146756,0.983146,18.170703  89.003664  r  100 rows × 3 columns

t,flux,band

0,1,2
6.532898,77.388964,r

0,1,2
10.541162,82.03493,r

0,1,2
17.718842,85.162961,r

0,1,2
7.145395,15.152488,r

0,1,2
18.170703,89.003664,r


## Approaches for Nesting Data

## Performance Comparisons to Pandas

### Taking Native Pandas Code

In [17]:
%%timeit

# Read data
object_df = pd.read_parquet("data/objects.parquet")
source_df = pd.read_parquet("data/ztf_sources.parquet")

# Filter on object
filtered_object = object_df.query("ra > 10.0")
#sync object to source --removes any index values of source not found in object
filtered_source = filtered_object[[]].join(source_df, how="left")

# Count number of observations per photometric band and add it to the object table
band_counts = source_df.groupby(level=0).apply(lambda x:
                                               x[["band"]].value_counts().reset_index()).pivot_table(values="count",
                                                                                                     index="index",
                                                                                                     columns="band",
                                                                                                     aggfunc="sum")
filtered_object = filtered_object.join(band_counts[["g","r"]])

# Filter on our nobs
filtered_object = filtered_object.query("g > 520")
filtered_source = filtered_object[[]].join(source_df, how="left")

# Calculate Amplitude
amplitude = licu.Amplitude()
filtered_source.groupby(level=0).apply(lambda x: amplitude(np.array(x.mjd), np.array(x.flux)))

530 ms ± 7.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Simplified and Faster with Nested-Pandas

In [21]:
%%timeit

#Read in parquet data
#nesting sources into objects
nf = npd.read_parquet(data="data/objects.parquet",
                  to_pack={"ztf_sources": "data/ztf_sources.parquet"})

# Filter on object
nf = nf.query("ra > 10.0")

# Count number of observations per photometric band and add it as a column
nf = count_nested(nf, "ztf_sources", by="band", join=True)

# Filter on our nobs
nf = nf.query("n_ztf_sources_g > 520")

# Calculate Amplitude
amplitude = licu.Amplitude()
nf.reduce(amplitude, "ztf_sources.mjd", "ztf_sources.flux")

251 ms ± 20.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Hands-on Scientific Example: Variability Analysis

### Load and Nest ZTF Timeseries Data

### Perform Initial Filtering

### Calculate Periodograms for all Lightcurves

### Visualizing Results

### Using Results to Modify our NestedFrame