# GroupBy for NestedPandas

This notebook explores how Pandas' built-in `groupby` interacts with `NestedPandas` structures.
<!-- highlight what works, what doesn’t, and why — with clear examples and explanations. -->

Because Nested-Pandas extends the Pandas library, native ``pandas.DataFrame.groupby`` works with nested-pandas out of the box in some ways. 

In [None]:
# This will be the nf example used in this doc
from nested_pandas.datasets import generate_data

nf = generate_data(5, 10, seed=1)
nf["c"] = [0, 0, 1, 1, 1]
nf

Unnamed: 0_level_0,a,b,nested,c
t,flux,band,Unnamed: 3_level_1,Unnamed: 4_level_1
t,flux,band,Unnamed: 3_level_2,Unnamed: 4_level_2
t,flux,band,Unnamed: 3_level_3,Unnamed: 4_level_3
t,flux,band,Unnamed: 3_level_4,Unnamed: 4_level_4
t,flux,band,Unnamed: 3_level_5,Unnamed: 4_level_5
0,0.417022,0.184677,t  flux  band  8.38389  10.233443  g  +9 rows  ...  ...,0.0
t,flux,band,,
8.38389,10.233443,g,,
+9 rows,...,...,,
1,0.720324,0.372520,t  flux  band  13.70439  41.405599  g  +9 rows  ...  ...,0.0
t,flux,band,,
13.70439,41.405599,g,,
+9 rows,...,...,,
2,0.000114,0.691121,t  flux  band  4.089045  69.440016  g  +9 rows  ...  ...,1.0
t,flux,band,,

t,flux,band
8.38389,10.233443,g
+9 rows,...,...

t,flux,band
13.70439,41.405599,g
+9 rows,...,...

t,flux,band
4.089045,69.440016,g
+9 rows,...,...

t,flux,band
17.562349,41.417927,g
+9 rows,...,...

t,flux,band
0.547752,4.995346,r
+9 rows,...,...


`groupby` works on *non-nested* columns and will return a `pandas.groupby` object.  
Grouping by nested columns does **not** work since nested values are mutable objects so they are unhashable.

Use base columns as group keys or extract scalar identifiers from nested data first.



In [55]:
nf.groupby("c") # returns a Pandas GroupBy object

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12d466c90>

## Basic Aggregations

- Some built-in methods like `count` work but not as expected (view nested column as a single object).
- Others (`min`, `max`, `mean`) fail on nested columns.
- Interestingly, `describe` will work as expcted with the automatic flattened nested column.

In [None]:
# count is viewing nested columns as signle objects
nf.groupby("c").count()

Unnamed: 0_level_0,a,b,nested
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,2,2,2
1,3,3,3


In [None]:
# min/max/mean fail on nested columns
nf.groupby("c").min() # will produce error

In [58]:
# describe works as expected with automatic flattened nested column
nf.groupby("c").describe()

Unnamed: 0_level_0,a,a,a,a,a,a,a,a,b,b,b,b,b,b,b,b,nested.t,nested.t,nested.t,nested.t,nested.t,nested.t,nested.t,nested.t,nested.flux,nested.flux,nested.flux,nested.flux,nested.flux,nested.flux,nested.flux,nested.flux
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
c,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2
0,2.0,0.568673,0.214467,0.417022,0.492848,0.568673,0.644499,0.720324,2.0,0.278599,0.132825,0.184677,0.231638,0.278599,0.32556,0.37252,20.0,10.881513,6.240902,0.387339,7.83715,12.445851,15.226208,19.777222,20.0,51.891513,32.136814,1.582124,21.910878,53.147725,88.645112,94.948926
1,3.0,0.149734,0.151131,0.000114,0.073435,0.146756,0.224544,0.302333,3.0,0.854097,0.200247,0.691121,0.742328,0.793535,0.935584,1.077633,30.0,8.700798,6.111402,0.365766,3.537964,6.070383,13.957988,19.157791,30.0,57.975918,27.028715,0.287033,40.029183,60.184998,75.090985,99.732285


## Type Preservation
Within each group, the object remains accessible as ``NestedFrame`` object and the nested columns remain ``NestedSeries``.

We can check this by applying a custom function on our 2-group `groupby` object:

In [59]:
# check the type
def type_check(df):
    print("Group DataFrame Type:", type(df))
    print("Nested Column Type:", type(df["nested"]))
    print()
    # return df

nf.groupby("c").apply(type_check, include_groups=False)

Group DataFrame Type: <class 'nested_pandas.nestedframe.core.NestedFrame'>
Nested Column Type: <class 'nested_pandas.series.nestedseries.NestedSeries'>

Group DataFrame Type: <class 'nested_pandas.nestedframe.core.NestedFrame'>
Nested Column Type: <class 'nested_pandas.series.nestedseries.NestedSeries'>



An important note is that when trying to accsss the row of each group with .iloc[], **numeric row-wise indexing** and **slice-based indexing** will output different types.

For `NestedFrame`, when we try to access the first row, row-wise indexing (.iloc[0]) will collapse the result in to 1-D `pandas.Series` with the nested column stored as a `DataFrame`. However, slice-based indexing (.iloc[0:1]) will preserve the nested structure and still output the row as a `NestedFrame` with nested column still being `NestedSeries`.

In [60]:
# check the full row type
def row_type_check(df):
    print("df.iloc[0]: ", type(df.iloc[0]))
    print('df.iloc[0:1]:', type(df.iloc[0:1]))
    print("\n Accessing nested column for both ways:")
    print('df.iloc[0] nested column:', type(df.iloc[0]["nested"]))
    print('df.iloc[0:1] nested column:', type(df.iloc[0:1]["nested"]))
    print()
    # return df

nf.groupby("c").apply(row_type_check, include_groups=False)


df.iloc[0]:  <class 'pandas.core.series.Series'>
df.iloc[0:1]: <class 'nested_pandas.nestedframe.core.NestedFrame'>

 Accessing nested column for both ways:
df.iloc[0] nested column: <class 'pandas.core.frame.DataFrame'>
df.iloc[0:1] nested column: <class 'nested_pandas.series.nestedseries.NestedSeries'>

df.iloc[0]:  <class 'pandas.core.series.Series'>
df.iloc[0:1]: <class 'nested_pandas.nestedframe.core.NestedFrame'>

 Accessing nested column for both ways:
df.iloc[0] nested column: <class 'pandas.core.frame.DataFrame'>
df.iloc[0:1] nested column: <class 'nested_pandas.series.nestedseries.NestedSeries'>



For nested column with type `NestedSeries`, accessing a single row from `df["nested"]` will either output a `pandas.DataFrame` (.iloc[0]) or a `pandas.Series` (.iloc[0:1]).

Note that outside groupby, `df["nested"].iloc[0]` is stored as a `pandas.DataFrame`, which is expected. 

<!-- (NestedPandas stores the nested frames as serialized DataFrames?) -->

In [61]:
# check the nested row type
def nested_row_type_check(df):
    print('df["nested"].iloc[0]:', type(df["nested"].iloc[0]))
    print('df["nested"].iloc[0:1]:', type(df["nested"].iloc[0:1]))
    print()
    # return df

nf.groupby("c").apply(nested_row_type_check, include_groups=False)

df["nested"].iloc[0]: <class 'pandas.core.frame.DataFrame'>
df["nested"].iloc[0:1]: <class 'pandas.core.series.Series'>

df["nested"].iloc[0]: <class 'pandas.core.frame.DataFrame'>
df["nested"].iloc[0:1]: <class 'pandas.core.series.Series'>



Other operations will preserve the nested structure in general, but if you need to work with the contents of a nested column directly, you may need to flatten it first using `.nest.to_flat()`.

## Custom Functions with `apply`

`.apply()` for nested operations is supported natively. It generally works if the function flattens or use index slicing to ensure matching type for operations. 

Some potential exmaples:

In [62]:
# custom function to flatten nested column
def flatten_nested(df):
    return df["nested"].nest.to_flat()

nf.groupby("c").apply(flatten_nested, include_groups=False)


Unnamed: 0_level_0,Unnamed: 1_level_0,t,flux,band
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,8.38389,10.233443,g
0,0,13.40935,53.589641,g
...,...,...,...,...
1,4,9.831463,90.853515,r
1,4,13.995167,99.732285,g


In [63]:
import pandas as pd

# custom function to perform aggregations on flattened nested column
def mean_flux(df):
    flat = df["nested"].nest.to_flat()
    return pd.Series({
        "mean_flux": flat["flux"].mean(),
        "mean_t": flat["t"].mean()
    })

nf.groupby("c").apply(mean_flux, include_groups=False)



Unnamed: 0_level_0,mean_flux,mean_t
c,Unnamed: 1_level_1,Unnamed: 2_level_1
0,51.891513,10.881513
1,57.975918,8.700798


## Summary
- Always group by **base columns**, not nested columns.  
- Use **slice-based indexing** (.iloc[0:1]) to preserve nested types.
- Use **.nest.to_flat()** to flatten a nested column when needed for numerical or aggregating operations.

- Nested structures are designed to reduce the need for expensive groupby operations by allowing data to stay organized hierarchically. However, when grouping is necessary, pandas’ groupby still works with nested-pandas and maintains type consistency.