# GroupBy for NestedPandas

This notebook explores how Pandas' built-in `groupby` interacts with `NestedPandas` structures.
<!-- highlight what works, what doesn’t, and why — with clear examples and explanations. -->

Because Nested-Pandas extends the Pandas library, native ``pandas.DataFrame.groupby`` works with nested-pandas out of the box in some ways. 

In [None]:
# This will be the nf example used in this doc
from nested_pandas.datasets import generate_data

nf = generate_data(5, 10, seed=1)
nf["c"] = [0, 0, 1, 1, 1]
nf

`groupby` works on *non-nested* columns and will return a `pandas.groupby` object.  
Grouping by nested columns does **not** work since nested values are mutable objects so they are unhashable.

Use base columns as group keys or extract scalar identifiers from nested data first.



In [None]:
nf.groupby("c")  # returns a Pandas GroupBy object

## Basic Aggregations

- Some built-in methods like `count` work but not as expected (view nested column as a single object).
- Others (`min`, `max`, `mean`) fail on nested columns.
- Interestingly, `describe` will work as expcted with the automatic flattened nested column.

In [None]:
# count is viewing nested columns as signle objects
nf.groupby("c").count()

In [None]:
# min/max/mean fail on nested columns
nf.groupby("c").min()  # will produce error

In [None]:
# describe works as expected with automatic flattened nested column
nf.groupby("c").describe()

## Type Preservation
Within each group, the object remains accessible as ``NestedFrame`` object and the nested columns remain ``NestedSeries``.

We can check this by applying a custom function on our 2-group `groupby` object:

In [None]:
# check the type
def type_check(df):
    print("Group DataFrame Type:", type(df))
    print("Nested Column Type:", type(df["nested"]))
    print()
    # return df


nf.groupby("c").apply(type_check, include_groups=False)

An important note is that when trying to accsss the row of each group with .iloc[], **numeric row-wise indexing** and **slice-based indexing** will output different types.

For `NestedFrame`, when we try to access the first row, row-wise indexing (.iloc[0]) will collapse the result in to 1-D `pandas.Series` with the nested column stored as a `DataFrame`. However, slice-based indexing (.iloc[0:1]) will preserve the nested structure and still output the row as a `NestedFrame` with nested column still being `NestedSeries`.

In [None]:
# check the full row type
def row_type_check(df):
    print("df.iloc[0]: ", type(df.iloc[0]))
    print("df.iloc[0:1]:", type(df.iloc[0:1]))
    print("\n Accessing nested column for both ways:")
    print("df.iloc[0] nested column:", type(df.iloc[0]["nested"]))
    print("df.iloc[0:1] nested column:", type(df.iloc[0:1]["nested"]))
    print()
    # return df


nf.groupby("c").apply(row_type_check, include_groups=False)

For nested column with type `NestedSeries`, accessing a single row from `df["nested"]` will either output a `pandas.DataFrame` (.iloc[0]) or a `pandas.Series` (.iloc[0:1]).

Note that outside groupby, `df["nested"].iloc[0]` is stored as a `pandas.DataFrame`, which is expected. 

<!-- (NestedPandas stores the nested frames as serialized DataFrames?) -->

In [None]:
# check the nested row type
def nested_row_type_check(df):
    print('df["nested"].iloc[0]:', type(df["nested"].iloc[0]))
    print('df["nested"].iloc[0:1]:', type(df["nested"].iloc[0:1]))
    print()
    # return df


nf.groupby("c").apply(nested_row_type_check, include_groups=False)

Other operations will preserve the nested structure in general, but if you need to work with the contents of a nested column directly, you may need to flatten it first using `.nest.to_flat()`.

## Custom Functions with `apply`

`.apply()` for nested operations is supported natively. It generally works if the function flattens or use index slicing to ensure matching type for operations. 

Some potential exmaples:

In [None]:
# custom function to flatten nested column
def flatten_nested(df):
    return df["nested"].nest.to_flat()


nf.groupby("c").apply(flatten_nested, include_groups=False)

In [None]:
import pandas as pd


# custom function to perform aggregations on flattened nested column
def mean_flux(df):
    flat = df["nested"].nest.to_flat()
    return pd.Series({"mean_flux": flat["flux"].mean(), "mean_t": flat["t"].mean()})


nf.groupby("c").apply(mean_flux, include_groups=False)

## Summary
- Always group by **base columns**, not nested columns.  
- Use **slice-based indexing** (.iloc[0:1]) to preserve nested types.
- Use **.nest.to_flat()** to flatten a nested column when needed for numerical or aggregating operations.

- Nested structures are designed to reduce the need for expensive groupby operations by allowing data to stay organized hierarchically. However, when grouping is necessary, pandas’ groupby still works with nested-pandas and maintains type consistency.