# Lower-level interface for performance and flexibility
## Reveal the hidden power of nested Series

This section is for users looking to optimize the performance, both computationally and in memory-usage, of their workflows. This section also details a broader suite of data representations usable within `nested-pandas`.
It shows how to deal with individual nested columns: add, remove, and modify data using both "flat-array" and "list-array" representations.
It also demonstrates how to convert nested Series to and from different data types, like `pd.ArrowDtype`d Series, flat dataframes, list-array dataframes, and collections of nested elements.

In [None]:
import numpy as np
import pandas as pd
import pyarrow as pa

from nested_pandas import NestedDtype
from nested_pandas.datasets import generate_data
from nested_pandas.series.packer import pack

## Generate some data and get a Series of `NestedDtype` type

We are going to use built-in data generator to get a `NestedFrame` with a "nested" column being a `Series` of `NestedDtype` type.
This column would represent [light curves](https://en.wikipedia.org/wiki/Light_curve) of some astronomical objects. 

In [None]:
nested_df = generate_data(4, 3, seed=42)
nested_series = nested_df["nested"]
nested_series[2]

## Get access to different data views using `.nest` accessor

`pandas` provides an interface to access series with custom "accessors" - special attributes acting like a different view on the data.
You may already know [`.str` accessor](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str) for strings or [`.dt` for datetime-like](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#timedelta-methods) data.
Since v2.0, pandas also supports few accessors for `ArrowDtype`d Series, `.list` for list-arrays and `.struct` for struct-arrays.

`nested-pandas` extends this concept and provides `.nest` accessor for `NestedDtype`d Series, which gives user an object to work with nested data more efficiently and flexibly.

### `.nest` object is a mapping

`.nest` accessor provides an object implementing `Mapping` interface, so you can use it like a dictionary.
Keys of this mapping are the names of the nested columns (fields), and values are "flat" Series representing the nested data.

In [None]:
list(nested_series.nest.keys())

You can also get a list of fields with `.fields` attribute

In [None]:
nested_series.nest.fields

The value of each key is a "flat" Series with repeated index, so the original index of the `nested_series` is repeated for each element of the nested data. 

In [None]:
nested_series.nest["t"]

You can also get a subset of nested columns as a new nested Series

In [None]:
nested_series.nest[["t", "flux"]].dtype

You can add new columns, drop existing ones, or modify the existing ones.
The modification is currently limited to the case when you replace the whole "flat" Series with a new one of the same length.
When modifying the nested data, only the column you are working with is changed, the rest of the data are not affected and not copied.

In [None]:
new_series = nested_series.copy()

# Change the data in-place
new_series.nest["flux"] = new_series.nest["flux"] - new_series.nest["flux"].mean()

# Add new column
new_series.nest["lsst_band"] = "lsst_" + new_series.nest["band"]

# Drop the column, .pop() method is also available
del new_series.nest["band"]

# Add a new column with a python list instead of a Series
new_series.nest["new_column"] = [1, 2] * (new_series.nest.flat_length // 2)

new_series.nest.to_flat()

### Different data views

`.nest` accessor provides a few different views on the data:
- `.to_flat()` - get a "flat" pandas data frame with repeated index, it is kinda of a concatenation of all nested elements along the nested axis
- `.to_lists()` - get a pandas data frame of nested-array (list-array) Series, where each element is a list of nested elements. Data type would be `pd.ArrowDtype` of pyarrow list.

Both representations are copy-free, so they could be done very efficiently. The only additional overhead when accessing a "flat" representation is the creation of a new repeating index.

In [None]:
nested_series.nest.to_flat(["flux", "t"])

In [None]:
lists_df = nested_series.nest.to_lists()  # may also accept a list of fields (nested columns) to get
lists_df["t"].list.len()  # here we use pandas' build-in list accessor to get the length of each list

List-arrays may be assigned back to the nested Series

In [None]:
new_series = nested_series.copy()

# Adjust each time to be relative to the first observation
dt = new_series.nest.to_lists()["t"].apply(lambda t: t - t.min())
new_series.nest.set_list_field("dt", dt)
new_series.nest.to_flat()

## Convert to and from nested Series

We have already seen how `.nest` accessor could be used to get different views on the nested data: "flat" dataframe, and list-array dataframe with columns of `pd.ArrowDtype`.

This section is about converting nested Series to and from other data types.
If you just need to add a nested column to a `NestedFrame`, you can do it with `.add_nested()` method.

### To and from `pd.ArrowDtype`

We can convert nested Series to and from `pd.ArrowDtype`d Series. 
`NestedDtype` is close to `pd.ArrowDtype` for arrow struct-arrays, but it is stricter about the nested data structure.
`nested-pandas` also uses `pyarrow` struct-arrays as a storage format, where struct fields are list-arrays of the same length.
So the conversion is quite straightforward, and doesn't require any data copying. 

In [None]:
struct_series = pd.Series(nested_series, dtype=nested_series.dtype.to_pandas_arrow_dtype())
struct_series.struct.field("flux")  # pandas build-in accessor for struct-arrays

In [None]:
nested_series.equals(pd.Series(struct_series, dtype=NestedDtype.from_pandas_arrow_dtype(struct_series.dtype)))

### `pack()` function for creating a new nested Series

`nested-pandas` provides a `pack()` function to create a new nested Series from either a sequence of a single flat dataframe with repeated index.

#### Using `pack()` to nest a flat dataframe

You can also use `pack()` to create a nested Series from a flat dataframe with repeated index, for example from a one given by `.nest.to_flat()` method.

In [None]:
new_series = pack(nested_series.nest.to_flat())
new_series.equals(nested_series)

In [None]:
series_from_flat = pack(
    pd.DataFrame(
        {
            "t": [1, 2, 3, 4, 5, 6],
            "flux": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
        },
        index=[0, 0, 0, 0, 1, 1],
    ),
    name="from_flat",  # optional
)
series_from_flat

#### Using `pack()` to nest a collection of elements

You can use `pack()` to create a nested Series from a collection of elements, where each element representing a single row of the nested data.
Each element can be one of many supported types, and you can mix them in the same collection:
- `pd.DataFrame`
- `dict` with items representing the nested columns, all the same length
- `pyarrow.StructScalar` with elements being list-arrays of the same length
- `None` or `pd.NA` for missing data

All the elements must have the same columns (fields), but may have the different length of the nested data.    

In [None]:
series_from_pack = pack(
    [
        pd.DataFrame({"t": [1, 2, 3], "flux": [0.1, 0.2, 0.3]}),
        {"t": [4, 5], "flux": [0.4, 0.5]},
        None,
    ],
    name="from_pack",  # optional
    index=[3, 4, 5],  # optional
)
series_from_pack

If we are not happy with the default dtype, we can specify it explicitly, see more details on how to do it in the next section, here we just show an example.

In [None]:
series_from_pack = pack(
    [
        pd.DataFrame({"t": [1, 2, 3], "flux": [0.1, 0.2, 0.3]}),
        {"t": [4, 5], "flux": [0.4, 0.5]},
        None,
    ],
    dtype=NestedDtype.from_fields({"t": pa.float64(), "flux": pa.float32()}),
)
series_from_pack

### Using pd.Series(values, dtype=NestedDtype.from_fields({...}))

`nested-pandas` provides a `NestedDtype` class to create a new nested Series with a given dtype directly.
`NestedDtype` may be built from a list of fields, where each field is a pair of a name and a data type.

This way allows you to create a new nested Series from a variety of datatypes, but due to pandas interface limitations requires you specifying a concrete dtype. 

#### pd.Series from a sequence of elements

This is the same as using `pack()` function, but you need to specify the dtype explicitly.

In [None]:
series_from_dtype = pd.Series(
    [
        pd.NA,
        pd.DataFrame({"t": [1, 2, 3], "band": ["g", "r", "r"]}),
        {"t": np.array([4, 5]), "band": [None, "r"]},
    ],
    dtype=NestedDtype.from_fields({"t": pa.float64(), "band": pa.string()}),
)
series_from_dtype

`pyarrow` native objects are also supported. Scalars:

In [None]:
series_pa_type = pa.struct({"t": pa.list_(pa.float64()), "band": pa.list_(pa.string())})
scalar_pa_type = pa.struct({"t": pa.list_(pa.int16()), "band": pa.list_(pa.string())})
series_from_pa_scalars = pd.Series(
    # Scalars will be cast to the given type
    [
        pa.scalar(None),
        pa.scalar({"t": [1, 2, 3], "band": ["g", None, "r"]}, type=scalar_pa_type),
    ],
    dtype=NestedDtype(series_pa_type),
    name="from_pa_scalars",
    index=[101, -2],
)
series_from_pa_scalars

#### pd.Series from an array

Construction with `pyarrow` struct arrays is the cheapest way to create a nested Series. It is very semilliar to initialisation of a `pd.Series` of `pd.ArrowDtype` type.

In [None]:
pa_struct_array = pa.StructArray.from_arrays(
    [
        [
            np.arange(10),
            np.arange(5),
        ],  # "a" field
        [
            np.linspace(0, 1, 10),
            np.linspace(0, 1, 5),
        ],  # "b" field
    ],
    names=["a", "b"],
)
series_from_pa_struct = pd.Series(
    pa_struct_array,
    dtype=NestedDtype(pa_struct_array.type),
    name="from_pa_struct_array",
    index=["I", "II"],
)

### Convert nested Series to different data types

We have already seen how to convert nested Series to `pd.ArrowDtype`d Series, to a flat dataframe, or to a list-array dataframe. Let's summarize it here one more time:

In [None]:
# Convert to pd.ArrowDtype Series of struct-arrays
arrow_dtyped_series = pd.Series(nested_series, dtype=nested_series.dtype.to_pandas_arrow_dtype())
# Convert to a flat dataframe
flat_df = nested_series.nest.to_flat()
# Convert to a list-array dataframe
list_df = nested_series.nest.to_lists()

#### Convert to a collection of nested elements

Single element representation of the nested Series is `pd.DataFrame`, so iteration over the nested Series would yield `pd.DataFrame` objects.

In [None]:
for element in nested_series:
    print(element)

All collections built with iterables would have `pd.DataFrame` as elements:

In [None]:
nested_elements = list(nested_series)
nested_elements[-1]

Nested Series also supports direct conversion to numpy array of object dtype:

In [None]:
nested_series_with_na = pack([None, pd.NA, {"t": [1, 2], "flux": [0.1, None]}])
# Would have None for top-level missed data
np_array1 = np.array(nested_series_with_na)
print(f"{np_array1[0] = }")

In [None]:
# Would have empty pd.DataFrame for top-level missed data
np_array2 = nested_series_with_na.to_numpy(na_value=pd.DataFrame())
print(f"{np_array2[0] = }")