# Loading Data into Nested-Pandas

This notebook provides a brief introduction to loading data into nested-pandas or converting data into a nested structure. For an introduction to nested-pandas, see the quick start tutorial or the [readthedocs page](https://nested-pandas.readthedocs.io/en/latest/)


## Installation and Imports

With a valid Python environment, nested-pandas and its dependencies are easy to install using the `pip` package manager. The following command can be used to install it:

In [None]:
# % pip install nested-pandas

In [None]:
import os
import tempfile

import pandas as pd

from nested_pandas import NestedFrame, read_parquet
from nested_pandas.datasets import generate_parquet_file

# Overview

Nested-pandas provides multiple mechanisms for loading data or converting data to the nested format.  Below we walk through some of the common approaches.

# Converting Flat Data

Commonly existing data sets will be provided in “flat” data structures such as dictionaries or Pandas DataFrames.  In these cases the data consists of a rectangular table where each row represents an instance or observation. Multiple instances of the same top-level item are linked together through an ID. All rows with the same ID correspond to the same object/item.

We define one such flat dataframe consisting of 10 rows for 3 distinct items.

In [None]:
flat_df = pd.DataFrame(
    data={
        "a": [1, 1, 1, 2, 2, 2, 3, 3, 3, 3],
        "b": [2, 2, 2, 4, 4, 4, 6, 6, 6, 6],
        "c": [0, 2, 4, 1, 4, 3, 1, 4, 1, 1],
        "d": [5, 4, 7, 5, 3, 1, 9, 3, 4, 1],
    },
    index=[0, 0, 0, 1, 1, 1, 2, 2, 2, 2],
)
flat_df

The first column provides the object id. As we can see there are three rows with ID=0, three rows with ID=1, and four rows with ID=2. Some of the values are constant for each item. For example both columns “a” and “b” take a single value for object. We are wasting space by repeating them in every row. Other values are different per row (columns “c” and “d”).

As a concrete example, consider patient records. Each patient is assigned a unique id and has static data such as a date birth. They also have measurements that are new with every trip to the doctor, such as blood pressure or temperature.

## Converting from Flat Pandas

The easiest approach to converting the flat table above into a nested structure is to use `NestedFrame.from_flat()`. This function takes
  * a list of columns that are not nested (base_columns)
  * a list of columns to nest (nested_columns)
  * the name of the nested column (name)
Rows are associated using the index by default, but a column name on which to join can also be provided.

In [None]:
nf = NestedFrame.from_flat(
    flat_df,
    base_columns=["a", "b"],  # the columns not to nest
    nested_columns=["c", "d"],  # the columns to nest
    name="nested",  # name of the nested column
)
nf

## Inserting Nested Rows

Alternatively, we can use the `NestedFrame` constructor to create our base frame from a dictionary of our columns (as we would do with a normal pandas DataFrame). This defines the top-level objects and the values that are constant across rows ("a" and "b").

In [None]:
nf = NestedFrame(
    data={
        "a": [1, 2, 3],
        "b": [2, 4, 6],
    },
    index=[0, 1, 2],
)
nf

We can then create an additional pandas dataframes for the nested columns and pack them into our `NestedFrame` with `NestedFrame.add_nested()` function. `add_nested` will align the nest based on the index by default (a column may be selected instead via the `on` kwarg), as we see the `nested` `DataFrame` has a repeated index corresponding to the `nf` `NestedFrame`.

In [None]:
nested = pd.DataFrame(
    data={
        "c": [0, 2, 4, 1, 4, 3, 1, 4, 1, 1],
        "d": [5, 4, 7, 5, 3, 1, 9, 3, 4, 1],
    },
    index=[0, 0, 0, 1, 1, 1, 2, 2, 2, 2],
)

nf = nf.add_nested(nested, "nested")
nf

The "index" parameter is used to perform the association.  All of the values for index=0 are bundled together into a sub-table and stored in row 0's "nested" column.

In [None]:
nf.loc[0]["nested"]

We could add other nested columns by creating new sub-tables and adding them with `add_nested()`. Note that while the tables added with each `add_nested()` must be rectangular, they do not need to have the same dimensions between calls. We could add another nested row with a different number of observations.

In [None]:
nested = pd.DataFrame(
    data={
        "c": [0, 1, 0, 1, 2, 0],
        "d": [5, 4, 5, 4, 3, 5],
    },
    index=[0, 0, 1, 1, 1, 2],
)

nf = nf.add_nested(nested, "nested2")
nf

# Loading Data from Parquet Files

For larger datasets, we support loading data from parquet files. In the following cell, we generate a series of temporary parquet files with random data, and ingest them with the `read_parquet` method:

In [None]:
# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.
# You can of course remove this and use your own directory and real files on your system.
with tempfile.TemporaryDirectory() as temp_path:
    # Generates parquet files with random data within our temporary directory
    generate_parquet_file(10, {"nested1": 100, "nested2": 10}, os.path.join(temp_path, "test.parquet"))

    # Read the parquet file to a NestedFrame
    nf = read_parquet(os.path.join(temp_path, "test.parquet"))

Nested-Pandas nested columns are compatible with the parquet format, meaning they can be written and read from parquet natively.

In [None]:
nf  # nf contains nested columns

# Saving NestedFrames to Parquet Files

Additionally we can save an existing `NestedFrame` as a parquet file using `NestedFrame.to_parquet`.

>Note: Nested-Pandas converts any nested columns to pyarrow datatypes when writing to parquet, meaning that parquet files with nested columns can be read by a parquet reader from other packages so long as they understand pyarrow dtypes.

In [None]:
# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.
# You can of course remove this and use your own directory and real files on your system.
with tempfile.TemporaryDirectory() as temp_path:
    nf.to_parquet(
        os.path.join(temp_path, "output.parquet"),  # The output file path
    )

    # List the files in temp_path to ensure they were saved correctly.
    print("The NestedFrame was saved to the following parquet files :", os.listdir(temp_path))