# Loading Data into Nested-Pandas

With a valid Python environment, nested-pandas and it's dependencies are easy to install using the `pip` package manager. The following command can be used to install it:

In [None]:
# % pip install nested-pandas

In [None]:
from nested_pandas.datasets import generate_parquet_file
from nested_pandas import NestedFrame
from nested_pandas import read_parquet

import os
import pandas as pd
import tempfile

# Loading Data from Dictionaries
Nested-Pandas is tailored towards efficient analysis of nested datasets, and supports loading data from multiple sources.

We can use the `NestedFrame` constructor to create our base frame from a dictionary of our columns.

We can then create an addtional pandas dataframes and pack them into our `NestedFrame` with `NestedFrame.add_nested`

In [None]:
nf = NestedFrame(data={"a": [1, 2, 3], "b": [2, 4, 6]}, index=[0, 1, 2])

nested = pd.DataFrame(
    data={"c": [0, 2, 4, 1, 4, 3, 1, 4, 1], "d": [5, 4, 7, 5, 3, 1, 9, 3, 4]},
    index=[0, 0, 0, 1, 1, 1, 2, 2, 2],
)

nf = nf.add_nested(nested, "nested")
nf

# Loading Data from Parquet Files

For larger datasets, we support loading data from parquet files.

In the following cell, we generate a series of temporary parquet files with random data, and ingest them with the `read_parquet` method.

First we load each file individually as its own data frame to be inspected. Then we use `read_parquet` to create the `NestedFrame` `nf`.

In [None]:
base_df, nested1, nested2 = None, None, None
nf = None

# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.
# You can of course remove this and use your own directory and real files on your system.
with tempfile.TemporaryDirectory() as temp_path:
    # Generates parquet files with random data within our temporary directorye.
    generate_parquet_file(10, {"nested1": 100, "nested2": 10}, temp_path, file_per_layer=True)

    # Read each individual parquet file into its own dataframe.
    base_df = read_parquet(os.path.join(temp_path, "base.parquet"))
    nested1 = read_parquet(os.path.join(temp_path, "nested1.parquet"))
    nested2 = read_parquet(os.path.join(temp_path, "nested2.parquet"))

    # Create a single NestedFrame packing multiple parquet files.
    nf = read_parquet(
        data=os.path.join(temp_path, "base.parquet"),
        to_pack={
            "nested1": os.path.join(temp_path, "nested1.parquet"),
            "nested2": os.path.join(temp_path, "nested2.parquet"),
        },
    )

When examining the individual tables for each of our parquet files we can see that:

a) they all have different dimensions
b) they have shared indices

In [None]:
# Print the dimensions of all of our underlying tables
print("Our base table 'base.parquet' has shape:", base_df.shape)
print("Our first nested table table 'nested1.parquet' has shape:", nested1.shape)
print("Our second nested table table 'nested2.parquet' has shape:", nested2.shape)

# Print the unique indices in each table:
print("The unique indices in our base table are:", base_df.index.values)
print("The unique indices in our first nested table are:", nested1.index.unique())
print("The unique indices in our second nested table are:", nested2.index.unique())

So inspect `nf`, a `NestedFrame` we created from our call to `read_parquet` with the `to_pack` argument, we're able to pack nested parquet files according to the shared index values with the index in `base.parquet`.

The resulting `NestedFrame` having the same number of rows as `base.parquet` and with `nested1.parquet` and `nested2.parquet` packed into the 'nested1' and 'nested2' columns respectively.

In [None]:
nf

Since we loaded each individual parquet file into its own dataframe, we can also verify that using `read_parquet` with the `to_pack` argument is equivalent to the following method of packing the dataframes directly with `NestedFrame.add_nested`

# Packing Together Existing Dataframes Into a NestedFrame

In [None]:
NestedFrame(base_df).add_nested(nested1, "nested1").add_nested(nested2, "nested2")

# Saving NestedFrames to Parquet Files

Additionally we can save an existing `NestedFrame` as one of more parquet files using `NestedFrame.to_parquet``

When `by_layer=True` we save each individual layer of the NestedFrame into its own parquet file in a specified output directory.

The base layer will be outputted to "base.parquet", and each nested layer will be written to a file based on its column name. So the nested layer in column `nested1` will be written to "nested1.parquet".

In [None]:
restored_nf = None

# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.
# You can of course remove this and use your own directory and real files on your system.
with tempfile.TemporaryDirectory() as temp_path:
    nf.to_parquet(
        temp_path,  # The directory to save our output parquet files.
        by_layer=True,  # Save each layer of the NestedFrame to its own parquet file.
    )

    # List the files in temp_path to ensure they were saved correctly.
    print("The NestedFrame was saved to the following parquet files :", os.listdir(temp_path))

    # Read the NestedFrame back in from our saved parquet files.
    restored_nf = read_parquet(
        data=os.path.join(temp_path, "base.parquet"),
        to_pack={
            "nested1": os.path.join(temp_path, "nested1.parquet"),
            "nested2": os.path.join(temp_path, "nested2.parquet"),
        },
    )

restored_nf  # our dataframe is restored from our saved parquet files

We also support saving a `NestedFrame` as a single parquet file where the packed layers are still packed in their respective columns.

Here we provide `NestedFrame.to_parquet` with the desired path of the *single* output file (rather than the path of a directory to store *multiple* output files) and use `per_layer=False'

Our `read_parquet` function can load a `NestedFrame` saved in this single file parquet without requiring any additional arguments. 

In [None]:
restored_nf_single_file = None

# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.
# You can of course remove this and use your own directory and real files on your system.
with tempfile.TemporaryDirectory() as temp_path:
    output_path = os.path.join(temp_path, "output.parquet")
    nf.to_parquet(
        output_path,  # The filename to save our NestedFrame to.
        by_layer=False,  # Save the entire NestedFrame to a single parquet file.
    )

    # List the files within our temp_path to ensure that we only saved a single parquet file.
    print("The NestedFrame was saved to the following parquet files :", os.listdir(temp_path))

    # Read the NestedFrame back in from our saved single parquet file.
    restored_nf_single_file = read_parquet(output_path)

restored_nf_single_file  # our dataframe is restored from a single saved parquet file