# Nested-Dask Best Practices

## When to Use Nested-Dask vs Nested-Pandas

Like Dask, Nested-Dask is focused towards working with large amounts of data. In particular, the threshold where this really will matter is when the amount of data exceeds the available memory of your machine/system and/or if parallel computing is needed. In such cases, Nested-Dask provides built-in tooling for working with these datasets and is recommended over using Nested-Pandas. These tools encompassing (but not limited to): 

* **lazy computation**: enabling construction of workflows with more control over when computation actually begins

* **partitioning**: breaking data up into smaller partitions that can fit into memory, enabling work on each chunk while keeping the overall memory footprint smaller than the full dataset size

* **progress tracking**: The [Dask Dashboard](https://docs.dask.org/en/latest/dashboard.html) can be used to track the progress of complex workflows, assess memory usage, find bottlenecks, etc.

* **parallel processing**: Dask workers are able to work in parallel on the partitions of a dataset, both on a local machine and on a distributed cluster.

In [None]:
from nested_dask.datasets import generate_data

# A lazily-represented dataset split into 5 partitions
generate_data(10, 100, npartitions=5)

In [None]:
# Setting up a Dask client, which would apply parallel processing
from dask.distributed import Client

client = Client()
client  # provides a link to access the Dask Dashboard

### Avoiding Dask Inefficiency

By contrast, when working with smaller datasets able to fit into memory it's often better to work directly with Nested-Pandas. This is particularly relevant for workflows that start with large amounts of data and filter down to a small dataset and do not require computationally heavy processing of this small dataset. By the nature of lazy computation, these filtering operations are not automatically applied to the dataset, and therefore you're still working effectively at scale. Let's walk through an example where we load a "large" dataset, in this case it will fit into memory but let's imagine that it is larger than memory.

In [None]:
# generate a "large" lazy dataset
ndf = generate_data(1000, 1000, npartitions=10)
ndf

Now let's apply a query that will filter the dataset down to a very small subset.

In [None]:
ndf = ndf.query("a > 0.99")
ndf.compute()  # returns a handful of rows from the original 1000

When `compute()` is called above, the Dask task graph is executed and the query is being run. However, the ndf object above is still a lazy Dask object meaning that any subsequent `.compute()`-like method (e.g. `.head()` or `.to_parquet()`) will still need to apply this query work all over again.

In [None]:
import numpy as np
import pandas as pd

# The result will be a dataframe with a single column with float values
meta = pd.DataFrame(columns=[0], dtype=float)

# Apply a mean operation on the "nested.flux" column
mean_flux = ndf.reduce(np.mean, "nested.flux", meta=meta)

# Dask has to reapply the query over `ndf` here, then apply the mean operation
mean_flux.compute()

In this case, it's better to work with the computed query in Nested-Pandas directly. 

In [None]:
import nested_pandas as npd

nf = ndf.compute()  # The query is computed and the result is brought into memory

# The computed result is a Nested-Pandas NestedFrame
isinstance(nf, npd.NestedFrame)

In [None]:
# Now we can apply the mean operation directly to the nested_pandas.NestedFrame
nf.reduce(np.mean, "nested.flux")

## Use Dask Divisions

Dask "divisions" are an optional component of Dask, but are highly recommended for Nested-Dask work. When the dataset is sorted by the index, these divisions are ranges to show which index values reside in each partition. For example:

In [None]:
# Divisions are in the left-most column
ndf = generate_data(15, 10, npartitions=5)
ndf

In [None]:
# Divisions show which index ranges reside in each partition
ndf.divisions

Divisions are particularly important to the speed and stability of table joins, which Dask-Nested uses heavily in it's nesting scheme. To set divisions, there are two main options. The first is when loading from files on disc, there are kwargs that can be set to automatically set to calculate divisions (`calculate_divisions=True` in the case of `read_parquet`). Alternatively, you can calculate them as part of a `set_index()` call.

In [None]:
# drop the index, no divisions set
ndf_no_index = ndf.reset_index()
ndf_no_index

In [None]:
# use sorted=True to indicate divisions should be set
# alternatively use sort=True if the chosen index is not sorted
ndf_no_index.set_index("index", sorted=True)