# Scaling to Large Data Volume

TAPE is built with survey-scale time-domain data in mind. This notebook offers information on how one would use TAPE to navigate large datasets.



## Useful `Dask` features of the `Ensemble`

Much of the scalability of TAPE comes from it's `Dask` roots. `Dask` has a whole host of features for organizing, visualizing, and executing large scale jobs. One such example being the "Lazy" execution discussed in the ["Working with the tape Ensemble object"](https://tape.readthedocs.io/en/latest/tutorials/working_with_the_ensemble.html) notebook, where specific lines of code are not automatically run at execution time, and instead are added to a scheduler and await a signal to run the calculation. This give the user greater control over when they'd like the bulk of the execution time to be run.

### The `Dask` Client and Scheduler

An important aspect of `Dask` to understand for optimizing it's performance for large datasets is the Distributed Client. `Dask` has [thorough documentation](https://distributed.dask.org/en/stable/client.html) on this, but the general idea is that the Distributed Client is the entrypoint for setting up a distributed system. The Distributed Client enables asynchronous computation, where Dask's `compute` and `persist` methods are able to run in the background and persist in memory while we continue doing other work.

In the TAPE `Ensemble`, by default a Distributed Client is spun up in the background, which can be accessed using `Ensemble.client_info()`:



In [None]:
from tape import Ensemble

ens = Ensemble()

ens.client_info()

In [None]:
ens.client.close()  # tear down the client when we're done with it

By calling `Ensemble.client_info()`, we get an interactive output that tells us information about the Client, the Cluster setup, and available workers. In addition, an address is provided to access the `Dask` [Dashboard](https://docs.dask.org/en/stable/dashboard.html). This is an incredibly useful interactive tool for monitoring your in-progress workflows and diagnosing potential issues or slowdowns. It's highly encouraged to check out the `Dask` documentation on this, but also to open this up and play around with it yourself!

In many cases, the client that is built by default may not be ideal for your workflow. In these instances, we have a few ways of building a tailored client. The first is to pass client arguments along to the `Ensemble` itself.

In [None]:
ens = Ensemble(n_workers=3, threads_per_worker=2)

ens.client_info()

In this case, we are only interested in giving the client some clear values in the number of workers, and the number of threads per worker. We end up with a fairly lightweight cluster built underneath.

In [None]:
ens.client.close()

Another option is to simply create a client external to the `Ensemble` and pass it in.

In [None]:
from dask.distributed import Client

client = Client()

ens = Ensemble(client=client)

In [None]:
client

In [None]:
ens.client_info()  # We see that the two are equivalent

This may be preferable for those who want full control of the `Dask` client API, which may be beneficial when working on external machines/services or when a more complex setup is desired.

Alternatively, there may be instances where you would prefer to not use the Distributed Client, particularly when working with smaller amounts of data. In these instances, we allow users to disable the creation of a Distributed Client by passing `client=False`, as follows:

In [None]:
ens=Ensemble(client=False)

### Data Partitioning and Parallelization

Partitioning is the subdivision of a singular (large) dataset into many (small) datasets. It is a crucial concept to understand when working at scale, as almost always the input data will be too large to exist all in memory at once. Partitioning allows data to be read into memory in manageable chunks.  

Partitioning is crucial for the performance of TAPE parallelization as well, as in general workers are allocated to partitions, so having more workers than partitions may incur more idle time on workers than you'd like.

Ideally, your input data should be pre-partitioned as any repartitioning of data will be expensive computationally. However, TAPE is able to repartition input data if required.

In [None]:
ens = Ensemble(client=client)

# Read in data from a parquet file
ens.from_parquet("../../tests/tape_tests/data/source/test_source.parquet",
                id_col='ps1_objid',
                time_col='midPointTai',
                flux_col='psFlux',
                err_col='psFluxErr',
                band_col='filterName',
                partition_size='5KB')

ens.info()

Here we read in our example dataset, but this time specify the `partition_size` keyword, setting it to '5KB'. This lets the ensemble know that the input dataset, which consists of a single partition, should be repartitioned into 5KB sized chunks. This results in a new dataframe with 32 partitions. Alternatively, we could have supplied `n_partitions` to specify a target number of output partitions. 

However, **be careful**, as in this instance we've likely split data from each timeseries into multiple partitions. The consequence of this is that operations on a single lightcurve now need to access multiple partitions, which adds significant data shuffling costs to your workflow. If your timeseries is split across multiple partitions, you'll also want to explicitly set `use_map=False` in `Ensemble.batch` calls, as otherwise it will use `Dask` [map_partitions](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html) which will only operate on the lightcurve data it sees in each partition!

In [None]:
from tape.analysis.stetsonj import calc_stetson_J
import numpy as np

mapres = ens.batch(calc_stetson_J, use_map=True)  # will not know to look at multiple partitions to get lightcurve data
groupres = ens.batch(calc_stetson_J, use_map=False)  # will know to look at multiple partitions, with shuffling costs

print("number of lightcurve results in mapres: ", len(mapres))
print("number of lightcurve results in groupres: ", len(groupres))
print("True number of lightcurves in the dataset:", len(np.unique(ens._source.index)))

As you can see, map_partitions finds chunks of the lightcurve in each partition separately and computes them separately, while the alternative (a groupby) does know to look in multiple partitions, with data shuffling costs. The reason why map_partitions is available as an option is that it is a performant choice when the lightcurves are optimally partitioned. This is part of why it's really ideal that this partitioning is already done before you've touched the data, as there's a better chance that the input source has stored their lightcurves with this requirement met.

In [None]:
ens.client.close()

### No one size fits all

With regards to `Dask`, the client setup, and partitioning there is no one solution that will apply to all use cases. We have largely provided the tools to tailor these as needed, but it does require some learning into what options are available and how they apply to your particular use case. We've gone over the general points above, but if you're interested in further learning, `Dask` has a [best practices](https://docs.dask.org/en/stable/best-practices.html) page that discusses some do's and don't's, which we recommend.