# Scaling to Large Data Volume

lsstseries is built with survey-scale time-domain data in mind. This notebook offers information on how one would use lsstseries to navigate large datasets.



## Useful `Dask` features of the `Ensemble`

Much of the scalability of lsstseries comes from it's `Dask` roots. `Dask` has a whole host of features for organizing, visualizing, and executing large scale jobs. One such example being the "Lazy" execution discussed in the <notebooks/working_with_the_ensemble> notebook, where specific lines of code are not automatically run at execution time, and instead are added to a scheduler and await a signal to run the calculation. This give the user greater control over when they'd like the bulk of the execution time to be run.

### The `Dask` Client and Scheduler

An important aspect of `Dask` to understand for optimizing it's performance for large datasets is the Client. `Dask` has [thorough documentation](https://distributed.dask.org/en/stable/client.html) on this, but the general idea is that the Client is an interface for sending instructions to a Scheduler, which in turn coordinates the worker nodes in executing the intended workflow. In essence, the Client gives you control over how to parallelize your workflow.

In the lsstseries `Ensemble`, a default client in the background, which can be accessed using `Ensemble.client_info()`:



In [10]:
from lsstseries import Ensemble

ens = Ensemble()

ens.client_info()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 59776 instead


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:59776/status,

0,1
Dashboard: http://127.0.0.1:59776/status,Workers: 5
Total threads: 10,Total memory: 32.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:59798,Workers: 5
Dashboard: http://127.0.0.1:59776/status,Total threads: 10
Started: Just now,Total memory: 32.00 GiB

0,1
Comm: tcp://127.0.0.1:60367,Total threads: 2
Dashboard: http://127.0.0.1:60391/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:59822,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-yz3clzl5,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-yz3clzl5

0,1
Comm: tcp://127.0.0.1:60384,Total threads: 2
Dashboard: http://127.0.0.1:60408/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:59823,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-a6kcvkic,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-a6kcvkic

0,1
Comm: tcp://127.0.0.1:60378,Total threads: 2
Dashboard: http://127.0.0.1:60400/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:59824,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-2ikzghs3,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-2ikzghs3

0,1
Comm: tcp://127.0.0.1:60424,Total threads: 2
Dashboard: http://127.0.0.1:60444/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:59825,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-kpqihktq,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-kpqihktq

0,1
Comm: tcp://127.0.0.1:60396,Total threads: 2
Dashboard: http://127.0.0.1:60431/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:59827,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-6cmgkvni,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-6cmgkvni


In [11]:
ens.client.close()  # tear down the client when we're done with it

By calling `Ensemble.client_info()`, we get an interactive output that tells us information about the Client, the Cluster setup, and available workers. In addition, an address is provided to access the `Dask` [Dashboard](https://docs.dask.org/en/stable/dashboard.html). This is an incredibly useful interactive tool for monitoring your in-progress workflows and diagnosing potential issues or slowdowns. It's highly encouraged to check out the `Dask` documentation on this, but also to open this up and play around with it yourself!

In many cases, the client that is built by default may not be ideal for your workflow. In these instances, we have a few ways of building a tailored client. The first is to pass client arguments along to the `Ensemble` itself.

In [12]:
ens = Ensemble(n_workers=3, threads_per_worker=2)

ens.client_info()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 61199 instead


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:61199/status,

0,1
Dashboard: http://127.0.0.1:61199/status,Workers: 3
Total threads: 6,Total memory: 32.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:61221,Workers: 3
Dashboard: http://127.0.0.1:61199/status,Total threads: 6
Started: Just now,Total memory: 32.00 GiB

0,1
Comm: tcp://127.0.0.1:61741,Total threads: 2
Dashboard: http://127.0.0.1:61761/status,Memory: 10.67 GiB
Nanny: tcp://127.0.0.1:61242,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-6us73gav,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-6us73gav

0,1
Comm: tcp://127.0.0.1:61757,Total threads: 2
Dashboard: http://127.0.0.1:61773/status,Memory: 10.67 GiB
Nanny: tcp://127.0.0.1:61243,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-tde769lg,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-tde769lg

0,1
Comm: tcp://127.0.0.1:61769,Total threads: 2
Dashboard: http://127.0.0.1:61785/status,Memory: 10.67 GiB
Nanny: tcp://127.0.0.1:61244,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-0cwfyncz,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-0cwfyncz


In this case, we are only interested in giving the client some clear values in the number of workers, and the number of threads per worker. We end up with a fairly lightweight cluster built underneath.

In [13]:
ens.client.close()

Another option is to simply create a client external to the `Ensemble` and pass it in.

In [14]:
from dask.distributed import Client

client = Client()

ens = Ensemble(client=client)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 62514 instead


In [15]:
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:62514/status,

0,1
Dashboard: http://127.0.0.1:62514/status,Workers: 5
Total threads: 10,Total memory: 32.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:62539,Workers: 5
Dashboard: http://127.0.0.1:62514/status,Total threads: 10
Started: Just now,Total memory: 32.00 GiB

0,1
Comm: tcp://127.0.0.1:63097,Total threads: 2
Dashboard: http://127.0.0.1:63129/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:62561,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-5v73nnb_,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-5v73nnb_

0,1
Comm: tcp://127.0.0.1:63086,Total threads: 2
Dashboard: http://127.0.0.1:63098/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:62562,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-z0e6d_82,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-z0e6d_82

0,1
Comm: tcp://127.0.0.1:63131,Total threads: 2
Dashboard: http://127.0.0.1:63158/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:62563,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-7cq994ji,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-7cq994ji

0,1
Comm: tcp://127.0.0.1:63149,Total threads: 2
Dashboard: http://127.0.0.1:63181/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:62565,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-ia92ax0t,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-ia92ax0t

0,1
Comm: tcp://127.0.0.1:63174,Total threads: 2
Dashboard: http://127.0.0.1:63191/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:62566,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-_4t6s18d,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-_4t6s18d


In [16]:
ens.client_info()  # We see that the two are equivalent

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:62514/status,

0,1
Dashboard: http://127.0.0.1:62514/status,Workers: 5
Total threads: 10,Total memory: 32.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:62539,Workers: 5
Dashboard: http://127.0.0.1:62514/status,Total threads: 10
Started: Just now,Total memory: 32.00 GiB

0,1
Comm: tcp://127.0.0.1:63097,Total threads: 2
Dashboard: http://127.0.0.1:63129/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:62561,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-5v73nnb_,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-5v73nnb_

0,1
Comm: tcp://127.0.0.1:63086,Total threads: 2
Dashboard: http://127.0.0.1:63098/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:62562,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-z0e6d_82,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-z0e6d_82

0,1
Comm: tcp://127.0.0.1:63131,Total threads: 2
Dashboard: http://127.0.0.1:63158/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:62563,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-7cq994ji,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-7cq994ji

0,1
Comm: tcp://127.0.0.1:63149,Total threads: 2
Dashboard: http://127.0.0.1:63181/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:62565,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-ia92ax0t,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-ia92ax0t

0,1
Comm: tcp://127.0.0.1:63174,Total threads: 2
Dashboard: http://127.0.0.1:63191/status,Memory: 6.40 GiB
Nanny: tcp://127.0.0.1:62566,
Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-_4t6s18d,Local directory: /var/folders/lc/dws63_cs5gz5mf8s869hjpx40000gn/T/dask-worker-space/worker-_4t6s18d


This may be preferable for those who want full control of the `Dask` client API, which may be beneficial when working on external machines/services or when a more complex setup is desired.

### Data Partitioning and Parallelization

Partitioning is the subdivision of a singular (large) dataset into many (small) datasets. It is a crucial concept to understand when working at scale, as almost always the input data will be too large to exist all in memory at once. Partitioning allows data to be read into memory in manageable chunks.  

Partitioning is crucial for the performance of lsstseries parallelization as well, as in general workers are allocated to partitions, so having more workers than partitions may incur more idle time on workers than you'd like.

Ideally, your input data should be pre-partitioned as any repartitioning of data will be expensive computationally. However, lsstseries is able to repartition input data if required.

In [17]:
ens = Ensemble(client=client)

# Read in data from a parquet file
ens.from_parquet("../../tests/lsstseries_tests/data/test_subset.parquet",
                id_col='ps1_objid',
                time_col='midPointTai',
                flux_col='psFlux',
                err_col='psFluxErr',
                band_col='filterName',
                partition_size='5KB')

ens.info()

Object Table
<class 'dask.dataframe.core.DataFrame'>
Int64Index: 15 entries, 88472468910699998 to 88480001353815785
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   nobs_g      15 non-null      float64
 1   nobs_r      15 non-null      float64
 2   nobs_total  15 non-null      int64
dtypes: float64(2), int64(1)
memory usage: 480.0 bytes
Source Table
<class 'dask.dataframe.core.DataFrame'>
Int64Index: 2000 entries, 88472468910699998 to 88480001353815785
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   midPointTai  2000 non-null      float32
 1   psFlux       2000 non-null      float32
 2   psFluxErr    2000 non-null      float32
 3   filterName   2000 non-null      object
dtypes: object(1), float32(3)
memory usage: 54.7 KB


Here we read in our example dataset, but this time specify the `partition_size` keyword, setting it to '5KB'. This lets the ensemble know that the input dataset, which consists of a single partition, should be repartitioned into 5KB sized chunks. This results in a new dataframe with 32 partitions. Alternatively, we could have supplied `n_partitions` to specify a target number of output partitions. 

However, **be careful**, as in this instance we've likely split data from each timeseries into multiple partitions. The consequence of this is that operations on a single lightcurve now need to access multiple partitions, which adds significant data shuffling costs to your workflow. If your timeseries is split across multiple partitions, you'll also want to explicitly set `use_map=False` in `Ensemble.batch` calls, as otherwise it will use `Dask` [map_partitions](https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html) which will only operate on the lightcurve data it sees in each partition!

In [18]:
from lsstseries.analysis.stetsonj import calc_stetson_J
import numpy as np

mapres = ens.batch(calc_stetson_J, use_map=True)  # will not know to look at multiple partitions to get lightcurve data
groupres = ens.batch(calc_stetson_J, use_map=False)  # will know to look at multiple partitions, with shuffling costs

print("number of lightcurve results in mapres: ", len(mapres))
print("number of lightcurve results in groupres: ", len(groupres))
print("True number of lightcurves in the dataset:", len(np.unique(ens._source.index)))

number of lightcurve results in mapres:  45
number of lightcurve results in groupres:  15
True number of lightcurves in the dataset: 15


As you can see, map_partitions finds chunks of the lightcurve in each partition separately and computes them separately, while the alternative (a groupby) does know to look in multiple partitions, with data shuffling costs. The reason why map_partitions is available as an option is that it is a performant choice when the lightcurves are optimally partitioned. This is part of why it's really ideal that this partitioning is already done before you've touched the data, as there's a better chance that the input source has stored their lightcurves with this requirement met.

In [19]:
ens.client.close()

### No one size fits all

With regards to `Dask`, the client setup, and partitioning there is no one solution that will apply to all use cases. We have largely provided the tools to tailor these as needed, but it does require some learning into what options are available and how they apply to your particular use case. We've gone over the general points above, but if you're interested in further learning, `Dask` has a [best practices](https://docs.dask.org/en/stable/best-practices.html) page that discusses some do's and don't's, which we recommend.