# Working with the lsstseries `Ensemble` object

When working with many lightcurves, the lsstseries `Ensemble` object serves as a singular interface for storing, filtering, and analyzing timeseries data. 
Let's consider an example set of lightcurves:

In [1]:
from lsstseries.ensemble import Ensemble

ens = Ensemble()  # initialize an ensemble object

# Read in data from a parquet file
ens.from_parquet("../../tests/lsstseries_tests/data/test_subset.parquet",
                id_col='ps1_objid',
                time_col='midPointTai',
                flux_col='psFlux',
                err_col='psFluxErr',
                band_col='filterName')

<lsstseries.ensemble.Ensemble at 0x10699f610>

We now have an `Ensemble` object, and have provided it with data from a parquet file. Within the call to `Ensemble.from_parquet`, we specified which columns of the input file mapped to timeseries quantities that the `Ensemble` needs to understand. It's important to link these arguments properly, as the `Ensemble` will use these columns when operations are requested on understood quantities. For example, if an lsstseries analysis function requires the time column, from this linking the `Ensemble` will automatically supply that function with the 'midPointTai' column.

## Dask and "Lazy Evaluation"

Before going any further, the `Ensemble` is built on top of `Dask`, which brings with it a powerful framework for parallelization and scalability. However, there are some differences in how `Dask` code works that, if you're unfamiliar with it, is worth establishing right here at the get-go. The first is that `Dask` evaluates code "lazily". Meaning that many operations are not executed when the line of code is run, but instead are added to a scheduler to be executed when the result is actually needed. See below for an example:

In [2]:
ens._data  # We have not actually loaded any data into memory

Unnamed: 0_level_0,midPointTai,psFlux,psFluxErr,filterName
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float32,float32,float32,object
,...,...,...,...


Here we are accessing the Dask Dataframe underneath, and despite running a command to read in our data, we only see an empty dataframe with some high-level information available. To explicitly bring the data into memory, we must run a `compute()` command.

In [3]:
ens.compute() # Compute lets dask know we're ready to bring the data into memory

Unnamed: 0_level_0,midPointTai,psFlux,psFluxErr,filterName
ps1_objid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88472468910699998,58972.382812,21.035807,0.199991,r
88472935274829959,58246.460938,18.149910,0.026191,r
88472935274829959,58249.441406,18.269829,0.028149,r
88472935274829959,58256.421875,18.243782,0.027706,r
88472935274829959,58259.445312,18.198299,0.026956,r
...,...,...,...,...
88480001353815785,59128.257812,18.197432,0.038719,r
88480001353815785,59130.257812,18.155918,0.037670,r
88480001353815785,59134.300781,18.168980,0.037996,r
88480001353815785,59136.300781,18.125595,0.036927,r


With this compute, we see above that we now have a populated dataframe (a Pandas dataframe in fact!). From this, many workflows in Dask and by extension lsstseries, will look like a series of lazily evaluated commands that are chained together and then executed with a .compute() call at the end of the workflow.

## Inspection and Filtering

The `Ensemble` contains an assortment of functions for inspecting and filtering your data.

### Inspection

These functions provide views into the contents of your `Ensemble` dataframe, especially important when dealing with large data volumes that cannot be brought into memory all at once. The first is `Ensemble.info` which provides information on the columns, data types, and memory usage of the dataframe.

In [4]:
# Inspection

ens.info(verbose=True, memory_usage=True) # Grabs high level information about the dataframe


<class 'dask.dataframe.core.DataFrame'>
Int64Index: 2000 entries, 88472468910699998 to 88480001353815785
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   midPointTai  2000 non-null      float32
 1   psFlux       2000 non-null      float32
 2   psFluxErr    2000 non-null      float32
 3   filterName   2000 non-null      object
dtypes: object(1), float32(3)
memory usage: 54.7 KB


`Ensemble.info` shows that we have 2000 rows with 54.7 KBs of used memory, and shows the columns we've brought in with their respective data types. If you'd like to actually bring a few rows into memory to inspect, `Ensemble.head` and `Ensemble.tail` provide access to the first n and last n rows respectively.

In [5]:
ens.head(5) # Grabs the first 5 rows

Unnamed: 0_level_0,midPointTai,psFlux,psFluxErr,filterName
ps1_objid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88472468910699998,58972.382812,21.035807,0.199991,r
88472935274829959,58246.460938,18.14991,0.026191,r
88472935274829959,58249.441406,18.269829,0.028149,r
88472935274829959,58256.421875,18.243782,0.027706,r
88472935274829959,58259.445312,18.198299,0.026956,r


In [6]:

ens.tail(5) # Grabs the last 5 rows

Unnamed: 0_level_0,midPointTai,psFlux,psFluxErr,filterName
ps1_objid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88480001353815785,59128.257812,18.197432,0.038719,r
88480001353815785,59130.257812,18.155918,0.03767,r
88480001353815785,59134.300781,18.16898,0.037996,r
88480001353815785,59136.300781,18.125595,0.036927,r
88480001353815785,59138.277344,18.115007,0.036673,r


Additionally, when you are working with a small enough dataset, `Ensemble.compute` can be used to bring the whole dataframe into memory (as shown previously). 

In [7]:
ens.compute()

Unnamed: 0_level_0,midPointTai,psFlux,psFluxErr,filterName
ps1_objid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88472468910699998,58972.382812,21.035807,0.199991,r
88472935274829959,58246.460938,18.149910,0.026191,r
88472935274829959,58249.441406,18.269829,0.028149,r
88472935274829959,58256.421875,18.243782,0.027706,r
88472935274829959,58259.445312,18.198299,0.026956,r
...,...,...,...,...
88480001353815785,59128.257812,18.197432,0.038719,r
88480001353815785,59130.257812,18.155918,0.037670,r
88480001353815785,59134.300781,18.168980,0.037996,r
88480001353815785,59136.300781,18.125595,0.036927,r


### Filtering

Filtering encompasses any action that removes rows from the dataframe as a result of some determining threshold/criteria. The `Ensemble` provides a general filtering function, as shown below [TO DO]

In [8]:
# TODO: GENERAL FILTER FUNCTION

Additionally, several more specific functions are available for common operations.

In [9]:
# Cleaning nans
ens.dropna(threshold=1) # threshold is the number of nans present in a row needed to drop the row

# Filtering on number of observations
ens.prune(threshold=50) # threshold is the number of observations needed to retain the object

ens.info(verbose=True)

<class 'dask.dataframe.core.DataFrame'>
Int64Index: 1928 entries, 88472935274829959 to 88480001353815785
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   midPointTai  1928 non-null      float32
 1   psFlux       1928 non-null      float32
 2   psFluxErr    1928 non-null      float32
 3   filterName   1928 non-null      object
 4   num_obs      1928 non-null      int64
dtypes: object(1), float32(3), int64(1)
memory usage: 67.8 KB


In the above operations, we remove any rows that have at least 1 NaN value present. And then filter such that only lightcurves which have at least 50 measurements are retained. We see then that `Ensemble.info` shows we've reduced the size of the dataset from 2000 rows to 1928 rows.

## Batch Analysis

The `Ensemble` provides a powerful batching interface, `Ensemble.batch`, with in-built parallelization (provided the input data is in multiple partitions). In addition, lsstseries has a suite of analysis functions on-hand for your use. Below, we show the application of `lsstseries.analysis.calc_stetson_J` on our dataset.

In [10]:
# using lsstseries analysis functions
from lsstseries.analysis import calc_stetson_J
res = ens.batch(calc_stetson_J, compute=True) # compute is set to true to execute immediately (non-lazily)
res

ps1_objid
88472935274829959      {'g': -0.04174282, 'r': 0.6075282}
88480000290704349    {'g': -0.15279391, 'r': -0.19460204}
88480000340043358       {'g': 0.49833068, 'r': 0.5957333}
88480000543134170    {'g': -0.23631111, 'r': -0.21187533}
88480001064201774       {'g': 0.13734582, 'r': 0.2295931}
88480001238615003    {'g': 0.011276966, 'r': 0.093267344}
88480001239454674    {'g': 0.011276966, 'r': 0.093267344}
88480001353815785                       {'r': 0.09490132}
Name: ps1_objid, dtype: object

## Using a Custom Analysis Function
The analysis functions contained in lsstseries are meant to provide a collection of performant, on-hand routines for common timeseries use cases. However, lsstseries is also equipped to handle externally defined functions. Let's walk through a short example of defining a simple custom function and applying it through `Ensemble.batch`.

Here we define a simple function, that returns an average flux for each photometric band. It requires an array of fluxes, an array of band labels per measurement, and has a keyword argument for determining which averaging strategy to use (mean or median).

In [11]:
import numpy as np

# Defining a simple function
def my_flux_average(flux_array, band_array, method='mean'):
    """Read in an array of fluxes, and return the average of the fluxes by band"""
    res = {}
    for band in np.unique(band_array):
        mask = [band_array == band]  # Create a band by band mask
        band_fluxes = flux_array[tuple(mask)]  # Mask the flux array
        if method == "mean":
            res[band] = np.mean(band_fluxes)
        elif method == "median":
            res[band] = np.median(band_fluxes)
    return res

With the function defined, we next supply it to `Ensemble.batch`. The column labels of the `Ensemble` columns we want to use as arguments must be provided, as well as any keyword arguments. In this case, we pass along `"psFlux"` and `"filterName"`, so that the `Ensemble` will map those columns to `flux_array` and `band_array` respectively. We also pass `method='mean'`, which will pass that kwarg along to `my_flux_average`.

In [12]:
# Applying the function to the ensemble
res = ens.batch(my_flux_average, "psFlux", "filterName", compute=True, meta=None, method='median')
res

ps1_objid
88472935274829959    {'g': 18.912304, 'r': 18.221779}
88480000290704349     {'g': 20.49425, 'r': 20.169392}
88480000340043358    {'g': 15.461195, 'r': 15.025036}
88480000543134170    {'g': 20.640196, 'r': 19.817234}
88480001064201774    {'g': 17.595337, 'r': 17.169535}
88480001238615003    {'g': 21.040936, 'r': 20.400923}
88480001239454674    {'g': 21.040936, 'r': 20.400923}
88480001353815785                    {'r': 18.165283}
Name: ps1_objid, dtype: object

We see that we now have a `Pandas.series` of `my_average_flux` result by object_id (lightcurve). In many cases, this may not be the ideal output for your function. This output is controlled by the `Dask` `meta` parameter. For more information on this parameter, you can read the `Dask` [documentation](https://blog.dask.org/2022/08/09/understanding-meta-keyword-argument). You may pass the `meta` parameter through `Ensemble.batch`, as shown above.

In [13]:
ens.__del__() # Tear down the ensemble client, TODO: we should make a wrapper function for this

<lsstseries.ensemble.Ensemble at 0x10699f610>