# Working with the TAPE `Ensemble` object

When working with many lightcurves, the TAPE `Ensemble` object serves as a singular interface for storing, filtering, and analyzing timeseries data. 
Let's consider an example set of lightcurves, generated as follows:

In [1]:
import numpy as np
np.random.seed(1)

# initialize a dictionary of empty arrays
source_dict = {"id": np.array([]),
                   "time": np.array([]),
                   "flux": np.array([]),
                   "error": np.array([]),
                   "band": np.array([])}

# Create 10 lightcurves with 100 measurements each
lc_len = 100
for i in range(10):
    source_dict["id"] = np.append(source_dict["id"], np.array([i]*lc_len)).astype(int)
    source_dict["time"] = np.append(source_dict["time"], np.linspace(1, lc_len, lc_len))
    source_dict["flux"] = np.append(source_dict["flux"], 100 + 50 * np.random.rand(lc_len))
    source_dict["error"] = np.append(source_dict["error"], 10 + 5 * np.random.rand(lc_len))
    source_dict["band"] = np.append(source_dict["band"], ["g"]*50+["r"]*50)

We can load these into the `Ensemble` using `Ensemble.from_source_dict()`:

In [2]:
from tape.ensemble import Ensemble

ens = Ensemble()  # initialize an ensemble object

# Read in the generated lightcurve data
ens.from_source_dict(source_dict, 
                     id_col="id",
                     time_col="time",
                     flux_col="flux",
                     err_col="error",
                     band_col="band")

<tape.ensemble.Ensemble at 0x1033bf8e0>

We now have an `Ensemble` object, and have provided it with the constructed data in the source dictionary. Within the call to `Ensemble.from_source_dict`, we specified which columns of the input file mapped to timeseries quantities that the `Ensemble` needs to understand. It's important to link these arguments properly, as the `Ensemble` will use these columns when operations are requested on understood quantities. For example, if an TAPE analysis function requires the time column, from this linking the `Ensemble` will automatically supply that function with the 'time' column.

## Column Mapping with the ColumnMapper

In the above example, we manually provide the column labels within the call to `Ensemble.from_source_dict`. Alternatively, the `tape.utils.ColumnMapper` class offers a means to assign the column mappings. Either manually as shown before, or even populated from a known mapping scheme.

In [3]:
from tape.utils import ColumnMapper

# columns assigned manually
col_map = ColumnMapper().assign(id_col="id",
                                time_col="time",
                                flux_col="flux",
                                err_col="error",
                                band_col="band")

# Pass the ColumnMapper along to from_source_dict
ens.from_source_dict(source_dict, column_mapper=col_map)

<tape.ensemble.Ensemble at 0x1033bf8e0>

## The Object and Source Frames
The `Ensemble` maintains two dataframes under the hood, the "object dataframe" and the "source dataframe". This borrows from the Rubin Observatories object-source convention, where object denotes a given astronomical object and source is the collection of measurements of that object. Essentially, the Object frame stores one-off information about objects, and the source frame stores the available time-domain data. As a result, `Ensemble` functions that operate on the underlying dataframes need to be pointed at either object or source. In most cases, the default is the object table as it's a more helpful interface for understanding the contents of the `Ensemble`, especially when dealing with large volumes of data.

We can also access Ensemble frames individually with `Ensemble.source` and `Ensemble.object`

## Dask and "Lazy Evaluation"

Before going any further, the `Ensemble` is built on top of `Dask`, which brings with it a powerful framework for parallelization and scalability. However, there are some differences in how `Dask` code works that, if you're unfamiliar with it, is worth establishing right here at the get-go. The first is that `Dask` evaluates code "lazily". Meaning that many operations are not executed when the line of code is run, but instead are added to a scheduler to be executed when the result is actually needed. See below for an example:

In [4]:
ens.source  # We have not actually loaded any data into memory

Unnamed: 0_level_0,time,flux,error,band
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float64,float64,float64,string
,...,...,...,...


Here we are accessing the Dask dataframe and despite running a command to read in our source data, we only see an empty dataframe with some high-level information available. To explicitly bring the data into memory, we must run a `compute()` command on the data's frame.

In [5]:
ens.source.compute()  # Compute lets dask know we're ready to bring the data into memory

Unnamed: 0_level_0,time,flux,error,band
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1.0,120.851100,11.633225,g
0,2.0,136.016225,12.635291,g
0,3.0,100.005719,14.429710,g
0,4.0,115.116629,11.786349,g
0,5.0,107.337795,14.542676,g
...,...,...,...,...
9,96.0,138.371176,12.237541,r
9,97.0,104.060829,10.920638,r
9,98.0,149.920678,14.143664,r
9,99.0,119.480601,10.154990,r


With this compute, we see above that we have returned a populated dataframe (a Pandas dataframe in fact!). From this, many workflows in Dask and by extension TAPE, will look like a series of lazily evaluated commands that are chained together and then executed with a .compute() call at the end of the workflow.

Alternatively we can use `ens.persist()` to execute the chained commands without loading the result into memory. This can speed up future `compute()` calls.

Note that `Ensemble.source` and `Ensemble.object` are instances of the `tape.SourceFrame` and `tape.ObjectFrame` classes respectively. These are subclasses of Dask dataframes that provide some additional utility for tracking by the ensemble while supporting most of the Dask dataframe API.  

## Inspection, Filtering, and Selecting

The `Ensemble` contains an assortment of functions for inspecting and filtering your data.

### Inspection

These functions provide views into the contents of your `Ensemble` dataframe, especially important when dealing with large data volumes that cannot be brought into memory all at once. The first is `Ensemble.info` which provides information on the columns, data types, and memory usage of the dataframe.

In [6]:
# Inspection

ens.info(verbose=True, memory_usage=True)  # Grabs high level information about the dataframes

Object Table
<class 'tape.ensemble_frame.ObjectFrame'>
Index: 0 entries
Empty ObjectFrame
Source Table
<class 'tape.ensemble_frame.SourceFrame'>
Index: 1000 entries, 0 to 9
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   time    1000 non-null      float64
 1   flux    1000 non-null      float64
 2   error   1000 non-null      float64
 3   band    1000 non-null      string
dtypes: float64(3), string(1)
memory usage: 36.1 KB


`Ensemble.info` shows that we have 2000 rows with 54.7 KBs of used memory, and shows the columns we've brought in with their respective data types. If you'd like to actually bring a few rows into memory to inspect, `EnsembleFrame.head` and `EnsembleFrame.tail` provide access to the first n and last n rows respectively.

In [7]:
ens.object.head(5)  # Grabs the first 5 rows of the object table

0
1
2
3
4


In [8]:
ens.source.tail(5)  # Grabs the last 5 rows of the source table

Unnamed: 0_level_0,time,flux,error,band
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
9,96.0,138.371176,12.237541,r
9,97.0,104.060829,10.920638,r
9,98.0,149.920678,14.143664,r
9,99.0,119.480601,10.15499,r
9,100.0,145.260138,14.733641,r


Additionally, when you are working with a small enough dataset, `Ensemble.compute` can be used to bring the whole dataframe into memory (as shown previously). 

In [9]:
ens.source.compute()

Unnamed: 0_level_0,time,flux,error,band
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1.0,120.851100,11.633225,g
0,2.0,136.016225,12.635291,g
0,3.0,100.005719,14.429710,g
0,4.0,115.116629,11.786349,g
0,5.0,107.337795,14.542676,g
...,...,...,...,...
9,96.0,138.371176,12.237541,r
9,97.0,104.060829,10.920638,r
9,98.0,149.920678,14.143664,r
9,99.0,119.480601,10.154990,r


### Filtering

The `Ensemble` provides a general filtering function `query` that mirrors a Pandas or Dask `query` command. Specifically, the function takes a string that provides an expression indicating which rows to **keep**. As with other `Ensemble` functions, an optional `table` parameter allows you to filter on either the object or the source table.

For example, the following code filters the sources to only include rows with a flux value above 18.2. It uses `ens._flux_col` to retrieve the name of the column with that information.

In [10]:
ens.source.query(f"{ens._flux_col} > 130.0").compute()

Unnamed: 0_level_0,time,flux,error,band
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2.0,136.016225,12.635291,g
0,12.0,134.260975,10.685679,g
0,14.0,143.905872,13.484091,g
0,16.0,133.523376,13.777315,g
0,21.0,140.037228,10.099401,g
...,...,...,...,...
9,91.0,140.368263,14.320720,r
9,92.0,148.476901,12.239495,r
9,96.0,138.371176,12.237541,r
9,98.0,149.920678,14.143664,r


Alternatively, we could use a Dask dataseries of Booleans to indicate which rows to *keep*. We can often compute these series as the result of some operation on the underlying tables:

In [11]:
keep_rows = ens.source["error"] < 12.0
keep_rows.compute()

id
0     True
0    False
0    False
0     True
0    False
     ...  
9    False
9     True
9    False
9     True
9    False
Name: error, Length: 1000, dtype: bool

We also provide filtering at the `Ensemble` level, so you can pass taht series to the `Ensemble.filter_from_series` function:

In [12]:
ens.filter_from_series(keep_rows, table="source")
ens.compute("source")

Unnamed: 0_level_0,time,flux,error,band
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1.0,120.851100,11.633225,g
0,4.0,115.116629,11.786349,g
0,7.0,109.313011,10.079106,g
0,11.0,120.959726,10.861703,g
0,12.0,134.260975,10.685679,g
...,...,...,...,...
9,88.0,134.215481,11.202422,r
9,89.0,147.302751,11.271162,r
9,90.0,110.009303,10.420432,r
9,97.0,104.060829,10.920638,r


Additionally, several more specific functions are available for common operations.

In [13]:
# Cleaning nans
ens.source.dropna()  # clean nans from source table
ens.object.dropna()  # clean nans from object table

# Filtering on number of observations
ens.prune(threshold=10)  # threshold is the minimum number of observations needed to retain the object

ens.info(verbose=True)

Object Table
<class 'tape.ensemble_frame.ObjectFrame'>
Index: 10 entries, 0 to 9
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   nobs_total  10 non-null      int64
dtypes: int64(1)
memory usage: 160.0 bytes
Source Table
<class 'tape.ensemble_frame.SourceFrame'>
Index: 387 entries, 0 to 9
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   time    387 non-null      float64
 1   flux    387 non-null      float64
 2   error   387 non-null      float64
 3   band    387 non-null      string
dtypes: float64(3), string(1)
memory usage: 14.0 KB


In the above operations, we remove any rows that have at least 1 NaN value present. And then filter such that only lightcurves which have at least 50 measurements are retained.

### Selecting

The `Ensemble` also provides a `select` function to filter down to a subset of columns.

In [14]:
# Add a new column so we can filter it out later.
ens._source = ens._source.assign(band2=ens._source["band"] + "2")
ens.compute("source")

Unnamed: 0_level_0,time,flux,error,band,band2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1.0,120.851100,11.633225,g,g2
0,4.0,115.116629,11.786349,g,g2
0,7.0,109.313011,10.079106,g,g2
0,11.0,120.959726,10.861703,g,g2
0,12.0,134.260975,10.685679,g,g2
...,...,...,...,...,...
9,88.0,134.215481,11.202422,r,r2
9,89.0,147.302751,11.271162,r,r2
9,90.0,110.009303,10.420432,r,r2
9,97.0,104.060829,10.920638,r,r2


In [15]:
ens.select(["time", "flux", "error", "band"], table="source")
ens.compute("source")

Unnamed: 0_level_0,time,flux,error,band
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1.0,120.851100,11.633225,g
0,4.0,115.116629,11.786349,g
0,7.0,109.313011,10.079106,g
0,11.0,120.959726,10.861703,g
0,12.0,134.260975,10.685679,g
...,...,...,...,...
9,88.0,134.215481,11.202422,r
9,89.0,147.302751,11.271162,r
9,90.0,110.009303,10.420432,r
9,97.0,104.060829,10.920638,r


# Updating an Ensemble's Frames

In the above examples, we demonstrate several methods where we generated filtered views of the source table.

However note that the underlying data remained unchanged, with no changes to the rows or columns of `Ensemble.source`

In [18]:
queried_src = ens.source.query(f"{ens._flux_col} > 130.0")

print(len(queried_src))
print(len(ens.source))

169
387


When modifying the views of a dataframe tracked by the `Ensemble`, we can update the `Source` or `Object` frame to use the updated view by calling

`Ensemble.update_frame(view_frame)`

Or alternately:

`view_frame.update_ensemble()`

In [19]:
# Now apply the views filter to the source frame.
queried_src.update_ensemble()

ens.source.compute()

Unnamed: 0_level_0,time,flux,error,band
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,12.0,134.260975,10.685679,g
0,21.0,140.037228,10.099401,g
0,22.0,148.413079,10.131055,g
0,24.0,134.616131,11.231055,g
0,30.0,143.907125,11.395918,g
...,...,...,...,...
9,81.0,149.016644,10.755373,r
9,85.0,130.071670,11.960329,r
9,86.0,136.297942,11.419338,r
9,88.0,134.215481,11.202422,r


Note that the above is still a series of lazy operations that will not be fully evaluated until an operation such as `compute`. So a call to `update_ensemble` will not yet alter or move any underlying data.

## Assignments and Column Manipulation

The ensemble object supports assignment through the Dask `assign` function. We can pass in either a callable or a series to assign to the new column. New column names are produced automatically from the argument name.

For example, if we want to compute the lower bound of an error range as the estimated flux minus twice the estimated error, we would use:

In [20]:
lower_bnd = ens.source.assign(lower_bnd=lambda x: x["flux"] - 2.0 * x["error"])
lower_bnd

Unnamed: 0_level_0,time,flux,error,band,lower_bnd
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,float64,float64,float64,string,float64
,...,...,...,...,...


## Batch Analysis

The `Ensemble` provides a powerful batching interface, `Ensemble.batch`, with in-built parallelization (provided the input data is in multiple partitions). In addition, TAPE has a suite of analysis functions on-hand for your use. Below, we show the application of `tape.analysis.calc_stetson_J` on our dataset.

In [21]:
# using tape analysis functions
from tape.analysis import calc_stetson_J

res = ens.batch(calc_stetson_J, compute=True)  # compute is set to true to execute immediately (non-lazily)
res

Temporary columns dropped from Object Table: ['nobs_total']
Using generated label, result_1, for a batch result.


id
0    {'g': -0.8833723170736909, 'r': -0.81291313232...
1    {'g': -0.7866661902102343, 'r': -0.79927945599...
2    {'g': -0.8650811883274131, 'r': -0.87939085289...
3    {'g': -0.9140015912865537, 'r': -0.90284371456...
4    {'g': -0.8232578922439672, 'r': -0.81922455220...
5    {'g': -0.668795976899231, 'r': -0.784477243304...
6    {'g': -0.8115552290707235, 'r': -0.90666227394...
7    {'g': -0.6217573153267577, 'r': -0.60999974938...
8    {'g': -0.7001359525394822, 'r': -0.73620435205...
9    {'g': -0.7266040976469818, 'r': -0.68878460237...
Name: stetsonJ, dtype: object

# Storing and Accessing Result Frames

Note for the above `batch` operation, we also printed:

`Using generated label, result_1, for a batch result.`

In addition to the source and object frames, the `Ensemble` may track other frames as well, accessed by either generated or user-provided labels.

We can access a saved frame with `Ensemble.select_frame(label)`

In [22]:
ens.select_frame("result_1").compute()

id
0    {'g': -0.8833723170736909, 'r': -0.81291313232...
1    {'g': -0.7866661902102343, 'r': -0.79927945599...
2    {'g': -0.8650811883274131, 'r': -0.87939085289...
3    {'g': -0.9140015912865537, 'r': -0.90284371456...
4    {'g': -0.8232578922439672, 'r': -0.81922455220...
5    {'g': -0.668795976899231, 'r': -0.784477243304...
6    {'g': -0.8115552290707235, 'r': -0.90666227394...
7    {'g': -0.6217573153267577, 'r': -0.60999974938...
8    {'g': -0.7001359525394822, 'r': -0.73620435205...
9    {'g': -0.7266040976469818, 'r': -0.68878460237...
Name: stetsonJ, dtype: object

`Ensemble.batch` has an optional `label` argument that will store the result with a user-provided label.

Likewise we can rename a frame with its `label` field, and call `update_ensemble` to have it tracked under the new label.

In [24]:
res = ens.batch(calc_stetson_J, compute=True, label="stetson_j")

ens.select_frame("stetson_j").compute()

id
0    {'g': -0.8833723170736909, 'r': -0.81291313232...
1    {'g': -0.7866661902102343, 'r': -0.79927945599...
2    {'g': -0.8650811883274131, 'r': -0.87939085289...
3    {'g': -0.9140015912865537, 'r': -0.90284371456...
4    {'g': -0.8232578922439672, 'r': -0.81922455220...
5    {'g': -0.668795976899231, 'r': -0.784477243304...
6    {'g': -0.8115552290707235, 'r': -0.90666227394...
7    {'g': -0.6217573153267577, 'r': -0.60999974938...
8    {'g': -0.7001359525394822, 'r': -0.73620435205...
9    {'g': -0.7266040976469818, 'r': -0.68878460237...
Name: stetsonJ, dtype: object

We can also add our own frames with `Ensemble.add_frame(frame, label)`

In [28]:
ens.add_frame(res.copy(), "new_res")
ens.select_frame("new_res")

ValueError: Unable to add frame: a frame with label 'new_res'is in the Ensemble.

Finally we can also drop frames we are no longer interested in having the `Ensemble` track.

In [None]:
ens.drop_frame("result_1")

ens.select_frame("result_1") # This should result in a KeyError since the frame has been dropped.

## Using light-curve package features

`Ensemble.batch` also supports the use of [light-curve](https://pypi.org/project/light-curve/) package feature extractor:

In [None]:
import light_curve as licu

extractor = licu.Extractor(licu.Amplitude(), licu.AndersonDarlingNormal(), licu.StetsonK())
res = ens.batch(extractor, compute=True, band_to_calc="g")
res

## Using a Custom Analysis Function
The analysis functions contained in TAPE are meant to provide a collection of performant, on-hand routines for common timeseries use cases. However, TAPE is also equipped to handle externally defined functions. Let's walk through a short example of defining a simple custom function and applying it through `Ensemble.batch`.

Here we define a simple function, that returns an average flux for each photometric band. It requires an array of fluxes, an array of band labels per measurement, and has a keyword argument for determining which averaging strategy to use (mean or median).

In [None]:
import numpy as np


# Defining a simple function
def my_flux_average(flux_array, band_array, method="mean"):
    """Read in an array of fluxes, and return the average of the fluxes by band"""
    res = {}
    for band in np.unique(band_array):
        mask = [band_array == band]  # Create a band by band mask
        band_fluxes = flux_array[tuple(mask)]  # Mask the flux array
        if method == "mean":
            res[band] = np.mean(band_fluxes)
        elif method == "median":
            res[band] = np.median(band_fluxes)
    return res

With the function defined, we next supply it to `Ensemble.batch`. The column labels of the `Ensemble` columns we want to use as arguments must be provided, as well as any keyword arguments. In this case, we pass along `"flux"` and `"band"`, so that the `Ensemble` will map those columns to `flux_array` and `band_array` respectively. We also pass `method='mean'`, which will pass that kwarg along to `my_flux_average`.

In [None]:
# Applying the function to the ensemble
res = ens.batch(my_flux_average, "flux", "band", compute=True, meta=None, method="median")
res

In [None]:
ens.frames.keys()

We see that we now have a `Pandas.series` of `my_average_flux` result by object_id (lightcurve). In many cases, this may not be the ideal output for your function. This output is controlled by the `Dask` `meta` parameter. For more information on this parameter, you can read the `Dask` [documentation](https://blog.dask.org/2022/08/09/understanding-meta-keyword-argument). You may pass the `meta` parameter through `Ensemble.batch`, as shown above.

In [None]:
ens.client.close()  # Tear down the ensemble client