# Working with the lsstseries `Ensemble` object

When working with many lightcurves, the lsstseries `Ensemble` object serves as a singular interface for storing, filtering, and analyzing timeseries data. 
Let's consider an example set of lightcurves:

In [1]:
from lsstseries.ensemble import Ensemble

ens = Ensemble()  # initialize an ensemble object

# Read in data from a parquet file
ens.from_parquet("../../tests/lsstseries_tests/data/test_subset.parquet",
                id_col='ps1_objid',
                time_col='midPointTai',
                flux_col='psFlux',
                err_col='psFluxErr',
                band_col='filterName')

<lsstseries.ensemble.Ensemble at 0x10619f640>

We now have an `Ensemble` object, and have provided it with data from a parquet file. Within the call to `Ensemble.from_parquet`, we specified which columns of the input file mapped to timeseries quantities that the `Ensemble` needs to understand. It's important to link these arguments properly, as the `Ensemble` will use these columns when operations are requested on understood quantities. For example, if an lsstseries analysis function requires the time column, from this linking the `Ensemble` will automatically supply that function with the 'midPointTai' column.

## Dask and "Lazy Evaluation"

Before going any further, the `Ensemble` is built on top of `Dask`, which brings with it a powerful framework for parallelization and scalability. However, there are some differences in how `Dask` code works that, if you're unfamiliar with it, is worth establishing right here at the get-go. The first is that `Dask` evaluates code "lazily". Meaning that many operations are not executed when the line of code is run, but instead are added to a scheduler to be executed when the result is actually needed. See below for an example:

In [2]:
ens._data  # We have not actually loaded any data into memory

Unnamed: 0_level_0,midPointTai,psFlux,psFluxErr,filterName
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float32,float32,float32,object
,...,...,...,...


Here we are accessing the Dask Dataframe underneath, and despite running a command to read in our data, we only see an empty dataframe with some high-level information available. To explicitly bring the data into memory, we must run a `compute()` command.

In [3]:
ens.compute() # Compute lets dask know we're ready to bring the data into memory

Unnamed: 0_level_0,midPointTai,psFlux,psFluxErr,filterName
ps1_objid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88472468910699998,58972.382812,21.035807,0.199991,r
88472935274829959,58246.460938,18.149910,0.026191,r
88472935274829959,58249.441406,18.269829,0.028149,r
88472935274829959,58256.421875,18.243782,0.027706,r
88472935274829959,58259.445312,18.198299,0.026956,r
...,...,...,...,...
88480001353815785,59128.257812,18.197432,0.038719,r
88480001353815785,59130.257812,18.155918,0.037670,r
88480001353815785,59134.300781,18.168980,0.037996,r
88480001353815785,59136.300781,18.125595,0.036927,r


With this compute, we see above that we now have a populated dataframe (a Pandas dataframe in fact!). From this, many workflows in Dask and by extension lsstseries, will look like a series of lazily evaluated commands that are chained together and then executed with a .compute() call at the end of the workflow.

## Inspection and Filtering

The `Ensemble` contains an assortment of functions for inspecting and filtering your data.

### Inspection

In [4]:
# Inspection

ens.info() # Grabs some high level information about the dataframe


<class 'dask.dataframe.core.DataFrame'>
Columns: 4 entries, midPointTai to filterName
dtypes: object(1), float32(3)

In [5]:
ens.head(5) # Grabs the first 5 rows

Unnamed: 0_level_0,midPointTai,psFlux,psFluxErr,filterName
ps1_objid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88472468910699998,58972.382812,21.035807,0.199991,r
88472935274829959,58246.460938,18.14991,0.026191,r
88472935274829959,58249.441406,18.269829,0.028149,r
88472935274829959,58256.421875,18.243782,0.027706,r
88472935274829959,58259.445312,18.198299,0.026956,r


In [6]:

ens.tail(5) # Grabs the last 5 rows

Unnamed: 0_level_0,midPointTai,psFlux,psFluxErr,filterName
ps1_objid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
88480001353815785,59128.257812,18.197432,0.038719,r
88480001353815785,59130.257812,18.155918,0.03767,r
88480001353815785,59134.300781,18.16898,0.037996,r
88480001353815785,59136.300781,18.125595,0.036927,r
88480001353815785,59138.277344,18.115007,0.036673,r


### Filtering

In [7]:
# TODO: GENERAL FILTER FUNCTION

# Cleaning nans
ens.dropna(threshold=1) # threshold is the number of nans present in a row needed to drop the row

# Filtering on number of observations
ens.prune(threshold=50) # threshold is the number of observations needed to retain the object

ens.compute()

Unnamed: 0_level_0,midPointTai,psFlux,psFluxErr,filterName,num_obs
ps1_objid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
88472935274829959,58246.460938,18.149910,0.026191,r,499
88472935274829959,58249.441406,18.269829,0.028149,r,499
88472935274829959,58256.421875,18.243782,0.027706,r,499
88472935274829959,58259.445312,18.198299,0.026956,r,499
88472935274829959,58262.378906,18.211143,0.027165,r,499
...,...,...,...,...,...
88480001353815785,59128.257812,18.197432,0.038719,r,146
88480001353815785,59130.257812,18.155918,0.037670,r,146
88480001353815785,59134.300781,18.168980,0.037996,r,146
88480001353815785,59136.300781,18.125595,0.036927,r,146


In [8]:
ens.__del__()

<lsstseries.ensemble.Ensemble at 0x10619f640>