In [None]:
import light_curve as licu
import dask.dataframe as dd
from tape import Ensemble, ColumnMapper
import numpy as np
from pathlib import Path

# LINCC Demo-Day: PLAsTiCC Eclipsing Binary Exploration with TAPE

Showing off some TAPE analysis when working with the PLAsTiCC dataset (converted to parquet files from csv files). This workflow was created by Kostya, where he was interested in exploring Eclipsing Binaries within the dataset.

Dataset Details:
* Total Size: ~10 GBs
* Number of Sources: 453,653,104
* Number of Objects: 3,492,890

## Setup and Loading

Begin by initializing an Ensemble, and we can also grab the Dask Dashboard link for inspecting the Dask cluster work as we run through the cells.

In [None]:
# Set some paths and variables
DATA_DIR = "/Users/dbranton/lincc/timeseries/data/plasticc/parquet" # You'll need to grab this data yourself
N_PROCESSORS = 4

# Initialize an Ensemble
ens = Ensemble(n_workers=N_PROCESSORS)
ens.client_info()

In [None]:
# Loading PLAsTiCC into the Ensemble

# ColumnMapper Establishes which table columns map to timeseries quantities
colmap = ColumnMapper(
        id_col='object_id',
        time_col='mjd',
        flux_col='flux',
        err_col='flux_err',
        band_col='passband',
      )

# We can read from parquet
ens.from_parquet(
    source_file=DATA_DIR+"/source/*.parquet",
    object_file=DATA_DIR+"/object/*.parquet",
    column_mapper=colmap,
    sync_tables=False, # Avoid doing an initial sync
    sorted=True, # If the input data is already sorted by the chosen index
    sort=False,
)

We've loaded the data with the `sorted` flag set to `True`, this will populate divisions for the Ensemble dataframes. Below, we see the divisions populated (the numbers along the index) even when the data itself is still represented lazily.

_**Divisions**: Given a sorted index, the boundary values for each partition that indicate which index slices live in which partition. Used to search for data only in a single partition, rather than needing to search all partitions._

In [None]:
ens.source

In [None]:
ens.object

## Analysis

First, let's select only Galactic objects, by cutting on hostgal_photoz. We use `query` to cut on a column of the object table, this will propagate to the source table when it's next used.

In [None]:
ens = ens.query("hostgal_photoz < 1e-3", table="object")

Second, let's select persistent sources, by cutting on the duration of the light curve. In this case, we use the `batch` interface to apply a custom function to each light curve.

In [None]:
duration = ens.batch(
    lambda time, detected: np.ptp(time[np.asarray(detected, dtype=bool)]),
    ens._time_col, 'detected_bool',
    use_map=True,
)

# Duration is a dask series, which we can assign as a column to the Object Table
ens.assign(table="object", duration=duration['result'])

In [None]:
ens.head("object", 5)

Now we can use our new duration column to further filter the dataset. Once again, we use `query`.

In [None]:
ens = ens.query("duration > 366", table="object")

Next, we use Otsu's method to split light curves into two groups:
* one with high flux
* one with low flux

Eclipsing binaries should have lower flux group smaller than the higher flux group, but having larger  variability. We use light-curve package to extract these features (https://github.com/light-curve/light-curve-python). For simplicity, we only calculate these features for the i (3) band.

In [None]:
# Once again using batch to apply a custom function
otsu_features = ens.batch(licu.OtsuSplit(), band_to_calc=3, use_map=True)

# otsu_features is a dataframe with multiple columns, can assign them to object
ens = ens.assign(
    table="object",
    otsu_lower_to_all_ratio=otsu_features['otsu_lower_to_all_ratio'],
    otsu_std_lower=otsu_features['otsu_std_lower'],
    otsu_std_upper=otsu_features['otsu_std_upper'],
)

Now we can query by these columns to filter down to to our objects of interest.

In [None]:
ens = ens.query(
    "otsu_lower_to_all_ratio < 0.1 and otsu_std_lower > otsu_std_upper",
    table="object",
)

Thus far, everything has mostly been run lazily. We can kick off the analysis by bringing the resulting object table into memory.

In [None]:
df = ens.compute("object")
df