In [None]:
rel_path = "../../tests/tape_tests/data"

# Data Access with TAPE

This tutorial demonstrates the various ways in which data can be ingest into TAPE.

## Loading from Parquet Files

The [Apache Parquet](https://parquet.apache.org/docs/overview/) format is an efficient column-oriented data file format well-suited for bulk datasets. TAPE provides functionality to create an `Ensemble` object from input parquet files via the `Ensemble.from_parquet()` function. At minimum, a parquet file containing time series information needed to populate the `Ensemble` source table should be supplied, as shown below. The `Ensemble` object table is created dynamically from the source table in this instance.

In [None]:
from tape.ensemble import Ensemble
from tape.utils import ColumnMapper

ens = Ensemble()  # initialize an ensemble object

# A ColumnMapper is created to map columns of the parquet file to timeseries quantities, such as flux, time, etc.
col_map = ColumnMapper(id_col="ps1_objid",
    time_col="midPointTai",
    flux_col="psFlux",
    err_col="psFluxErr",
    band_col="filterName")

# Read in data from a parquet file that contains source (timeseries) data
ens.from_parquet(source_file=f"{rel_path}/source/test_source.parquet",
                 column_mapper=col_map
                )

ens.source.head(5) # View the first 5 entries of the source table

In [None]:
ens.client.close()  # Tear down the ensemble client

Alternatively, if object level information is available in a second parquet file, that may also be provided to populate the `Ensemble` object table, as follows:

In [None]:
ens = Ensemble()  # initialize an ensemble object

col_map = ColumnMapper(id_col="ps1_objid",
    time_col="midPointTai",
    flux_col="psFlux",
    err_col="psFluxErr",
    band_col="filterName",
)

# Read in data from a parquet file that contains source (timeseries) data
ens.from_parquet(source_file=f"{rel_path}/source/test_source.parquet",
                 object_file=f"{rel_path}/object/test_object.parquet",
                 column_mapper=col_map
                )

ens.object.head(5) # View the first 5 entries of the object table

In [None]:
ens.client.close()  # Tear down the ensemble client

In the above examples, we use the `ColumnMapper` helper class to facilitate mapping of parquet file columns to a set of internally recognized quantities, such as flux, time, ids, errors, etc. These quantities are used to infer the correct columns to use when applying certain filtering operations, or when using `TAPE.analysis` functions. It may be the case that you aren't sure what columns are actually present in a given parquet file before attempting to ingest into TAPE. In these instances, we recommend using the pyarrow package to preview metadata, as shown below.

In [None]:
from pyarrow import parquet
parquet.read_schema(f"{rel_path}/source/test_source.parquet", memory_map=True)

Apache parquet files have many advantages for the type of scalable workflows that TAPE seeks to enable. A key advantage being that the parquet file supports in-format partitioning of large datasets. TAPE, by virtue of using `Dask`, inherits these partitions on load, avoiding any need to manually set a partitioning scheme for your data.

## TAPE Datasets

TAPE hosts a number of datasets that are retrievable by the user. These datasets have been added to demonstrate and test the various scientific workflows that TAPE has been developed to support. The `Ensemble.available_datasets()` may be used to see which datasets are available to retrieve.

In [None]:
ens = Ensemble()

ens.available_datasets()

The function returns a dictionary of datasets and a brief description of their contents. To retrieve them, use the `Ensemble.from_dataset()` function with the dictionary key value corresponding to a specific dataset. Column mapping information is automatically generated for these known datasets.

In [None]:
ens.from_dataset("s82_rrlyrae")  # Let's grab the Stripe 82 RR Lyrae

ens.object.head(5)

In [None]:
ens.client.close()  # Tear down the ensemble client

## Loading from Array Data

If your data is stored in arrays, `Ensemble.from_source_dict()` offers an interface to load these into an `Ensemble` object using a dictionary.

Let's start by creating some example data arrays in a dictionary:

In [None]:
import numpy as np
np.random.seed(1)

# initialize a dictionary of empty arrays
source_dict = {"id": np.array([]),
                   "time": np.array([]),
                   "flux": np.array([]),
                   "error": np.array([]),
                   "band": np.array([])}

# Create 10 lightcurves with 100 measurements each
lc_len = 100
for i in range(10):
    source_dict["id"] = np.append(source_dict["id"], np.array([i]*lc_len)).astype(int)
    source_dict["time"] = np.append(source_dict["time"], np.linspace(1, lc_len, lc_len))
    source_dict["flux"] = np.append(source_dict["flux"], 100 + 50 * np.random.rand(lc_len))
    source_dict["error"] = np.append(source_dict["error"], 10 + 5 * np.random.rand(lc_len))
    source_dict["band"] = np.append(source_dict["band"], ["g"]*50+["r"]*50)

From here, we just need to pass the dictionary along to `Ensemble.from_source_dict()` and set the `ColumnMapper` appropriately.

In [None]:
colmap = ColumnMapper(id_col="id",
                      time_col="time",
                      flux_col="flux",
                      err_col="error",
                      band_col="band")
ens = Ensemble()
ens.from_source_dict(source_dict, column_mapper=colmap)

ens.info()



In [None]:
ens.client.close()  # Tear down the ensemble client