In [None]:
rel_path = "../../tests/tape_tests/data/small_sky_hipscat"

# Using TAPE with LSDB and HiPSCat Data

The [Hierarchical Progressive Survey Catalog (HiPSCat)](https://hipscat.readthedocs.io/en/latest/) format is a partitioning of objects on a sphere. Its purpose is for storing data from large astronomy surveys, with the main feature being the adaptive sizing of partitions based on the number of objects in a given region of the sky, using [healpix](https://healpix.jpl.nasa.gov/).

The [Large Survey Database (LSDB)](https://lsdb.readthedocs.io/en/latest/) is a framework that facilitates and enables spatial analysis for extremely large astronomical databases (i.e. querying and crossmatching O(1B) sources). This package uses dask to parallelize operations across multiple HiPSCat partitioned surveys.

Both HiPSCat and LSDB are strong tools in the arsenal of a TAPE user. HiPSCat provides a scalable data format for working at the scale of LSST. While LSDB provides tooling to prepare more complex datasets for TAPE analysis, including operations like cross-matching multiple surveys, cone searches to select data from specific regions of the sky, etc. In this notebook, we'll walk through the process by which these can be used with TAPE.

## Loading from HiPSCat data

TAPE offers a built-in HiPSCat loader function, which can be used to quickly load in a dataset that is in the HiPSCat format. We'll use a small dummy dataset for this example. Before loading, let's just peek at the data we'll be working with.

In [None]:
import pyarrow.parquet as pq
import os

object_path = os.path.join(rel_path, "small_sky_object_catalog")
source_path = os.path.join(rel_path, "small_sky_source_catalog")

# Object Schema
pq.read_metadata(os.path.join(object_path, "_common_metadata")).schema

In [None]:
# Source Schema
pq.read_metadata(os.path.join(source_path, "_common_metadata")).schema

The schema indicates which fields are available in each catalog. Notice the `_hipscat_index` in both, this is a specially constructed index that the data is sorted on and enables efficient use of the HiPSCat format. It's recommended to use this as the ID column in TAPE when loading from hipscatted object and source catalogs. With this established, let's load this data into TAPE.

In [None]:
from tape import Ensemble, ColumnMapper

ens = Ensemble(client=False)

# Setup a ColumnMapper
colmap = ColumnMapper(
    id_col="_hipscat_index",  # using _hipscat_index is recommended
    time_col="mjd",  # pulling these from the source schema list above
    flux_col="mag",
    err_col="Norder",  # we don't have an error column, using a random column for this toy example
    band_col="band",
)

ens.from_hipscat(source_path, object_path, column_mapper=colmap, object_index="id", source_index="object_id")

ens.object.head(5)

In the `from_hipscat` call, we additionally needed to specify `object_index` and `source_index`, these are a column from both tables that map to the same object-level identifier. It's used to join object and source, and convert the source `_hipscat_index` (which is unique per source) to use the object `_hipscat_index` (unique per object). From here, the `_hipscat_index` will serve as an object ID that ties sources together for TAPE operations.

In [None]:
# We're now free to work with our TAPE Ensemble as normal
import matplotlib.pyplot as plt

ts = ens.to_timeseries(12751184493818150912)  # select a lightcurve using the _hipscat_index

# Let's plot this, though it's toy data so it won't look like anything...
plt.plot(ts.data["mjd"], ts.data["mag"], ".")
plt.title(ts.meta["id"])

## Loading from LSDB Catalogs



`Ensemble.from_hipscat` is used to directly ingest HiPSCat data into TAPE. In many cases, you may prefer to do a few operations on your HiPSCat data first using LSDB. Let's walk through how this would look.

In [None]:
# Loading into LSDB
import lsdb

# Load the dataset into LSDB catalog objects
object_cat = lsdb.read_hipscat(object_path)
source_cat = lsdb.read_hipscat(source_path)

We've now loaded our catalogs into LSDB catalog objects. From here, we can do LSDB operations on the catalogs. For example, let's perform a cone search to narrow down our list of objects.

In [None]:
object_cat_cone = object_cat.cone_search(
    ra=315.0,
    dec=-69.5,
    radius_arcsec=100000.0,
)

print(f"Original Number of Objects: {len(object_cat._ddf)}")
print(f"New Number of Objects: {len(object_cat_cone._ddf)}")

With our cone search performed, we can now move into TAPE. We'll first need to create a new source catalog, `joined_source_cat`, which incorporates the result of the cone search and also reindexes onto the object `_hipscat_index`.

In [None]:
# We do this to get the source catalog indexed by the objects hipscat index
joined_source_cat = object_cat_cone.join(
    source_cat, left_on="id", right_on="object_id", suffixes=("_object", "")
)

colmap = ColumnMapper(
    id_col="_hipscat_index",
    time_col="mjd",
    flux_col="mag",
    err_col="Norder",  # no error column...
    band_col="band",
)

ens = Ensemble(client=False)

# We just pass in the catalog objects
ens.from_lsdb(joined_source_cat, object_cat_cone, column_mapper=colmap)

ens.object.compute()

And from here, we're once again able to work with our TAPE Ensemble as normal.