## Data Preview 1 mock data collections

This notebook presents two Data Preview 1 (DP1) mock data collections available in the HATS format. We will walk you through on how to load, preview and work with this data in preparation for the official data release by the Rubin Observatory.

#### How was this data generated?

The dummy data was generated with a simple [Python script](https://github.com/lsst-sitcom/linccf/blob/main/internal/LSSTCam_init/Mock_DP1_generation.ipynb) that randomizes fields according to partition-level min/max-values.

### Table of contents

- Which observatory data products were imported?
- How to visualize the distribution of the data?
- How to visualize the catalog metadata and schema?
- How to load individual files with a parquet reader?
- How to work with the full catalog with LSDB?

In [None]:
%pip install lsdb --quiet

In [None]:
from pathlib import Path
# Base path to the mock DP1 data
base_path = Path("/sdf/data/rubin/shared/lsdb_commissioning/mock_dp1")

#### Which observatory data products were imported?

The Data Preview 1 mock data collections were generated based on [DRP v29_0_0_rc5](https://rubinobs.atlassian.net/browse/DM-49865). They contain **DUMMY** data in the same format and with the same data types as the upcoming Rubin Data Preview 1 HATS catalogs.

The available collections are `dia_object_collection` and `object_collection`. 

Each collection contains a main object catalog with time-domain data, and two auxiliary catalogs: a margin cache catalog and an index catalog. The data of interest resides in the main catalogs, named *dia_object_lc* and *object_lc*, and they contain light curve information.

- `dia_object_lc` contains data obtained from difference imaging. To create this catalog we joined the data for each *dia_object* with the respective detections in *dia_source* and *dia_forced_source*.

- `object_lc` contains data obtained from science imaging. To create this catalog we joined the data for each *object* with the respective detections in *forced_source*. There is no association between *source* and *object*.

Powered by [**nested-pandas**](https://nested-pandas.readthedocs.io/en/stable/), the objects' light curve information can be loaded, previewed and processed within a single data structure.

In [None]:
!tree -L 2 $base_path

#### How to visualize the distribution of the data?

The metadata allows us to visualize the distribution of the data quickly and without any compute. Using the `hats` package we can plot the HEALPix distribution in a mollweide view as well as observe a higher order Multi-Order-Coverage (MOC) map of where the data is in the sky.

In [None]:
import hats
object_lc = hats.read_hats(base_path / "object_collection").main_catalog
object_lc.plot_pixels()

In [None]:
object_lc.plot_moc()

#### How to visualize the catalog metadata and schema?

The catalogs' metadata and schema (columns and their data types) can be found in their HATS object.

In [None]:
# The catalog's arrow schema
object_lc.original_schema

In [None]:
# Other provenance information
dict(object_lc.catalog_info)

#### How to load individual files with a parquet reader?

We can load individual data files with any parquet-compatible file reader (e.g. `pyarrow.parquet`).

In [None]:
# Grab a single file from the object catalog
single_parquet = base_path / "object_collection/object_lc/dataset/Norder=3/Dir=0/Npix=562.parquet"

In [None]:
import pyarrow.parquet as pq
partition = pq.read_table(single_parquet)
partition.to_pandas().head()

There is a nested column with light curve information (*objectForcedSource*). We recommend **nested-pandas** for reading files in this format.

In [None]:
from nested_pandas import read_parquet
nested_df = read_parquet(single_parquet)
nested_df.head()

#### How to work with the full catalog with LSDB?

Loading, previewing and creating workflows with HATS data is much simpler with [LSDB](https://docs.lsdb.io/en/stable/).

In [None]:
%%time
import lsdb
# Read the catalog metadata and visualize it in the notebook
object_lc = lsdb.read_hats(base_path / "object_collection")
object_lc

In [None]:
%%time
# Look at the first 5 rows
object_lc.head()

A common use case is applying a user-defined function over each partition (pixel) of the catalog:

In [None]:
def run_per_partition(df, pixel):
    """This code runs once per partition (pixel)."""
    # Do some processing on the dataframe...
    # For example, let's add two new columns with the pixel order and number
    df["Norder"] = pixel.order
    df["Npix"] = pixel.pixel
    return df

# This function call is lazily evaluated
my_object_lc = object_lc.map_partitions(run_per_partition, include_pixel=True)

The computation is triggered by calling `.compute()`. Here we use `.head()` to only get the first 5 rows.

In [None]:
my_object_lc.head()

Keep in mind that `.compute()` will bring the full result of the catalog into memory. 

If your catalog is too big to fit in memory or you wish to reuse it later, call `to_hats` and save it to disk:

```python
my_object_lc.to_hats("path_to_my_catalog", catalog_name="name_for_my_catalog")
```