# Unit test data

This directory contains very small, toy, data sets that are used
for unit tests.

## Object catalog: small_sky

This "object catalog" is 131 randomly generated radec values. 

- All radec positions are in the Healpix pixel order 0, pixel 11.
- IDs are integers from 700-831.

In [1]:
import hats
import lsdb
import tempfile

from dask.distributed import Client
from hats.io.file_io import remove_directory
from hats_import import (
    ImportArguments,
    pipeline_with_client,
    CollectionArguments,
    NestLightCurveArguments,
)
from lsdb.io import to_hats

tmp_path = tempfile.TemporaryDirectory()
tmp_dir = tmp_path.name

client = Client(n_workers=1, threads_per_worker=1, local_directory=tmp_dir)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 38851 instead




### small_sky_order1

This catalog has the same data points as other small sky catalogs,
but is coerced to spreading these data points over partitions at order 1, instead
of order 0.

This means there are 4 leaf partition files, instead of just 1, and so can
be useful for confirming reads/writes over multiple leaf partition files.

NB: Setting `constant_healpix_order` coerces the import pipeline to create
leaf partitions at order 1.

In [2]:
remove_directory("./small_sky_order1_collection")
with tempfile.TemporaryDirectory() as pipeline_tmp:
    args = (
        CollectionArguments(
            output_artifact_name="small_sky_order1_collection",
            output_path=".",
            tmp_dir=pipeline_tmp,
            addl_hats_properties={"obs_regime": "Optical"},
        )
        .catalog(
            input_file_list=["raw/small_sky/small_sky.csv"],
            file_reader="csv",
            output_artifact_name="small_sky_order1",
            constant_healpix_order=1,
        )
        .add_margin(
            margin_threshold=3600, output_artifact_name="small_sky_order1_margin_1deg", is_default=True
        )
        .add_margin(margin_threshold=7200, output_artifact_name="small_sky_order1_margin_2deg")
        .add_index(
            indexing_column="id",
            output_artifact_name="small_sky_order1_id_index",
            include_healpix_29=False,
            compute_partition_size=200_000,
        )
    )

    pipeline_with_client(args, client)

Catalog: Planning  :   0%|          | 0/4 [00:00<?, ?it/s]

Catalog: Mapping   :   0%|          | 0/1 [00:00<?, ?it/s]

Catalog: Binning   :   0%|          | 0/2 [00:00<?, ?it/s]

Catalog: Splitting :   0%|          | 0/1 [00:00<?, ?it/s]

Catalog: Reducing  :   0%|          | 0/4 [00:00<?, ?it/s]

Catalog: Finishing :   0%|          | 0/6 [00:00<?, ?it/s]

Margin: Planning  :   0%|          | 0/3 [00:00<?, ?it/s]

Margin: Mapping   :   0%|          | 0/4 [00:00<?, ?it/s]

Margin: Binning   :   0%|          | 0/1 [00:00<?, ?it/s]

Margin: Reducing  :   0%|          | 0/15 [00:00<?, ?it/s]

Margin: Finishing :   0%|          | 0/4 [00:00<?, ?it/s]



Margin: Planning  :   0%|          | 0/3 [00:00<?, ?it/s]

Margin: Mapping   :   0%|          | 0/4 [00:00<?, ?it/s]

Margin: Binning   :   0%|          | 0/1 [00:00<?, ?it/s]

Margin: Reducing  :   0%|          | 0/15 [00:00<?, ?it/s]

Margin: Finishing :   0%|          | 0/4 [00:00<?, ?it/s]



Index: Finishing :   0%|          | 0/3 [00:00<?, ?it/s]

Collection: Finishing :   0%|          | 0/2 [00:00<?, ?it/s]

### small_sky

This "object catalog" is 131 randomly generated radec values. 

- All radec positions are in the Healpix pixel order 0, pixel 11.
- IDs are integers from 700-831.

This catalog was generated with the following snippet:

In [3]:
remove_directory("./small_sky")
with tempfile.TemporaryDirectory() as pipeline_tmp:
    args = ImportArguments(
        input_file_list=["raw/small_sky/small_sky.csv"],
        output_path=".",
        file_reader="csv",
        output_artifact_name="small_sky",
        tmp_dir=pipeline_tmp,
        highest_healpix_order=5,
    )
    pipeline_with_client(args, client)

Catalog: Planning  :   0%|          | 0/4 [00:00<?, ?it/s]

Catalog: Mapping   :   0%|          | 0/1 [00:00<?, ?it/s]

Catalog: Binning   :   0%|          | 0/2 [00:00<?, ?it/s]

Catalog: Splitting :   0%|          | 0/1 [00:00<?, ?it/s]

Catalog: Reducing  :   0%|          | 0/1 [00:00<?, ?it/s]

Catalog: Finishing :   0%|          | 0/6 [00:00<?, ?it/s]

## Object catalog: small_sky_source

This "source catalog" is 131 detections at each of the 131 objects
in the "small_sky" catalog. These have a random magnitude, MJD, and 
band (selected from ugrizy). The full script that generated the values
can be found [here](https://github.com/delucchi-cmu/hipscripts/blob/main/twiddling/small_sky_source.py)

### small_sky_order1_source

In [4]:
remove_directory("./small_sky_order1_source_collection")
with tempfile.TemporaryDirectory() as pipeline_tmp:
    args = (
        CollectionArguments(
            output_artifact_name="small_sky_order1_source_collection",
            output_path=".",
            tmp_dir=pipeline_tmp,
            addl_hats_properties={"obs_regime": "Optical"},
        )
        .catalog(
            input_file_list=["raw/small_sky_source/small_sky_source.csv"],
            file_reader="csv",
            ra_column="source_ra",
            dec_column="source_dec",
            catalog_type="source",
            output_artifact_name="small_sky_order1_source",
            constant_healpix_order=1,
        )
        .add_margin(
            margin_threshold=7200, output_artifact_name="small_sky_order1_source_margin", is_default=True
        )
        .add_index(
            indexing_column="object_id",
            output_artifact_name="small_sky_order1_source_object_id_index",
            include_healpix_29=False,
            compute_partition_size=200_000,
        )
        .add_index(
            indexing_column="band",
            output_artifact_name="small_sky_order1_source_band_index",
            include_healpix_29=False,
            compute_partition_size=200_000,
        )
    )

    pipeline_with_client(args, client)

Catalog: Planning  :   0%|          | 0/4 [00:00<?, ?it/s]

Catalog: Mapping   :   0%|          | 0/1 [00:00<?, ?it/s]

Catalog: Binning   :   0%|          | 0/2 [00:00<?, ?it/s]

Catalog: Splitting :   0%|          | 0/1 [00:00<?, ?it/s]

Catalog: Reducing  :   0%|          | 0/5 [00:00<?, ?it/s]

Catalog: Finishing :   0%|          | 0/6 [00:00<?, ?it/s]

Margin: Planning  :   0%|          | 0/3 [00:00<?, ?it/s]

Margin: Mapping   :   0%|          | 0/5 [00:00<?, ?it/s]

Margin: Binning   :   0%|          | 0/1 [00:00<?, ?it/s]

Margin: Reducing  :   0%|          | 0/18 [00:00<?, ?it/s]

Margin: Finishing :   0%|          | 0/4 [00:00<?, ?it/s]



Index: Finishing :   0%|          | 0/3 [00:00<?, ?it/s]

Index: Finishing :   0%|          | 0/3 [00:00<?, ?it/s]

Collection: Finishing :   0%|          | 0/2 [00:00<?, ?it/s]

## Nested catalogs

### small_sky_order1_nested

Current implementation - uses LSDB to run the `join_nested` calculation. 


In [5]:
%%time
remove_directory("small_sky_order1_nested_sources")
small_sky_order1_catalog = lsdb.open_catalog("small_sky_order1_collection")
small_sky_order1_source_with_margin = lsdb.open_catalog("small_sky_order1_source_collection")
small_sky_order1_nested = small_sky_order1_catalog.join_nested(
    small_sky_order1_source_with_margin, left_on="id", right_on="object_id", nested_column_name="photometry"
)
to_hats(
    small_sky_order1_nested,
    base_catalog_path="small_sky_order1_nested_sources",
    catalog_name="small_sky_order1_nested_sources",
    histogram_order=5,
)

CPU times: user 2.09 s, sys: 107 ms, total: 2.19 s
Wall time: 2.38 s


### small_sky_light_curve

New implementation - uses the hats-import pipeline to nest light curves.

- More flexible - we can change the partition threshold just like with catalog import
- can partition according to object count, source count, or bytes
- Fault tolerant - has the same resume functionality as catalog import
- progress reporting
- does not require a margin catalog to be created!

In [6]:
%%time
remove_directory("small_sky_light_curve")
with tempfile.TemporaryDirectory() as pipeline_tmp:
    small_sky_ncl_args = NestLightCurveArguments(
        object_catalog_dir="small_sky_order1_collection",
        object_id_column="id",
        source_catalog_dir="small_sky_order1_source_collection",
        source_object_id_column="object_id",
        source_nested_columns=["mjd", "mag", "band"],
        nested_column_name="photometry",
        output_artifact_name="small_sky_light_curve",
        partition_strategy="source_count",
        partition_threshold=5000,
        highest_healpix_order=5,
        output_path=".",
        resume=False,
        tmp_dir=pipeline_tmp,
    )

    pipeline_with_client(small_sky_ncl_args, client)

Planning  :   0%|          | 0/2 [00:00<?, ?it/s]

Counting  :   0%|          | 0/4 [00:00<?, ?it/s]

Finishing :   0%|          | 0/3 [00:00<?, ?it/s]

CPU times: user 69.8 ms, sys: 24 ms, total: 93.8 ms
Wall time: 523 ms


In [7]:
small_sky_nested_lsdb = hats.read_hats("small_sky_order1_nested_sources")
small_sky_nested_lsdb.aggregate_column_statistics()

Unnamed: 0_level_0,min_value,max_value,null_count,row_count
column_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
id,700,830,0,131
ra,280.5,350.5,0,131
dec,-69.5,-25.5,0,131
ra_error,0,0,0,131
dec_error,0,0,0,131
photometry.source_id,70000,87160,0,17161
photometry.source_ra,280.50167438094974,351.0955564811893,0,17161
photometry.source_dec,-69.49946137698012,-24.901268838630852,0,17161
photometry.mjd,58363.28635615028,59562.86235105604,0,17161
photometry.mag,15.000127559957456,20.99969128807222,0,17161


In [8]:
small_sky_nested_pipeline = hats.read_hats(small_sky_ncl_args.catalog_path)
assert small_sky_nested_pipeline.on_disk
assert small_sky_nested_pipeline.catalog_path == small_sky_ncl_args.catalog_path
assert len(small_sky_nested_pipeline.get_healpix_pixels()) == 10
assert small_sky_nested_pipeline.catalog_info.total_rows == 131

small_sky_nested_pipeline.aggregate_column_statistics()

Unnamed: 0_level_0,min_value,max_value,null_count,row_count
column_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
id,700,830,0,131
ra,280.5,350.5,0,131
dec,-69.5,-25.5,0,131
ra_error,0,0,0,131
dec_error,0,0,0,131
photometry.band,g,z,0,17161
photometry.mag,15.000127559957456,20.99969128807222,0,17161
photometry.mjd,58363.28635615028,59562.86235105604,0,17161


## Comparison

The outputs are largely the same. It's easy to only add the nested columns you want, so the pipeline version has only the columns that are distinct for each row.

There's a significant time difference, however. This continues to be faster for larger datasets. Until it isn't. And I'm working on that.

LSDB: 
```
CPU times: user 2.09 s, sys: 107 ms, total: 2.19 s
Wall time: 2.38 s
```
Dedicated pipeline:
```
CPU times: user 69.8 ms, sys: 24 ms, total: 93.8 ms
Wall time: 523 ms
```

**TODO**:
- understand and correct for performance issues with larger datasets
- respond to code review comments
- update the `generate_data` notebooks to use the newer pipeline
- remove `lsdb` as sometimes-dependency for hats-import (to generate nested data)

In [9]:
tmp_path.cleanup()
client.close()

2026-01-14 16:26:04,132 - distributed.diskutils - ERROR - Failed to remove '/tmp/tmp0_9yse5y/dask-scratch-space/worker-ydt3sn96' (failed in <built-in function lstat>): [Errno 2] No such file or directory: '/tmp/tmp0_9yse5y/dask-scratch-space/worker-ydt3sn96'
