## Catalog collections

This notebook showcases the use of HATS catalog collections and how the row retrieval works using ID search.

**Note:** These changes have not been merged yet, but they are implemented and under review.

In [1]:
%pip install -q git+https://github.com/astronomy-commons/lsdb.git@issue/689/id-search

Note: you may need to restart the kernel to use updated packages.


In [2]:
import lsdb
import hats
from upath import UPath

In [3]:
small_sky_collection = lsdb.read_hats("small_sky_order1_collection")
small_sky_collection

Unnamed: 0_level_0,id,ra,dec,ra_error,dec_error
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Order: 1, Pixel: 44",int64[pyarrow],double[pyarrow],double[pyarrow],int64[pyarrow],int64[pyarrow]
"Order: 1, Pixel: 45",...,...,...,...,...
"Order: 1, Pixel: 46",...,...,...,...,...
"Order: 1, Pixel: 47",...,...,...,...,...


In [4]:
small_sky_collection.hc_collection?

[0;31mType:[0m        CatalogCollection
[0;31mString form:[0m <hats.catalog.catalog_collection.CatalogCollection object at 0x16044dad0>
[0;31mFile:[0m        ~/hats/src/hats/catalog/catalog_collection.py
[0;31mDocstring:[0m  
A collection of HATS Catalog with data stored in a HEALPix Hive partitioned structure

Catalogs of this type are described by a `collection.properties` file which specifies
the underlying main catalog, margin catalog and index catalog paths. These catalogs are
stored at the root of the collection, each in its separate directory::

    catalog_collection/
    ├── main_catalog/
    ├── margin_catalog/
    ├── index_catalog/
    ├── collection.properties

Margin and index catalogs are optional but there could also be multiple of them. The
catalogs used by default are specified in the `collection.properties` file in the
`default_margin` and `default_index` keywords.

In [5]:
small_sky_collection.hc_structure.catalog_path

PosixUPath('small_sky_order1_collection/small_sky_order1')

The default margin was loaded automatically:

In [6]:
small_sky_collection.margin.hc_structure.catalog_path

PosixUPath('small_sky_order1_collection/small_sky_order1_margin_1deg')

In [7]:
small_sky_collection.hc_collection.default_margin

'small_sky_order1_margin_1deg'

We can also load a non-default margin using:
- A single identifier. It needs to exist in the *collection.properties* `all_margins`.
- An absolute path to the margin, hosted locally or in remote.

In [8]:
small_sky_collection = lsdb.read_hats("small_sky_order1_collection", margin_cache="small_sky_order1_margin_2deg")
small_sky_collection.margin.hc_structure.catalog_path

PosixUPath('small_sky_order1_collection/small_sky_order1_margin_2deg')

In [9]:
pwd = %pwd
margin_absolute_path = f"{pwd}/small_sky_order1_collection/small_sky_order1_margin_2deg"
small_sky_collection = lsdb.read_hats("small_sky_order1_collection", margin_cache=margin_absolute_path)
small_sky_collection.margin.hc_structure.catalog_path

PosixUPath('/Users/scampos/notebooks_lf/sprints/2025/04_17/catalog_collections/small_sky_order1_collection/small_sky_order1_margin_2deg')

Loading from remote, we can give it a string or a UPath, in case we need to specify credentials:

In [10]:
remote_margin = UPath("https://epyc.astro.washington.edu/~lincc-frameworks/other_degree_surveys/small_sky_order1_margin_2deg")
small_sky_collection = lsdb.read_hats("small_sky_order1_collection", margin_cache=remote_margin)
small_sky_collection.margin.hc_structure.catalog_path

HTTPPath('https://epyc.astro.washington.edu/~lincc-frameworks/other_degree_surveys/small_sky_order1_margin_2deg')

We still validate that the provided margin schema is compatible:

In [11]:
euclid_margin = "https://data.lsdb.io/hats/euclid_q1/euclid_q1_merFinalCatalog_10arcs"
small_sky_collection = lsdb.read_hats("small_sky_order1_collection", margin_cache=euclid_margin)

ValueError: The margin catalog and the main catalog must have the same schema.

### `id_search` 

(Preliminary implementation, but still under discussion [here](https://github.com/astronomy-commons/lsdb/issues/689#issuecomment-2813285589))

```
def id_search(self, ids: list, id_column: str | None = None, fine: bool = True) -> Catalog
```

In [None]:
small_sky_collection.hc_collection.default_index_field

In [12]:
small_sky_collection.hc_collection.default_index_catalog_dir

PosixUPath('small_sky_order1_collection/small_sky_order1_id_index')

In [13]:
# Specifying a single value
small_sky_collection.id_search(700).compute()

Unnamed: 0_level_0,id,ra,dec,ra_error,dec_error
_healpix_29,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3318157971331423954,700,282.5,-58.5,0,0


In [14]:
# Or a list of values
small_sky_collection.id_search([700,702]).compute()

Unnamed: 0_level_0,id,ra,dec,ra_error,dec_error
_healpix_29,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3318157971331423954,700,282.5,-58.5,0,0
3399532867186255393,702,310.5,-27.5,0,0


We can also specify any other indexing field, as long as the collection properties have an indexing catalog for it:

In [15]:
small_sky_collection.id_search([310.5,282.5], id_column="ra").compute()

Unnamed: 0_level_0,id,ra,dec,ra_error,dec_error
_healpix_29,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3225326185519392324,766,310.5,-63.5,0,0
3318157971331423954,700,282.5,-58.5,0,0
3340634127750683939,827,310.5,-40.5,0,0
3399532867186255393,702,310.5,-27.5,0,0


In [16]:
# No indexing catalog in properties for field "dec"
small_sky_collection.id_search([310.5,282.5], id_column="dec").compute()

ValueError: Index for field `dec` is not specified in all_indexes

This method is only available on catalog collections:

In [17]:
# Going for `id_search` on the small_sky_order1 catalog will not work
lsdb.read_hats("small_sky_order1_collection/small_sky_order1").id_search([700])

NotImplementedError: `Catalog.id_search` is only available in the context of Catalog collections. Use `Catalog.index_search` with an instance of `HCIndexCatalog` instead.

But users can still use the `index_search` with their own HATS index catalogs:

In [18]:
index_catalog = hats.read_hats("small_sky_order1_collection/small_sky_order1_id_index")
small_sky_collection.index_search([700,900], index_catalog).compute()

Unnamed: 0_level_0,id,ra,dec,ra_error,dec_error
_healpix_29,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3318157971331423954,700,282.5,-58.5,0,0


### Needing feedback

Some questions arose:

1. Should we allow queries for lists of values instead of a single value? 
  - This is the behavior on `index_search`, and it looks useful

2. How do we feel about the `default_index`? 
  - Is it too implicit?

3. Fwd discussion with Kostya, Sean:
  - Would `Catalog.match_values(col1=val1, col2=val2)` be of higher value to the users? 
  - We don't support composite indices. Keep 1 index catalog per field and add more logic for retrieval?
  - Any other alternatives?