# DASH on a large instance

Execution for [2025_04](https://rubinobs.atlassian.net/browse/DM-48556)

This notebook uses the butler only to fetch the tracts/patches, and to fetch the URIs of backing parquet files. Those files are read into the hats-import pipeline directly.

This is done because many `butler.get` results are too large to fit in the memory of a medium or large RSP notebook instance.

Beyond the butler issues, there were additional problems with running the importer on a smaller instance. While these can largely be avoided by running on the dev machines that are available outside notebooks, I think it's a good lesson for how the Rubin data is structured and how we can more efficiently import with our existing tools.

Useful material:
- LINCC notebooks: https://github.com/lsst-sitcom/linccf
- https://github.com/LSSTScienceCollaborations/StackClub/tree/master

In [1]:
### UPDATE THIS CELL
## Then run all cells.

from pathlib import Path

base_output_dir = Path("/sdf/data/rubin/shared/lsdb_commissioning/dm_48556")
collections = 'LSSTComCam/runs/DRP/DP1/w_2025_04/DM-48556'

In [2]:
# %pip install -q lsdb hats-import

In [3]:
# LSST Science Pipelines (Stack) packages
import lsst.daf.butler as dafButler

# HATS/LSDB
import lsdb
import hats_import.pipeline as runner
from hats_import.catalog.arguments import ImportArguments

from tqdm import tqdm
import pandas as pd

### Configure Butler

In [4]:
config = '/repo/main'
butler = dafButler.Butler(config, collections=collections)

In [5]:
raw_dir = base_output_dir / "raw"
hats_dir = base_output_dir /  "hats"

raw_dir.mkdir(parents=True, exist_ok=True)
hats_dir.mkdir(parents=True, exist_ok=True)

### Helper methods

In [6]:
def uris_from_butler(dataset_type, out_dir):
    refs = butler.query_datasets(dataset_type)
    paths = []
    for i, ref in enumerate(tqdm(refs)):
        table_path = butler.getURI(dataset_type, dataId=ref.dataId)
        paths.append(table_path.path)
    
    print(f"Found {len(paths)} files for {dataset_type}")

    file_pointer = out_dir / "paths" / f"{dataset_type}.txt"
    with file_pointer.open("w", encoding="utf8") as _file:
        for path in paths:
            _file.write(path + "\n")

    ref_ids = [ref.dataId.mapping for ref in refs]
    ref_frame = pd.DataFrame(ref_ids)
    ref_frame.to_csv(out_dir / "refs" / f"{dataset_type}.csv", index=False)            
    
def download_visits(out_dir):
    """Downloads the visitTable for LSSTComCam"""
    visits = butler.get("visitTable", dataId={'instrument': 'LSSTComCam'})
    parquet_path = out_dir / "visits.parquet"
    visits.to_parquet(parquet_path)
    print(f"Saved {len(visits)} visits rows to {parquet_path}")
    return visits

## Fetch all URIs.

We write the file paths to a simple text file.

Example outputs, to give an idea of number of files and total runtime:

```
100%|██████████| 28/28 [00:01<00:00, 24.77it/s]
Found 28 files for diaObjectTable_tract
100%|██████████| 28/28 [00:01<00:00, 22.72it/s]
Found 28 files for diaSourceTable_tract
100%|██████████| 581/581 [00:26<00:00, 22.21it/s]
Found 581 files for forcedSourceOnDiaObjectTable
100%|██████████| 599/599 [00:27<00:00, 21.96it/s]
Found 599 files for objectTable
 38%|███▊      | 6320/16471 [05:00<08:49, 19.15it/s]
100%|██████████| 16471/16471 [12:49<00:00, 21.40it/s]
Found 16471 files for sourceTable
100%|██████████| 599/599 [00:27<00:00, 21.90it/s]
Found 599 files for forcedSourceTable
```

This took a really long time, relative to what I expected, and I'll comment out the invocations.

### CONCERN TO DM

I'm concerned about the growth of the `sourceTable` in particular. This is already at `16_471` datasets. The result size of `butler.query_datasets("sourceTable")` will soon be too large to handle, and there doesn't appear to be a mechanism in the existing API for pagination.

In [None]:
# uris_from_butler('diaObjectTable_tract', raw_dir)
# uris_from_butler('diaSourceTable_tract', raw_dir)
# uris_from_butler('forcedSourceOnDiaObjectTable', raw_dir)
# uris_from_butler('objectTable', raw_dir)
# uris_from_butler('sourceTable', raw_dir)
# uris_from_butler('forcedSourceTable', raw_dir)

### Import data to HATS

In [5]:
from rubin_reader import RubinParquetReader
from dask.distributed import Client
import tempfile
from hats_import.catalog.file_readers import ParquetPyarrowReader

tmp_path = tempfile.TemporaryDirectory()
tmp_dir = tmp_path.name

client = Client(n_workers=4, threads_per_worker=1, local_directory=tmp_dir)

In [6]:
def get_paths(dataset_type, out_dir):
    file_pointer = out_dir /"paths"/  f"{dataset_type}.txt"
    with file_pointer.open("r", encoding="utf8") as _text_file:
        paths = _text_file.readlines()

    paths = [path.strip() for path in paths]
    return paths

#### DiaObject

In [None]:
dataset_type = "diaObjectTable_tract"

diaObj_default_columns = ["diaObjectId", "ra", "dec", "nDiaSources", "radecMjdTai"]

args = ImportArguments(
    output_path=hats_dir,
    output_artifact_name="diaObject",
    input_file_list=get_paths(dataset_type, raw_dir),
    file_reader=ParquetPyarrowReader(column_names=diaObj_default_columns),
    ra_column="ra",
    dec_column="dec",
    catalog_type="object",
    resume=False,
    pixel_threshold=2_000_000,
)
runner.pipeline_with_client(args, client)

#### DiaSource

In [None]:
dataset_type = "diaSourceTable_tract"

args = ImportArguments(
    output_path=hats_dir,
    output_artifact_name="diaSource",
    input_file_list=get_paths(dataset_type, raw_dir),
    file_reader=ParquetPyarrowReader(),
    ra_column="ra",
    dec_column="dec",
    catalog_type="source",
    resume=False,
    pixel_threshold=2_000_000,
)
runner.pipeline_with_client(args, client)

#### DiaForcedSource

In [None]:
dataset_type = "forcedSourceOnDiaObjectTable"

args = ImportArguments(
    output_path=hats_dir,
    output_artifact_name="diaForcedSource",
    input_file_list=get_paths(dataset_type, raw_dir),
    file_reader=ParquetPyarrowReader(),
    ra_column="coord_ra",
    dec_column="coord_dec",
    catalog_type="source",
    pixel_threshold=5_000_000,
    highest_healpix_order=12,
)
runner.pipeline_with_client(args, client)

#### Object

In [None]:
cols_per_band = []
for band in list("ugrizy"):
    for flux_type in ["psf","kron"]:
        prefix = f"{band}_{flux_type}"
        cols_per_band.extend([f"{prefix}Flux", f"{prefix}FluxErr"])
    cols_per_band.append(f"{band}_kronRad")
    
obj_default_columns = [
    "objectId",
    "refFwhm",
    "shape_flag",
    "sky_object",
    "parentObjectId",
    "detect_isPrimary",
    "x",
    "y",
    "xErr",
    "yErr",
    "shape_yy", 
    "shape_xx", 
    "shape_xy", 
    "coord_ra",
    "coord_dec", 
    "coord_raErr", 
    "coord_decErr",
    "tract",
    "patch",
    "detect_isIsolated"
] + cols_per_band

obj_default_columns

In [None]:
dataset_type = "objectTable"

args = ImportArguments(
    output_path=hats_dir,
    output_artifact_name="object",
    input_file_list=get_paths(dataset_type, raw_dir),
    file_reader=ParquetPyarrowReader(column_names=obj_default_columns),
    ra_column="coord_ra",
    dec_column="coord_dec",
    catalog_type="object",
    resume=False,
    pixel_threshold=300_000,
)
runner.pipeline_with_client(args, client)

#### Source

This is one that's going to get much worse very quickly. The `sourceTable` dimension is on the visit. So each file is very small, and there are LOTS of them. 

```
Planning  : 100% 4/4 [00:00<00:00, 123.68it/s]
Mapping   : 100% 16471/16471 [04:25<00:00,  1.77it/s]
Binning   : 100% 2/2 [00:38<00:00, 17.09s/it]
Splitting : 100% 16471/16471 [28:41<00:00,  1.64s/it]
Reducing  : 100% 148/148 [04:30<00:00,  2.21s/it]
Finishing : 100% 5/5 [00:24<00:00,  8.99s/it]
```

Solutions:

- Use the `IndexedParquetReader`. We can aggregate each index file by something like tract/patch of the visit, to reduce intermediate file usage.
- Escalate to DM. This is going to be ROUGH for everyone if there is no aggregation.

In [None]:
dataset_type = "sourceTable"

args = ImportArguments(
    output_path=hats_dir,
    output_artifact_name="source",
    input_file_list=get_paths(dataset_type, raw_dir),
    file_reader=ParquetPyarrowReader(),
    ra_column="ra",
    dec_column="dec",
    catalog_type="source",
    resume=False,
    pixel_threshold=1_000_000,
)
runner.pipeline_with_client(args, client)

#### ForcedSource

In [7]:
visits = download_visits(raw_dir)
visit_map = visits[["expMidptMJD"]].T.to_dict('records')[0]

Saved 1857 visits rows to /sdf/data/rubin/shared/lsdb_commissioning/dm_48556/raw/visits.parquet


In [7]:
dataset_type = "forcedSourceTable"

args = ImportArguments(
    output_path=hats_dir,
    output_artifact_name="forcedSource",
    input_file_list=get_paths(dataset_type, raw_dir),
    file_reader=ParquetPyarrowReader(),
    ra_column="coord_ra",
    dec_column="coord_dec",
    catalog_type="source",
    resume=False,
    pixel_threshold=8_000_000,
)
runner.pipeline_with_client(args, client)

# Planning  : 100% 4/4 [00:00<00:00, 352.13it/s]
# Mapping   : 100% 605/605 [00:18<00:00, 26.07it/s]
# Binning   : 100% 2/2 [00:08<00:00,  4.85s/it]
# Splitting : 100% 605/605 [04:59<00:00,  1.87s/it]
# Reducing  : 100% 207/207 [03:15<00:00,  1.01s/it]

Planning  :   0%|          | 0/4 [00:00<?, ?it/s]

Mapping   :   0%|          | 0/605 [00:00<?, ?it/s]

Binning   :   0%|          | 0/2 [00:00<?, ?it/s]

Splitting :   0%|          | 0/605 [00:00<?, ?it/s]

Reducing  :   0%|          | 0/207 [00:00<?, ?it/s]



Finishing :   0%|          | 0/5 [00:00<?, ?it/s]

### Extra: nest sources in object catalogs

In [None]:
diaObject_cat = lsdb.read_hats(os.path.join(hats_dir, "diaObject"))
diaSource_cat = lsdb.read_hats(os.path.join(hats_dir, "diaSource"))
diaForcedSource_cat = lsdb.read_hats(os.path.join(hats_dir, "diaForcedSource"))

In [None]:
diaObject_cat_nested = diaObject_cat.join_nested(
    diaSource_cat, left_on="diaObjectId", right_on="diaObjectId", nested_column_name="diaSource").join_nested(
    diaForcedSource_cat, left_on="diaObjectId", right_on="diaObjectId", nested_column_name="diaForcedSource")
diaObject_cat_nested

In [None]:
# diaObject_cat_nested.to_hats(os.path.join(hats_dir, "diaObject_lc"))

In [None]:
object_cat = lsdb.read_hats(os.path.join(hats_dir, "object"))
source_cat = lsdb.read_hats(os.path.join(hats_dir, "source"))
forcedSource_cat = lsdb.read_hats(os.path.join(hats_dir, "forcedSource"))

In [None]:
object_cat_nested = object_cat.join_nested(
    #source_cat, left_on="objectId", right_on="objectId", nested_column_name="source").join_nested(
    forcedSource_cat, left_on="objectId", right_on="objectId", nested_column_name="forcedSource")
object_cat_nested

In [None]:
# object_cat_nested.to_hats(os.path.join(hats_dir, "object_lc"))

In [8]:
client.close()
tmp_path.cleanup()