# Fetch data from butler

Execution for [2025_04](https://rubinobs.atlassian.net/browse/DM-48556)

This notebook uses the butler only to fetch the tracts/patches, and to fetch the URIs of backing parquet files. Those files are read into the hats-import pipeline directly.

This is done because many `butler.get` results are too large to fit in the memory of a medium or large RSP notebook instance.

Beyond the butler issues, there were additional problems with running the importer on a smaller instance. While these can largely be avoided by running on the dev machines that are available outside notebooks, I think it's a good lesson for how the Rubin data is structured and how we can more efficiently import with our existing tools.

Useful material:
- LINCC notebooks: https://github.com/lsst-sitcom/linccf
- https://github.com/LSSTScienceCollaborations/StackClub/tree/master

In [1]:
# LSST Science Pipelines (Stack) packages
import lsst.daf.butler as dafButler

import os
import pandas as pd

from tqdm import tqdm
from pathlib import Path

### Set DRP_VERSION and COLLECTION_TAG

1. Update the `DRP_VERSION` and `COLLECTION_TAGS` in *00-set_env.sh*.
2. Source the script: ```source 00-set_env.sh```.
3. Run Jupyter from the same terminal.

In [None]:
DRP_VERSION = os.environ["DRP_VERSION"]
COLLECTION_TAG = os.environ["COLLECTION_TAG"]
print(f"DRP_VERSION: {DRP_VERSION}")
print(f"COLLECTION_TAG: {COLLECTION_TAG}")
base_output_dir = Path(f"/sdf/data/rubin/shared/lsdb_commissioning/hats/{DRP_VERSION}")
collections = f"LSSTComCam/runs/DRP/DP1/{DRP_VERSION}/{COLLECTION_TAG}"

In [4]:
raw_dir = base_output_dir / "raw"

paths_dir = raw_dir / "paths"
refs_dir = raw_dir / "refs"
sizes_dir = raw_dir / "sizes"

paths_dir.mkdir(parents=True, exist_ok=True)
refs_dir.mkdir(parents=True, exist_ok=True)
sizes_dir.mkdir(parents=True, exist_ok=True)

### Configure Butler

In [5]:
config = '/repo/main'
butler = dafButler.Butler(config, collections=collections)

### Helper methods

In [6]:
def uris_from_butler(dataset_type, out_dir):
    refs = butler.query_datasets(dataset_type)
    paths = []
    for _, ref in enumerate(tqdm(refs)):
        table_path = butler.getURI(dataset_type, dataId=ref.dataId)
        paths.append(table_path.path)
    
    print(f"Found {len(paths)} files for {dataset_type}")

    file_pointer = out_dir / "paths" / f"{dataset_type}.txt"
    with file_pointer.open("w", encoding="utf8") as _file:
        for path in paths:
            _file.write(path + "\n")

    ref_ids = [ref.dataId.mapping for ref in refs]
    ref_frame = pd.DataFrame(ref_ids)
    ref_frame.to_csv(out_dir / "refs" / f"{dataset_type}.csv", index=False)            
    
def download_visits(out_dir):
    """Downloads the visitTable for LSSTComCam"""
    visits = butler.get("visitTable", dataId={'instrument': 'LSSTComCam'})
    parquet_path = out_dir / "visits.parquet"
    visits.to_parquet(parquet_path)
    print(f"Saved {len(visits)} visits rows to {parquet_path}")

## Fetch all URIs.

We write the file paths to a simple text file.

Example outputs, to give an idea of number of files and total runtime:

```
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:01<00:00, 20.95it/s]
Found 28 files for diaObjectTable_tract
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:01<00:00, 22.46it/s]
Found 28 files for diaSourceTable_tract
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 586/586 [00:27<00:00, 21.54it/s]
Found 586 files for forcedSourceOnDiaObjectTable
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 605/605 [00:27<00:00, 21.63it/s]
Found 605 files for objectTable
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 16471/16471 [13:36<00:00, 20.16it/s]
Found 16471 files for sourceTable
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 605/605 [00:28<00:00, 21.34it/s]
Found 605 files for forcedSourceTable
```

This took a really long time, relative to what I expected, and I'll comment out the invocations.

### CONCERN TO DM

I'm concerned about the growth of the `sourceTable` in particular. This is already at `16_471` datasets. The result size of `butler.query_datasets("sourceTable")` will soon be too large to handle, and there doesn't appear to be a mechanism in the existing API for pagination.

In [None]:
uris_from_butler('diaObjectTable_tract', raw_dir)
uris_from_butler('diaSourceTable_tract', raw_dir)
uris_from_butler('forcedSourceOnDiaObjectTable', raw_dir)
uris_from_butler('objectTable', raw_dir)
uris_from_butler('sourceTable', raw_dir)
uris_from_butler('forcedSourceTable', raw_dir)

## Download visits table

In [None]:
download_visits(raw_dir)