# Fetch data from butler

Execution for [DP1 v29_0_0](https://rubinobs.atlassian.net/browse/DM-50260)

This notebook uses the butler only to fetch the tracts/patches, and to fetch the URIs of backing parquet files. Those files are read into the hats-import pipeline directly.

This is done because many `butler.get` results are too large to fit in the memory of a medium or large RSP notebook instance.

Beyond the butler issues, there were additional problems with running the importer on a smaller instance. While these can largely be avoided by running on the dev machines that are available outside notebooks, I think it's a good lesson for how the Rubin data is structured and how we can more efficiently import with our existing tools.

Useful material:
- LINCC notebooks: https://github.com/lsst-sitcom/linccf
- https://github.com/LSSTScienceCollaborations/StackClub/tree/master

In [None]:
# LSST Science Pipelines (Stack) packages
import lsst.daf.butler as dafButler

import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

from tqdm import tqdm
from pathlib import Path

In [None]:
INSTRUMENT = os.environ["INSTRUMENT"]
REPO = os.environ["REPO"]
RUN = os.environ["RUN"]
VERSION = os.environ["VERSION"]
COLLECTION = os.environ["COLLECTION"]
OUTPUT_DIR = Path(os.environ["OUTPUT_DIR"])

print(f"INSTRUMENT: {INSTRUMENT}")
print(f"REPO: {REPO}")
print(f"RUN: {RUN}")
print(f"VERSION: {VERSION}")
print(f"COLLECTION: {COLLECTION}")
print(f"OUTPUT_DIR: {OUTPUT_DIR}")

collections = f"{INSTRUMENT}/runs/DRP/{RUN}/{VERSION}/{COLLECTION}"

In [None]:
raw_dir = OUTPUT_DIR / "raw" / VERSION

paths_dir = raw_dir / "paths"
refs_dir = raw_dir / "refs"
sizes_dir = raw_dir / "sizes"

paths_dir.mkdir(parents=True, exist_ok=True)
refs_dir.mkdir(parents=True, exist_ok=True)
sizes_dir.mkdir(parents=True, exist_ok=True)

### Configure Butler

In [None]:
butler = dafButler.Butler(REPO, collections=collections)

### Helper methods

In [None]:
def get_uris_from_butler(dataset_type):
    """Fetch the parquet URIs for a given dataset"""
    refs = butler.query_datasets(dataset_type)
    paths = []
    for _, ref in enumerate(tqdm(refs)):
        table_path = butler.getURI(dataset_type, dataId=ref.dataId)
        paths.append(table_path.geturl())

    print(f"Found {len(paths)} files for {dataset_type}")

    file_pointer = raw_dir / "paths" / f"{dataset_type}.txt"
    with file_pointer.open("w", encoding="utf8") as _file:
        for path in paths:
            _file.write(path + "\n")

    ref_ids = [ref.dataId.mapping for ref in refs]
    ref_frame = pd.DataFrame(ref_ids)
    ref_frame.to_csv(raw_dir / "refs" / f"{dataset_type}.csv", index=False)


def get_visits_from_butler(visits_type):
    """Downloads the visitTable for instrument"""
    visits = butler.get(visits_type, dataId={"instrument": INSTRUMENT})
    parquet_path = raw_dir / f"{visits_type}.parquet"
    visits_table = pa.Table.from_pandas(visits.to_pandas())
    pq.write_table(visits_table, parquet_path)
    print(f"Saved {len(visits)} visits rows to {parquet_path}")

## Fetch all URIs

We write the file paths to a simple text file.

Example outputs, to give an idea of number of files and total runtime:

```
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:01<00:00, 20.95it/s]
Found 28 files for dia_object
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:01<00:00, 22.46it/s]
Found 28 files for dia_source
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 586/586 [00:27<00:00, 21.54it/s]
Found 586 files for dia_object_forced_source
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 605/605 [00:27<00:00, 21.63it/s]
Found 605 files for object
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1787/1787 [01:42<00:00, 17.40it/s]
Found 1787 files for source
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 605/605 [00:28<00:00, 21.34it/s]
Found 605 files for object_forced_source
```

In [None]:
get_uris_from_butler("dia_object")
get_uris_from_butler("dia_source")
get_uris_from_butler("dia_object_forced_source")
get_uris_from_butler("object")
get_uris_from_butler("source")
get_uris_from_butler("object_forced_source")

## Fetch visits table

In [None]:
get_visits_from_butler("visit_table")