## HATS Data Preview 1 on RSP

This notebook tests access to Data Preview 1 (DP1) data in the HATS format. 

**Goal:** To load a randomized sample of the data, to be used for scale testing within the RSP.

In [13]:
# if not previously installed
# %pip install lsdb --quiet

In [2]:
import lsdb
import numpy as np
import pandas as pd
from upath import UPath

In [3]:
base_path = UPath("/rubin/lsdb_data")
object_collection = lsdb.open_catalog(base_path / "object_collection_lite")

In [4]:
pixel_statistics = object_collection.per_pixel_statistics()
counts = pd.to_numeric(pixel_statistics["objectId: row_count"], errors="coerce")
pixel_counts = counts.groupby(level=0).sum()

In [5]:
partition_indices = []
for percentile in [10, 50, 90]:
    q = np.percentile(pixel_counts, percentile)
    print(f"Percentile: {percentile}, Quartile: {q}")
    index = int(np.argmin(np.abs(pixel_counts - q)))
    closest_value = pixel_counts.iloc[index]
    print(f"Closest value: {closest_value}, partition index: {index}")
    partition_indices.append(index)

Percentile: 10, Quartile: 1786.0
Closest value: 1789, partition index: 138
Percentile: 50, Quartile: 5240.0
Closest value: 5240, partition index: 255
Percentile: 90, Quartile: 11744.6
Closest value: 11738, partition index: 21


In [6]:
for index in partition_indices:
    print(f"Sampling partition {index} of size {pixel_counts.iloc[index]}")
    %timeit object_collection.sample(index, n=100, seed=10)

Sampling partition 138 of size 1789
663 ms ± 8.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sampling partition 255 of size 5240
1.35 s ± 38.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sampling partition 21 of size 11738
722 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
