# Download DIA OR4

Author: Melissa

Last updated: 2025-01-13 by Sandro

Let's try to pull data out of the APDB for DIA object, source, and forced source. Once the full set of data has been dumped to parquet, we will import into HATS in another notebook.

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
from tqdm import tqdm
from lsst.analysis.ap import apdb

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
plt.set_loglevel('WARNING')

In [2]:
out_dir = "/sdf/data/rubin/shared/lsdb_commissioning/or4_dia/raw"

Based on columns and query structure from Neven's notebook:

https://github.com/lsst-sitcom/notebooks_dia/blob/main/OR4/N_obj_src_truth_and_det.ipynb

In [3]:
columns = [
    "diaSourceId", 
    "diaObjectId", 
    "ra", 
    "dec", 
    "raErr", 
    "decErr", 
    "midpointMjdTai", 
    "psfFlux", 
    "psfFluxErr", 
    "scienceFlux", 
    "scienceFluxErr", 
    "snr", 
    "band", 
    "visit",
    "detector",
    "x",
    "xErr",
    "y",
    "yErr",
    "time_processed", 
    "time_withdrawn",
    "isDipole",
    "centroid_flag",
    "apFlux_flag",
    "apFlux_flag_apertureTruncated",
    "psfFlux_flag",
    "psfFlux_flag_edge",
    "psfFlux_flag_noGoodPixels",
    "trail_flag_edge",
    "forced_PsfFlux_flag",
    "forced_PsfFlutx_flag_edge",
    "forced_PsfFlux_flag_noGoodPixels",
    "shape_flag",
    "shape_flag_no_pixels",
    "shape_flag_not_contained",
    "shape_flag_parent_source",
    "pixelFlags",
    "pixelFlags_bad",
    "pixelFlags_cr",
    "pixelFlags_crCenter",
    "pixelFlags_edge",
    "pixelFlags_interpolated",
    "pixelFlags_interpolatedCenter",
    "pixelFlags_offimage",
    "pixelFlags_saturated",
    "pixelFlags_saturatedCenter",
    "pixelFlags_suspect",
    "pixelFlags_suspectCenter",
    "pixelFlags_streak",
    "pixelFlags_streakCenter",
    "pixelFlags_injected",
    "pixelFlags_injectedCenter",
    "pixelFlags_injected_template",
    "pixelFlags_injected_templateCenter",
    "reliability"
]

# Convert list of columns into a comma-separated string
columns_string = ', '.join(f'"{col}"' for col in columns)

Let's connect to APDB to get the LSST ComCam simulation data (OR4):

In [3]:
schema='jeremym_ppdb_replication_test_3'
instrument = 'LSSTComCamSim'
apdbQuery = apdb.ApdbPostgresQuery(instrument=instrument, namespace=schema)

Let's check how many DIA objects there are so that we can query and store the data in parquet:

In [5]:
with apdbQuery.connection as connection:
    src4_field = pd.read_sql_query(f'''
        SELECT count("diaObjectId")
        FROM "{schema}"."DiaObject"
    ''', connection)
src4_field

Unnamed: 0,count
0,12222552


In [6]:
with apdbQuery.connection as connection:
    for lower in tqdm(range(0, 13_000_000, 500_000)):
        src4_field = pd.read_sql_query(f'''
            SELECT *
            FROM "{schema}"."DiaObject"
            LIMIT 500000
            offset {lower}
        ''', connection)
        src4_field.to_parquet(f"{out_dir}/object/object_{lower}.parquet")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26/26 [15:30<00:00, 35.78s/it]


In [10]:
with apdbQuery.connection as connection:
    for lower in tqdm(range(0, 13_000_000, 1_000_000)):
        src4_field = pd.read_sql_query(f'''
            SELECT *
            FROM "{schema}"."DiaSource"
            LIMIT 1000000
            offset {lower}
        ''', connection)
        src4_field.to_parquet(f"{out_dir}/source/source_{lower}.parquet")

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [16:28<00:00, 76.05s/it]


In [12]:
with apdbQuery.connection as connection:
    for lower in tqdm(range(0, 70_000_000, 1_000_000)):
        src4_field = pd.read_sql_query(f'''
            SELECT *
            FROM "{schema}"."DiaForcedSource"
            LIMIT 1000000
            offset {lower}
        ''', connection)
        src4_field.to_parquet(f"{out_dir}/forced/forced_{lower}.parquet")

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [16:34<00:00, 14.21s/it]


There are apparently a lot of duplicates (~85%)! We will select only those rows of latest validity in the upcoming notebooks.

In [9]:
with apdbQuery.connection as connection:
    src4_field = pd.read_sql_query(f'''
        SELECT cast(count(distinct("diaObjectId")) as float) / count("diaObjectId") as unique_id_ratio
        FROM "{schema}"."DiaObject"
    ''', connection)
src4_field

Unnamed: 0,unique_id_ratio
0,0.146611
