In [1]:
import os
import math
import numpy as np
import torch
import pandas as pd

In [2]:
import GCRCatalogs
# Command line: within virtual environment run 
# python -m pip install https://github.com/LSSTDESC/gcr-catalogs/archive/v1.4.0.tar.gz#egg=GCRCatalogs[full]

In [3]:
GCRCatalogs.set_root_dir('/nfs/turbo/lsa-regier/')
GCRCatalogs.get_root_dir()
# need to do this in accordance with instructions at https://data.lsstdesc.org/doc/install_gcr

'/nfs/turbo/lsa-regier/'

In [4]:
# List of public catalog names
GCRCatalogs.get_public_catalog_names()

['desc_cosmodc2',
 'desc_dc2_run2.2i_dr6_object',
 'desc_dc2_run2.2i_dr6_object_with_truth_match',
 'desc_dc2_run2.2i_dr6_truth',
 'desc_dc2_run2.2i_truth_galaxy_summary',
 'desc_dc2_run2.2i_truth_sn_summary',
 'desc_dc2_run2.2i_truth_sn_variability',
 'desc_dc2_run2.2i_truth_star_summary',
 'desc_dc2_run2.2i_truth_star_variability']

We're generally going to be interested in the truth files, which we know will have redshift and photometry values for us to use. Unsure right now whether we want `truth` or `_with_truth_match`. Evidently, the truth match files contain information from the object files as well. We can explore relevant fields.

In [5]:
truth_cat = GCRCatalogs.load_catalog('desc_dc2_run2.2i_dr6_truth')
truth_cat.list_all_quantities()

['flux_r',
 'flux_z',
 'mag_y',
 'host_galaxy',
 'av',
 'match_sep',
 'mag_z',
 'flux_u',
 'is_unique_truth_entry',
 'ra',
 'match_objectId',
 'rv',
 'mag_i',
 'is_nearest_neighbor',
 'cosmodc2_hp',
 'flux_y',
 'cosmodc2_id',
 'id',
 'patch',
 'mag_r',
 'flux_i',
 'id_string',
 'truth_type',
 'dec',
 'redshift',
 'mag_g',
 'flux_g',
 'tract',
 'mag_u',
 'is_good_match']

Per Xinyue's work, the match files contain information on the psf and galaxy parameters that are unique to DC2. See discussion at the top of the page here: https://github.com/prob-ml/bliss/blob/dc2_script/case_studies/dc2/DC2_galaxy_psf_params.ipynb

These PSF and Galaxy parameters may need to be incorporated into the forward model for BLISS if the goal is inference for DC2 images specifically. Maybe we won't worry about that so much for now, and we will focus on just extracting the relevant fluxes and redshifts. We can add in PSF and galaxy specifics later.

In [6]:
# This cell may take a long time to run, about 30-50 minutes for me
# We will extract relevant quanitites and save them for later
data = truth_cat.get_quantities([
    "id", "match_objectId", "cosmodc2_id", "ra", "dec", "truth_type", 
    "redshift", "flux_g", "flux_i", "flux_r", "flux_u", "flux_y", "flux_z"
])

In [7]:
data['id'].shape

(764026213,)

There are 764 million+ objects. We can get more meta-info by doing the below. See https://yymao.github.io/generic-catalog-reader/#GCR.BaseGenericCatalog.get_quantities for more documentation.

In [8]:
truth_cat.get_catalog_info()

{'subclass_name': 'dc2_truth_match.DC2TruthMatchCatalog',
 'base_dir': '/nfs/turbo/lsa-regier/lsstdesc-public/dc2/run2.2i-dr6-v4/truth_match',
 'filename_pattern': 'truth_tract\\d+.parquet$',
 'as_truth_table': True,
 'description': 'DESC DC2 (Run2.2i) DR6 Truth Table for the Public Release v4',
 'creators': 'Rubin LSST DESC DC2 Team',
 'public_release': ['v4']}

We may want to filter down some of these. For example, to start, maybe we should only consider redshifts between 0 and 1 and aim to infer these to a reasonably high degree of accuracy. These probably just iterate through the data, so this cell also may take a long time. Following the example here: https://github.com/LSSTDESC/gcr-catalogs/blob/master/examples/GCRCatalogs%20Demo.ipynb

In [9]:
from GCR import GCRQuery

# Let's choose a small RA and Dec range to do the matching so that it won't take too long!
ra_min, ra_max = 55.5, 56.0
dec_min, dec_max = -29.0, -28.5
redshift_min, redshift_max = 0.0, 1.0

coord_cut = GCRQuery(
    'ra >= {}'.format(ra_min),
    'ra < {}'.format(ra_max),
    'dec >= {}'.format(dec_min),
    'dec < {}'.format(dec_max),
)

redshift_cut = GCRQuery(
    'redshift >= {}'.format(redshift_min),
    'redshift < {}'.format(redshift_max),
)

magnitude_filters = GCRQuery(
    (np.isfinite, 'flux_i'),
    'flux_i > 1e3',
)

data_subset = (coord_cut & magnitude_filters & redshift_cut).filter(data)

In [10]:
data_subset['id'].shape

(13185,)

This is a very restrictive cut due to the ra and dec ranges specified, just for example. We will not need such cuts, but may add some cuts on flux or magnitude, for example.