## Investigating Stars & Galaxies Redshift

In [None]:
import os
import math
import numpy as np
import torch
import pandas as pd
import GCRCatalogs
from GCR import GCRQuery

In [None]:
# Command line: within virtual environment run 
# python -m pip install https://github.com/LSSTDESC/gcr-catalogs/archive/v1.4.0.tar.gz#egg=GCRCatalogs[full]

In [None]:
GCRCatalogs.set_root_dir('/nfs/turbo/lsa-regier/')
GCRCatalogs.get_root_dir()
# need to do this in accordance with instructions at https://data.lsstdesc.org/doc/install_gcr

In [None]:
# List of public catalog names
GCRCatalogs.get_public_catalog_names()

We're generally going to be interested in the truth files, which we know will have redshift and photometry values for us to use. Let's load a truth file and examine the fields we have available to us. We can explore relevant fields.

In [None]:
truth_cat = GCRCatalogs.load_catalog('desc_dc2_run2.2i_dr6_truth')
truth_cat.list_all_quantities()

Jacky found the following link that gives details on some fields: https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/SCHEMA.md
        

In particular, we have that `truth_type==2` corresponds to stars, and `truth_type==1` for galaxies. However, Jacky also found that all stars had zero redshift, which was very strange. Let's investigate.

**The cell below will take a few minutes to run.**

In [None]:
data = truth_cat.get_quantities(["truth_type", "redshift"])

In [None]:
only_stars = GCRQuery('truth_type == 2')
data_only_stars = only_stars.filter(data)

In [None]:
data_only_stars.keys()

In [None]:
data_only_stars['truth_type'][:10]

In [None]:
data_only_stars['redshift'].shape

In [None]:
(data_only_stars['redshift'] != 0).sum()

Indeed, all stars have redshift zero. 

In [None]:
(data['redshift'] != 0).sum()

In [None]:
data['redshift'].shape

If we investigate the shapes above, it appears that that the data catalog has a total of 764 million objects. Of these, there are about 4 million stars only, and it appears about 5 million objects total with redshift == 0. Clearly the stars comprise these. Let's focus on galaxies instead, we will filter for these.

In [None]:
only_galaxies = GCRQuery('truth_type == 1')
data_only_galaxies = only_galaxies.filter(data)

In [None]:
data_only_galaxies['redshift'].shape

In [None]:
(data_only_galaxies['redshift'] != 0).sum()

So contrarily for galaxies, every single redshift is nonzero, which is precisely what we want. Let's extract these galaxies to `.csv` for use in our data simulation process. However, we might as well extract galaxy and other parameters as well in case we want them later on. We will investigate the contents of other catalogs and merge as required.

## Merge Catalog (IGNORE THESE CELLS)

In [None]:
# A bit wonky with filepaths and where stuff is located
GCRCatalogs.set_root_dir('/nfs/turbo/lsa-regier/lsstdesc-public/dc2/')
GCRCatalogs.get_root_dir()

In [None]:
match_cat = GCRCatalogs.load_catalog('desc_dc2_run2.2i_dr6_object_with_truth_match')

In [None]:
quantities = match_cat.list_all_quantities()

In [None]:
quantities

In [None]:
list(filter(lambda k: 'redshift' in k, quantities))

I'm unsure what `redshift_truth` is, and how it differs from the `redshift` field in the other catalog. I have reason to believe this is catalog is deprecated b/c it's from v2 rather than v4 of the public releases from GCRCatalogs: https://github.com/LSSTDESC/gcr-catalogs/releases

I'm going to ignore this catalog for now. It appears to only really have PSF stuff anyway, which we don't really need for our purposes?

## CosmoDC2 Catalog + Merging With Truth Catalog

In [1]:
import os
import math
import numpy as np
import torch
import pandas as pd
import GCRCatalogs
from GCR import GCRQuery

In [2]:
# Command line: within virtual environment run 
# python -m pip install https://github.com/LSSTDESC/gcr-catalogs/archive/v1.4.0.tar.gz#egg=GCRCatalogs[full]

In [3]:
GCRCatalogs.set_root_dir('/nfs/turbo/lsa-regier/')
GCRCatalogs.get_root_dir()
# need to do this in accordance with instructions at https://data.lsstdesc.org/doc/install_gcr

'/nfs/turbo/lsa-regier/'

In [4]:
# List of public catalog names
GCRCatalogs.get_public_catalog_names()

['desc_cosmodc2',
 'desc_dc2_run2.2i_dr6_object',
 'desc_dc2_run2.2i_dr6_object_with_truth_match',
 'desc_dc2_run2.2i_dr6_truth',
 'desc_dc2_run2.2i_truth_galaxy_summary',
 'desc_dc2_run2.2i_truth_sn_summary',
 'desc_dc2_run2.2i_truth_sn_variability',
 'desc_dc2_run2.2i_truth_star_summary',
 'desc_dc2_run2.2i_truth_star_variability']

We're generally going to be interested in the truth files, which we know will have redshift and photometry values for us to use. Let's load a truth file and examine the fields we have available to us. We can explore relevant fields.

In [5]:
truth_cat = GCRCatalogs.load_catalog('desc_dc2_run2.2i_dr6_truth')
truth_cat.list_all_quantities()

['flux_g',
 'cosmodc2_hp',
 'mag_i',
 'patch',
 'mag_y',
 'flux_u',
 'tract',
 'dec',
 'flux_r',
 'av',
 'id',
 'redshift',
 'rv',
 'cosmodc2_id',
 'ra',
 'match_objectId',
 'is_unique_truth_entry',
 'flux_y',
 'mag_g',
 'is_nearest_neighbor',
 'host_galaxy',
 'mag_z',
 'is_good_match',
 'flux_i',
 'truth_type',
 'match_sep',
 'flux_z',
 'mag_r',
 'mag_u',
 'id_string']

First let's reload all relevant quantities from the truth catalog we're familiar with.

**The cell below will take a long time to run.**

In [6]:
all_truth_data = {}
quantities = ["flux_u", "flux_g", "flux_r", "flux_i", "flux_z", "flux_y",
             "mag_u", "mag_g", "mag_r", "mag_i", "mag_z", "mag_y",
             "truth_type", "redshift",
             "id", "match_objectId", "cosmodc2_id", "id_string"]
for q in quantities:
    this_field = truth_cat.get_quantities([q])
    all_truth_data[q] = this_field[q]
    print('Finished {}'.format(q))
    

Finished flux_u
Finished flux_g
Finished flux_r
Finished flux_i
Finished flux_z
Finished flux_y
Finished mag_u
Finished mag_g
Finished mag_r
Finished mag_i
Finished mag_z
Finished mag_y
Finished truth_type
Finished redshift
Finished id
Finished match_objectId
Finished cosmodc2_id
Finished id_string


In [7]:
truth_data = pd.DataFrame(all_truth_data)

In [8]:
truth_data.shape

(764026213, 18)

In [9]:
truth_data.head(10)

Unnamed: 0,flux_u,flux_g,flux_r,flux_i,flux_z,flux_y,mag_u,mag_g,mag_r,mag_i,mag_z,mag_y,truth_type,redshift,id,match_objectId,cosmodc2_id,id_string
0,5678.708984,5577.517578,6334.502441,8848.515625,15267.947266,19116.740234,22.014378,22.033897,21.895718,21.532824,20.94055,20.696468,1,1.050468,10940305839,11975906419540343,10940305839,10940305839
1,146.518021,1341.131714,5984.994629,12850.581055,18818.283203,22972.25,25.985275,23.581322,21.95734,21.127697,20.713551,20.496994,1,0.474819,10937870093,11975906419541206,10937870093,10937870093
2,134.074341,272.126617,766.903015,2072.049561,2690.330322,3052.688965,26.081638,25.313074,24.188148,23.108999,22.825487,22.688293,1,0.759036,11563663598,11976043858493441,11563663598,11563663598
3,932.008118,1224.280762,2598.827148,6678.408691,9719.280273,11019.9375,23.976452,23.680298,22.863056,21.838318,21.430914,21.294554,1,0.808502,10938869183,11976043858493443,10938869183,10938869183
4,35.554699,104.535225,462.301849,1221.936035,2384.806396,2966.539062,27.522758,26.351845,24.737686,23.682381,22.956369,22.719376,1,0.849298,11564005688,11976043858493737,11564005688,11564005688
5,577.276917,648.449158,943.12561,2090.335938,2522.676758,2646.651123,24.49654,24.37031,23.963577,23.099461,22.895346,22.843258,1,0.822614,11563831110,11976043858493738,11563831110,11563831110
6,893.910278,838.389343,976.022888,1568.729858,2002.057739,2065.459961,24.021767,24.091387,23.926352,23.411131,23.146309,23.112459,1,0.92938,11564445231,11976043858493739,11564445231,11564445231
7,610.097229,828.885193,1807.437744,2516.555664,2857.029785,3111.177246,24.436504,24.103765,23.257341,22.897985,22.760212,22.66769,1,0.517864,11562943167,11976043858493740,11562943167,11562943167
8,607.974792,1396.32312,2972.82251,3748.65332,4305.629395,4791.282715,24.440289,23.537537,22.717077,22.465313,22.314909,22.19887,1,0.307164,10937620679,11976043858493741,10937620679,10937620679
9,287.352844,495.052368,1347.424683,1959.870117,2078.425049,2199.094727,25.253963,24.663374,23.576241,23.169432,23.105663,23.044392,1,0.572746,11563034476,11976043858493742,11563034476,11563034476


In [10]:
truth_data.memory_usage().sum()/1e9 # in GB

70.290411724

In [11]:
truth_data.to_csv('/data/scratch/declan/dc2_truth.csv')

The `cosmo` catalog contains cosmological parameter values. It's only concerning galaxies, not stars. Again, more information can be found here: https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/SCHEMA.md

In [None]:
GCRCatalogs.set_root_dir('/nfs/turbo/lsa-regier')
GCRCatalogs.get_root_dir()

In [None]:
cosmo_cat = GCRCatalogs.load_catalog('desc_cosmodc2')

In [None]:
cosmo_cat.list_all_quantities()

There are a lot of parameters here. Following @Xinyue's notebook, I will extract the following galaxy parameters (about shape, i.e. disk and bulge, etc.), and additionally I'll extract: redshift, and all flux magnitudes in all bands. This includes all the different types of fluxes, which we can sort out later. We also extract `galaxy_id` which can be used to merge with our truth table (?)

In [None]:
gal_params = [
    "galaxy_id", "position_angle_true", "size_minor_disk_true", 
    "size_disk_true", "size_minor_bulge_true", 
    "size_bulge_true", "bulge_to_total_ratio_i"
]

In [None]:
#mag_fields = list(filter(lambda k: 'mag' in k, cosmo_cat.list_all_quantities()))
mag_fields = ['mag_u', 'mag_g', 'mag_r', 'mag_i', 'mag_z', 'mag_y']

In [None]:
redshift_fields = list(filter(lambda k: 'redshift' in k, cosmo_cat.list_all_quantities()))

In [None]:
to_extract = gal_params + mag_fields + redshift_fields
to_extract

In [None]:
type(cosmo_cat)

In [None]:
type(cosmo_cat._file_list)

The CosmoDC2 Catalog is enormous. It's about 4.6 TB worth of data to load everything. I'm not even sure how many objects it considers. If you investigate the `_file_list` you'll see a list of all the files it loads. Below, I manually edit this list to reduce it to the first `num_files` files. This results in a drastic speedup.

In [None]:
num_files = 7
to_keep = list(cosmo_cat._file_list.keys())[:num_files]
to_remove = list(cosmo_cat._file_list.keys())[num_files:]

In [None]:
for key in to_remove:
    if key in cosmo_cat._file_list:
        cosmo_cat._file_list.pop(key)

In [None]:
all_cosmo_data = {}
for q in to_extract:
    this_field = cosmo_cat.get_quantities([q])
    all_cosmo_data[q] = this_field[q]
    print('Finished {}'.format(q))

In [None]:
cosmo_data = pd.DataFrame(all_cosmo_data)

The number of objects below can be increased by reading more files above. Currently the file list is restricted to 7 files.

In [None]:
cosmo_data.shape

In [None]:
cosmo_data.head(10)

In [None]:
ids_in_both = np.intersect1d(np.array(cosmo_data['galaxy_id']), np.array(truth_data['cosmodc2_id']))

In [None]:
ids_in_both.shape

### Let's try to merge the two!

We merge the cosmological parameters *into* the truth catalog. This is why below we do a 'left' merge. Following @Xinyue, the field we merge on is called "cosmodc2_id" in the truth catalog and just "galaxy_id" in the cosmoDC2 catalog.

First we filter each to only have `ids_in_both`. This speeds up the merge. We could also have `set_index` on these two but I like seeing how many objects there are.

In [None]:
filtered_cosmo_data = cosmo_data.set_index('galaxy_id').loc[ids_in_both]

In [None]:
filtered_truth_data = truth_data.set_index('cosmodc2_id').loc[ids_in_both]

In [None]:
filtered_cosmo_data.shape, filtered_truth_data.shape

In [None]:
combined_df = truth_data.merge(cosmo_data,
                               left_index=True, 
                               right_index=True,
                               how = "inner")

In [None]:
combined_df.shape

In [None]:
combined_df.head(10)

In [None]:
combined_df.filter(regex='redshift', axis=1).head(20)

In [None]:
import matplotlib.pyplot as plt
plt.hist(combined_df['redshift_y'].values)

Here's link that seems not very useful? https://irsa.ipac.caltech.edu/data/theory/Cosmosims/gator_docs/CosmoDC2_Mock_V1_Catalog.html

Another one: https://nbviewer.org/github/LSSTDESC/DC2-analysis/blob/rendered/tutorials/extragalactic_gcr_redshift_dist.nbconvert.ipynb

In [None]:
combined_df.to_csv('/data/scratch/declan/combined_truth.csv')

In [None]:
combined_df.to_csv('/data/scratch/declan/combined_truth.csv')

In [None]:
combined_df.shape

Per Xinyue's work, the match files contain information on the psf and galaxy parameters that are unique to DC2. See discussion at the top of the page here: https://github.com/prob-ml/bliss/blob/dc2_script/case_studies/dc2/DC2_galaxy_psf_params.ipynb

These PSF and Galaxy parameters may need to be incorporated into the forward model for BLISS if the goal is inference for DC2 images specifically. Maybe we won't worry about that so much for now, and we will focus on just extracting the relevant fluxes and redshifts. We can add in PSF and galaxy specifics later.

In [None]:
# This cell may take a long time to run, about 30-50 minutes for me
# We will extract relevant quanitites and save them for later
data = truth_cat.get_quantities([
    "id", "match_objectId", "cosmodc2_id", "ra", "dec", "truth_type", 
    "redshift", "flux_g", "flux_i", "flux_r", "flux_u", "flux_y", "flux_z"
])

In [None]:
data['id'].shape

There are 764 million+ objects. We can get more meta-info by doing the below. See https://yymao.github.io/generic-catalog-reader/#GCR.BaseGenericCatalog.get_quantities for more documentation.

In [None]:
truth_cat.get_catalog_info()

We may want to filter down some of these. For example, to start, maybe we should only consider redshifts between 0 and 1 and aim to infer these to a reasonably high degree of accuracy. These probably just iterate through the data, so this cell also may take a long time. Following the example here: https://github.com/LSSTDESC/gcr-catalogs/blob/master/examples/GCRCatalogs%20Demo.ipynb

In [None]:
from GCR import GCRQuery

# Let's choose a small RA and Dec range to do the matching so that it won't take too long!
ra_min, ra_max = 55.5, 56.0
dec_min, dec_max = -29.0, -28.5
redshift_min, redshift_max = 0.0, 1.0

coord_cut = GCRQuery(
    'ra >= {}'.format(ra_min),
    'ra < {}'.format(ra_max),
    'dec >= {}'.format(dec_min),
    'dec < {}'.format(dec_max),
)

redshift_cut = GCRQuery(
    'redshift >= {}'.format(redshift_min),
    'redshift < {}'.format(redshift_max),
)

magnitude_filters = GCRQuery(
    (np.isfinite, 'flux_i'),
    'flux_i > 1e3',
)

data_subset = (coord_cut & magnitude_filters & redshift_cut).filter(data)

In [None]:
data_subset['id'].shape

This is a very restrictive cut due to the ra and dec ranges specified, just for example. We will not need such cuts, but may add some cuts on flux or magnitude, for example.