# Prior Construction For Photo-z

As mentioned in the review document, we construct two priors on redshift and photometry based on SDSS and DC2 data.

### Data Gathering - SDSS

The SDSS data needs to be gathered from online. This is accomplished by running the script `get_redshift.py`. As you will see if you read the script, this results in a file called `photometric_data.csv` containing object ids, flux magnitudes, and redshifts, and also the object type.

This `photometric_data.csv` file was then cleaned (if necessary), and split into two files `STAR.csv` and `GALAXY.csv`. To make the problem easier, for the URPS project we only generated galaxies, not stars. You can investigate the class `RedshiftCSVPriorSDSS` in `prior.py` to see the corresponding prior class, that reads in `GALAXY.csv`. This class samples redshifts and photometry by picking rows at random from `GALAXY.csv`. 

**The current location for all files mentioned above is `/data/scratch/declan`**.

##### TODO: Write additional data wrangling scripts to generate `STAR.csv` and `GALAXY.CSV`. Currently these don't exist. Organize all wrangling files including `get_redshift.py`.

### Data Gathering - DC2

Fortunately, the DC2 data are already stored on the server. The cells below illustrate how one can access the "truth tables" that contain the redshifts and photometry.

In [2]:
import os
import math
import numpy as np
import torch
import pandas as pd
import GCRCatalogs
from GCR import GCRQuery
from tqdm import tqdm

In [6]:
# Command line: within virtual environment run 
# python -m pip install https://github.com/LSSTDESC/gcr-catalogs/archive/v1.4.0.tar.gz#egg=GCRCatalogs[full]

In [7]:
GCRCatalogs.set_root_dir('/nfs/turbo/lsa-regier/')
GCRCatalogs.get_root_dir()
# need to do this in accordance with instructions at https://data.lsstdesc.org/doc/install_gcr

'/nfs/turbo/lsa-regier/'

In [8]:
# List of public catalog names
GCRCatalogs.get_public_catalog_names()

['desc_cosmodc2',
 'desc_dc2_run2.2i_dr6_object',
 'desc_dc2_run2.2i_dr6_object_with_truth_match',
 'desc_dc2_run2.2i_dr6_truth',
 'desc_dc2_run2.2i_truth_galaxy_summary',
 'desc_dc2_run2.2i_truth_sn_summary',
 'desc_dc2_run2.2i_truth_sn_variability',
 'desc_dc2_run2.2i_truth_star_summary',
 'desc_dc2_run2.2i_truth_star_variability']

We're generally going to be interested in the truth files, which we know will have redshift and photometry values for us to use. Let's load a truth file and examine the fields we have available to us. We can explore relevant fields.

In [9]:
truth_cat = GCRCatalogs.load_catalog('desc_dc2_run2.2i_dr6_truth')
truth_cat.list_all_quantities()

['is_good_match',
 'id',
 'is_unique_truth_entry',
 'ra',
 'is_nearest_neighbor',
 'mag_u',
 'flux_g',
 'av',
 'flux_i',
 'mag_g',
 'mag_y',
 'truth_type',
 'cosmodc2_hp',
 'tract',
 'patch',
 'rv',
 'redshift',
 'mag_i',
 'flux_r',
 'flux_y',
 'match_objectId',
 'match_sep',
 'id_string',
 'dec',
 'mag_z',
 'flux_u',
 'mag_r',
 'flux_z',
 'cosmodc2_id',
 'host_galaxy']

First let's reload all relevant quantities from the truth catalog we're familiar with.

**The cell below will take a long time to run.**

In [11]:
all_truth_data = {}
quantities = ["flux_u", "flux_g", "flux_r", "flux_i", "flux_z", "flux_y",
             "mag_u", "mag_g", "mag_r", "mag_i", "mag_z", "mag_y",
             "truth_type", "redshift",
             "id", "match_objectId", "cosmodc2_id", "id_string"]
for q in tqdm(quantities):
    this_field = truth_cat.get_quantities([q])
    all_truth_data[q] = this_field[q]
    print('Finished {}'.format(q))
    

TypeError: 'module' object is not callable

In [14]:
for i in tqdm.tqdm([1,2]):
    print(1)

100%|██████████| 2/2 [00:00<00:00, 8192.00it/s]

1
1





In [6]:
truth_data = pd.DataFrame(all_truth_data)

In [1]:
truth_data.shape

NameError: name 'truth_data' is not defined

In [8]:
truth_data.head(10)

Unnamed: 0,flux_u,flux_g,flux_r,flux_i,flux_z,flux_y,mag_u,mag_g,mag_r,mag_i,mag_z,mag_y,truth_type,redshift,id,match_objectId,cosmodc2_id,id_string
0,5678.708984,5577.517578,6334.502441,8848.515625,15267.947266,19116.740234,22.014378,22.033897,21.895718,21.532824,20.94055,20.696468,1,1.050468,10940305839,11975906419540343,10940305839,10940305839
1,146.518021,1341.131714,5984.994629,12850.581055,18818.283203,22972.25,25.985275,23.581322,21.95734,21.127697,20.713551,20.496994,1,0.474819,10937870093,11975906419541206,10937870093,10937870093
2,134.074341,272.126617,766.903015,2072.049561,2690.330322,3052.688965,26.081638,25.313074,24.188148,23.108999,22.825487,22.688293,1,0.759036,11563663598,11976043858493441,11563663598,11563663598
3,932.008118,1224.280762,2598.827148,6678.408691,9719.280273,11019.9375,23.976452,23.680298,22.863056,21.838318,21.430914,21.294554,1,0.808502,10938869183,11976043858493443,10938869183,10938869183
4,35.554699,104.535225,462.301849,1221.936035,2384.806396,2966.539062,27.522758,26.351845,24.737686,23.682381,22.956369,22.719376,1,0.849298,11564005688,11976043858493737,11564005688,11564005688
5,577.276917,648.449158,943.12561,2090.335938,2522.676758,2646.651123,24.49654,24.37031,23.963577,23.099461,22.895346,22.843258,1,0.822614,11563831110,11976043858493738,11563831110,11563831110
6,893.910278,838.389343,976.022888,1568.729858,2002.057739,2065.459961,24.021767,24.091387,23.926352,23.411131,23.146309,23.112459,1,0.92938,11564445231,11976043858493739,11564445231,11564445231
7,610.097229,828.885193,1807.437744,2516.555664,2857.029785,3111.177246,24.436504,24.103765,23.257341,22.897985,22.760212,22.66769,1,0.517864,11562943167,11976043858493740,11562943167,11562943167
8,607.974792,1396.32312,2972.82251,3748.65332,4305.629395,4791.282715,24.440289,23.537537,22.717077,22.465313,22.314909,22.19887,1,0.307164,10937620679,11976043858493741,10937620679,10937620679
9,287.352844,495.052368,1347.424683,1959.870117,2078.425049,2199.094727,25.253963,24.663374,23.576241,23.169432,23.105663,23.044392,1,0.572746,11563034476,11976043858493742,11563034476,11563034476


There is excess information above; we really can just focus on the magnitudes, redshifts, and truth-types. Pere the schema below, `truth_type=1` indicates a galaxy, while `truth_type=2` indicates a star.

https://github.com/LSSTDESC/gcr-catalogs/blob/master/GCRCatalogs/SCHEMA.md

Similar to the above for SDSS, we want to filter to galaxies only for now to make the problem easier. We also will have no choice but to reduce the size of the dataset. As you see above, 764 million objects is way too many. The file `dc2_truth_galaxies_mini_clean.csv` does performs the following operations on the dataframe above:

    -- Clean if necessary (remove rows with negative redshifts, magnitudes, etc.)
    -- Filter for only galaxies
    -- Consider only a few milllion observations
    
By investigating the class `RedshiftCSVPriorDC2` in `prior.py`, you can see how this file is used in the DC2 prior. Currently, there is just one copy of this file and not a script to generate it automatically.

##### TODO: write additional wrangling scripts to automate the procedure above to produce the desired csv files. Organize all wrangling scripts. 

In [3]:
dataset_name = 'desc_dc2_run2.2i_dr6_truth'
dataset = pd.read_pickle(f'/home/qiaozhih/bliss/data/redshift/dc2/{dataset_name}.pkl')

In [4]:
dataset.head(10)

Unnamed: 0,mag_u,mag_g,mag_r,mag_i,mag_z,mag_y,redshift
0,22.014378,22.033897,21.895718,21.532824,20.94055,20.696468,1.050468
1,25.985275,23.581322,21.95734,21.127697,20.713551,20.496994,0.474819
2,26.081638,25.313074,24.188148,23.108999,22.825487,22.688293,0.759036
3,23.976452,23.680298,22.863056,21.838318,21.430914,21.294554,0.808502
4,27.522758,26.351845,24.737686,23.682381,22.956369,22.719376,0.849298
5,24.49654,24.37031,23.963577,23.099461,22.895346,22.843258,0.822614
6,24.021767,24.091387,23.926352,23.411131,23.146309,23.112459,0.92938
7,24.436504,24.103765,23.257341,22.897985,22.760212,22.66769,0.517864
8,24.440289,23.537537,22.717077,22.465313,22.314909,22.19887,0.307164
9,25.253963,24.663374,23.576241,23.169432,23.105663,23.044392,0.572746


In [19]:
dataset[dataset.redshift < 0]

Unnamed: 0,mag_u,mag_g,mag_r,mag_i,mag_z,mag_y,redshift


In [5]:
dataset_test = dataset[:100]

In [6]:
dataset_test

Unnamed: 0,mag_u,mag_g,mag_r,mag_i,mag_z,mag_y,redshift
0,22.014378,22.033897,21.895718,21.532824,20.940550,20.696468,1.050468
1,25.985275,23.581322,21.957340,21.127697,20.713551,20.496994,0.474819
2,26.081638,25.313074,24.188148,23.108999,22.825487,22.688293,0.759036
3,23.976452,23.680298,22.863056,21.838318,21.430914,21.294554,0.808502
4,27.522758,26.351845,24.737686,23.682381,22.956369,22.719376,0.849298
...,...,...,...,...,...,...,...
95,25.747742,25.446239,24.831236,23.958385,23.872292,23.831142,0.769845
96,25.492907,24.795624,23.362864,22.307549,21.623981,21.420773,0.910632
97,24.466492,23.877995,23.155602,22.910280,22.738886,22.609978,0.329320
98,26.374680,23.893337,22.645502,21.705147,21.280571,21.075308,0.000000


In [9]:
save_path = os.path.join('/home/qiaozhih/bliss/data/redshift/dc2', 'test.pkl')
dataset_test.to_pickle(save_path)

In [13]:
photo_z = dataset_test
x = photo_z.values[:,:-1].astype(float)
y = photo_z.values[:, -1].astype(float)

In [14]:
x.shape

(100, 6)

In [15]:
y.shape

(100,)