# Import ground truth OR4

Author: Sandro

Last updated: Jan 13, 2025

Let's pull the truth data for OR4 and create a HATS catalog. We're only interested in variable stars for the purpose of computing the LombScargle. This means that even though we have data for point sources and galaxies, only points sources that are variable are interesting for this use case. 

The pixel data is under `/sdf/data/rubin/shared/ops-rehearsals/ops-rehearsal-4/imSim_catalogs/skyCatalogs/`.


In [2]:
import lsdb
import os
import glob
import re
import pandas as pd
import numpy as np
from tqdm.auto import tqdm

In [3]:
truth_data_dir = "/sdf/data/rubin/shared/ops-rehearsals/ops-rehearsal-4/imSim_catalogs/skyCatalogs"

### Analyze data

First, let's have a look at the data to try to understand it better.

In [4]:
files = glob.glob(f"{truth_data_dir}/pointsource_[0-9]*.parquet")
files.sort()
print(f"Found data for {len(files)} pixels")

Found data for 36 pixels


Let's check if the truth data has duplicate objects:

In [52]:
ps_ids = []
for ps_file in tqdm(files, bar_format='{l_bar}{bar:80}{r_bar}'):
    df_ps = pd.read_parquet(ps_file)
    variable_ps = df_ps[df_ps["is_variable"]]
    ps_ids.extend(variable_ps["id"].to_numpy(dtype=np.int64))

  0%|                                                                                | 0/36 [00:00<?, ?it/s]

In [57]:
val, counts = np.unique(ps_ids, return_counts=True)

There are about 5 variable stars with duplicate IDs:

In [58]:
assert any(counts > 1)
val[counts > 1]

array([1072245120, 1072251751, 1072257404, 1072258970, 1072281490])

When we load these objects we will need to merge their respective fluxes. Let's test these operations for pixel `9999`.

In [60]:
pixel = 9999

In [67]:
df_ps = pd.read_parquet(os.path.join(truth_data_dir, f"pointsource_{pixel}.parquet"))
df_variable_ps = df_ps[df_ps["is_variable"]]
df_variable_ps

Unnamed: 0,object_type,id,ra,dec,host_galaxy_id,magnorm,sed_filepath,MW_rv,MW_av,mura,mudec,radial_velocity,parallax,variability_model,salt2_params,is_variable,period,mag_amplitude,phase
16,star,31077623756,224.294413,-39.378476,0,19.874692,starSED/kurucz/km10_6000.fits_g25_6040.gz,3.1,0.315522,-4.53,-1.75,91.440002,0.090657,,,True,5.917516,0.765215,4.928983
31,star,31386906773,224.270445,-39.395871,0,28.479100,starSED/phoSimMLT/lte037-5.5-1.0a+0.4.BT-Settl...,3.1,0.309887,-0.59,-0.54,-109.430000,0.029813,,,True,7.139647,0.445995,3.127816
32,star,31386907962,224.279231,-39.396148,0,20.641500,starSED/kurucz/km15_5000.fits_g00_5120.gz,3.1,0.311014,3.00,-5.60,146.539990,0.187845,,,True,0.295224,0.516455,3.999048
40,star,30289057917,224.273973,-39.394200,0,27.390710,starSED/phoSimMLT/lte032-4.5-1.0a+0.4.BT-Settl...,3.1,0.309887,-4.21,-4.21,-78.940002,0.173540,,,True,6.014836,0.760917,4.559557
52,star,31077615752,224.277464,-39.389618,0,26.473024,starSED/phoSimMLT/lte033-4.5-1.0a+0.4.BT-Settl...,3.1,0.311014,-1.87,-0.60,-49.750000,0.159735,,,True,2.294495,0.398108,4.809991
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
654898,star,31387058409,224.310882,-38.113320,0,27.929097,starSED/phoSimMLT/lte036-5.0-1.0a+0.4.BT-Settl...,3.1,0.269320,-0.55,-1.43,78.410004,0.042678,,,True,0.597262,0.822043,4.080590
654908,star,30289160254,224.302090,-38.111077,0,22.805223,starSED/phoSimMLT/lte033-4.5-1.0a+0.4.BT-Settl...,3.1,0.269320,-1.00,-11.78,2.910000,0.858618,,,True,1.424634,0.622993,4.000912
654910,star,31077907723,224.300680,-38.113520,0,23.794691,starSED/kurucz/km25_4250.fits_g00_4370.gz,3.1,0.270447,-3.34,-4.09,-33.279999,0.106316,,,True,4.131805,0.569348,1.420987
654916,star,31387057007,224.304456,-38.115031,0,26.020497,starSED/phoSimMLT/lte036-5.0-1.0a+0.4.BT-Settl...,3.1,0.270447,-5.90,-3.65,97.989998,0.098946,,,True,0.316465,0.540497,5.486771


In [68]:
df_ps_flux = pd.read_parquet(os.path.join(truth_data_dir, f"pointsource_flux_{pixel}.parquet"))
df_ps_flux

Unnamed: 0,id,lsst_flux_u,lsst_flux_g,lsst_flux_r,lsst_flux_i,lsst_flux_z,lsst_flux_y
0,31497485885,5.082593e-07,0.000027,0.000081,0.000122,0.000113,0.000055
1,31386912175,6.279932e-08,0.000004,0.000011,0.000023,0.000025,0.000012
2,31077624681,1.490962e-07,0.000010,0.000029,0.000097,0.000120,0.000065
3,31386911436,9.795985e-04,0.013522,0.016184,0.015345,0.011090,0.004575
4,30289060999,8.578418e-08,0.000006,0.000017,0.000068,0.000089,0.000049
...,...,...,...,...,...,...,...
654942,31077909581,1.488944e-06,0.000090,0.000259,0.000624,0.000687,0.000354
654943,31077909400,1.119244e-06,0.000073,0.000220,0.000466,0.000490,0.000248
654944,31387057553,1.653541e-07,0.000011,0.000031,0.000104,0.000127,0.000069
654945,31296674179,1.056311e-06,0.000065,0.000196,0.000376,0.000381,0.000190


Join the flux ground truth to the respective objects. We have multiband "ugrizy" data.

In [70]:
joined_ps = df_variable_ps.merge(df_ps_flux, on="id")
joined_ps

Unnamed: 0,object_type,id,ra,dec,host_galaxy_id,magnorm,sed_filepath,MW_rv,MW_av,mura,...,is_variable,period,mag_amplitude,phase,lsst_flux_u,lsst_flux_g,lsst_flux_r,lsst_flux_i,lsst_flux_z,lsst_flux_y
0,star,31077623756,224.294413,-39.378476,0,19.874692,starSED/kurucz/km10_6000.fits_g25_6040.gz,3.1,0.315522,-4.53,...,True,5.917516,0.765215,4.928983,3.463510e-04,0.005581,0.007208,0.007021,0.005138,0.002147
1,star,31386906773,224.270445,-39.395871,0,28.479100,starSED/phoSimMLT/lte037-5.5-1.0a+0.4.BT-Settl...,3.1,0.309887,-0.59,...,True,7.139647,0.445995,3.127816,3.957781e-08,0.000002,0.000008,0.000014,0.000013,0.000007
2,star,31386907962,224.279231,-39.396148,0,20.641500,starSED/kurucz/km15_5000.fits_g00_5120.gz,3.1,0.311014,3.00,...,True,0.295224,0.516455,3.999048,9.628465e-05,0.002616,0.004482,0.004977,0.003917,0.001717
3,star,30289057917,224.273973,-39.394200,0,27.390710,starSED/phoSimMLT/lte032-4.5-1.0a+0.4.BT-Settl...,3.1,0.309887,-4.21,...,True,6.014836,0.760917,4.559557,1.361930e-07,0.000009,0.000026,0.000088,0.000109,0.000059
4,star,31077615752,224.277464,-39.389618,0,26.473024,starSED/phoSimMLT/lte033-4.5-1.0a+0.4.BT-Settl...,3.1,0.311014,-1.87,...,True,2.294495,0.398108,4.809991,3.064553e-07,0.000019,0.000056,0.000157,0.000182,0.000096
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65371,star,31387058409,224.310882,-38.113320,0,27.929097,starSED/phoSimMLT/lte036-5.0-1.0a+0.4.BT-Settl...,3.1,0.269320,-0.55,...,True,0.597262,0.822043,4.080590,7.022828e-08,0.000004,0.000013,0.000025,0.000025,0.000013
65372,star,30289160254,224.302090,-38.111077,0,22.805223,starSED/phoSimMLT/lte033-4.5-1.0a+0.4.BT-Settl...,3.1,0.269320,-1.00,...,True,1.424634,0.622993,4.000912,9.522000e-06,0.000596,0.001707,0.004706,0.005452,0.002864
65373,star,31077907723,224.300680,-38.113520,0,23.794691,starSED/kurucz/km25_4250.fits_g00_4370.gz,3.1,0.270447,-3.34,...,True,4.131805,0.569348,1.420987,2.686038e-06,0.000140,0.000366,0.000499,0.000435,0.000203
65374,star,31387057007,224.304456,-38.115031,0,26.020497,starSED/phoSimMLT/lte036-5.0-1.0a+0.4.BT-Settl...,3.1,0.270447,-5.90,...,True,0.316465,0.540497,5.486771,4.066954e-07,0.000025,0.000075,0.000145,0.000147,0.000073


This will be the data for our pixel 9999.

In [71]:
joined_ps.dtypes

object_type           object
id                    object
ra                   float64
dec                  float64
host_galaxy_id         int64
magnorm              float64
sed_filepath          object
MW_rv                float32
MW_av                float32
mura                 float64
mudec                float64
radial_velocity      float64
parallax             float64
variability_model     object
salt2_params          object
is_variable             bool
period               float64
mag_amplitude        float64
phase                float64
lsst_flux_u          float32
lsst_flux_g          float32
lsst_flux_r          float32
lsst_flux_i          float32
lsst_flux_z          float32
lsst_flux_y          float32
dtype: object

## Import all data

We should have very few variable stars, so we can use `lsdb.from_dataframe` to import everything. It's also the most intuitive way of importing data into the HATS format and what users will be inclined to do.

In [74]:
dfs = []

for ps_file in tqdm(files, bar_format='{l_bar}{bar:80}{r_bar}'):
    
    # Extract the pixel number from the pointsource path
    match = re.search(r"pointsource_(\d+)*.parquet", ps_file)
    pixel_index = match.group(1)
    print(pixel_index)

    print(f"Processing pixel {pixel_index}")
    
    df_ps = pd.read_parquet(ps_file)
    df_variable_ps = df_ps[df_ps["is_variable"]]
    df_ps_flux = pd.read_parquet(f"{truth_data_dir}/pointsource_flux_{pixel_index}.parquet")
    
    df_res = df_variable_ps.merge(df_ps_flux, on="id")
    dfs.append(df_res)

  0%|                                                                                | 0/36 [00:00<?, ?it/s]

10000
Processing pixel 10000
10026
Processing pixel 10026
10128
Processing pixel 10128
10154
Processing pixel 10154
10155
Processing pixel 10155
10179
Processing pixel 10179
10255
Processing pixel 10255
10256
Processing pixel 10256
10282
Processing pixel 10282
10283
Processing pixel 10283
10306
Processing pixel 10306
10307
Processing pixel 10307
10430
Processing pixel 10430
10431
Processing pixel 10431
10638
Processing pixel 10638
10750
Processing pixel 10750
10751
Processing pixel 10751
10859
Processing pixel 10859
10860
Processing pixel 10860
5785
Processing pixel 5785
5912
Processing pixel 5912
5913
Processing pixel 5913
6041
Processing pixel 6041
7309
Processing pixel 7309
7436
Processing pixel 7436
7437
Processing pixel 7437
7564
Processing pixel 7564
7565
Processing pixel 7565
7692
Processing pixel 7692
7982
Processing pixel 7982
8110
Processing pixel 8110
8111
Processing pixel 8111
8237
Processing pixel 8237
8238
Processing pixel 8238
8366
Processing pixel 8366
9999
Processing p

In [122]:
variables_df = pd.concat(dfs, ignore_index=True)
variables_df

Unnamed: 0,object_type,id,ra,dec,host_galaxy_id,magnorm,sed_filepath,MW_rv,MW_av,mura,...,is_variable,period,mag_amplitude,phase,lsst_flux_u,lsst_flux_g,lsst_flux_r,lsst_flux_i,lsst_flux_z,lsst_flux_y
0,star,31387144925,225.705279,-39.368454,0,19.981522,starSED/kurucz/km30_6250.fits_g35_6330.gz,3.1,0.246783,1.44,...,True,7.135628,0.822984,3.288269,4.163741e-04,0.005429,0.006479,0.006129,0.004413,0.001808
1,star,31296729958,225.689056,-39.368112,0,19.121722,starSED/kurucz/km10_5750.fits_g20_5850.gz,3.1,0.245656,-5.55,...,True,0.621520,0.517424,5.187645,7.091290e-04,0.011986,0.015789,0.015478,0.011375,0.004763
2,star,31387141856,225.693407,-39.374447,0,26.131393,starSED/phoSimMLT/lte035-4.5-1.0a+0.4.BT-Settl...,3.1,0.246783,-5.24,...,True,0.385058,0.548586,2.421399,4.267017e-07,0.000024,0.000070,0.000149,0.000157,0.000079
3,star,31070031235,225.671926,-39.371443,0,28.846105,starSED/phoSimMLT/lte027-2.0-0.0a+0.0.BT-Settl...,3.1,0.245656,-4.81,...,True,4.143379,0.284235,2.064777,6.142869e-08,0.000003,0.000011,0.000047,0.000073,0.000052
4,star,31296727461,225.671929,-39.372528,0,23.511648,starSED/kurucz/km40_4250.fits_g00_4250.gz,3.1,0.245656,-8.43,...,True,3.257895,0.270389,1.772266,3.513396e-06,0.000186,0.000525,0.000756,0.000679,0.000320
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1476352,star,31387058409,224.310882,-38.113320,0,27.929097,starSED/phoSimMLT/lte036-5.0-1.0a+0.4.BT-Settl...,3.1,0.269320,-0.55,...,True,0.597262,0.822043,4.080590,7.022828e-08,0.000004,0.000013,0.000025,0.000025,0.000013
1476353,star,30289160254,224.302090,-38.111077,0,22.805223,starSED/phoSimMLT/lte033-4.5-1.0a+0.4.BT-Settl...,3.1,0.269320,-1.00,...,True,1.424634,0.622993,4.000912,9.522000e-06,0.000596,0.001707,0.004706,0.005452,0.002864
1476354,star,31077907723,224.300680,-38.113520,0,23.794691,starSED/kurucz/km25_4250.fits_g00_4370.gz,3.1,0.270447,-3.34,...,True,4.131805,0.569348,1.420987,2.686038e-06,0.000140,0.000366,0.000499,0.000435,0.000203
1476355,star,31387057007,224.304456,-38.115031,0,26.020497,starSED/phoSimMLT/lte036-5.0-1.0a+0.4.BT-Settl...,3.1,0.270447,-5.90,...,True,0.316465,0.540497,5.486771,4.066954e-07,0.000025,0.000075,0.000145,0.000147,0.000073


Let's cleanup this DataFrame a bit more:

- `object_type` is always "star", so can be removed;
- `is_variable` is always True, column will be removed;
- `salt2_params` is filled with nulls, it will also be removed;
- `variability_model` is filled with empty strings, will also be removed;
- `id` should be of type int64.

In [None]:
assert all(variables_df["salt2_params"].isna())
assert all(variables_df["object_type"] == "star")
assert all(variables_df["variability_model"] == '')
variables_df.drop(columns=["object_type","is_variable","salt2_params","variability_model"], inplace=True, errors="ignore")
variables_df = variables_df.astype({"id": np.int64})

In [134]:
variables_hats = lsdb.from_dataframe(variables_df)
variables_hats



Unnamed: 0_level_0,id,ra,dec,host_galaxy_id,magnorm,sed_filepath,MW_rv,MW_av,mura,mudec,radial_velocity,parallax,period,mag_amplitude,phase,lsst_flux_u,lsst_flux_g,lsst_flux_r,lsst_flux_i,lsst_flux_z,lsst_flux_y,Norder,Dir,Npix
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
"Order: 0, Pixel: 7",int64[pyarrow],double[pyarrow],double[pyarrow],int64[pyarrow],double[pyarrow],string[pyarrow],float[pyarrow],float[pyarrow],double[pyarrow],double[pyarrow],double[pyarrow],double[pyarrow],double[pyarrow],double[pyarrow],double[pyarrow],double[pyarrow],double[pyarrow],double[pyarrow],double[pyarrow],double[pyarrow],double[pyarrow],uint8[pyarrow],uint64[pyarrow],uint64[pyarrow]
"Order: 0, Pixel: 8",...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"Order: 0, Pixel: 10",...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"Order: 0, Pixel: 11",...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


That was quick! Now Let's just look at the data types to make sure they are reasonable.

In [135]:
variables_hats.dtypes

id                  int64[pyarrow]
ra                 double[pyarrow]
dec                double[pyarrow]
host_galaxy_id      int64[pyarrow]
magnorm            double[pyarrow]
sed_filepath       string[pyarrow]
MW_rv               float[pyarrow]
MW_av               float[pyarrow]
mura               double[pyarrow]
mudec              double[pyarrow]
radial_velocity    double[pyarrow]
parallax           double[pyarrow]
period             double[pyarrow]
mag_amplitude      double[pyarrow]
phase              double[pyarrow]
lsst_flux_u        double[pyarrow]
lsst_flux_g        double[pyarrow]
lsst_flux_r        double[pyarrow]
lsst_flux_i        double[pyarrow]
lsst_flux_z        double[pyarrow]
lsst_flux_y        double[pyarrow]
Norder              uint8[pyarrow]
Dir                uint64[pyarrow]
Npix               uint64[pyarrow]
dtype: object

Let's store this catalog to disk with `to_hats`. It only occupies 170MB.

In [138]:
out_dir = "/sdf/data/rubin/shared/lsdb_commissioning/or4_truth"
variables_hats.to_hats(out_dir, catalog_name="or4_truth")

In [141]:
%ls -l $out_dir

total 6160
drwxr-sr-x 1 stavar rubin_users       0 Jan 13 12:46 [0m[36mdataset[0m/
-rw-r--r-- 1 stavar rubin_users      30 Jan 13 12:46 [32mpartition_info.csv[0m
-rw-r--r-- 1 stavar rubin_users 6298560 Jan 13 12:46 [00mpoint_map.fits[0m
-rw-r--r-- 1 stavar rubin_users     246 Jan 13 12:46 [00mproperties[0m
