<img align='left' src = '../images/linea.png' width=150 style='padding: 20px'> 

# DP02 duplicates analysis
## Part 2 - Analysis of two subsamples, one with even tracts and other with odd tracts

Analysis of duplicates found in the DP02 catalog.

Contacts: Luigi Silva ([luigi.silva@linea.org.br](mailto:luigi.silva@linea.org.br)); Julia Gschwend ([julia@linea.org.br](mailto:julia@linea.org.br)).

Last check: 03/10/2024

#### Acknowledgments

'_This notebook used computational resources from the Associação Laboratório Interinstitucional de e-Astronomia (LIneA) with financial support from the INCT of e-Universe (Process No. 465376/2014-2)._'

'_This notebook uses libraries from the LSST Interdisciplinary Network for Collaboration and Computing (LINCC) Frameworks project, such as the hipscat, hipscat_import, and lsdb libraries. The LINCC Frameworks project is supported by Schmidt Sciences. It is also based on work supported by the National Science Foundation under Grant No. AST-2003196. Additionally, it receives support from the DIRAC Institute at the Department of Astronomy of the University of Washington. The DIRAC Institute is supported by gifts from the Charles and Lisa Simonyi Fund for Arts and Sciences and the Washington Research Foundation._'

# Imports and Configs

Let us import the packages that we will need.

In [None]:
########################### GENERAL ##########################
import os
import re
import glob
import getpass
import warnings
import tables_io
import numpy as np
import pandas as pd
from pathlib import Path
############################ DASK ############################
from dask import dataframe as dd
from dask import delayed
from dask.distributed import Client, performance_report
from dask_jobqueue import SLURMCluster
########################## HIPSCAT ###########################
import hipscat
from hipscat.catalog import Catalog
from hipscat.inspection import plot_pixels
from hipscat_import.catalog.file_readers import ParquetReader
from hipscat_import.margin_cache.margin_cache_arguments import MarginCacheArguments
from hipscat_import.pipeline import ImportArguments, pipeline_with_client
############################ LSDB ############################
import lsdb
######################## VISUALIZATION #######################
### BOKEH
import bokeh
from bokeh.io import output_notebook, show
from bokeh.models import ColorBar, LinearColorMapper
from bokeh.palettes import Viridis256

### HOLOVIEWS
import holoviews as hv
from holoviews import opts
from holoviews.operation.datashader import rasterize, dynspread

### GEOVIEWS
import geoviews as gv
import geoviews.feature as gf
from cartopy import crs

### DATASHADER
import datashader as ds

### MATPLOTLIB
import matplotlib.pyplot as plt
########################## ASTRONOMY #########################
from astropy import units as u
from astropy.coordinates import SkyCoord
from astropy.units.quantity import Quantity

Now, let us configure the plots to be inline.

In [None]:
hv.extension('bokeh')
gv.extension('bokeh')
output_notebook()
%matplotlib inline

Now, let us define the paths to save the logs and outputs.

In [None]:
user = getpass.getuser()
base_path = f'/lustre/t0/scratch/users/{user}/report_hipscat/'

In [None]:
output_dir = os.path.join(base_path, 'output')
logs_dir = os.path.join(base_path, 'logs')
os.makedirs(output_dir, exist_ok=True)
os.makedirs(logs_dir, exist_ok=True)

Then, let us define the parameters for the cluster.

In [None]:
# Configuring the SLURMCluster.
cluster = SLURMCluster(
    interface="ib0",    # Lustre interface
    queue='cpu_small',  # Name of the queue
    cores=52,           # Number of logical cores per node
    processes=13,       # Number of dask processes per node
    memory='120GB',     # Memory per node
    walltime='06:00:00',  # Maximum execution time
    job_extra_directives=[
        '--propagate',
        f'--output={output_dir}/dask_job_%j.out',  
        f'--error={output_dir}/dask_job_%j.err'
    ],
)

# Scaling the cluster to use X nodes
cluster.scale(jobs=6)

# Defining the dask client
client = Client(cluster)

# Getting the paths of the files corresponding to the subsamples

We want to define two subsamples from the DP02 files, one containing just even tracts and the other containing just odd tracts.

Before getting the paths for the files corresponding to each subsample, let us show how many parquet files do we have in total.

In [None]:
path = '/lustre/t1/cl/lsst/dp02/primary/catalogs/object/*.parq'

In [None]:
total_files = [f for f in glob.glob(path)]

In [None]:
total_files[0:5]

In [None]:
len(total_files)

## First subsample - Paths of even tracts

Now, let us get only the paths of even tracts.

In [None]:
files_even = [os.path.basename(f) for f in glob.glob(path) if re.search(r'tract_(\d+)', f) and int(re.search(r'tract_(\d+)', f).group(1)) % 2 == 0]

In [None]:
files_even[0:5]

In [None]:
len(files_even)

## Second subsample - Paths of odd tracts

Here, we get only the paths of odd tracts.

In [None]:
files_odd = [os.path.basename(f) for f in glob.glob(path) if re.search(r'tract_(\d+)', f) and int(re.search(r'tract_(\d+)', f).group(1)) % 2 != 0]

In [None]:
files_odd[0:5]

In [None]:
len(files_odd)

# Converting the subsamples to HiPSCat format

## Converting the first subsample (even tracts)

Generating the HiPSCat catalog.

In [None]:
# DO YOU WANT TO RUN THE PIPELINE? SET FALSE IF YOU ALREADY GENERATED THE HIPSCAT CATALOG.
run_the_pipeline = True

In [None]:
################################## INPUT CONFIGS #################################
### Directory and name of the input files. The name can be a list or contain a wildcard, ex: files_*.parquet.
CATALOG_DIR = Path('/lustre/t1/cl/lsst/dp02/primary/catalogs/object/')
CATALOG_FILES = files_even
### Columns to be selected in the input files. The id, ra e dec columns are essential.
CATALOG_SELECTED_COLUMNS = ['objectId', 'coord_ra', 'coord_dec', 'u_cModelFlux', 'g_cModelFlux', 'r_cModelFlux', 'i_cModelFlux', 
                            'z_cModelFlux', 'y_cModelFlux', 'u_cModelFluxErr', 'g_cModelFluxErr', 'r_cModelFluxErr', 'i_cModelFluxErr', 
                            'z_cModelFluxErr', 'y_cModelFluxErr', 'detect_isPrimary']
CATALOG_SORT_COLUMN = 'objectId'
CATALOG_RA_COLUMN = 'coord_ra'
CATALOG_DEC_COLUMN = 'coord_dec'
### Type of the files we will read.
FILE_TYPE = 'parquet'
### Name of the HiPSCat catalog to be saved.
CATALOG_HIPSCAT_NAME = 'DP02_object_even_tracts'
###########################################################################################

################################# OUTPUT CONFIGS #################################
### Output directory for the catalogs.
OUTPUT_DIR = Path(output_dir)
HIPSCAT_DIR_NAME = 'hipscat'
HIPSCAT_DIR = OUTPUT_DIR / HIPSCAT_DIR_NAME

CATALOG_HIPSCAT_DIR = HIPSCAT_DIR / CATALOG_HIPSCAT_NAME

### Path to dask performance report.
LOGS_DIR = Path(logs_dir) 

PERFORMANCE_REPORT_NAME = 'performance_report_make_hipscat_DP02_object_even.html'
PERFORMANCE_DIR = LOGS_DIR / PERFORMANCE_REPORT_NAME
###########################################################################################

############################### EXECUTING THE PIPELINE ######################################
if run_the_pipeline==True:
    with performance_report(filename=PERFORMANCE_DIR):
        if isinstance(CATALOG_FILES, list)==True:
            CATALOG_PATHS = [CATALOG_DIR / file for file in CATALOG_FILES]
        elif isinstance(CATALOG_FILES, str)==True:
            CATALOG_PATHS = list(CATALOG_DIR.glob(CATALOG_FILES))
        else:
            raise Exception('The type of names of catalogs files (CATALOG_FILES) is not supported. Supported types are list and str.')
    
        if FILE_TYPE=='parquet':
            catalog_args = ImportArguments(
                sort_columns=CATALOG_SORT_COLUMN,
                ra_column=CATALOG_RA_COLUMN,
                dec_column=CATALOG_DEC_COLUMN,
                input_file_list=CATALOG_PATHS,
                file_reader=ParquetReader(column_names=CATALOG_SELECTED_COLUMNS),
                output_artifact_name=CATALOG_HIPSCAT_NAME,
                output_path=HIPSCAT_DIR,
            )
            pipeline_with_client(catalog_args, client)
        else:
            raise Exception('Input catalog type not supported yet.')
else:
    print('You selected not to run the pipeline.') 
###########################################################################################

Plotting the pixels.

In [None]:
# Read the HiPSCat catalog metadata, it does not load any data, just healpix pixels and other metadata
DP02_even_tracts_hipscat_catalog = Catalog.read_from_hipscat(CATALOG_HIPSCAT_DIR)
plot_pixels(DP02_even_tracts_hipscat_catalog)

Reading the catalog into a dask dataframe.

In [None]:
DP02_even_tracts_from_disk = lsdb.read_hipscat(CATALOG_HIPSCAT_DIR)
DP02_even_tracts_from_disk_delayed = DP02_even_tracts_from_disk.to_delayed()
DP02_even_tracts_from_disk_ddf = dd.from_delayed(DP02_even_tracts_from_disk_delayed)

print(CATALOG_HIPSCAT_DIR)

Plotting the first lines of the dataframe and the basic statistics.

In [None]:
DP02_even_tracts_from_disk_ddf.head()

In [None]:
DP02_even_tracts_from_disk_ddf.describe().compute()

## Converting the second subsample (odd tracts)

Generating the HiPSCat catalog.

In [None]:
# DO YOU WANT TO RUN THE PIPELINE? SET FALSE IF YOU ALREADY GENERATED THE HIPSCAT CATALOG.
run_the_pipeline = True

In [None]:
################################## INPUT CONFIGS #################################
### Directory and name of the input files. The name can be a list or contain a wildcard, ex: files_*.parquet.
CATALOG_DIR = Path('/lustre/t1/cl/lsst/dp02/primary/catalogs/object/')
CATALOG_FILES = files_odd
### Columns to be selected in the input files. The id, ra e dec columns are essential.
CATALOG_SELECTED_COLUMNS = ['objectId', 'coord_ra', 'coord_dec', 'u_cModelFlux', 'g_cModelFlux', 'r_cModelFlux', 'i_cModelFlux', 
                            'z_cModelFlux', 'y_cModelFlux', 'u_cModelFluxErr', 'g_cModelFluxErr', 'r_cModelFluxErr', 'i_cModelFluxErr', 
                            'z_cModelFluxErr', 'y_cModelFluxErr', 'detect_isPrimary']
CATALOG_SORT_COLUMN = 'objectId'
CATALOG_RA_COLUMN = 'coord_ra'
CATALOG_DEC_COLUMN = 'coord_dec'
### Type of the files we will read.
FILE_TYPE = 'parquet'
### Name of the HiPSCat catalog to be saved.
CATALOG_HIPSCAT_NAME = 'DP02_object_odd_tracts'
###########################################################################################

################################# OUTPUT CONFIGS #################################
### Output directory for the catalogs.
OUTPUT_DIR = Path(output_dir)
HIPSCAT_DIR_NAME = 'hipscat'
HIPSCAT_DIR = OUTPUT_DIR / HIPSCAT_DIR_NAME

CATALOG_HIPSCAT_DIR = HIPSCAT_DIR / CATALOG_HIPSCAT_NAME

### Path to dask performance report.
LOGS_DIR = Path(logs_dir) 

PERFORMANCE_REPORT_NAME = 'performance_report_make_hipscat_DP02_object_odd.html'
PERFORMANCE_DIR = LOGS_DIR / PERFORMANCE_REPORT_NAME
###########################################################################################

############################### EXECUTING THE PIPELINE ######################################
if run_the_pipeline==True:
    with performance_report(filename=PERFORMANCE_DIR):
        if isinstance(CATALOG_FILES, list)==True:
            CATALOG_PATHS = [CATALOG_DIR / file for file in CATALOG_FILES]
        elif isinstance(CATALOG_FILES, str)==True:
            CATALOG_PATHS = list(CATALOG_DIR.glob(CATALOG_FILES))
        else:
            raise Exception('The type of names of catalogs files (CATALOG_FILES) is not supported. Supported types are list and str.')
    
        if FILE_TYPE=='parquet':
            catalog_args = ImportArguments(
                sort_columns=CATALOG_SORT_COLUMN,
                ra_column=CATALOG_RA_COLUMN,
                dec_column=CATALOG_DEC_COLUMN,
                input_file_list=CATALOG_PATHS,
                file_reader=ParquetReader(column_names=CATALOG_SELECTED_COLUMNS),
                output_artifact_name=CATALOG_HIPSCAT_NAME,
                output_path=HIPSCAT_DIR,
            )
            pipeline_with_client(catalog_args, client)
        else:
            raise Exception('Input catalog type not supported yet.')
else:
    print('You selected not to run the pipeline.') 
###########################################################################################

Plotting the pixels.

In [None]:
# Read the HiPSCat catalog metadata, it does not load any data, just healpix pixels and other metadata
DP02_odd_tracts_hipscat_catalog = Catalog.read_from_hipscat(CATALOG_HIPSCAT_DIR)
plot_pixels(DP02_odd_tracts_hipscat_catalog)

Reading the catalog into a dask dataframe.

In [None]:
DP02_odd_tracts_from_disk = lsdb.read_hipscat(CATALOG_HIPSCAT_DIR)
DP02_odd_tracts_from_disk_delayed = DP02_odd_tracts_from_disk.to_delayed()
DP02_odd_tracts_from_disk_ddf = dd.from_delayed(DP02_odd_tracts_from_disk_delayed)

print(CATALOG_HIPSCAT_DIR)

Plotting the first lines of the dataframe and the basic statistics.

In [None]:
DP02_odd_tracts_from_disk_ddf.head()

In [None]:
DP02_odd_tracts_from_disk_ddf.describe().compute()

# Generating the margin cache for the second subsample

Generating the margin cache for the odd tracts.

In [None]:
# DO YOU WANT TO RUN THE PIPELINE? SET FALSE IF YOU ALREADY GENERATED THE HIPSCAT CATALOG.
run_the_pipeline = True

In [None]:
################################## INPUT CONFIGS #################################
### Path of the input HiPSCat catalog.
CATALOG_HIPSCAT_DIR = Path(f'/lustre/t0/scratch/users/luigi.silva/report_hipscat/output/hipscat/DP02_object_odd_tracts')
MARGIN_CACHE_THRESHOLD = 1.0 #arcsec
CATALOG_MARGIN_CACHE_NAME = "DP02_object_odd_tracts_margin_cache"
###########################################################################################

################################# OUTPUT CONFIGS #################################
### Output path for the catalogs.
OUTPUT_DIR = Path(output_dir)
HIPSCAT_DIR_NAME = "hipscat"
HIPSCAT_DIR = OUTPUT_DIR / HIPSCAT_DIR_NAME

CATALOG_MARGIN_CACHE_DIR = HIPSCAT_DIR / CATALOG_MARGIN_CACHE_NAME

### Path to dask performance report.
LOGS_DIR = Path(logs_dir)

PERFORMANCE_REPORT_NAME = 'performance_report_make_margin_cache_DP02_object_odd.html'
PERFORMANCE_DIR = LOGS_DIR / PERFORMANCE_REPORT_NAME
###########################################################################################

############################### EXECUTING THE PIPELINE ######################################
if run_the_pipeline==True:
    with performance_report(filename=PERFORMANCE_DIR):   
        ### Getting informations from the catalog.
        catalog = hipscat.read_from_hipscat(CATALOG_HIPSCAT_DIR)

        info_frame = catalog.partition_info.as_dataframe()

        for index, partition in info_frame.iterrows():
            file_name = result = hipscat.io.paths.pixel_catalog_file(
                CATALOG_HIPSCAT_DIR, partition["Norder"], partition["Npix"]
            )
            info_frame.loc[index, "size_on_disk"] = os.path.getsize(file_name)

        info_frame = info_frame.astype(int)
        info_frame["gbs"] = info_frame["size_on_disk"] / (1024 * 1024 * 1024)
        
        ### Computing the margin cache, if it is possible.
        number_of_pixels = len(info_frame["Npix"])
        if number_of_pixels <= 1:
            warnings.warn(f"Number of pixels is equal to {number_of_pixels}. Impossible to compute margin cache.")
        else:
            margin_cache_args = MarginCacheArguments(
                input_catalog_path=CATALOG_HIPSCAT_DIR,
                output_path=HIPSCAT_DIR,
                margin_threshold=MARGIN_CACHE_THRESHOLD,  # arcsec
                output_artifact_name=CATALOG_MARGIN_CACHE_NAME,
            )
            pipeline_with_client(margin_cache_args, client)
else:
    print('You selected not to run the pipeline.')

# Doing the crossmatching between the two subsamples

Now, let us do the crossmatching between the two subsamples catalogs.

In [None]:
# DO YOU WANT TO RUN THE PIPELINE? SET FALSE IF YOU ALREADY GENERATED THE HIPSCAT CATALOG.
run_the_pipeline = True

In [None]:
################################## INPUT CONFIGS #################################
LEFT_HIPSCAT_DIR = Path('/lustre/t0/scratch/users/luigi.silva/report_hipscat/output/hipscat/DP02_object_even_tracts')
LEFT_CATALOG_HIPSCAT_NAME = 'DP02_object_even_tracts'
RIGHT_HIPSCAT_DIR = Path('/lustre/t0/scratch/users/luigi.silva/report_hipscat/output/hipscat/DP02_object_odd_tracts')
RIGHT_CATALOG_HIPSCAT_NAME = 'DP02_object_odd_tracts'
RIGHT_MARGIN_CACHE_DIR = Path('/lustre/t0/scratch/users/luigi.silva/report_hipscat/output/hipscat/DP02_object_odd_tracts_margin_cache')

CROSS_MATCHING_RADIUS = 1.0 # Up to 1 arcsec distance, it is the default
NEIGHBORS_NUMBER = 1 # Single closest object, it is the default
###########################################################################################

################################# OUTPUT CONFIGS #################################
OUTPUT_DIR = Path(output_dir)
HIPSCAT_DIR_NAME = 'hipscat'
HIPSCAT_DIR = OUTPUT_DIR / HIPSCAT_DIR_NAME

XMATCH_NAME = LEFT_CATALOG_HIPSCAT_NAME+'_x_'+RIGHT_CATALOG_HIPSCAT_NAME
OUTPUT_HIPSCAT_DIR = HIPSCAT_DIR / XMATCH_NAME

LOGS_DIR = Path(logs_dir)

PERFORMANCE_REPORT_NAME = 'performance_report_make_xmatching.html'
PERFORMANCE_DIR = LOGS_DIR / PERFORMANCE_REPORT_NAME
###########################################################################################

############################### EXECUTING THE PIPELINE ######################################
if run_the_pipeline==True:
    with performance_report(filename=PERFORMANCE_DIR):
        left_catalog = lsdb.read_hipscat(LEFT_HIPSCAT_DIR)
        right_margin_cache_catalog = lsdb.read_hipscat(RIGHT_MARGIN_CACHE_DIR)
        right_catalog = lsdb.read_hipscat(RIGHT_HIPSCAT_DIR, margin_cache=right_margin_cache_catalog)
    
        xmatched = left_catalog.crossmatch(
            right_catalog,
            radius_arcsec=CROSS_MATCHING_RADIUS,
            n_neighbors=NEIGHBORS_NUMBER,
            suffixes=(LEFT_CATALOG_HIPSCAT_NAME, RIGHT_CATALOG_HIPSCAT_NAME),
        )
        xmatched.to_hipscat(OUTPUT_HIPSCAT_DIR)
else:
    print('You selected not to run the pipeline.')

Reading the catalog into a dask dataframe.

In [None]:
DP02_xmatched_from_disk = lsdb.read_hipscat(OUTPUT_HIPSCAT_DIR)
DP02_xmatched_from_disk_delayed = DP02_xmatched_from_disk.to_delayed()
DP02_xmatched_from_disk_ddf = dd.from_delayed(DP02_xmatched_from_disk_delayed)

print(OUTPUT_HIPSCAT_DIR)

Plotting the first lines of the dataframe and the basic statistics.

In [None]:
DP02_xmatched_from_disk_ddf.head()

In [None]:
DP02_xmatched_from_disk_ddf.describe().compute()

# Making the scatter plot

First of all, let us define the geoviews Points element.

In [None]:
ra = DP02_xmatched_from_disk_ddf['coord_raDP02_object_even_tracts']
dec = DP02_xmatched_from_disk_ddf['coord_decDP02_object_even_tracts']

ra_dec_points_minusRA = gv.Points((-ra, dec), kdims=['-R.A. (deg)', 'DEC (deg)'])

#### Note

In what follows, if cartopy tries to download some file from natural earth, check the path of the cartopy data directory with
```python
import cartopy
print(cartopy.config['data_dir'])
```
Then, download the file manually to the ```shapefiles/natural_earth/physical``` folder inside this directory and unzip it.

### Plot using the Plate Carrée projection

Defining the title, the axis labels and the plot sizes.

In [None]:
title = 'Spatial distribution - X-matched objects of even and odd tracts - Plate Carrée projection'
height = 500
width = 1000
padding = 0.05

Making the plot with geoviews and datashader.

In [None]:
Plate_Carree_rasterized_points = rasterize(ra_dec_points_minusRA, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Plate_Carree_spread_points = dynspread(Plate_Carree_rasterized_points).opts(width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True, 
                                                                            tools=['box_select'], show_grid=True)

Plate_Carree_spread_points

### Plot using the Mollweide projection

Defining the title, the axis labels and the plot sizes.

In [None]:
title = 'Spatial distribution - X-matched objects of even and odd tracts - Mollweide projection'
height = 500
width = 1000
padding = 0.05

Defining the RA and DEC ticks for the Mollweide projection.

In [None]:
longitudes = np.arange(30, 360, 30)
latitudes = np.arange(-75, 76, 15)

lon_labels = [f"{lon}°" for lon in longitudes]
lat_labels = [f"{lat}°" for lat in latitudes]

labels_data = {
    "lon": list(np.flip(longitudes)) + [-180] * len(latitudes),
    "lat": [0] * len(longitudes) + list(latitudes),
    "label": lon_labels + lat_labels,
}

df_labels = pd.DataFrame(labels_data)

labels_plot = gv.Labels(df_labels, kdims=["lon", "lat"], vdims=["label"]).opts(
    text_font_size="12pt",
    text_color="black",
    text_align='right',
    text_baseline='bottom',
    projection=crs.Mollweide()
)

Making the plot with geoviews and datashader.

In [None]:
projected = gv.operation.project(ra_dec_points_minusRA, projection=crs.Mollweide())

Mollweide_rasterized_points = rasterize(projected, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Mollweide_spread_points = dynspread(Mollweide_rasterized_points).opts(width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True, 
                                                                            tools=['box_select'])

grid = gf.grid()

Mollweide_spread_points * grid * labels_plot

# Printing and saving just the objectIds

In [None]:
df_ids = DP02_xmatched_from_disk[['objectIdDP02_object_even_tracts', 'objectIdDP02_object_odd_tracts']].compute()

In [None]:
number_to_save = 30
df_ids.head(number_to_save).to_csv(f'duplicates_even_tracts_vs_odd_tracts_sample_of_{number_to_save}_objects.csv')

# Closing the client and cluster

In [None]:
# Fechando o client
client.close()
cluster.close()