<img align='left' src = '../images/linea.png' width=150 style='padding: 20px'> 

# DP02 duplicates analysis
## Part 3 - Analysis of the spatial distribution

Analysis of duplicates found in the DP02 catalog.

Contacts: Luigi Silva ([luigi.silva@linea.org.br](mailto:luigi.silva@linea.org.br)); Julia Gschwend ([julia@linea.org.br](mailto:julia@linea.org.br)).

Last check: 04/10/2024

#### Acknowledgments

'_This notebook used computational resources from the Associação Laboratório Interinstitucional de e-Astronomia (LIneA) with financial support from the INCT of e-Universe (Process No. 465376/2014-2)._'

'_This notebook uses libraries from the LSST Interdisciplinary Network for Collaboration and Computing (LINCC) Frameworks project, such as the hipscat, hipscat_import, and lsdb libraries. The LINCC Frameworks project is supported by Schmidt Sciences. It is also based on work supported by the National Science Foundation under Grant No. AST-2003196. Additionally, it receives support from the DIRAC Institute at the Department of Astronomy of the University of Washington. The DIRAC Institute is supported by gifts from the Charles and Lisa Simonyi Fund for Arts and Sciences and the Washington Research Foundation._'

# Imports and Configs

Let us import the packages that we will need.

In [None]:
########################### GENERAL ##########################
import os
import re
import glob
import getpass
import warnings
import tables_io
import numpy as np
import pandas as pd
from pathlib import Path
############################ DASK ############################
from dask import dataframe as dd
from dask import delayed
from dask.distributed import Client, performance_report
from dask_jobqueue import SLURMCluster
########################## HIPSCAT ###########################
import hipscat
from hipscat.catalog import Catalog
from hipscat.inspection import plot_pixels
from hipscat_import.catalog.file_readers import ParquetReader
from hipscat_import.margin_cache.margin_cache_arguments import MarginCacheArguments
from hipscat_import.pipeline import ImportArguments, pipeline_with_client
############################ LSDB ############################
import lsdb
######################## VISUALIZATION #######################
### BOKEH
import bokeh
from bokeh.io import output_notebook, show
from bokeh.models import ColorBar, LinearColorMapper
from bokeh.palettes import Viridis256

### HOLOVIEWS
import holoviews as hv
from holoviews import opts
import holoviews.operation.datashader as hd
from holoviews.operation.datashader import rasterize, dynspread, datashade

### GEOVIEWS
import geoviews as gv
import geoviews.feature as gf
from cartopy import crs

### DATASHADER
import datashader as ds

### MATPLOTLIB
import matplotlib.pyplot as plt
########################## ASTRONOMY #########################
from astropy import units as u
from astropy.coordinates import SkyCoord
from astropy.units.quantity import Quantity

Now, let us configure the plots to be inline.

In [None]:
hv.extension('bokeh')
gv.extension('bokeh')
output_notebook()
%matplotlib inline

Now, let us define the paths to save the logs and outputs.

In [None]:
user = getpass.getuser()
base_path = f'/lustre/t0/scratch/users/{user}/report_hipscat/'

In [None]:
output_dir = os.path.join(base_path, 'output')
logs_dir = os.path.join(base_path, 'logs')
os.makedirs(output_dir, exist_ok=True)
os.makedirs(logs_dir, exist_ok=True)

Then, let us define the parameters for the cluster.

In [None]:
# Configuring the SLURMCluster.
cluster = SLURMCluster(
    interface="ib0",    # Lustre interface
    queue='cpu_small',  # Name of the queue
    cores=28,           # Number of logical cores per node
    processes=7,       # Number of dask processes per node
    memory='30GB',     # Memory per node
    walltime='06:00:00',  # Maximum execution time
    job_extra_directives=[
        '--propagate',
        f'--output={output_dir}/dask_job_%j.out',  
        f'--error={output_dir}/dask_job_%j.err'
    ],
)

# Scaling the cluster to use X nodes
cluster.scale(jobs=10)

# Defining the dask client
client = Client(cluster)

# Getting the paths of the object catalog files

Let us get a list with the paths of the parquets of the catalog.

In [None]:
path = '/lustre/t1/cl/lsst/dp02/primary/catalogs/object/*.parq'

In [None]:
total_files = [f for f in glob.glob(path)]

In [None]:
total_files[0:5]

In [None]:
len(total_files)

## Reading the catalog into a dask dataframe

Now, let us read the files into a dask dataframe.

In [None]:
ddf = dd.read_parquet(total_files)

selected_columns = ['coord_ra', 'coord_dec', 'u_cModelFlux', 'g_cModelFlux', 'r_cModelFlux', 'i_cModelFlux', 
                    'z_cModelFlux', 'y_cModelFlux', 'u_cModelFluxErr', 'g_cModelFluxErr', 'r_cModelFluxErr', 'i_cModelFluxErr', 
                    'z_cModelFluxErr', 'y_cModelFluxErr', 'detect_isPrimary']

ddf_small = ddf[selected_columns]

ddf_small = ddf_small.persist()

# Spatial distributions of all objects, without any filter

----------------------------------------------------------------------------------------------
#### Note

In what follows, if cartopy tries to download some file from natural earth, check the path of the cartopy data directory with
```python
import cartopy
print(cartopy.config['data_dir'])
```
Then, download the file manually to the ```shapefiles/natural_earth/physical``` folder inside this directory and unzip it.

----------------------------------------------------------------------------------------------

First of all, let us define the geoviews Points element.

In [None]:
points = gv.Points(ddf_small, kdims=['coord_ra', 'coord_dec'])

#### Note

In what follows, if cartopy tries to download some file from natural earth, check the path of the cartopy data directory with
```python
import cartopy
print(cartopy.config['data_dir'])
```
Then, download the file manually to the ```shapefiles/natural_earth/physical``` folder inside this directory and unzip it.

## Plot using the Plate Carrée projection

Defining the title, the axis labels and the plot sizes.

In [None]:
title = 'Spatial Distribution - All Objects - Plate Carrée Projection'
height = 500
width = 1000
padding = 0.05

Making the plot with geoviews and datashader.

In [None]:
Plate_Carree_rasterized_points = rasterize(points, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Plate_Carree_spread_points = dynspread(Plate_Carree_rasterized_points).opts(
    width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True,
    tools=['box_select'], show_grid=True, 
    invert_xaxis=True  # Inverter o eixo x
)

Plate_Carree_spread_points

## Plot using the Mollweide projection

Defining the title, the axis labels and the plot sizes.

In [None]:
title = 'Spatial Distribution - All Objects - Mollweide Projection'
height = 500
width = 1000
padding = 0.05

Defining the RA and DEC ticks for the Mollweide projection.

In [None]:
longitudes = np.arange(30, 360, 30)
latitudes = np.arange(-75, 76, 15)

lon_labels = [f"{lon}°" for lon in longitudes]
lon_labels.reverse()
lat_labels = [f"{lat}°" for lat in latitudes]

labels_data = {
    "lon": list(np.flip(longitudes)) + [180] * len(latitudes),
    "lat": [0] * len(longitudes) + list(latitudes),
    "label": lon_labels + lat_labels,
}

df_labels = pd.DataFrame(labels_data)

labels_plot = gv.Labels(df_labels, kdims=["lon", "lat"], vdims=["label"]).opts(
    text_font_size="12pt",
    text_color="black",
    text_align='right',
    text_baseline='bottom',
    projection=crs.Mollweide()
)

Making the plot with geoviews and datashader.

In [None]:
Mollweide_rasterized_points = rasterize(points, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Mollweide_spread_points = dynspread(Mollweide_rasterized_points).opts(
    width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True,
    invert_xaxis=True
)

grid = gf.grid()

(Mollweide_spread_points).options(opts.Points(projection=crs.Mollweide())) * grid * labels_plot

# Spatial distributions of objects with ```detect_isPrimary==True```

The ```detect_isPrimary``` flag is true when:

1) A source is located on the interior of a patch and tract (detect_isPatchInner & detect_isTractInner)

2) A source is not a sky object (~merge_peak_sky for coadds or ~sky_source for single visits)

3) A source is either an isolated parent that is un-modeled or deblended from a parent with multiple children (isDeblendedSource)

Source: https://pipelines.lsst.io/modules/lsst.pipe.tasks/deblending-flags-overview.html

First of all, let us define the filtered dask dataframe.

In [None]:
ddf_small_filtered = ddf_small[ddf_small['detect_isPrimary']==True]

Next, let us define the geoviews Points element.

In [None]:
points_filtered = gv.Points(ddf_small_filtered, kdims=['coord_ra', 'coord_dec'])

#### Note

In what follows, if cartopy tries to download some file from natural earth, check the path of the cartopy data directory with
```python
import cartopy
print(cartopy.config['data_dir'])
```
Then, download the file manually to the ```shapefiles/natural_earth/physical``` folder inside this directory and unzip it.

## Plot using the Plate Carrée projection

Defining the title, the axis labels and the plot sizes.

In [None]:
title = 'Spatial Distribution - detect_isPrimary==True - Plate Carrée Projection'
height = 500
width = 1000
padding = 0.05

Making the plot with geoviews and datashader.

In [None]:
Plate_Carree_rasterized_points_filtered = rasterize(points_filtered, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Plate_Carree_spread_points_filtered = dynspread(Plate_Carree_rasterized_points_filtered).opts(
    width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True,
    tools=['box_select'], show_grid=True, 
    invert_xaxis=True  # Inverter o eixo x
)

Plate_Carree_spread_points_filtered

## Plot using the Mollweide projection

Defining the title, the axis labels and the plot sizes.

In [None]:
title = 'Spatial Distribution - detect_isPrimary==True - Mollweide Projection'
height = 500
width = 1000
padding = 0.05

Defining the RA and DEC ticks for the Mollweide projection.

In [None]:
longitudes = np.arange(30, 360, 30)
latitudes = np.arange(-75, 76, 15)

lon_labels = [f"{lon}°" for lon in longitudes]
lon_labels.reverse()
lat_labels = [f"{lat}°" for lat in latitudes]

labels_data = {
    "lon": list(np.flip(longitudes)) + [180] * len(latitudes),
    "lat": [0] * len(longitudes) + list(latitudes),
    "label": lon_labels + lat_labels,
}

df_labels = pd.DataFrame(labels_data)

labels_plot = gv.Labels(df_labels, kdims=["lon", "lat"], vdims=["label"]).opts(
    text_font_size="12pt",
    text_color="black",
    text_align='right',
    text_baseline='bottom',
    projection=crs.Mollweide()
)

Making the plot with geoviews and datashader.

In [None]:
Mollweide_rasterized_points_filtered = rasterize(points_filtered, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Mollweide_spread_points_filtered = dynspread(Mollweide_rasterized_points_filtered).opts(
    width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True,
    invert_xaxis=True
)

grid = gf.grid()

(Mollweide_spread_points_filtered).options(opts.Points(projection=crs.Mollweide())) * grid * labels_plot

# Closing the client and cluster

In [None]:
# Fechando o client
client.close()
cluster.close()