<img align='left' src = '../images/linea.png' width=150 style='padding: 20px'> 

# DP02 duplicates analysis
## Part 1 - Analysis of two tracts

Analysis of duplicates found in the DP02 catalog.

Contacts: Luigi Silva ([luigi.silva@linea.org.br](mailto:luigi.silva@linea.org.br)); Julia Gschwend ([julia@linea.org.br](mailto:julia@linea.org.br)).

Last check: 03/10/2024

#### Acknowledgments

'_This notebook used computational resources from the Associação Laboratório Interinstitucional de e-Astronomia (LIneA) with financial support from the INCT of e-Universe (Process No. 465376/2014-2)._'

'_This notebook uses libraries from the LSST Interdisciplinary Network for Collaboration and Computing (LINCC) Frameworks project, such as the hipscat, hipscat_import, and lsdb libraries. The LINCC Frameworks project is supported by Schmidt Sciences. It is also based on work supported by the National Science Foundation under Grant No. AST-2003196. Additionally, it receives support from the DIRAC Institute at the Department of Astronomy of the University of Washington. The DIRAC Institute is supported by gifts from the Charles and Lisa Simonyi Fund for Arts and Sciences and the Washington Research Foundation._'

# Imports and Configs

Let us import the packages that we will need.

In [None]:
import os
import dask
from dask import dataframe as dd
from dask import delayed
from dask.distributed import Client, performance_report
from dask_jobqueue import SLURMCluster
import tables_io
import pandas as pd
import getpass

Now, let us define the paths to save the logs and outputs.

In [None]:
user = getpass.getuser()
base_path = f'/lustre/t0/scratch/users/{user}/report_hipscat/'

In [None]:
output_dir = os.path.join(base_path, 'output')
logs_dir = os.path.join(base_path, 'logs')
os.makedirs(output_dir, exist_ok=True)
os.makedirs(logs_dir, exist_ok=True)

Then, let us define the parameters for the cluster.

In [None]:
# Configuring the SLURMCluster.
cluster = SLURMCluster(
    interface="ib0",    # Lustre interface
    queue='cpu_small',  # Name of the queue
    cores=30,           # Number of logical cores per node
    processes=15,       # Number of dask processes per node
    memory='20GB',     # Memory per node
    walltime='06:00:00',  # Maximum execution time
    job_extra_directives=[
        '--propagate',
        f'--output={output_dir}/dask_job_%j.out',  
        f'--error={output_dir}/dask_job_%j.err'
    ],
)

# Scaling the cluster to use X nodes
cluster.scale(jobs=6)

# Defining the dask client
client = Client(cluster)

# Analyzing two tracts of the DP02 object table

First, let us define the paths to the parquets of the considered tracts.

In [None]:
path_tract4029 = f'/lustre/t1/cl/lsst/dp02/primary/catalogs/object/objectTable_tract_4029_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_18_20220220T153612Z.parq'
path_tract4030 = f'/lustre/t1/cl/lsst/dp02/primary/catalogs/object/objectTable_tract_4030_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_18_20220220T153612Z.parq'

Now, let us read the parquet files with dask.

In [None]:
ddf_tract4029 = dd.read_parquet(path_tract4029)
ddf_tract4030 = dd.read_parquet(path_tract4030)

Here, we use ```.compute()``` to generate pandas dataframes from the dask dataframes. The pandas dataframes must be small, otherwise the Jupyter memory will blow up. So, we select just some columns.

In [None]:
#selected_columns = ['coord_ra', 'coord_dec', 'u_cModelFlux', 'g_cModelFlux', 'r_cModelFlux', 'i_cModelFlux', 
#                    'z_cModelFlux', 'y_cModelFlux', 'u_cModelFluxErr', 'g_cModelFluxErr', 'r_cModelFluxErr', 'i_cModelFluxErr', 
#                    'z_cModelFluxErr', 'y_cModelFluxErr', 'detect_isPrimary']

selected_columns = ['coord_ra', 'coord_dec', 'g_cModelFlux', 'r_cModelFlux', 'i_cModelFlux', 
                    'g_cModelFluxErr', 'r_cModelFluxErr', 'i_cModelFluxErr', 'detect_isPrimary']

df_tract4029_small = ddf_tract4029[selected_columns].compute()
df_tract4030_small = ddf_tract4030[selected_columns].compute()

## Checking for duplicates in tract 4029, considering the R.A. and DEC coordinates

Now, we will check for duplicates in tract 4029, considering the R.A. and DEC coordinates, and we sort the values based on the ```coord_ra``` column.

In [None]:
df_tract4029_duplicates = df_tract4029_small[df_tract4029_small[['coord_ra', 'coord_dec']].duplicated(keep=False)].sort_values('coord_ra')

In [None]:
df_tract4029_duplicates.head(10)

In [None]:
df_tract4029_duplicates.describe()

Now, let us filter to see only the objects that have not-NaN values in the ```g_cModelFlux``` column.

In [None]:
df_tract4029_duplicates_g_not_nan = df_tract4029_duplicates[df_tract4029_duplicates['g_cModelFlux'].notna()]
df_tract4029_duplicates_g_not_nan.head(10)

In [None]:
df_tract4029_duplicates_g_not_nan.describe()

As we can see, there are objects that have different ```objectId``` but have the same R.A. and DEC coordinates. However, they don't have the same flux.

## Checking for duplicates in tract 4030, considering the R.A. and DEC coordinates

Now, we will check for duplicates in tract 4030, considering the R.A. and DEC coordinates, and we sort the values based on the ```coord_ra``` column.

In [None]:
df_tract4030_duplicates = df_tract4030_small[df_tract4030_small[['coord_ra', 'coord_dec']].duplicated(keep=False)].sort_values('coord_ra')

In [None]:
df_tract4030_duplicates.head(10)

In [None]:
df_tract4030_duplicates.describe()

Now, let us filter to see only the objects that have not-NaN values in the ```g_cModelFlux``` column.

In [None]:
df_tract4030_duplicates_g_not_nan = df_tract4030_duplicates[df_tract4030_duplicates['g_cModelFlux'].notna()]

In [None]:
df_tract4030_duplicates_g_not_nan.head(10)

In [None]:
df_tract4030_duplicates_g_not_nan.describe()

Again, there are objects that have different ```objectId``` but have the same R.A. and DEC coordinates. However, they don't have the same flux.

## Checking for duplicates in both tracts concatenated, considering the R.A. and DEC coordinates

Now, we will check for duplicates in both tracts concatenated, considering the R.A. and DEC coordinates, and we sort the values based on the ```coord_ra``` column.

First, let us concatenate the dataframes and save the R.A. and DEC. coordinates in a pandas dataframe.

In [None]:
df_concat_4029_4030 = pd.concat([df_tract4029_small, df_tract4030_small])

Now, let us search for duplicates.

In [None]:
df_concat_4029_4030[df_concat_4029_4030[['coord_ra', 'coord_dec']].duplicated(keep=False)].sort_values('coord_ra')

As we can see, the number of duplicates is exactly the sum of the number of duplicates of both tracts individually (4029 and 4030), that is, $40199\text{ rows} + 37817 \text{ rows} = 78016 \text{ rows}$. So, it seems that the duplicates are coming from a individual tract, and not from the combination of both tracts.

## Checking for duplicates in both tracts concatenated, considering the objectId column (index)

Now, we will check for duplicates in both tracts concatenated, considering just the ```objectId column```, and we sort the values based on the ```coord_ra``` column.

In [None]:
df_concat_4029_4030[df_concat_4029_4030.index.duplicated(keep=False)].sort_values('coord_ra')

As we can see, there are no duplicated ```objectId```, although, as we saw before, there are objects that have the same R.A. and DEC, but with different ```objectId``` values.

# Closing the client and cluster

In [None]:
# Fechando o client
client.close()
cluster.close()