<img align='left' src = '../images/linea.png' width=150 style='padding: 20px'> 

# DP02 duplicates analysis
## Part 1 - Analysis of two tracts

Analysis of duplicates found in the DP02 catalog.

Contacts: Luigi Silva ([luigi.silva@linea.org.br](mailto:luigi.silva@linea.org.br)); Julia Gschwend ([julia@linea.org.br](mailto:julia@linea.org.br)).

Last check: 10/10/2024

#### Acknowledgments

'_This notebook used computational resources from the Associação Laboratório Interinstitucional de e-Astronomia (LIneA) with financial support from the INCT of e-Universe (Process No. 465376/2014-2)._'

'_This notebook uses libraries from the LSST Interdisciplinary Network for Collaboration and Computing (LINCC) Frameworks project, such as the hipscat, hipscat_import, and lsdb libraries. The LINCC Frameworks project is supported by Schmidt Sciences. It is also based on work supported by the National Science Foundation under Grant No. AST-2003196. Additionally, it receives support from the DIRAC Institute at the Department of Astronomy of the University of Washington. The DIRAC Institute is supported by gifts from the Charles and Lisa Simonyi Fund for Arts and Sciences and the Washington Research Foundation._'

---------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------
# Imports and Configs

Let us import the packages that we will need.

In [None]:
########################### GENERAL ##########################
import os
import tables_io
import pandas as pd
import getpass
############################ DASK ############################
import dask
from dask import dataframe as dd
from dask import delayed
from dask.distributed import Client, performance_report
from dask_jobqueue import SLURMCluster
######################## VISUALIZATION #######################
### BOKEH
import bokeh
from bokeh.io import output_notebook, show
from bokeh.models import ColorBar, LinearColorMapper
from bokeh.palettes import Viridis256
### HOLOVIEWS
import holoviews as hv
from holoviews import opts
from holoviews.operation.datashader import rasterize, dynspread
### GEOVIEWS
import geoviews as gv
import geoviews.feature as gf
from cartopy import crs
### DATASHADER
import datashader as ds
### MATPLOTLIB
import matplotlib.pyplot as plt

Now, let us configure the plots to be inline.

In [None]:
hv.extension('bokeh')
gv.extension('bokeh')
output_notebook()
%matplotlib inline

Now, let us define the paths to save the logs and outputs.

In [None]:
user = getpass.getuser()
base_path = f'/lustre/t0/scratch/users/{user}/report_hipscat/'

In [None]:
output_dir = os.path.join(base_path, 'output')
logs_dir = os.path.join(base_path, 'logs')
os.makedirs(output_dir, exist_ok=True)
os.makedirs(logs_dir, exist_ok=True)

Then, let us define the parameters for the cluster.

In [None]:
# Configuring the SLURMCluster.
cluster = SLURMCluster(
    interface="ib0",    # Lustre interface
    queue='cpu_small',  # Name of the queue
    cores=28,           # Number of logical cores per node
    processes=7,       # Number of dask processes per node
    memory='30GB',     # Memory per node
    walltime='06:00:00',  # Maximum execution time
    job_extra_directives=[
        '--propagate',
        f'--output={output_dir}/dask_job_%j.out',  
        f'--error={output_dir}/dask_job_%j.err'
    ],
)

# Scaling the cluster to use X nodes
cluster.scale(jobs=10)

# Defining the dask client
client = Client(cluster)

---------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------
# Analyzing two tracts of the DP02 object table

First, let us define the paths to the parquets of the considered tracts.

In [None]:
path_tract4029 = f'/lustre/t1/cl/lsst/dp02/primary/catalogs/object/objectTable_tract_4029_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_18_20220220T153612Z.parq'
path_tract4030 = f'/lustre/t1/cl/lsst/dp02/primary/catalogs/object/objectTable_tract_4030_DC2_2_2i_runs_DP0_2_v23_0_1_PREOPS-905_step3_18_20220220T153612Z.parq'

Now, let us read the parquet files with dask.

In [None]:
ddf_tract4029 = dd.read_parquet(path_tract4029)
ddf_tract4030 = dd.read_parquet(path_tract4030)

Here, we use ```.compute()``` to generate pandas dataframes from the dask dataframes. The pandas dataframes must be small, otherwise the Jupyter memory will blow up. So, we select just some columns.

In [None]:
selected_columns = ['coord_ra', 'coord_dec', 'g_cModelFlux', 'r_cModelFlux', 'i_cModelFlux', 'g_cModelFluxErr', 
                    'r_cModelFluxErr', 'i_cModelFluxErr', 'detect_isDeblendedSource', 'detect_isPatchInner', 'detect_isTractInner', 'merge_peak_sky', 'detect_isPrimary']

df_tract4029_small = ddf_tract4029[selected_columns].compute()
df_tract4030_small = ddf_tract4030[selected_columns].compute()

---------------------------------------------------------------------------------------------------------
## Checking for duplicates in tract 4029, considering the exact match of the R.A. and DEC coordinates

Now, we will check for duplicates in tract 4029, **considering the exact match of the R.A. and DEC coordinates**, and we sort the values based on the ```coord_ra``` column.

In [None]:
df_tract4029_duplicates = df_tract4029_small[df_tract4029_small[['coord_ra', 'coord_dec']].duplicated(keep=False)].sort_values('coord_ra')

### Duplicates (exact R.A and DEC) in tract 4029  - General information

First lines of the full duplicates table.

In [None]:
df_tract4029_duplicates.head(10)

Basic statistics.

In [None]:
df_tract4029_duplicates.describe()

Making the plot of all duplicates.

In [None]:
### Defining the points
dataframe = df_tract4029_duplicates
ra = dataframe['coord_ra']
dec = dataframe['coord_dec']

ra_dec_points_minusRA = gv.Points((-ra, dec), kdims=['-R.A. (deg)', 'DEC (deg)'])

### Defining some plot parameters.
title = 'Spatial distribution - Duplicates (exact R.A and DEC) in tract 4029 - All duplicates'
height = 500
width = 1000
padding = 0.05

### Making the plot.
Plate_Carree_rasterized_points = rasterize(ra_dec_points_minusRA, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Plate_Carree_spread_points = dynspread(Plate_Carree_rasterized_points).opts(width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True, 
                                                                            tools=['box_select'], show_grid=True)

Plate_Carree_spread_points

### Duplicates (exact R.A and DEC) in tract 4029 - Analysis of the flux

Now, let us compare the fluxes of these duplicated objects.

Below, we have the first lines of the table containing the objects that have **not-NaN values in the ```i_cModelFlux``` column.**

In [None]:
df_tract4029_duplicates_i_not_nan = df_tract4029_duplicates[df_tract4029_duplicates['i_cModelFlux'].notna()]
df_tract4029_duplicates_i_not_nan.head(10)

Basic statistics.

In [None]:
df_tract4029_duplicates_i_not_nan.describe()

Now, let us see if any of these objects have exactly the same fluxes.

In [None]:
df_check = df_tract4029_duplicates_i_not_nan[df_tract4029_duplicates_i_not_nan[['g_cModelFlux']].duplicated(keep=False)].sort_values('coord_ra')
df_check = df_check[df_check['g_cModelFlux'] != 0]
df_check = df_check[df_check['g_cModelFlux'].notna()]
df_check

In [None]:
df_check = df_tract4029_duplicates_i_not_nan[df_tract4029_duplicates_i_not_nan[['r_cModelFlux']].duplicated(keep=False)].sort_values('coord_ra')
df_check = df_check[df_check['r_cModelFlux'] != 0]
df_check = df_check[df_check['r_cModelFlux'].notna()]
df_check

In [None]:
df_check = df_tract4029_duplicates_i_not_nan[df_tract4029_duplicates_i_not_nan[['i_cModelFlux']].duplicated(keep=False)].sort_values('coord_ra')
df_check = df_check[df_check['i_cModelFlux'] != 0]
df_check = df_check[df_check['i_cModelFlux'].notna()]
df_check

**So, considering only objects with not-NaN values in the ```i_cModelFlux``` column, there are duplicated objects that have different ```objectId```, the same R.A. and DEC coordinates, and different fluxes. None of the fluxes are exactly the same.**

### Duplicates (exact R.A and DEC) in tract 4029 - ```detect_isDeblendedSource'``` flag

Let us see what happens if we use the flag ```detect_isDeblendedSource``` in the duplicates table. This flag is ```True``` when:

1) The source is a top level parent and it is isolated (detect_isIsolated & parent==0)

2) The source was deblended from a parent with multiple children and has no children of its own (detect_fromBlend & deblend_nPeaks == 1)

So, by setting it as ```False```, we are basically getting sources before the deblending.

#### Flag ```detect_isDeblendedSource == False```
Defining the dataframe. We will use the flags ```'detect_isPatchInner==True'``` and ```detect_isTractInner==True``` to isolate duplicates effects comming only from deblending.

In [None]:
dataframe_parent = df_tract4029_duplicates[df_tract4029_duplicates['detect_isDeblendedSource'] == False]
dataframe_parent = dataframe_parent[dataframe_parent['merge_peak_sky'] == False]
dataframe_parent = dataframe_parent[dataframe_parent['detect_isPatchInner'] == True]
dataframe_parent = dataframe_parent[dataframe_parent['detect_isTractInner'] == True]

First lines.

In [None]:
dataframe_parent.head(10)

Basic statistics.

In [None]:
dataframe_parent.describe()

Making the plot.

In [None]:
### Defining the points
dataframe = dataframe_parent
ra = dataframe['coord_ra']
dec = dataframe['coord_dec']

ra_dec_points_minusRA = gv.Points((-ra, dec), kdims=['-R.A. (deg)', 'DEC (deg)'])

### Defining some plot parameters.
title = 'Spatial distribution - Duplicates (exact R.A and DEC) in tract 4029 - detect_isDeblendedSource==False'
height = 500
width = 1000
padding = 0.05

### Making the plot.
Plate_Carree_rasterized_points = rasterize(ra_dec_points_minusRA, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Plate_Carree_spread_points = dynspread(Plate_Carree_rasterized_points).opts(width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True, 
                                                                            tools=['box_select'], show_grid=True)

Plate_Carree_spread_points

**So, in our duplicates table, we have 87 single objects before the deblending (parents).**

Parents may also be parent of other parent (other object that will be, again, deblended). Let us check if we have this in our duplicates table by searching for duplicates in the parents table, considering the R.A. and DEC. coordinates.

In [None]:
dataframe_parent_of_parent = dataframe_parent[dataframe_parent[['coord_ra', 'coord_dec']].duplicated(keep=False)].sort_values('coord_ra')
dataframe_parent_of_parent

**So, our parents don't have other parents as children, considering exact correspondence of R.A. and DEC. coordinates. It is worth noting that they may have other parents as children, but with slightly difference in the R.A. and DEC. coordinates, which we will not get here.**

#### Flag ```detect_isDeblendedSource == False``` - Who are the children?

Now, let us see who are the children, considering exact correspondence of R.A. and DEC., in our duplicates table associated to these parents objects.

Defining the children dataframe. Again, we will use the flags ```'detect_isPatchInner==True'``` and ```detect_isTractInner==True``` to isolate duplicates effects comming only from deblending.

In [None]:
dataframe_child = df_tract4029_duplicates[df_tract4029_duplicates['detect_isDeblendedSource'] == True]
dataframe_child = dataframe_child[dataframe_child['merge_peak_sky'] == False]
dataframe_child = dataframe_child[dataframe_child['detect_isPatchInner'] == True]
dataframe_child = dataframe_child[dataframe_child['detect_isTractInner'] == True]

Associating the children to the parents.

In [None]:
intersect = pd.merge(dataframe_parent, dataframe_child, on=['coord_ra', 'coord_dec'], how='inner')

filtered_parent = dataframe_parent[dataframe_parent[['coord_ra', 'coord_dec']].apply(tuple, axis=1).isin(intersect[['coord_ra', 'coord_dec']].apply(tuple, axis=1))]
filtered_child = dataframe_child[dataframe_child[['coord_ra', 'coord_dec']].apply(tuple, axis=1).isin(intersect[['coord_ra', 'coord_dec']].apply(tuple, axis=1))]

dataframe_merged = pd.concat([filtered_parent, filtered_child]).sort_values(by=['coord_ra', 'coord_dec']).reset_index(drop=True)

In [None]:
filtered_child.describe()

In [None]:
dataframe_merged.head(10)

**So, we have only 29 single children that have exact correspondence of R.A. and DEC. to the parents listed in the previous dataframe of parents. The parents may have other children, but we don't get them here, because we are considering only exact correspondence of R.A. and DEC.**

**As we can see, one of the sources of duplicates, considering R.A. and DEC. coordinates, is the deblending process, because we are counting parents and children together.**

### Duplicates in tract 4029 - ```detect_isPatchInner``` flag

Let us see what happens if we use the flag ```detect_isPatchInner==False``` in the duplicates table. This flag is ```True``` when:
1) The source is in the inner region of a patch.

So, by setting it as ```False```, we are basically getting sources which are in the edge of the patch, which overlaps with other patches. Then, we will have multiple observations of the same object.

#### Flag ```detect_isPatchInner == False```

In [None]:
dataframe_patchinner = df_tract4029_duplicates[df_tract4029_duplicates['detect_isPatchInner'] == False]

First lines.

In [None]:
dataframe_patchinner.head(10)

Basic statistics.

In [None]:
dataframe_patchinner.describe()

Making the plot of duplicates with ```detect_isPatchInner==False```.

In [None]:
### Defining the points
dataframe = dataframe_patchinner
ra = dataframe['coord_ra']
dec = dataframe['coord_dec']

ra_dec_points_minusRA = gv.Points((-ra, dec), kdims=['-R.A. (deg)', 'DEC (deg)'])

### Defining some plot parameters.
title = 'Spatial distribution - Duplicates (exact R.A and DEC) in tract 4029 - detect_isPatchInner==False'
height = 500
width = 1000
padding = 0.05

### Making the plot.
Plate_Carree_rasterized_points = rasterize(ra_dec_points_minusRA, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Plate_Carree_spread_points = dynspread(Plate_Carree_rasterized_points).opts(width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True, 
                                                                            tools=['box_select'], show_grid=True)

Plate_Carree_spread_points

**So, 34636 duplicates objects (counting the original and the duplicate) are objects in the edges of the patches, therefore in the region of overlap between different patches, where we expect them to have multiple observations.**

### Duplicates in tract 4029 - ```detect_isTractInner==False```

Let us see what happens if we use the flag ```detect_isTractInner==False``` in the duplicates table. This flag is ```True``` when:
1) The source is in the inner region of a tract.

So, by setting it as ```False```, we are basically getting sources which are in the edge of the tract, which overlaps with other tracts. Then, we will have multiple observations of the same object.

#### Flag ```detect_isTractInner == False```

In [None]:
dataframe_tractinner = df_tract4029_duplicates[df_tract4029_duplicates['detect_isTractInner'] == False]

First lines.

In [None]:
dataframe_tractinner.head(10)

Basic statistics.

In [None]:
dataframe_tractinner.describe()

Making the plot of duplicates with ```detect_isTractInner==False```.

In [None]:
### Defining the points
dataframe = dataframe_tractinner
ra = dataframe['coord_ra']
dec = dataframe['coord_dec']

ra_dec_points_minusRA = gv.Points((-ra, dec), kdims=['-R.A. (deg)', 'DEC (deg)'])

### Defining some plot parameters.
title = 'Spatial distribution - Duplicates (exact R.A and DEC) in tract 4029 - detect_isTractInner==False'
height = 500
width = 1000
padding = 0.05

### Making the plot.
Plate_Carree_rasterized_points = rasterize(ra_dec_points_minusRA, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Plate_Carree_spread_points = dynspread(Plate_Carree_rasterized_points).opts(width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True, 
                                                                            tools=['box_select'], show_grid=True)

Plate_Carree_spread_points

**So, 7046 duplicates objects (counting the original and the duplicate) are objects in the edges of the tract, therefore in the region of overlap between different tracts, where we expect them to have multiple observations.**

---------------------------------------------------------------------------------------------------------
## Checking for duplicates in tract 4030, considering the exact match of the R.A. and DEC coordinates

Now, we will check for duplicates in tract 4030, **considering the exact match of the R.A. and DEC coordinates**, and we sort the values based on the ```coord_ra``` column.

In [None]:
df_tract4030_duplicates = df_tract4030_small[df_tract4030_small[['coord_ra', 'coord_dec']].duplicated(keep=False)].sort_values('coord_ra')

### Duplicates (exact R.A and DEC) in tract 4030  - General information

First lines of the full duplicates table.

In [None]:
df_tract4030_duplicates.head(10)

Basic statistics.

In [None]:
df_tract4030_duplicates.describe()

Making the plot for all duplicates.

In [None]:
### Defining the points
dataframe = df_tract4030_duplicates
ra = dataframe['coord_ra']
dec = dataframe['coord_dec']

ra_dec_points_minusRA = gv.Points((-ra, dec), kdims=['-R.A. (deg)', 'DEC (deg)'])

### Defining some plot parameters.
title = 'Spatial distribution - Duplicates (exact R.A and DEC) in tract 4030 - All duplicates'
height = 500
width = 1000
padding = 0.05

### Making the plot.
Plate_Carree_rasterized_points = rasterize(ra_dec_points_minusRA, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Plate_Carree_spread_points = dynspread(Plate_Carree_rasterized_points).opts(width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True, 
                                                                            tools=['box_select'], show_grid=True)

Plate_Carree_spread_points

### Duplicates (exact R.A and DEC) in tract 4030 - Analysis of the flux

Now, let us compare the fluxes of these duplicated objects.

Below, we have the first lines of the table containing the objects that have **not-NaN values in the ```i_cModelFlux``` column.**

In [None]:
df_tract4030_duplicates_i_not_nan = df_tract4030_duplicates[df_tract4030_duplicates['i_cModelFlux'].notna()]
df_tract4030_duplicates_i_not_nan.head(10)

Basic statistics.

In [None]:
df_tract4030_duplicates_i_not_nan.describe()

Now, let us see if any of these objects have exactly the same fluxes.

In [None]:
df_check_4030 = df_tract4030_duplicates_i_not_nan[df_tract4030_duplicates_i_not_nan[['g_cModelFlux']].duplicated(keep=False)].sort_values('coord_ra')
df_check_4030 = df_check_4030[df_check_4030['g_cModelFlux'] != 0]
df_check_4030 = df_check_4030[df_check_4030['g_cModelFlux'].notna()]
df_check_4030

In [None]:
df_check_4030 = df_tract4030_duplicates_i_not_nan[df_tract4030_duplicates_i_not_nan[['r_cModelFlux']].duplicated(keep=False)].sort_values('coord_ra')
df_check_4030 = df_check_4030[df_check_4030['r_cModelFlux'] != 0]
df_check_4030 = df_check_4030[df_check_4030['r_cModelFlux'].notna()]
df_check_4030

In [None]:
df_check_4030 = df_tract4030_duplicates_i_not_nan[df_tract4030_duplicates_i_not_nan[['i_cModelFlux']].duplicated(keep=False)].sort_values('coord_ra')
df_check_4030 = df_check_4030[df_check_4030['i_cModelFlux'] != 0]
df_check_4030 = df_check_4030[df_check_4030['i_cModelFlux'].notna()]
df_check_4030

**So, considering only objects with not-NaN values in the ```i_cModelFlux``` column, there are duplicated objects that have different ```objectId```, the same R.A. and DEC coordinates, and different fluxes. None of the fluxes are exactly the same.**

### Duplicates (exact R.A and DEC) in tract 4030 - ```detect_isDeblendedSource'``` flag

Let us see what happens if we use the flag ```detect_isDeblendedSource``` in the duplicates table. This flag is ```True``` when:

1) The source is a top level parent and it is isolated (detect_isIsolated & parent==0)

2) The source was deblended from a parent with multiple children and has no children of its own (detect_fromBlend & deblend_nPeaks == 1)

So, by setting it as ```False```, we are basically getting sources before the deblending.

#### Flag ```detect_isDeblendedSource == False```
Defining the dataframe. We will use the flags ```'detect_isPatchInner==True'``` and ```detect_isTractInner==True``` to isolate duplicates effects comming only from deblending.

In [None]:
dataframe_parent_4030 = df_tract4030_duplicates[df_tract4030_duplicates['detect_isDeblendedSource'] == False]
dataframe_parent_4030 = dataframe_parent_4030[dataframe_parent_4030['merge_peak_sky'] == False]
dataframe_parent_4030 = dataframe_parent_4030[dataframe_parent_4030['detect_isPatchInner'] == True]
dataframe_parent_4030 = dataframe_parent_4030[dataframe_parent_4030['detect_isTractInner'] == True]

First lines.

In [None]:
dataframe_parent_4030.head(10)

Basic statistics.

In [None]:
dataframe_parent_4030.describe()

Making the plot.

In [None]:
### Defining the points
dataframe = dataframe_parent_4030
ra = dataframe['coord_ra']
dec = dataframe['coord_dec']

ra_dec_points_minusRA = gv.Points((-ra, dec), kdims=['-R.A. (deg)', 'DEC (deg)'])

### Defining some plot parameters.
title = 'Spatial distribution - Duplicates (exact R.A and DEC) in tract 4030 - detect_isDeblendedSource==False'
height = 500
width = 1000
padding = 0.05

### Making the plot.
Plate_Carree_rasterized_points = rasterize(ra_dec_points_minusRA, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Plate_Carree_spread_points = dynspread(Plate_Carree_rasterized_points).opts(width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True, 
                                                                            tools=['box_select'], show_grid=True)

Plate_Carree_spread_points

**So, in our duplicates table, we have 87 single objects before the deblending (parents). This is exactly equal to the number of parents in tract 4029. Maybe it is some pattern of the simulation?**

Parents may also be parent of other parent (other object that will be, again, deblended). Let us check if we have this in our duplicates table by searching for duplicates in the parents table, considering the R.A. and DEC. coordinates.

In [None]:
dataframe_parent_of_parent_4030 = dataframe_parent_4030[dataframe_parent_4030[['coord_ra', 'coord_dec']].duplicated(keep=False)].sort_values('coord_ra')
dataframe_parent_of_parent_4030

**So, our parents don't have other parents as children, considering exact correspondence of R.A. and DEC. coordinates. It is worth noting that they may have other parents as children, but with slightly difference in the R.A. and DEC. coordinates, which we will not get here.**

#### Flag ```detect_isDeblendedSource == False``` - Who are the children?

Now, let us see who are the children, considering exact correspondence of R.A. and DEC., in our duplicates table associated to these parents objects.

Defining the children dataframe. Again, we will use the flags ```'detect_isPatchInner==True'``` and ```detect_isTractInner==True``` to isolate duplicates effects comming only from deblending.

In [None]:
dataframe_child_4030 = df_tract4030_duplicates[df_tract4030_duplicates['detect_isDeblendedSource'] == True]
dataframe_child_4030 = dataframe_child_4030[dataframe_child_4030['merge_peak_sky'] == False]
dataframe_child_4030 = dataframe_child_4030[dataframe_child_4030['detect_isPatchInner'] == True]
dataframe_child_4030 = dataframe_child_4030[dataframe_child_4030['detect_isTractInner'] == True]

Associating the children to the parents.

In [None]:
intersect_4030 = pd.merge(dataframe_parent_4030, dataframe_child_4030, on=['coord_ra', 'coord_dec'], how='inner')

filtered_parent_4030 = dataframe_parent_4030[dataframe_parent_4030[['coord_ra', 'coord_dec']].apply(tuple, axis=1).isin(intersect_4030[['coord_ra', 'coord_dec']].apply(tuple, axis=1))]
filtered_child_4030 = dataframe_child_4030[dataframe_child_4030[['coord_ra', 'coord_dec']].apply(tuple, axis=1).isin(intersect_4030[['coord_ra', 'coord_dec']].apply(tuple, axis=1))]

dataframe_merged_4030 = pd.concat([filtered_parent_4030, filtered_child_4030]).sort_values(by=['coord_ra', 'coord_dec']).reset_index(drop=True)

In [None]:
filtered_child_4030.describe()

In [None]:
dataframe_merged_4030.head(10)

**So, we have only 28 single children that have exact correspondence of R.A. and DEC. to the parents listed in the previous dataframe of parents. The parents may have other children, but we don't get them here, because we are considering only exact correspondence of R.A. and DEC.**

**As we can see, one of the sources of duplicates, considering R.A. and DEC. coordinates, is the deblending process, because we are counting parents and children together.**

### Duplicates in tract 4030 - ```detect_isPatchInner``` flag

Let us see what happens if we use the flag ```detect_isPatchInner==False``` in the duplicates table. This flag is ```True``` when:
1) The source is in the inner region of a patch.

So, by setting it as ```False```, we are basically getting sources which are in the edge of the patch, which overlaps with other patches. Then, we will have multiple observations of the same object.

#### Flag ```detect_isPatchInner == False```

In [None]:
dataframe_patchinner_4030 = df_tract4030_duplicates[df_tract4030_duplicates['detect_isPatchInner'] == False]

First lines.

In [None]:
dataframe_patchinner_4030.head(10)

Basic statistics.

In [None]:
dataframe_patchinner_4030.describe()

Making the plot of duplicates with ```detect_isPatchInner==False```.

In [None]:
### Defining the points
dataframe = dataframe_patchinner_4030
ra = dataframe['coord_ra']
dec = dataframe['coord_dec']

ra_dec_points_minusRA = gv.Points((-ra, dec), kdims=['-R.A. (deg)', 'DEC (deg)'])

### Defining some plot parameters.
title = 'Spatial distribution - Duplicates (exact R.A and DEC) in tract 4030 - detect_isPatchInner==False'
height = 500
width = 1000
padding = 0.05

### Making the plot.
Plate_Carree_rasterized_points = rasterize(ra_dec_points_minusRA, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Plate_Carree_spread_points = dynspread(Plate_Carree_rasterized_points).opts(width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True, 
                                                                            tools=['box_select'], show_grid=True)

Plate_Carree_spread_points

**So, 32410 duplicates objects (counting the original and the duplicate) are objects in the edges of the patches, therefore in the region of overlap between different patches, where we expect them to have multiple observations.**

### Duplicates in tract 4030 - ```detect_isTractInner==False```

Let us see what happens if we use the flag ```detect_isTractInner==False``` in the duplicates table. This flag is ```True``` when:
1) The source is in the inner region of a tract.

So, by setting it as ```False```, we are basically getting sources which are in the edge of the tract, which overlaps with other tracts. Then, we will have multiple observations of the same object.

#### Flag ```detect_isTractInner == False```

In [None]:
dataframe_tractinner_4030 = df_tract4030_duplicates[df_tract4030_duplicates['detect_isTractInner'] == False]

First lines.

In [None]:
dataframe_tractinner_4030.head(10)

Basic statistics.

In [None]:
dataframe_tractinner_4030.describe()

Making the plot of duplicates with ```detect_isTractInner==False```.

In [None]:
### Defining the points
dataframe = dataframe_tractinner_4030
ra = dataframe['coord_ra']
dec = dataframe['coord_dec']

ra_dec_points_minusRA = gv.Points((-ra, dec), kdims=['-R.A. (deg)', 'DEC (deg)'])

### Defining some plot parameters.
title = 'Spatial distribution - Duplicates (exact R.A and DEC) in tract 4030 - detect_isTractInner==False'
height = 500
width = 1000
padding = 0.05

### Making the plot.
Plate_Carree_rasterized_points = rasterize(ra_dec_points_minusRA, aggregator=ds.count()).opts(cmap="Viridis", cnorm='log')

Plate_Carree_spread_points = dynspread(Plate_Carree_rasterized_points).opts(width=width, height=height, padding=padding, title=title, toolbar='above', colorbar=True, 
                                                                            tools=['box_select'], show_grid=True)

Plate_Carree_spread_points

**So, 6873 duplicates objects (counting the original and the duplicate) are objects in the edges of the tract, therefore in the region of overlap between different tracts, where we expect them to have multiple observations.**

---------------------------------------------------------------------------------------------------
## Checking for duplicates in both tracts concatenated, considering the R.A. and DEC coordinates

Now, we will check for duplicates in both tracts concatenated, considering the R.A. and DEC coordinates, and we sort the values based on the ```coord_ra``` column.

First, let us define the concatenated dask dataframe.

In [None]:
ddf_concat_4029_4030 = dd.concat([ddf_tract4029,ddf_tract4030])

selected_columns = ['coord_ra', 'coord_dec', 'g_cModelFlux', 'r_cModelFlux', 'i_cModelFlux', 'g_cModelFluxErr', 
                    'r_cModelFluxErr', 'i_cModelFluxErr', 'detect_isDeblendedSource', 'detect_isPatchInner', 'detect_isTractInner', 'merge_peak_sky', 'detect_isPrimary']

df_concat_4029_4030 = ddf_concat_4029_4030[selected_columns].compute()

Now, let us search for duplicates.

In [None]:
df_concat_4029_4030[df_concat_4029_4030[['coord_ra', 'coord_dec']].duplicated(keep=False)].sort_values('coord_ra')

**As we can see, the number of duplicates is exactly the sum of the number of duplicates of both tracts individually (4029 and 4030), that is, $40199\text{ rows} + 37817 \text{ rows} = 78016 \text{ rows}$. So, when we consider the R.A. and DEC. exactly correspondence as the criteria for finding duplicates, it seems that the duplicates are coming from a individual tract, and not from the combination of both tracts.**


### Checking for duplicates in both tracts concatenated, considering the objectId column (index)

Now, we will check for duplicates in both tracts concatenated, considering just the ```objectId column```, and we sort the values based on the ```coord_ra``` column.

In [None]:
df_concat_4029_4030[df_concat_4029_4030.index.duplicated(keep=False)].sort_values('coord_ra')

As we can see, there are no duplicated ```objectId```, although, as we saw before, there are objects that have the same R.A. and DEC, but with different ```objectId``` values.

------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------
# Closing the client and cluster

In [None]:
# Fechando o client
client.close()
cluster.close()