# Analysis of duplicates in `ztf_axs/ztf_dr14`

I use the hipscated version of the dataset here to show that it contains
positional and PS1 index duplicates

In [None]:
# !pip install polars

In [1]:
from pathlib import Path

import polars as pl
from polars import col

In [2]:
# PSC location
DIR = '/jet/home/malanche/shared/hipscat/catalogs/ztf_axs/ztf_dr14/'

In [3]:
def hist(*group_by_cols, select=None):
    if select is None:
        select = group_by_cols
    
    dfs = []
    for file in (Path(DIR).glob('**/*.parquet')):
        df = (
            pl.scan_parquet(file, hive_partitioning=False)
            .select(select)
        )
        dfs.append(df)
    
    result = (
        pl.concat(dfs)
        .group_by(*group_by_cols)
        .count()
        .select(col('count').alias('N_dups'))
        .filter(col('N_dups') > 1)
        .group_by('N_dups')
        .count()
        .sort('N_dups')
    )
    return result.collect()

### PS1 object ID duplicates
Probably due to some issues in AXS- or Hipscat- import

In [4]:
%%time

hist('ps1_objid')

CPU times: user 26min 11s, sys: 6min 11s, total: 32min 23s
Wall time: 1min 58s


N_dups,count
u32,u32
2,214561


### Exact position duplicates

Probably due to PS1 skycell overlapping

In [5]:
%%time

hist('ra', 'dec')

CPU times: user 26min 26s, sys: 9min 20s, total: 35min 46s
Wall time: 1min 20s


N_dups,count
u32,u32
2,27342697
3,694450
4,18570
5,248
6,111


### Approximate position duplicates
#### Objects sharing the same healpix norder=19 tile
Probably the same, but with some floating point rounding errors

In [6]:
%%time

# From hipscat code, 19 is Norder used for the index
shift_by = (64 - (4 + 2 * 19))
# polars doesn't have a native support of bit-shifts
hist('healpix19', select=(pl.col('_hipscat_index') // (1 << shift_by)).alias('healpix19'))

CPU times: user 21min 25s, sys: 16min 59s, total: 38min 25s
Wall time: 1min 6s


N_dups,count
u32,u32
2,31025043
3,941139
4,29881
5,550
6,153
