# Performance of LSDB's cross-matching algorithm

We compare the performance of the default LSDB's cross-matching algorithm, `KdTreeCrossmatch`, with `astropy`'s functions: `match_coordinates_sky` and `search_around_sky`.
While `KdTreeCrossmatch` allows to specify both search radius and number of neighbors to get, `match_coordinates_sky` may provide n-th neighbor only, and `search_around_sky` may provide all neighbors in a given search radius.

This notebook shows, that for the simplest case of a small radius and a single cross-match pair, LSDB's algorithm works a bit slower than `match_coordinates_sky`, if the least one receives `SkyCoord` as an input.
My hypothesis that it happens because `SkyCoord` coverts spherical coordinates to cartesian at the construction time, while `KdTreeCrossmatch` does this conversion itself. This makes the result of the comparison unfair.

From the other side, `KdTreeCrossmatch` cross-matches coordinate catalogs two orders of magnitude faster than `SkyCoord` construction, so if we would include initialisation of `SkyCoord`s into the astropy benchmark, it would run much slower than LSDB's algorithm.
    

### Import packages

In [1]:
import pandas as pd

import lsdb
from astropy.coordinates import Angle, SkyCoord, match_coordinates_sky, search_around_sky
from lsdb.core.crossmatch.kdtree_match import KdTreeCrossmatch

### Set cross-match parameters

In [2]:
N_NEIGHBORS = 10
RADIUS_ARCSEC = 60.0

### Obtain LINCC Frameworks' "half-degree" catalogs: Gaia DR3 and ZTF DR14

In [None]:
%%time

gaia_path = 'https://epyc.astro.washington.edu/~lincc-frameworks/half_degree_surveys/gaia/'
ztf_path = 'https://epyc.astro.washington.edu/~lincc-frameworks/half_degree_surveys/ztf/ztf_object/'

gaia_catalog = lsdb.read_hipscat(gaia_path, columns=["ra", "dec"])
ztf_catalog = lsdb.read_hipscat(ztf_path, columns=["ra", "dec"])

gaia_df = gaia_catalog.compute().reset_index(drop=True)
ztf_df = ztf_catalog.compute().reset_index(drop=True)

### Run LSDB's cross-matching algorithm 

In [None]:
def lsdb_crossmatch():
    # LSDB's cross-matching algorithms are not designed to be used out of 
    # lsdb.catalog.Catalog.crossmatch(), so we construct the object with some
    # mock properties.
    kd_tree_crossmatch = KdTreeCrossmatch(
        left=gaia_df,
        right=ztf_df,
        left_order=-1,
        left_pixel=-1,
        right_order=-1,
        right_pixel=-1,
        left_metadata=gaia_catalog.hc_structure,
        right_metadata=ztf_catalog.hc_structure,
        right_margin_hc_structure=None,
        suffixes=("", ""),
    )

    return kd_tree_crossmatch.crossmatch(n_neighbors=N_NEIGHBORS, radius_arcsec=RADIUS_ARCSEC)

%timeit lsdb_crossmatch()
lsdb_result = lsdb_crossmatch().reset_index(drop=True)

### Create `SkyCoord`s for astropy - note that it takes a while

In [None]:
%%time

gaia_skycoord = SkyCoord(ra=gaia_df["ra"], dec=gaia_df["dec"], unit="deg")
ztf_skycoord = SkyCoord(ra=ztf_df["ra"], dec=ztf_df["dec"], unit="deg")

### Run `astropy`'s `match_coordinates_sky`

In [None]:
def astropy_match_nthneighbor(nthneighbor: int) -> pd.DataFrame:
    idx_ztf, d2, _d3 = match_coordinates_sky(gaia_skycoord, ztf_skycoord, nthneighbor=nthneighbor)
    result = pd.concat(
        [
            gaia_df,
            ztf_df.iloc[idx_ztf].reset_index(drop=True),
            pd.DataFrame({"_dist_arcsec": d2.to_value('arcsec')}),
        ],
        axis=1,
    )
    return result[d2 < Angle(RADIUS_ARCSEC, 'arcsec')]

def astropy_match():
    result = pd.concat(
        [astropy_match_nthneighbor(nth) for nth in range(1, N_NEIGHBORS + 1)],
        axis=0,
    )
    return result

%timeit astropy_match()
astropy_match_result = astropy_match().sort_index(kind='stable').reset_index(drop=True)

### Run `astropy`'s `search_around_sky`

In [None]:
def astropy_search():
    idx_gaia, idx_ztf, d2, _d3 = search_around_sky(
        gaia_skycoord,
        ztf_skycoord,
        Angle(RADIUS_ARCSEC, 'arcsec'),
    )
    idx_df = pd.DataFrame(
        {"idx_gaia": idx_gaia, "idx_ztf": idx_ztf, "_dist_arcsec": d2.to_value("arcsec")},
    )
    idx_df = idx_df.groupby("idx_gaia").apply(lambda x: x.nsmallest(N_NEIGHBORS, "_dist_arcsec"))
    idx_df = idx_df.reset_index(drop=True)
    return pd.concat(
        [
            gaia_df.iloc[idx_df["idx_gaia"]].reset_index(drop=True),
            ztf_df.iloc[idx_df["idx_ztf"]].reset_index(drop=True),
            idx_df["_dist_arcsec"].to_frame(),
        ],
        axis=1,
    )

%timeit astropy_search()
astropy_search_result = astropy_search().reset_index(drop=True)

### Compare the results

In [None]:
pd.testing.assert_frame_equal(
    lsdb_result,
    astropy_match_result,
)
pd.testing.assert_frame_equal(
    lsdb_result,
    astropy_search_result,
)
print('All dataframes are equal')