# Using per pixel statistics in a search

Author: Melissa

This is just a little proof-of-concept notebook, to explore using per pixel statistics to reduce the search space of what would otherwise be a full-table scan.

This is likely something that we get from dask expressions, but if we would like to move away from dask's more exotic features, then this is something we could consider as a substitute?

## Introduction

So, what are we going to do?

We're going to find all stars in GAIA Bailor-Jones that are closer than 3 (units?) away, using median photogeometric distances estimates. Someone else can tell me this is a silly question, but I'm doing it anyway.

For numeric extrema, we can use the `per_pixel_statistics` values, derived from the parquet `_metadata` file to perform some pretty strong filtering: if we want all rows with value less than `k`, then the minimum value in the data partition must be less than `k`.

In [1]:
import lsdb

In [32]:
gaia_d = lsdb.open_catalog("/epyc/data3/hats/catalogs/gaia_edr3_distances/gaia_edr3_distances")
per_pixel = gaia_d.per_pixel_statistics(include_columns=["r_med_photogeo"])
per_pixel

Unnamed: 0,r_med_photogeo: min_value,r_med_photogeo: max_value,r_med_photogeo: null_count,r_med_photogeo: row_count
"Order: 2, Pixel: 0",7.220708,30164.0723,4060,613898
"Order: 3, Pixel: 4",9.662493,24033.4453,1527,234484
...,...,...,...,...
"Order: 3, Pixel: 766",18.323370,44377.1016,9376,960532
"Order: 3, Pixel: 767",14.478624,44927.0469,5648,704797


## What pixels are we dealing with?

We can query these global stats to find just SEVEN partitions that have values in this extreme end. That's less than half a percent of all data partitions.

In [3]:
nearby = per_pixel.query("`r_med_photogeo: min_value` < 3")
limit_pixels = nearby.index.tolist()
limit_pixels

[Order: 2, Pixel: 21,
 Order: 4, Pixel: 1308,
 Order: 2, Pixel: 107,
 Order: 5, Pixel: 7238,
 Order: 4, Pixel: 1986,
 Order: 2, Pixel: 142,
 Order: 6, Pixel: 41591]

In [4]:
%%time
nearby_results_df = gaia_d.pixel_search(pixels = limit_pixels).query('r_med_photogeo < 3').compute()
nearby_results_df

CPU times: user 4.3 s, sys: 4.74 s, total: 9.04 s
Wall time: 3 s


Unnamed: 0_level_0,source_id,r_med_geo,r_lo_geo,r_hi_geo,r_med_photogeo,r_lo_photogeo,r_hi_photogeo,flag,ra,dec,Norder,Dir,Npix
_healpix_29,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
381407739701909405,762815470562110464,2.545952,2.545706,2.546205,2.545851,2.545692,2.546106,10033,165.83096,35.948653,2,0,21
1473525239409412345,2947050466531873024,2.670147,2.668469,2.671663,2.670632,2.668852,2.672157,10012,101.286626,-16.720933,4,0,1308
1932486467174911786,3864972938605115520,2.408358,2.407998,2.408747,2.408285,2.407773,2.40883,10022,164.10319,7.002727,2,0,107
2037570889377880190,4075141768785646848,2.975431,2.975131,2.975705,2.975421,2.975177,2.975699,10033,282.458789,-23.837097,5,0,7238
2236416425020061922,4472832130942575872,1.828106,1.827975,1.828225,1.8281,1.82797,1.828225,10033,269.448503,4.73942,4,0,1986
2570346799484574619,5140693571158739840,2.718584,2.713234,2.724433,2.718944,2.714043,2.724843,10022,24.771554,-17.9483,2,0,142
2570346800245147123,5140693571158946048,2.67459,2.67066,2.678184,2.675949,2.672405,2.679119,10022,24.771674,-17.947683,2,0,142
2926749359394994434,5853498713190525696,1.301911,1.301814,1.302005,1.301935,1.301849,1.302006,10033,217.392321,-62.676075,6,40000,41591


## Putting it together into a custom search method:

I'm not sure if we should just go ahead and add this to our code, or use it as an example of how you could create a custom search subclass. 

This notebook started as a "how could I make a custom search subclass", but the result might be too useful just for documentation?

In [61]:
from lsdb.core.search.abstract_search import AbstractSearch
from lsdb.types import HCCatalogTypeVar
import nested_pandas as npd

class ExtremaSearch(AbstractSearch):
    
    def __init__(self, field: str = None, less_than_value = None, greater_than_value = None, inclusive=False, fine:bool=True):
        super().__init__(fine)
        if field is None:
            raise ValueError("field is required")
        if less_than_value is None == greater_than_value is None:
            raise ValueError("exactly one of less_than_value or greater_than_value is required")
        self.field = field
        self.less_than_value = less_than_value
        self.greater_than_value = greater_than_value
        self.stat = "min_value" if self.less_than_value is not None else "max_value"
        stat_value = self.less_than_value or self.greater_than_value
        self.stat_clause = f' {"<" if self.less_than_value is not None else ">"}{"=" if inclusive else ""} {stat_value}'
    
    def perform_hc_catalog_filter(self, hc_structure: HCCatalogTypeVar) -> HCCatalogTypeVar:
        """Determine the pixels for which there is a result in each field"""
        stats = hc_structure.per_pixel_statistics(include_columns=[self.field], include_stats = [self.stat])
        if len(stats) == 0:
            raise ValueError("Error with extrema search - column not found")
        
        within_extrema = stats.query(f"`{self.field}: {self.stat}` {self.stat_clause}")
        limit_pixels = within_extrema.index.tolist()
        if len(limit_pixels) == 0:
            raise ValueError("Error with extrema search - there's nothing that extreme!!")
        return hc_structure.filter_from_pixel_list(limit_pixels) 
    
    def search_points(self, frame: npd.NestedFrame, _) -> npd.NestedFrame:
        """Determine the search results within a data frame"""
        return frame.query(f'{self.field}{self.stat_clause}')

In [62]:
%%time
search_object = ExtremaSearch(field="r_med_photogeo", less_than_value=3)
gaia_d.search(search_object).compute()

CPU times: user 5.33 s, sys: 1.72 s, total: 7.04 s
Wall time: 3.04 s


Unnamed: 0_level_0,source_id,r_med_geo,r_lo_geo,r_hi_geo,r_med_photogeo,r_lo_photogeo,r_hi_photogeo,flag,ra,dec,Norder,Dir,Npix
_healpix_29,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
381407739701909405,762815470562110464,2.545952,2.545706,2.546205,2.545851,2.545692,2.546106,10033,165.83096,35.948653,2,0,21
1473525239409412345,2947050466531873024,2.670147,2.668469,2.671663,2.670632,2.668852,2.672157,10012,101.286626,-16.720933,4,0,1308
1932486467174911786,3864972938605115520,2.408358,2.407998,2.408747,2.408285,2.407773,2.40883,10022,164.10319,7.002727,2,0,107
2037570889377880190,4075141768785646848,2.975431,2.975131,2.975705,2.975421,2.975177,2.975699,10033,282.458789,-23.837097,5,0,7238
2236416425020061922,4472832130942575872,1.828106,1.827975,1.828225,1.8281,1.82797,1.828225,10033,269.448503,4.73942,4,0,1986
2570346799484574619,5140693571158739840,2.718584,2.713234,2.724433,2.718944,2.714043,2.724843,10022,24.771554,-17.9483,2,0,142
2570346800245147123,5140693571158946048,2.67459,2.67066,2.678184,2.675949,2.672405,2.679119,10022,24.771674,-17.947683,2,0,142
2926749359394994434,5853498713190525696,1.301911,1.301814,1.302005,1.301935,1.301849,1.302006,10033,217.392321,-62.676075,6,40000,41591


In [63]:
%%time
search_object = ExtremaSearch(field="r_med_photogeo", greater_than_value=70_000)
gaia_d.search(search_object).compute()

CPU times: user 4.09 s, sys: 844 ms, total: 4.94 s
Wall time: 3.01 s


Unnamed: 0_level_0,source_id,r_med_geo,r_lo_geo,r_hi_geo,r_med_photogeo,r_lo_photogeo,r_hi_photogeo,flag,ra,dec,Norder,Dir,Npix
_healpix_29,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2330742456168955475,4661484892918173568,34315.3008,26725.9688,44993.543,70077.9297,58305.1562,79006.0,10022,76.399752,-67.517804,6,30000,33121
2342977020792708032,4685954013164072064,48638.1562,38473.0664,61766.7617,79661.0078,37747.668,102706.75,10133,13.201436,-72.968982,6,30000,33295
2343823008975960372,4687645989689635200,36967.332,28247.9082,45206.5391,79365.6797,44929.6758,91542.6641,10022,18.859864,-71.455139,4,0,2081
3075946613201331246,6151893215265946496,1478.46069,801.458435,3332.05005,71093.4609,52224.9414,75026.6719,20012,185.882392,-36.097714,4,0,2731
