# Estimating catalog size

This notebook figures out an upper-bound for the number of rows, and expected sizes in memory and on disk.

In [70]:
import lsdb
import hats as hc
import human_readable
import nested_pandas as npd
import pandas as pd
import sys

from hats.pixel_math import HealpixPixel

We cannot provide an accurate estimate until we compute the result...

- But we can provide a "lazy" estimate from the catalog's original metadata.

- Especially for cases of **column selection** and **spatial filtering**.

- *_metadata* sizes do **NOT** include parquet headers/footers, just the data pages! 

- In practice there will also be in-memory overhead from readers.

These should be good estimates for raw data, not full in-memory footprint!

In [71]:
from dask.distributed import Client
client = Client(n_workers=3)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 36717 instead


### Testing on 2mass

In [72]:
two_mass = lsdb.open_catalog('https://data.lsdb.io/hats/two_mass')

Here is the example for a catalog filtered with pixels and columns:

In [73]:
pixels = [HealpixPixel(2,187),HealpixPixel(4,2654)]
columns = ["ra", "decl"]
filtered_two_mass = two_mass[columns].pixel_search(pixels)

Let's grab the per-pixel statistics:

In [74]:
stats = filtered_two_mass.per_pixel_statistics(
    use_default_columns=False,
    exclude_hats_columns=False,
    include_columns=list(filtered_two_mass.columns) + ["_healpix_29"],
    include_stats=["row_count", "memory_bytes", "disk_bytes"],
    multi_index=True,
)
stats



Unnamed: 0_level_0,Unnamed: 1_level_0,row_count,memory_bytes,disk_bytes
pixel,column,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Order: 4, Pixel: 2654",_healpix_29,1198865,9283543.0,11264710.0
"Order: 4, Pixel: 2654",ra,1198865,6908816.0,11225892.0
"Order: 4, Pixel: 2654",decl,1198865,5347272.0,10236096.0
"Order: 2, Pixel: 187",_healpix_29,3317984,27710228.0,31295725.0
"Order: 2, Pixel: 187",ra,3317984,19560792.0,31350383.0
"Order: 2, Pixel: 187",decl,3317984,16961158.0,30759849.0


#### Max row count

Summing the file lengths for each pixel will give us an upper-bound to the number of rows.

In [75]:
# Reading it from the statistics
row_counts = stats.groupby(level=0).first()["row_count"]
row_counts

pixel
Order: 2, Pixel: 187     3317984
Order: 4, Pixel: 2654    1198865
Name: row_count, dtype: int64

In [76]:
row_counts.sum()

np.int64(4516849)

In [77]:
# Checking by loading each individual parquet file
row_count = 0
for pixel in pixels:
    path = hc.io.pixel_catalog_file(two_mass.hc_structure.catalog_path, pixel)
    row_count += len(npd.read_parquet(path))
row_count

4516849

#### Size in memory (`total_uncompressed`)

Size in memory is trickier, because it will vary between machines/readers.

In [78]:
mem_sizes = pd.to_numeric(stats["memory_bytes"], errors="coerce")
mem_sizes

pixel                  column     
Order: 4, Pixel: 2654  _healpix_29     9283543.0
                       ra              6908816.0
                       decl            5347272.0
Order: 2, Pixel: 187   _healpix_29    27710228.0
                       ra             19560792.0
                       decl           16961158.0
Name: memory_bytes, dtype: float64

In [79]:
human_readable.file_size(mem_sizes.sum())

'85.8 MB'

Let's check this size with what we get for pandas:

In [80]:
filtered_df = filtered_two_mass.compute()

In [81]:
human_readable.file_size(sys.getsizeof(filtered_df))

'108.4 MB'

In [82]:
# Similar result using pandas memory_usage
human_readable.file_size(filtered_df.memory_usage(deep=True).sum())

'108.4 MB'

These values look reasonable, even if a bit off.

Something weird happens when we compare to `total_compressed_size` though:

### Size on disk (`total_compressed`)

In [83]:
disk_sizes = pd.to_numeric(stats["disk_bytes"], errors="coerce")
disk_sizes

pixel                  column     
Order: 4, Pixel: 2654  _healpix_29    11264710.0
                       ra             11225892.0
                       decl           10236096.0
Order: 2, Pixel: 187   _healpix_29    31295725.0
                       ra             31350383.0
                       decl           30759849.0
Name: disk_bytes, dtype: float64

In [84]:
human_readable.file_size(disk_sizes.sum())

'126.1 MB'

It's larger than the size we previously got for memory but not by much. 
Maybe for 4M rows the headers and footers create this overhead. 

This size may also vary depending on the compression used:

In [85]:
filtered_two_mass.compute().to_parquet("zstd.parquet", compression="ZSTD")
filtered_two_mass.compute().to_parquet("snappy.parquet", compression="snappy")

In [87]:
!du -sh *

40K	estimate_size.ipynb
106M	snappy.parquet
91M	zstd.parquet


### Looking at the full catalog...

In [88]:
full_stats = two_mass.per_pixel_statistics(
    use_default_columns=False,
    exclude_hats_columns=False,
    include_columns=list(two_mass.all_columns) + ["_healpix_29"],
    include_stats=["row_count", "memory_bytes", "disk_bytes"],
    multi_index=True,
)
full_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,row_count,memory_bytes,disk_bytes
pixel,column,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Order: 1, Pixel: 0",_healpix_29,2498051,21173325,23540650
"Order: 1, Pixel: 0",ra,2498051,14618877,23681196
...,...,...,...,...
"Order: 1, Pixel: 47",coadd_key,4449556,1567626,3603565
"Order: 1, Pixel: 47",coadd,4449556,799008,1538556


In [89]:
human_readable.file_size(pd.to_numeric(full_stats["memory_bytes"], errors="coerce").sum())

'36.6 GB'

In [90]:
human_readable.file_size(pd.to_numeric(full_stats["disk_bytes"], errors="coerce").sum())

'57.6 GB'

In [91]:
# Our `hats_estsize` is suspiciously close to the memory_bytes amount (?).
human_readable.file_size(int(two_mass.hc_structure.catalog_info.hats_estsize) * 1024)

'36.7 GB'

```python
def estimate_dir_size(path: str | Path | UPath | None = None, *, divisor=1):
    """Estimate the disk usage of a directory, and recursive contents.

    When divisor == 1, returns size in bytes."""
    path = file_io.get_upath(path)
    if path is None:
        return 0

    def _estimate_dir_size(target_dir):
        total_size = 0
        for item in target_dir.iterdir():
            if item.is_dir():
                total_size += _estimate_dir_size(item)
            else:
                total_size += item.stat().st_size
        return total_size

    est_size = _estimate_dir_size(path)
    if divisor > 1:
        return int(est_size / divisor)
    return est_size
```

In [92]:
!du -sh /epyc/data3/hats/catalogs/two_mass/two_mass/*

35G	/epyc/data3/hats/catalogs/two_mass/two_mass/dataset
4.0K	/epyc/data3/hats/catalogs/two_mass/two_mass/hats.properties
4.0K	/epyc/data3/hats/catalogs/two_mass/two_mass/partition_info.csv
6.1M	/epyc/data3/hats/catalogs/two_mass/two_mass/point_map.fits
4.0K	/epyc/data3/hats/catalogs/two_mass/two_mass/properties
12K	/epyc/data3/hats/catalogs/two_mass/two_mass/skymap.2.fits
32K	/epyc/data3/hats/catalogs/two_mass/two_mass/skymap.4.fits
392K	/epyc/data3/hats/catalogs/two_mass/two_mass/skymap.6.fits
6.1M	/epyc/data3/hats/catalogs/two_mass/two_mass/skymap.fits


In [93]:
# The metadata is small
!du -sh /epyc/data3/hats/catalogs/two_mass/two_mass/dataset/_metadata

21M	/epyc/data3/hats/catalogs/two_mass/two_mass/dataset/_metadata


### Another example with Gaia 

Another example, with Gaia:

In [94]:
gaia = lsdb.open_catalog('s3://stpubdata/gaia/gaia_dr3/public/hats')

In [95]:
stats = gaia.per_pixel_statistics(
    use_default_columns=False,
    exclude_hats_columns=False,
    include_columns=list(gaia.all_columns) + ["_healpix_29"],
    include_stats=["row_count", "memory_bytes", "disk_bytes"],
    multi_index=True,
)
stats

Unnamed: 0_level_0,Unnamed: 1_level_0,row_count,memory_bytes,disk_bytes
pixel,column,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Order: 2, Pixel: 0",_healpix_29,726621,6256447,6919275
"Order: 2, Pixel: 0",solution_id,726621,464,392
...,...,...,...,...
"Order: 3, Pixel: 767",ebpminrp_gspphot_upper,820930,640619,682846
"Order: 3, Pixel: 767",libname_gspphot,820930,152771,184938


In [96]:
mem_sizes = pd.to_numeric(stats["memory_bytes"], errors="coerce").sum()
human_readable.file_size(mem_sizes)

'764.7 GB'

In [97]:
disk_sizes = pd.to_numeric(stats["disk_bytes"], errors="coerce").sum()
human_readable.file_size(disk_sizes)

'908.9 GB'

In [98]:
# Same thing here with hats_estsize
human_readable.file_size(int(gaia.hc_structure.catalog_info.hats_estsize) * 1024)

'765.3 GB'

In [99]:
!du -sh /epyc/data3/hats/catalogs/gaia_dr3/gaia/*

713G	/epyc/data3/hats/catalogs/gaia_dr3/gaia/dataset
4.0K	/epyc/data3/hats/catalogs/gaia_dr3/gaia/hats.properties
16K	/epyc/data3/hats/catalogs/gaia_dr3/gaia/partition_info.csv
6.1M	/epyc/data3/hats/catalogs/gaia_dr3/gaia/point_map.fits
4.0K	/epyc/data3/hats/catalogs/gaia_dr3/gaia/properties
12K	/epyc/data3/hats/catalogs/gaia_dr3/gaia/skymap.2.fits
32K	/epyc/data3/hats/catalogs/gaia_dr3/gaia/skymap.4.fits
392K	/epyc/data3/hats/catalogs/gaia_dr3/gaia/skymap.6.fits
6.1M	/epyc/data3/hats/catalogs/gaia_dr3/gaia/skymap.fits


In [100]:
!du -sh /epyc/data3/hats/catalogs/gaia_dr3/gaia/dataset/_metadata

221M	/epyc/data3/hats/catalogs/gaia_dr3/gaia/dataset/_metadata


With a small cone:

In [101]:
cone = gaia[["ra","dec"]].cone_search(ra=270,dec=-20,radius_arcsec=3600)
cone_stats = cone.per_pixel_statistics(
    use_default_columns=False,
    exclude_hats_columns=False,
    include_columns=list(cone.columns) + ["_healpix_29"],
    include_stats=["row_count", "memory_bytes", "disk_bytes"],
    multi_index=True,
)
cone_stats



Unnamed: 0_level_0,Unnamed: 1_level_0,row_count,memory_bytes,disk_bytes
pixel,column,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Order: 5, Pixel: 7231",_healpix_29,1566976,9019886.0,14767496.0
"Order: 5, Pixel: 7231",ra,1566976,12651247.0,14767496.0
"Order: 5, Pixel: 7231",dec,1566976,13149159.0,14767496.0
"Order: 5, Pixel: 7274",_healpix_29,719847,4454137.0,6850670.0
"Order: 5, Pixel: 7274",ra,719847,5934404.0,6850670.0
"Order: 5, Pixel: 7274",dec,719847,6112627.0,6850670.0
"Order: 6, Pixel: 29268",_healpix_29,583654,3307034.0,5506115.0
"Order: 6, Pixel: 29268",ra,583654,4700694.0,5506115.0
"Order: 6, Pixel: 29268",dec,583654,4899475.0,5506115.0
"Order: 6, Pixel: 29269",_healpix_29,391817,2271673.0,3692458.0


In [102]:
mem_sizes = pd.to_numeric(cone_stats["memory_bytes"], errors="coerce").sum()
human_readable.file_size(mem_sizes)

'109.9 MB'

In [103]:
disk_sizes = pd.to_numeric(cone_stats["disk_bytes"], errors="coerce").sum()
human_readable.file_size(disk_sizes)

'139.6 MB'

In [104]:
filtered_gaia = cone.compute()

In [105]:
human_readable.file_size(sys.getsizeof(filtered_gaia))

'31.2 MB'

In [106]:
human_readable.file_size(filtered_gaia.memory_usage(deep=True).sum())

'31.2 MB'

In [107]:
filtered_gaia.to_parquet("zstd_gaia.parquet", compression="ZSTD")
filtered_gaia.to_parquet("snappy_gaia.parquet", compression="snappy")

In [108]:
!du -sh *

32K	estimate_size.ipynb
106M	snappy.parquet
30M	snappy_gaia.parquet
91M	zstd.parquet
27M	zstd_gaia.parquet
