# Speed up reading partition info

Let's start with an existing catalog with a few thousand partitions in it. I've got a catwise metadata file locally, so I'm using it.

In [1]:
from hipscat.catalog import PartitionInfo
import os

catalog_path = "/data3/epyc/data3/hipscat/catalogs/neowise_yr8"
metadata_path = os.path.join(catalog_path, "_metadata")
csv_path = os.path.join(catalog_path, "partition_info.csv")

Start with reading from `_metadata`. This is pretty slow. We'll run each variety 20 times just to smooth out variance a little.

In [2]:
%%time

info = PartitionInfo.read_from_file(metadata_path)
len(info.get_healpix_pixels())

CPU times: user 7.16 s, sys: 1.75 s, total: 8.91 s
Wall time: 8.92 s


20010

In [3]:
%%time

for i in range(0,20):
    info = PartitionInfo.read_from_file(metadata_path)


CPU times: user 2min 11s, sys: 12.5 s, total: 2min 23s
Wall time: 2min 23s


In [4]:
import numpy as np
from hipscat.io import FilePointer, file_io
from hipscat.pixel_math import HealpixPixel

def write_to_npy(partition_info, catalog_path: FilePointer):
    npy_file = file_io.append_paths_to_pointer(catalog_path, "partition_info")
    data_frame = partition_info.as_dataframe()

    npy_array = np.array(
        [data_frame[partition_info.METADATA_ORDER_COLUMN_NAME], data_frame[partition_info.METADATA_PIXEL_COLUMN_NAME]]
    )
    np.save(npy_file, npy_array)

def read_from_npy(catalog_path: FilePointer, storage_options: dict = None) -> PartitionInfo:
    npy_file = file_io.append_paths_to_pointer(catalog_path, "partition_info.npy")
    if not file_io.does_file_or_directory_exist(npy_file, storage_options=storage_options):
        raise FileNotFoundError(f"No partition info found where expected: {str(npy_file)}")

    npy_array = np.load(npy_file, allow_pickle=True)
    orders = npy_array[0]
    pixels = npy_array[1]

    pixel_list = [HealpixPixel(order, pixel) for order, pixel in zip(orders, pixels)]

    return PartitionInfo(pixel_list)

Using that same partition info, let's write it out to an uncompressed npy file. This should be wicked fast.

In [5]:
%%time
write_to_npy(info, catalog_path)

CPU times: user 26.9 ms, sys: 1.78 ms, total: 28.6 ms
Wall time: 27.9 ms


In [6]:
%%time

info = read_from_npy(catalog_path)
len(info.get_healpix_pixels())

CPU times: user 20.6 ms, sys: 2.75 ms, total: 23.4 ms
Wall time: 22 ms


20010

In [7]:
%%time

for i in range(0,20):
    info = read_from_npy(catalog_path)

CPU times: user 830 ms, sys: 6.64 ms, total: 836 ms
Wall time: 835 ms


Let's also look at what this looks like if we use a CSV and pandas parsing. Is it substantially worse than the npy?

In [8]:
%%time
info.write_to_file(csv_path)

CPU times: user 260 ms, sys: 6.21 ms, total: 266 ms
Wall time: 289 ms


In [9]:
%%time

info = PartitionInfo.read_from_csv(csv_path)
len(info.get_healpix_pixels())

CPU times: user 51.6 ms, sys: 1.12 ms, total: 52.7 ms
Wall time: 50.8 ms


20010

In [10]:
%%time

for i in range(0,20):
    info = PartitionInfo.read_from_csv(csv_path)

CPU times: user 960 ms, sys: 3.4 ms, total: 963 ms
Wall time: 962 ms


Well. That's embarassing. It's really not much better than the original CSV stuff we had a few months ago. What if we pickle the whole list of pixels? And don't have to do any additional data marshalling?

In [11]:
def write_to_npy(partition_info, catalog_path: FilePointer):
    npy_file = file_io.append_paths_to_pointer(catalog_path, "partition_info")
    data_frame = partition_info.as_dataframe()

    npy_array = np.array(
        partition_info.pixel_list
    )
    np.save(npy_file, npy_array)

def read_from_npy(catalog_path: FilePointer, storage_options: dict = None) -> PartitionInfo:
    npy_file = file_io.append_paths_to_pointer(catalog_path, "partition_info.npy")
    if not file_io.does_file_or_directory_exist(npy_file, storage_options=storage_options):
        raise FileNotFoundError(f"No partition info found where expected: {str(npy_file)}")

    npy_array = np.load(npy_file, allow_pickle=True)
    return PartitionInfo(npy_array)

In [12]:
%%time
write_to_npy(info, catalog_path)

CPU times: user 78.6 ms, sys: 323 µs, total: 78.9 ms
Wall time: 77.1 ms


In [13]:
%%time

info = read_from_npy(catalog_path)
len(info.get_healpix_pixels())

CPU times: user 30.4 ms, sys: 5.28 ms, total: 35.7 ms
Wall time: 33 ms


20010

In [14]:
%%time

for i in range(0,20):
    info = read_from_npy(catalog_path)

CPU times: user 738 ms, sys: 7.11 ms, total: 745 ms
Wall time: 740 ms


That's a little better, but we're not talking about a 10x speedup here. This is <2x speedup, that comes with a lot of complexity, and adding a new file type to the mix.

So, I think it's just not worth it. And using the `partition_info.csv` that we already generate and know how to parse is pretty good.