# Catalog Import Verification

Author: Melissa

In this notebook, we demonstrate basic usage of the newly-added Verification pipeline, authored by Troy Raen.

The purpose of the pipeline is to make exhaustive checks of the expected length of the catalog, and that all metadata is self-consistent. This pipeline can take a while to run on catalogs, with the total number of rows and the total number of columns being the largest factors in the time spent.

In [1]:
import hats_import.verification.run_verification as runner
import pandas as pd
from hats_import.verification.arguments import VerificationArguments
from time import perf_counter
from pathlib import Path
import hats

## Smaller catalogs, Verbose=False

This cell runs the pipeline for a number of smaller catalogs, and only outputs in the event of a failure. There should be no cell output!

Note that the pipeline itself returns an object with rich interaction with the individual test results (the `verifier` object). We use it here just to determine if all of the tests have passed.

In [2]:
SMALL_CATALOGS = [
    "/epyc/data3/hats/catalogs/alerce/alerce_nested/",# 3.16
    "/epyc/data3/hats/catalogs/erosita/erosita_dr1_erass1",# 0.86
    "/epyc/data3/hats/catalogs/sdss_dr18_specphotoall/", #2.52
    "/epyc/data3/hats/catalogs/ztf_dr14/ztf_object",# 49.08
    "/epyc/data3/hats/catalogs/gaia_dr3/gaia_edr3_distances",# 69.59
    "/epyc/data3/hats/catalogs/two_mass",# 83.11
]

for catalog_path in SMALL_CATALOGS:
    t1_start = perf_counter()
    short_name = Path(catalog_path).stem
    output_path = "./results/" + short_name
    
    args = VerificationArguments(input_catalog_path=catalog_path, 
                                 output_path=output_path, 
                                 verbose=False)
    verifier = runner.run(args, write_mode="w")
    if not verifier.all_tests_passed:
        print(f"FAILED Catalog {short_name}")
    
    print(f"Processed {short_name} in: {perf_counter()-t1_start:.2f} (seconds)")

Processed alerce_nested in: 0.52 (seconds)
Processed erosita_dr1_erass1 in: 0.50 (seconds)
Processed sdss_dr18_specphotoall in: 1.93 (seconds)
Processed ztf_object in: 47.90 (seconds)
Processed gaia_edr3_distances in: 7.86 (seconds)
Processed two_mass in: 82.16 (seconds)


## Large catalog, verbose

In the following cell, we operate over a larger catalog. This takes a little longer, and it's nice to see all of the progress reporting that goes by.

In [4]:
catalog_path = "/epyc/data3/hats/catalogs/ztf_dr22/ztf_lc"
short_name = "ztf_dr22_lc"
output_path = "./results/" + short_name

args = VerificationArguments(input_catalog_path=catalog_path, 
                             output_path=output_path, 
                             verbose=True)
verifier = runner.run(args, write_mode="w")

Loading dataset and schema.

Starting: Test hats.io.validation.is_valid_catalog (hats version 0.4.6.dev6+gfeedc15).
Validating catalog at path /epyc/data3/hats/catalogs/ztf_dr22/ztf_lc ... 
Found 10839 partitions.
Approximate coverage is 78.13 % of the sky.
Result: PASSED

Starting: Test that files in _metadata match the data files on disk.
Result: PASSED

Starting: Test that number of rows are equal.
	file footers vs catalog properties
	file footers vs _metadata
Result: PASSED

Starting: Test that schemas are equal, excluding metadata.
	_common_metadata vs truth
	_metadata vs truth
	file footers vs truth
Result: PASSED

Verifier results written to results/ztf_dr22_lc/verifier_results.csv
 Elapsed time (seconds) :26.33


## Additional manual verification

Based on [this tutorial notebook](https://docs.lsdb.io/en/stable/tutorials/manual_verification.html)



In [5]:
catalog_object = hats.read_hats(catalog_path)
catalog_object.schema

_healpix_29: int64
objectid: int64
filterid: int8
fieldid: int16
rcid: int8
objra: float
objdec: float
nepochs: int64
hmjd: list<element: double>
  child 0, element: double
mag: list<element: float>
  child 0, element: float
magerr: list<element: float>
  child 0, element: float
clrcoeff: list<element: float>
  child 0, element: float
catflags: list<element: int32>
  child 0, element: int32
Norder: uint8
Dir: uint64
Npix: uint64

In [6]:
pd.set_option('display.float_format', '{:.2f}'.format)

catalog_object.aggregate_column_statistics()

Unnamed: 0_level_0,min_value,max_value,null_count
column_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
objectid,202110100000000.0,1896211400028220.8,0.0
filterid,1.0,3.0,0.0
fieldid,202.0,1896.0,0.0
rcid,0.0,63.0,0.0
objra,0.0,360.0,0.0
objdec,-30.7,89.21,0.0
nepochs,1.0,1884.0,0.0
hmjd.list.element,58197.12,60491.42,0.0
mag.list.element,-2.52,32.44,0.0
magerr.list.element,-1.46,1.14,0.0
