# Importing macauff results into HiPSCat / LSDB

The macauff pipeline OUTSIDE of LSDB will create a series of yml and csv files
that represent the counterpart assignments and likelihoods between objects in 
two catalogs.

To convert into an LSDB-friendly association table, you will need:
- match file schema file (either yaml or parquet schema)
- match output CSVs
- HATS left catalog
- HATS right catalog

## Schema

If you have a yaml schema, you will first need to convert it into a parquet
schema to be understood by the import pipeline.

Use the `from_yaml` utility method to read the yaml files; parse column names,
types, and other key-values; convert into parquet types; write to an empty
parquet file.

In [1]:
from hats.io import file_io
from lsdb_macauff.import_pipeline.convert_metadata import from_yaml
from pathlib import Path

## Location of the data on UW internal servers
epyc_input_path = Path("/epyc/data3/hipscat/raw/macauff_match")
yaml_input_file = epyc_input_path / "macauff_metadata.yml"
from_yaml(yaml_input_file, epyc_input_path)

We can inspect the parquet/pyarrow schema that's generated:

In [2]:
matches_schema_file = epyc_input_path / "macauff_GaiaDR3xCatWISE2020_matches.parquet"
single_metadata = file_io.read_parquet_metadata(matches_schema_file)
schema = single_metadata.schema.to_arrow_schema()
schema

gaia_source_id: int64
  -- field metadata --
  name: 'gaia_source_id'
  @id: '#macauff_GaiaDR3xCatWISE2020_matches.gaia_source_id'
  datatype: 'long'
  description: 'The Gaia DR3 object ID.'
gaia_ra: double
  -- field metadata --
  name: 'gaia_ra'
  @id: '#macauff_GaiaDR3xCatWISE2020_matches.gaia_ra'
  datatype: 'double'
  description: 'Right Ascension of the Gaia DR3 source.'
gaia_dec: double
  -- field metadata --
  name: 'gaia_dec'
  @id: '#macauff_GaiaDR3xCatWISE2020_matches.gaia_dec'
  datatype: 'double'
  description: 'The Gaia DR3 declination.'
BP: double
  -- field metadata --
  name: 'BP'
  @id: '#macauff_GaiaDR3xCatWISE2020_matches.BP'
  datatype: 'double'
  description: 'The BP magnitude, from Gaia DR3.'
G: double
  -- field metadata --
  name: 'G'
  @id: '#macauff_GaiaDR3xCatWISE2020_matches.G'
  datatype: 'double'
  description: 'The Gaia DR3 G magnitude.'
RP: double
  -- field metadata --
  name: 'RP'
  @id: '#macauff_GaiaDR3xCatWISE2020_matches.RP'
  datatype: 'double'
 

## Association table

To create the association between two catalogs, use the purpose-built Macauff
association table import map-reduce pipeline.

See also the general documentation for pipeline argument configuration 
for the [hipscat-import pipeline](https://hipscat-import.readthedocs.io/en/latest/catalogs/arguments.html#pipeline-setup). 
The macauff pipeline uses that pipeline construction directly.

In [4]:
import lsdb_macauff.import_pipeline.run_import as runner
from hats_import.catalog.file_readers import CsvReader
from lsdb_macauff.import_pipeline.arguments import MacauffArguments
from dask.distributed import Client
import glob

matches_schema_file = epyc_input_path / "macauff_GaiaDR3xCatWISE2020_matches.parquet"

## Find all of the CSV files under the macauff output directory.
macauff_data_dir = "/epyc/data3/hipscat/raw/macauff_results/rds/project/iris_vol3/rds-iris-ip005/tjw/dr3_catwise_allskytest/output_csvs/"
files = glob.glob(f"{macauff_data_dir}/**/*.csv")
files.sort()

args = MacauffArguments(
    ## This will create an association catalog at the path:
    ##    /epyc/data3/hats/catalogs/macauff_association/
    output_path="/epyc/data3/hats/tmp",
    output_artifact_name="macauff_association",
    ## Make sure you use a directory with enough space!
    tmp_dir="/epyc/data3/hats/tmp/macauff",
    ## Read the CSV files and use the generated schema file for types and
    ## other key-value metadata
    input_file_list=files,
    input_format="csv",
    file_reader=CsvReader(schema_file=matches_schema_file, header=None),
    metadata_file_path=matches_schema_file,
    ## For left catalog, specify the pre-hats location, and ra/dec columns
    left_catalog_dir="/epyc/data3/hats/catalogs/gaia_dr3/gaia",
    left_ra_column="gaia_ra",
    left_dec_column="gaia_dec",
    left_id_column="source_id",
    left_assn_column="gaia_source_id",
    ## For right catalog, specify the pre-hats location, and ra/dec columns
    right_catalog_dir="/epyc/data3/hats/tmp/catwise/catwise2020_test",
    right_ra_column="catwise_ra",
    right_dec_column="catwise_dec",
    right_id_column="source_name",
    right_assn_column="catwise_name",
)

with Client(
    n_workers=64,
    threads_per_worker=1,
    local_directory="/epyc/data3/hats/tmp/macauff",
) as client:
    runner.run(args, client)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 42607 instead


Planning:   0%|          | 0/4 [00:00<?, ?it/s]

Splitting :   0%|          | 0/1558 [00:00<?, ?it/s]



Reducing  :   0%|          | 0/3933 [00:00<?, ?it/s]

Finishing:   0%|          | 0/4 [00:00<?, ?it/s]

This could take a while. Once it's done, check that the association data can be parsed as a valid association catalog.

In [None]:
import lsdb

assoc = lsdb.open_catalog(args.catalog_path)
assoc

Unnamed: 0,gaia_source_id,gaia_ra,gaia_dec,BP,G,RP,catwise_name,catwise_ra,catwise_dec,W1,W2,match_p,separation,eta,xi,gaia_avg_cont,catwise_avg_cont,gaia_cont_f1,gaia_cont_f10,catwise_cont_f1,catwise_cont_f10,catwise_fit_sig
0,5724779088834304,40.328126,4.256346,18.7025,18.3868,17.9721,J024118.73+041522.4,40.328074,4.256249,17.229,16.959,0.998938,0.395863,-0.255275,3.22872,6e-06,0.159815,1.7e-05,8.8e-05,0.559919,0.955755,0.195821
1,5725225766071680,40.201305,4.23729,,20.2303,19.0433,J024048.32+041414.2,40.201354,4.237288,16.668,16.599,0.999983,0.175508,0.881262,3.893636,7e-06,0.121164,3.7e-05,0.00014,0.34503,0.88752,0.130057
2,5725157046338432,40.201799,4.226506,20.0679,19.6351,19.0454,J024048.45+041335.0,40.201909,4.226394,16.83,17.56,0.999576,0.566629,0.690721,2.681283,7e-06,0.141987,3.7e-05,0.00014,0.978493,0.998583,0.143704
3,5725126981583744,40.292701,4.281578,12.4559,12.1779,11.7331,J024110.24+041653.7,40.292698,4.281597,10.88,10.897,1.0,0.068846,1.646148,4.841319,0.0,0.004005,0.0,3e-06,0.004571,0.04182,0.036315
4,5725126981081216,40.289341,4.28365,,20.2949,19.3369,J024109.44+041659.9,40.289365,4.283327,16.554,16.127,0.996093,1.166542,0.521858,1.884581,1e-05,0.147713,4.2e-05,0.000182,0.998986,0.999952,0.179148
