DM-36199: Add optional Parquet outputs to diaPipe #160

parejkoj · 2022-09-13T19:01:01Z

No description provided.

mrawls

The two new parquet tables are nearly consistent enough with the APDB tables to be able to reconstruct the latter from the former, but not quite. In order for these parquet outputs to be useful (e.g., for making analysis_tools style plots), this condition needs to be met.

mrawls · 2022-10-13T22:59:10Z

python/lsst/ap/association/diaPipe.py

        name="{fakesType}{coaddName}Diff_assocDiaSrc",
        storageClass="DataFrame",
        dimensions=("instrument", "visit", "detector"),
    )
+    diaForcedSources = connTypes.Output(


Please investigate how to make the flags column behave in a more sensible way. It is being cast as a float for some reason, and the column only exists at all for the not-first dataId processed. All the values are either NaN or 0.0 (both np.float64).

Is the flags column always a float, or is it only a float after you use the pandas concat function?

There is no way that I know of to enforce a schema on pandas tables (hence why many of us want to push the project away from using them), so I don't think what you're asking for is possible either way.

The flags column is always a float, when it exists; it does not exist for the first visit+detector processed, but it does for the second, and it is a float then.

diaForcedSources['flags'] is added when self.diaCatalogLoader returns >1 entries (shouldn't that be >0?) in loaderResult.diaForcedSources (thus, only when there are previous sources in this field to do forced photometry on, I believe). The append call in doPackageAlerts results in that field being converted to float64 from the originally-loaded int64: that's definitely a bug, and @kfindeisen 's changes on #162 won't fix it, either. This is, once again, the danger of using pandas.

We could conceivably have diaForcedSource.run always add a zero-filled flags column, so that pandas doesn't try to use it's terribly stupid "null means convert numbers to float" approach.

I don't know what goes into the DiaForceSource['flags'] field, and apparently neither does the APDB baseline schema in sdm_schemas, since the docstring for all our flags columns are: "bitfield, tbd", so I don't know how necessary this particular field is in this particular schema, or what it even means.

mrawls · 2022-10-13T23:10:17Z

python/lsst/ap/association/diaPipe.py

+        storageClass="DataFrame",
+        dimensions=("instrument", "visit", "detector"),
+    )
+    diaObjects = connTypes.Output(


Please investigate how to make the validityStart column behave in a more useful way. Ideally it would be updated to the given exposure's timestamp before this table is persisted. Right now that doesn't seem to be happening, and instead validityStart times are only pulled in via the APDB DiaObject history. This makes it next to impossible to construct an APDB-equivalent DiaObject Table from this.

Can you give me a minimum working example on how to demonstrate this difference? I think that digging into how the diaObject table is created is well outside the scope of this ticket, but if it's an easy change to which thing I'm persisting, I can give it a try.

Here's a MWE showing the difference:

#!/usr/bin/env python # script from Meredith to compare validityStart values between APDB and the new parque output. import os import sys import pandas as pd from IPython.display import display import lsst.daf.butler as dafButler # A hack to use the DM-34627 branch of analysis_ap, to be able to use apdbUtils. sys.path.append('/sdf/data/rubin/u/mrawls/scipipe/analysis_ap/python/lsst/analysis/ap/') import apdbUtils as pla repo = '/sdf/data/rubin/u/parejko/cosmos-DM-36199/repo' collections = 'ap_verify-output' butler = dafButler.Butler(repo, collections=collections) instrument = 'HSC' skymap = 'hsc_rings_v1' dataIds = [{'visit': 59150, 'detector': 50, 'instrument': 'HSC'}, {'visit': 59160, 'detector': 51, 'instrument': 'HSC'}] # Load APDB DiaObject table objTableApdb = pla.loadAllApdbObjects(dbName=os.path.join(repo, '../association.db'), allCol=True) objTableApdb.set_index('diaObjectId', inplace=True) # Load parquet DiaObject tables and concatenate them, which is necessary to make desired "per-run" plots def loadTable(dataId, dataset): oneTable = butler.get(dataset, dataId=dataId) return oneTable objFrames = [loadTable(dataId, 'fakes_deepDiff_diaObject') for dataId in dataIds] objTableParquet = pd.concat(objFrames) objTableParquet.drop(columns='diaObjectId', inplace=True) # Compare the two tables' validityStart and validityEnd columns display(objTableApdb.sort_index().sort_index(axis=1)[['nDiaSources', 'validityStart', 'validityEnd']]) display(objTableParquet.sort_index().sort_index(axis=1)[['nDiaSources', 'validityStart', 'validityEnd']])

Having dug into this a bit more, all of the validity information is computed inside dax_apdb and done in the database itself. We don't have a facility right now to move that information back into the diaObject structure inside diaPipe.py, and I think figuring out how to do that is out of scope for this ticket. I'd advocate merging this as-is, so that we at least have some output written, and file another ticket to (optionally? Given that it might not be a fast operation, since it requires reading back from ADPB.) have the updated values of filled in inside the dax_apdb code (_storeDiaObjects, but those seem quite different in apdbSql.py vs. abdpCassandra.py, and there's no abstractmethod in the base class).

mrawls

After chatting with John and more digging into my two issues (flags for diaForcedSources and validityStart for diaObjects), it seems like merging this as-is will do no harm and we should open a new ticket for following up.

It should be noted that naively concatenating parquet diaObject tables is not sufficient to reproduce an APDB DiaObject table due to a lack of validityStart info. Whatever ticket is made to handle this should clearly be noted as blocking any analysis_tools style plots of DiaObjects.

kfindeisen · 2022-10-25T16:59:01Z

Since this seems to be the main discussion on this, note that we already have a to-do on Confluence for getting rid of the concatenation operations on performance grounds. That might affect how you want to approach the type and validity problems.

parejkoj force-pushed the tickets/DM-36199 branch from e2b9f67 to 3936d3b Compare October 7, 2022 21:11

mrawls requested changes Oct 13, 2022

View reviewed changes

parejkoj force-pushed the tickets/DM-36199 branch from 3936d3b to 74f680c Compare October 21, 2022 23:12

mrawls approved these changes Oct 25, 2022

View reviewed changes

parejkoj mentioned this pull request Oct 25, 2022

DM-36058: Fix untested Pandas deprecation warnings in ap_association #162

Merged

parejkoj added 2 commits October 27, 2022 10:37

Cleanup docstrings

80608be

Add optional diaForcedSources and diaObjects parquet outputs

d41e3a2

parejkoj force-pushed the tickets/DM-36199 branch from 74f680c to d41e3a2 Compare October 27, 2022 17:38

parejkoj merged commit aac2bdf into main Oct 28, 2022

parejkoj deleted the tickets/DM-36199 branch October 28, 2022 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-36199: Add optional Parquet outputs to diaPipe #160

DM-36199: Add optional Parquet outputs to diaPipe #160

parejkoj commented Sep 13, 2022

mrawls left a comment

mrawls Oct 13, 2022

parejkoj Oct 21, 2022

mrawls Oct 21, 2022

parejkoj Oct 25, 2022

mrawls Oct 13, 2022

parejkoj Oct 21, 2022

parejkoj Oct 21, 2022 •

edited

parejkoj Oct 25, 2022 •

edited

mrawls left a comment

kfindeisen commented Oct 25, 2022

DM-36199: Add optional Parquet outputs to diaPipe #160

DM-36199: Add optional Parquet outputs to diaPipe #160

Conversation

parejkoj commented Sep 13, 2022

mrawls left a comment

Choose a reason for hiding this comment

mrawls Oct 13, 2022

Choose a reason for hiding this comment

parejkoj Oct 21, 2022

Choose a reason for hiding this comment

mrawls Oct 21, 2022

Choose a reason for hiding this comment

parejkoj Oct 25, 2022

Choose a reason for hiding this comment

mrawls Oct 13, 2022

Choose a reason for hiding this comment

parejkoj Oct 21, 2022

Choose a reason for hiding this comment

parejkoj Oct 21, 2022 • edited

Choose a reason for hiding this comment

parejkoj Oct 25, 2022 • edited

Choose a reason for hiding this comment

mrawls left a comment

Choose a reason for hiding this comment

kfindeisen commented Oct 25, 2022

parejkoj Oct 21, 2022 •

edited

parejkoj Oct 25, 2022 •

edited