Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-36199: Add optional Parquet outputs to diaPipe #160

Merged
merged 2 commits into from Oct 28, 2022
Merged

Conversation

parejkoj
Copy link
Contributor

No description provided.

Copy link
Contributor

@mrawls mrawls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two new parquet tables are nearly consistent enough with the APDB tables to be able to reconstruct the latter from the former, but not quite. In order for these parquet outputs to be useful (e.g., for making analysis_tools style plots), this condition needs to be met.

name="{fakesType}{coaddName}Diff_assocDiaSrc",
storageClass="DataFrame",
dimensions=("instrument", "visit", "detector"),
)
diaForcedSources = connTypes.Output(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please investigate how to make the flags column behave in a more sensible way. It is being cast as a float for some reason, and the column only exists at all for the not-first dataId processed. All the values are either NaN or 0.0 (both np.float64).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the flags column always a float, or is it only a float after you use the pandas concat function?

There is no way that I know of to enforce a schema on pandas tables (hence why many of us want to push the project away from using them), so I don't think what you're asking for is possible either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flags column is always a float, when it exists; it does not exist for the first visit+detector processed, but it does for the second, and it is a float then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diaForcedSources['flags'] is added when self.diaCatalogLoader returns >1 entries (shouldn't that be >0?) in loaderResult.diaForcedSources (thus, only when there are previous sources in this field to do forced photometry on, I believe). The append call in doPackageAlerts results in that field being converted to float64 from the originally-loaded int64: that's definitely a bug, and @kfindeisen 's changes on #162 won't fix it, either. This is, once again, the danger of using pandas.

We could conceivably have diaForcedSource.run always add a zero-filled flags column, so that pandas doesn't try to use it's terribly stupid "null means convert numbers to float" approach.

I don't know what goes into the DiaForceSource['flags'] field, and apparently neither does the APDB baseline schema in sdm_schemas, since the docstring for all our flags columns are: "bitfield, tbd", so I don't know how necessary this particular field is in this particular schema, or what it even means.

storageClass="DataFrame",
dimensions=("instrument", "visit", "detector"),
)
diaObjects = connTypes.Output(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please investigate how to make the validityStart column behave in a more useful way. Ideally it would be updated to the given exposure's timestamp before this table is persisted. Right now that doesn't seem to be happening, and instead validityStart times are only pulled in via the APDB DiaObject history. This makes it next to impossible to construct an APDB-equivalent DiaObject Table from this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me a minimum working example on how to demonstrate this difference? I think that digging into how the diaObject table is created is well outside the scope of this ticket, but if it's an easy change to which thing I'm persisting, I can give it a try.

Copy link
Contributor Author

@parejkoj parejkoj Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a MWE showing the difference:

#!/usr/bin/env python
# script from Meredith to compare validityStart values between APDB and the new parque output.
import os
import sys

import pandas as pd
from IPython.display import display
import lsst.daf.butler as dafButler
# A hack to use the DM-34627 branch of analysis_ap, to be able to use apdbUtils.
sys.path.append('/sdf/data/rubin/u/mrawls/scipipe/analysis_ap/python/lsst/analysis/ap/')
import apdbUtils as pla

repo = '/sdf/data/rubin/u/parejko/cosmos-DM-36199/repo'
collections = 'ap_verify-output'
butler = dafButler.Butler(repo, collections=collections)
instrument = 'HSC'
skymap = 'hsc_rings_v1'
dataIds = [{'visit': 59150, 'detector': 50, 'instrument': 'HSC'}, {'visit': 59160, 'detector': 51, 'instrument': 'HSC'}]

# Load APDB DiaObject table
objTableApdb = pla.loadAllApdbObjects(dbName=os.path.join(repo, '../association.db'), allCol=True)
objTableApdb.set_index('diaObjectId', inplace=True)

# Load parquet DiaObject tables and concatenate them, which is necessary to make desired "per-run" plots
def loadTable(dataId, dataset):
    oneTable = butler.get(dataset, dataId=dataId)
    return oneTable

objFrames = [loadTable(dataId, 'fakes_deepDiff_diaObject') for dataId in dataIds]
objTableParquet = pd.concat(objFrames)
objTableParquet.drop(columns='diaObjectId', inplace=True)

# Compare the two tables' validityStart and validityEnd columns
display(objTableApdb.sort_index().sort_index(axis=1)[['nDiaSources', 'validityStart', 'validityEnd']])
display(objTableParquet.sort_index().sort_index(axis=1)[['nDiaSources', 'validityStart', 'validityEnd']])

Copy link
Contributor Author

@parejkoj parejkoj Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having dug into this a bit more, all of the validity information is computed inside dax_apdb and done in the database itself. We don't have a facility right now to move that information back into the diaObject structure inside diaPipe.py, and I think figuring out how to do that is out of scope for this ticket. I'd advocate merging this as-is, so that we at least have some output written, and file another ticket to (optionally? Given that it might not be a fast operation, since it requires reading back from ADPB.) have the updated values of filled in inside the dax_apdb code (_storeDiaObjects, but those seem quite different in apdbSql.py vs. abdpCassandra.py, and there's no abstractmethod in the base class).

Copy link
Contributor

@mrawls mrawls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After chatting with John and more digging into my two issues (flags for diaForcedSources and validityStart for diaObjects), it seems like merging this as-is will do no harm and we should open a new ticket for following up.

It should be noted that naively concatenating parquet diaObject tables is not sufficient to reproduce an APDB DiaObject table due to a lack of validityStart info. Whatever ticket is made to handle this should clearly be noted as blocking any analysis_tools style plots of DiaObjects.

@kfindeisen
Copy link
Member

Since this seems to be the main discussion on this, note that we already have a to-do on Confluence for getting rid of the concatenation operations on performance grounds. That might affect how you want to approach the type and validity problems.

@parejkoj parejkoj merged commit aac2bdf into main Oct 28, 2022
@parejkoj parejkoj deleted the tickets/DM-36199 branch October 28, 2022 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants