New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-36199: Add optional Parquet outputs to diaPipe #160
Conversation
e2b9f67
to
3936d3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two new parquet tables are nearly consistent enough with the APDB tables to be able to reconstruct the latter from the former, but not quite. In order for these parquet outputs to be useful (e.g., for making analysis_tools style plots), this condition needs to be met.
name="{fakesType}{coaddName}Diff_assocDiaSrc", | ||
storageClass="DataFrame", | ||
dimensions=("instrument", "visit", "detector"), | ||
) | ||
diaForcedSources = connTypes.Output( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please investigate how to make the flags
column behave in a more sensible way. It is being cast as a float for some reason, and the column only exists at all for the not-first dataId processed. All the values are either NaN or 0.0 (both np.float64
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the flags column always a float, or is it only a float after you use the pandas concat
function?
There is no way that I know of to enforce a schema on pandas tables (hence why many of us want to push the project away from using them), so I don't think what you're asking for is possible either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flags column is always a float, when it exists; it does not exist for the first visit+detector processed, but it does for the second, and it is a float then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diaForcedSources['flags']
is added when self.diaCatalogLoader
returns >1
entries (shouldn't that be >0
?) in loaderResult.diaForcedSources
(thus, only when there are previous sources in this field to do forced photometry on, I believe). The append
call in doPackageAlerts
results in that field being converted to float64
from the originally-loaded int64
: that's definitely a bug, and @kfindeisen 's changes on #162 won't fix it, either. This is, once again, the danger of using pandas.
We could conceivably have diaForcedSource.run
always add a zero-filled flags
column, so that pandas doesn't try to use it's terribly stupid "null means convert numbers to float" approach.
I don't know what goes into the DiaForceSource['flags']
field, and apparently neither does the APDB baseline schema in sdm_schemas
, since the docstring for all our flags
columns are: "bitfield, tbd", so I don't know how necessary this particular field is in this particular schema, or what it even means.
storageClass="DataFrame", | ||
dimensions=("instrument", "visit", "detector"), | ||
) | ||
diaObjects = connTypes.Output( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please investigate how to make the validityStart
column behave in a more useful way. Ideally it would be updated to the given exposure's timestamp before this table is persisted. Right now that doesn't seem to be happening, and instead validityStart
times are only pulled in via the APDB DiaObject history. This makes it next to impossible to construct an APDB-equivalent DiaObject Table from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give me a minimum working example on how to demonstrate this difference? I think that digging into how the diaObject table is created is well outside the scope of this ticket, but if it's an easy change to which thing I'm persisting, I can give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a MWE showing the difference:
#!/usr/bin/env python
# script from Meredith to compare validityStart values between APDB and the new parque output.
import os
import sys
import pandas as pd
from IPython.display import display
import lsst.daf.butler as dafButler
# A hack to use the DM-34627 branch of analysis_ap, to be able to use apdbUtils.
sys.path.append('/sdf/data/rubin/u/mrawls/scipipe/analysis_ap/python/lsst/analysis/ap/')
import apdbUtils as pla
repo = '/sdf/data/rubin/u/parejko/cosmos-DM-36199/repo'
collections = 'ap_verify-output'
butler = dafButler.Butler(repo, collections=collections)
instrument = 'HSC'
skymap = 'hsc_rings_v1'
dataIds = [{'visit': 59150, 'detector': 50, 'instrument': 'HSC'}, {'visit': 59160, 'detector': 51, 'instrument': 'HSC'}]
# Load APDB DiaObject table
objTableApdb = pla.loadAllApdbObjects(dbName=os.path.join(repo, '../association.db'), allCol=True)
objTableApdb.set_index('diaObjectId', inplace=True)
# Load parquet DiaObject tables and concatenate them, which is necessary to make desired "per-run" plots
def loadTable(dataId, dataset):
oneTable = butler.get(dataset, dataId=dataId)
return oneTable
objFrames = [loadTable(dataId, 'fakes_deepDiff_diaObject') for dataId in dataIds]
objTableParquet = pd.concat(objFrames)
objTableParquet.drop(columns='diaObjectId', inplace=True)
# Compare the two tables' validityStart and validityEnd columns
display(objTableApdb.sort_index().sort_index(axis=1)[['nDiaSources', 'validityStart', 'validityEnd']])
display(objTableParquet.sort_index().sort_index(axis=1)[['nDiaSources', 'validityStart', 'validityEnd']])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having dug into this a bit more, all of the validity information is computed inside dax_apdb and done in the database itself. We don't have a facility right now to move that information back into the diaObject structure inside diaPipe.py
, and I think figuring out how to do that is out of scope for this ticket. I'd advocate merging this as-is, so that we at least have some output written, and file another ticket to (optionally? Given that it might not be a fast operation, since it requires reading back from ADPB.) have the updated values of filled in inside the dax_apdb code (_storeDiaObjects
, but those seem quite different in apdbSql.py
vs. abdpCassandra.py
, and there's no abstractmethod in the base class).
3936d3b
to
74f680c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After chatting with John and more digging into my two issues (flags for diaForcedSources and validityStart for diaObjects), it seems like merging this as-is will do no harm and we should open a new ticket for following up.
It should be noted that naively concatenating parquet diaObject tables is not sufficient to reproduce an APDB DiaObject table due to a lack of validityStart info. Whatever ticket is made to handle this should clearly be noted as blocking any analysis_tools style plots of DiaObjects.
Since this seems to be the main discussion on this, note that we already have a to-do on Confluence for getting rid of the concatenation operations on performance grounds. That might affect how you want to approach the type and validity problems. |
74f680c
to
d41e3a2
Compare
No description provided.