New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-18736: Convert ap_association to use Pandas data frames (rather than afw::table) as an interface #46
Conversation
Complete initial pandas conversion in AssociationTask. Implement first unittest.
Debug first unittest. Convert association test to pandas. Fix data production method for pandas. Modify test_update to use pandas. Patch up AssociationTask for use in tests. Partially convert final unittests. Fix pandas warning and unittest.
Add default values for non-Nullable DiaObject columns. Fixup initial DiaObject creation. Fix pandas type bugs. Simplify DiaObject locating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Several suggestions to make more use of the dataframe indexing.
from lsst.daf.base import DateTime | ||
from lsst.meas.algorithms.indexerRegistry import IndexerRegistry | ||
import lsst.pex.config as pexConfig | ||
import lsst.pipe.base as pipeBase | ||
|
||
from .afwUtils import make_dia_object_schema | ||
|
||
pandas.options.mode.chained_assignment = 'raise' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave a comment here explaining why we are setting this.
currentFluxMask = dia_sources.get("filterId") == filter_id | ||
fluxes = dia_sources.get("psFlux")[currentFluxMask] | ||
fluxErrors = dia_sources.get("psFluxErr")[currentFluxMask] | ||
currentFluxMask = dia_sources["filterId"].array == filter_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the dataframe indices around here: don't use array
, but instead do:
currentFluxMask = dia_sources["filterId"] == filter_id
fluxes = dia_sources.loc[:, "psFlux"].replace([None], np.nan)
fluxes = fluxes[currentFluxMask]
fluxErrors = dia_sources.loc[:, "psFluxErr"].replace([None], np.nan)
fluxErrors = fluxErrors[currentFluxMask]
|
||
noNanMask = np.logical_and(np.isfinite(fluxes), np.isfinite(fluxErrors)) | ||
fluxes = fluxes[noNanMask] | ||
fluxErrors = fluxErrors[noNanMask] | ||
midpointTais = dia_sources.get("midPointTai")[currentFluxMask][noNanMask] | ||
midpointTais = dia_sources["midPointTai"].array[currentFluxMask][noNanMask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
midpointTais = dia_sources.loc[currentFluxMask & noNanMask, "midPointTai"]
fluxes = dia_sources.get("psFlux")[currentFluxMask] | ||
fluxErrors = dia_sources.get("psFluxErr")[currentFluxMask] | ||
currentFluxMask = dia_sources["filterId"].array == filter_id | ||
fluxes = dia_sources.loc[:, "psFlux"].replace([None], np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, you might try if dia_sources.loc[:, "psFlux"].fillna(np.nan)
works here and throughout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the docs, fillna
is designed to fill in NA/NaN values so it seems a bit weird to use it to fill NaN values to me.
totFluxes = dia_sources.get("totFlux")[currentFluxMask] | ||
totFluxErrors = dia_sources.get("totFluxErr")[currentFluxMask] | ||
totFluxes = dia_sources.loc[:, "totFlux"].replace([None], np.nan) | ||
totFluxes = totFluxes.array[currentFluxMask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, remove the array
covering_dia_objects = ppdb.getDiaObjects(index_ranges) | ||
covering_dia_objects = ppdb.getDiaObjects(index_ranges, | ||
return_pandas=True) | ||
ccd_mask = np.zeros(len(covering_dia_objects), dtype=bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ccd_mask = pd.Series(False,index=covering_dia_objects.index)
output_dia_objects = afwTable.SourceCatalog( | ||
covering_dia_objects.getSchema()) | ||
for cov_dia_object in covering_dia_objects: | ||
for idx, (df_index, cov_dia_object) in enumerate(covering_dia_objects.iterrows()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with ccd_mask
defined as above, don't need the enumerate
idx
.
if self._check_dia_object_position(cov_dia_object, bbox, wcs): | ||
output_dia_objects.append(cov_dia_object) | ||
ccd_mask[idx] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ccd_mask[df_index] = True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also shouldn't this be ccd_mask.loc[idx] = True
to avoid the error from before?
filter_name = exposure.getFilter().getName() | ||
filter_id = exposure.getFilter().getId() | ||
|
||
updated_obj_ids.sort() | ||
dia_object_used = pandas.DataFrame( | ||
np.zeros(len(dia_objects), dtype=bool), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can replace the vector np.zeros
with False
(scalar).
"nearbyObj1": 0, | ||
"nearbyObj2": 0, | ||
"nearbyObj3": 0} | ||
for f in ["u", "g", "r", "i", "z", "y"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are filter names configurable? Should we be getting the allowed values from somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filter names are not currently configurable as the columns in the Ppdb are fixed elsewhere. The best place to do this I think would be somewhere in the ppdb so that the columns of the DB match up with those here.
Use diaObjectId and diaSourceId for DataFrame indexing. Enforce all columns when creating DiaObject DataFrame DataFrame columns not specified using to_sql are written as "0.0" rather than NaN/Null. Creating all columns prevents this. Clean up ap_association docs. Clean up mapApData docs. Fix flaking. Add comment explaining pandas setting. Keep dataframes in flux computation. Simplify pandas usage in retrieve_dia_objects. Simplify pandas usage in update_dia_objects. Simplify pandas indexing in set_flux
8d25d4b
to
0e0c624
Compare
No description provided.