Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-18736: Convert ap_association to use Pandas data frames (rather than afw::table) as an interface #46

Merged
merged 4 commits into from Jun 11, 2019

Conversation

morriscb
Copy link
Contributor

@morriscb morriscb commented Jun 5, 2019

No description provided.

Complete initial pandas conversion in AssociationTask.

Implement first unittest.
Debug first unittest.

Convert association test to pandas.

Fix data production method for pandas.

Modify test_update to use pandas.

Patch up AssociationTask for use in tests.

Partially convert final unittests.

Fix pandas warning and unittest.
Add default values for non-Nullable DiaObject columns.

Fixup initial DiaObject creation.

Fix pandas type bugs.

Simplify DiaObject locating.
Copy link
Contributor

@ebellm ebellm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Several suggestions to make more use of the dataframe indexing.

from lsst.daf.base import DateTime
from lsst.meas.algorithms.indexerRegistry import IndexerRegistry
import lsst.pex.config as pexConfig
import lsst.pipe.base as pipeBase

from .afwUtils import make_dia_object_schema

pandas.options.mode.chained_assignment = 'raise'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave a comment here explaining why we are setting this.

currentFluxMask = dia_sources.get("filterId") == filter_id
fluxes = dia_sources.get("psFlux")[currentFluxMask]
fluxErrors = dia_sources.get("psFluxErr")[currentFluxMask]
currentFluxMask = dia_sources["filterId"].array == filter_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the dataframe indices around here: don't use array, but instead do:

currentFluxMask = dia_sources["filterId"] == filter_id
fluxes = dia_sources.loc[:, "psFlux"].replace([None], np.nan)
fluxes = fluxes[currentFluxMask]
fluxErrors = dia_sources.loc[:, "psFluxErr"].replace([None], np.nan)
fluxErrors = fluxErrors[currentFluxMask]


noNanMask = np.logical_and(np.isfinite(fluxes), np.isfinite(fluxErrors))
fluxes = fluxes[noNanMask]
fluxErrors = fluxErrors[noNanMask]
midpointTais = dia_sources.get("midPointTai")[currentFluxMask][noNanMask]
midpointTais = dia_sources["midPointTai"].array[currentFluxMask][noNanMask]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

midpointTais = dia_sources.loc[currentFluxMask & noNanMask, "midPointTai"]

fluxes = dia_sources.get("psFlux")[currentFluxMask]
fluxErrors = dia_sources.get("psFluxErr")[currentFluxMask]
currentFluxMask = dia_sources["filterId"].array == filter_id
fluxes = dia_sources.loc[:, "psFlux"].replace([None], np.nan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, you might try if dia_sources.loc[:, "psFlux"].fillna(np.nan) works here and throughout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the docs, fillna is designed to fill in NA/NaN values so it seems a bit weird to use it to fill NaN values to me.

totFluxes = dia_sources.get("totFlux")[currentFluxMask]
totFluxErrors = dia_sources.get("totFluxErr")[currentFluxMask]
totFluxes = dia_sources.loc[:, "totFlux"].replace([None], np.nan)
totFluxes = totFluxes.array[currentFluxMask]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, remove the array

covering_dia_objects = ppdb.getDiaObjects(index_ranges)
covering_dia_objects = ppdb.getDiaObjects(index_ranges,
return_pandas=True)
ccd_mask = np.zeros(len(covering_dia_objects), dtype=bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ccd_mask = pd.Series(False,index=covering_dia_objects.index)

output_dia_objects = afwTable.SourceCatalog(
covering_dia_objects.getSchema())
for cov_dia_object in covering_dia_objects:
for idx, (df_index, cov_dia_object) in enumerate(covering_dia_objects.iterrows()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with ccd_mask defined as above, don't need the enumerate idx.

if self._check_dia_object_position(cov_dia_object, bbox, wcs):
output_dia_objects.append(cov_dia_object)
ccd_mask[idx] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ccd_mask[df_index] = True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also shouldn't this be ccd_mask.loc[idx] = True to avoid the error from before?

filter_name = exposure.getFilter().getName()
filter_id = exposure.getFilter().getId()

updated_obj_ids.sort()
dia_object_used = pandas.DataFrame(
np.zeros(len(dia_objects), dtype=bool),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can replace the vector np.zeros with False (scalar).

"nearbyObj1": 0,
"nearbyObj2": 0,
"nearbyObj3": 0}
for f in ["u", "g", "r", "i", "z", "y"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are filter names configurable? Should we be getting the allowed values from somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter names are not currently configurable as the columns in the Ppdb are fixed elsewhere. The best place to do this I think would be somewhere in the ppdb so that the columns of the DB match up with those here.

@ebellm ebellm self-requested a review June 11, 2019 18:20
Use diaObjectId and diaSourceId for DataFrame indexing.

Enforce all columns when creating DiaObject DataFrame

DataFrame columns not specified using to_sql are written as "0.0"
rather than NaN/Null. Creating all columns prevents this.

Clean up ap_association docs.

Clean up mapApData docs.

Fix flaking.

Add comment explaining pandas setting.

Keep dataframes in flux computation.

Simplify pandas usage in retrieve_dia_objects.

Simplify pandas usage in update_dia_objects.

Simplify pandas indexing in set_flux
@morriscb morriscb merged commit 0e0c624 into master Jun 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants