New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

DM-18736: Convert ap_association to use Pandas data frames (rather than afw::table) as an interface #46

Merged

morriscb merged 4 commits into master from tickets/DM-18736

Jun 11, 2019

Contributor

morriscb commented Jun 5, 2019

No description provided.

morriscb added 3 commits

June 3, 2019 19:08


          Convert AssociationTask to use pandas.

e4a0522

Complete initial pandas conversion in AssociationTask.

Implement first unittest.


          Convert unittests.

89958e7

Debug first unittest.

Convert association test to pandas.

Fix data production method for pandas.

Modify test_update to use pandas.

Patch up AssociationTask for use in tests.

Partially convert final unittests.

Fix pandas warning and unittest.


          Add pandas return option to MapApData.

Add default values for non-Nullable DiaObject columns.

Fixup initial DiaObject creation.

Fix pandas type bugs.

Simplify DiaObject locating.

ebellm requested changes

View reviewed changes

Contributor

ebellm left a comment

Looks good overall. Several suggestions to make more use of the dataframe indexing.

python/lsst/ap/association/association.py

               from lsst.daf.base import DateTime
               from lsst.meas.algorithms.indexerRegistry import IndexerRegistry
               import lsst.pex.config as pexConfig
               import lsst.pipe.base as pipeBase
               from .afwUtils import make_dia_object_schema
+              pandas.options.mode.chained_assignment = 'raise'

Contributor

ebellm Jun 7, 2019

I would leave a comment here explaining why we are setting this.

python/lsst/ap/association/association.py Outdated

-                  currentFluxMask = dia_sources.get("filterId") == filter_id
-                  fluxes = dia_sources.get("psFlux")[currentFluxMask]
-                  fluxErrors = dia_sources.get("psFluxErr")[currentFluxMask]
+                  currentFluxMask = dia_sources["filterId"].array == filter_id

Contributor

ebellm Jun 7, 2019

Keep the dataframe indices around here: don't use array, but instead do:

currentFluxMask = dia_sources["filterId"] == filter_id
fluxes = dia_sources.loc[:, "psFlux"].replace([None], np.nan)
fluxes = fluxes[currentFluxMask]
fluxErrors = dia_sources.loc[:, "psFluxErr"].replace([None], np.nan)
fluxErrors = fluxErrors[currentFluxMask]

python/lsst/ap/association/association.py Outdated

    
                  noNanMask = np.logical_and(np.isfinite(fluxes), np.isfinite(fluxErrors))

                  fluxes = fluxes[noNanMask]

                  fluxErrors = fluxErrors[noNanMask]

                  midpointTais = dia_sources.get("midPointTai")[currentFluxMask][noNanMask]

                  midpointTais = dia_sources["midPointTai"].array[currentFluxMask][noNanMask]

Contributor

ebellm Jun 7, 2019

midpointTais = dia_sources.loc[currentFluxMask & noNanMask, "midPointTai"]

python/lsst/ap/association/association.py Outdated

-                  fluxes = dia_sources.get("psFlux")[currentFluxMask]
-                  fluxErrors = dia_sources.get("psFluxErr")[currentFluxMask]
+                  currentFluxMask = dia_sources["filterId"].array == filter_id
+                  fluxes = dia_sources.loc[:, "psFlux"].replace([None], np.nan)

Contributor

ebellm Jun 7, 2019

also, you might try if dia_sources.loc[:, "psFlux"].fillna(np.nan) works here and throughout

Contributor Author

morriscb Jun 7, 2019

Looking at the docs, fillna is designed to fill in NA/NaN values so it seems a bit weird to use it to fill NaN values to me.

python/lsst/ap/association/association.py Outdated

-                  totFluxes = dia_sources.get("totFlux")[currentFluxMask]
-                  totFluxErrors = dia_sources.get("totFluxErr")[currentFluxMask]
+                  totFluxes = dia_sources.loc[:, "totFlux"].replace([None], np.nan)
+                  totFluxes = totFluxes.array[currentFluxMask]

Contributor

ebellm Jun 7, 2019

again, remove the array

python/lsst/ap/association/association.py Outdated

-                      covering_dia_objects = ppdb.getDiaObjects(index_ranges)
+                      covering_dia_objects = ppdb.getDiaObjects(index_ranges,
+                                                                return_pandas=True)
+                      ccd_mask = np.zeros(len(covering_dia_objects), dtype=bool)

Contributor

ebellm Jun 7, 2019

ccd_mask = pd.Series(False,index=covering_dia_objects.index)

python/lsst/ap/association/association.py Outdated

-                      output_dia_objects = afwTable.SourceCatalog(
-                          covering_dia_objects.getSchema())
-                      for cov_dia_object in covering_dia_objects:
+                      for idx, (df_index, cov_dia_object) in enumerate(covering_dia_objects.iterrows()):

Contributor

ebellm Jun 7, 2019

with ccd_mask defined as above, don't need the enumerate idx.

python/lsst/ap/association/association.py Outdated

                           if self._check_dia_object_position(cov_dia_object, bbox, wcs):
-                              output_dia_objects.append(cov_dia_object)
+                              ccd_mask[idx] = True

Contributor

ebellm Jun 7, 2019

ccd_mask[df_index] = True

Contributor Author

morriscb Jun 7, 2019

Also shouldn't this be ccd_mask.loc[idx] = True to avoid the error from before?

python/lsst/ap/association/association.py Outdated

                       filter_name = exposure.getFilter().getName()
                       filter_id = exposure.getFilter().getId()
+                      updated_obj_ids.sort()
+                      dia_object_used = pandas.DataFrame(
+                          np.zeros(len(dia_objects), dtype=bool),

Contributor

ebellm Jun 7, 2019

You can replace the vector np.zeros with False (scalar).

python/lsst/ap/association/association.py

+                                        "nearbyObj1": 0,
+                                        "nearbyObj2": 0,
+                                        "nearbyObj3": 0}
+                      for f in ["u", "g", "r", "i", "z", "y"]:

Contributor

ebellm Jun 7, 2019

are filter names configurable? Should we be getting the allowed values from somewhere?

Contributor Author

morriscb Jun 7, 2019

Filter names are not currently configurable as the columns in the Ppdb are fixed elsewhere. The best place to do this I think would be somewhere in the ppdb so that the columns of the DB match up with those here.

ebellm self-requested a review

June 11, 2019 18:20

ebellm approved these changes

View reviewed changes


          Simplify pandas.DataFrame indexing and usage.

0e0c624

Use diaObjectId and diaSourceId for DataFrame indexing.

Enforce all columns when creating DiaObject DataFrame

DataFrame columns not specified using to_sql are written as "0.0"
rather than NaN/Null. Creating all columns prevents this.

Clean up ap_association docs.

Clean up mapApData docs.

Fix flaking.

Add comment explaining pandas setting.

Keep dataframes in flux computation.

Simplify pandas usage in retrieve_dia_objects.

Simplify pandas usage in update_dia_objects.

Simplify pandas indexing in set_flux

morriscb force-pushed the tickets/DM-18736 branch from 8d25d4b to 0e0c624 Compare

June 11, 2019 18:26

morriscb merged commit 0e0c624 into master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment