Add two new functions for matching and reindexing catalogs #553

fred3m · 2020-12-09T16:09:04Z

This PR adds the function reIndexCatalog adopted from code by @PaulPrice to apply a numpy array as an index to rows in an afw Catalog and the matchCatalogsExact function to use the unique combination of (parentId, peakId, patch) to exactly match two catalogs derived from the same mergeDet catalog. The main use case is to test changes in deblending and measurement algorithms on the same exact set of sources.

TallJimbo · 2020-12-09T16:30:02Z

python/lsst/afw/table/catalogMatches.py

+    -------
+    result: tuple of `lsst.afw.table.BaseCatalog`
+        Both catalogs reordered such that all rows are
+        matched to the same source.


Looks like this actually returns a list of SourceMatch instances.

Given that, I think you should document catalog1 and catalog2 as being specifically SourceCatalog, unless you're sure this actually does work on other catalog types.

Yup, I forgot to update the docs after I realized what pipe_analysis expected.

TallJimbo · 2020-12-09T16:31:05Z

python/lsst/afw/table/catalogMatches.py

+    Parameters
+    ----------
+    catalog1, catalog2: subclass of `lsst.afw.table.BaseCatalog`
+        The two catalogs to merge


I haven't seen this style in numpydoc parameters before. Are you sure it's allowed?

Also, these aren't actually subclasses; they're instances (subclasses would be types). But (see below), I think this should just be SourceCatalog anyway, because I'm skeptical the return type is compatible with any other catalog type.

Good point on the subclasses, I'll update this.

What do you mean about the style for the numpydoc parameters? You mean listing two parameters on the same line? From https://numpydoc.readthedocs.io/en/latest/format.html :

When two or more input parameters have exactly the same type, shape and description, they can be combined:

x1, x2 : array_like Input arrays, description of `x1`, `x2`.

Note that our docs are not exactly numpydoc format as I found out, so you need to consult our style guide

TallJimbo · 2020-12-09T16:33:07Z

python/lsst/afw/table/catalogMatches.py

+        Reindexed catalog. Records are shallow copies of those in ``catalog``.
+    """
+    new = type(catalog)(catalog.schema)
+    new.reserve(len(indices))


Probably don't want this reserve line, because you're making shallow copies and hence don't need to allocate new memory (there will be memory allocated for the pointers, but I don't think you need to worry about allocating that up-front; std::vector will be smart enough to do it plenty efficiently without any help).

I think the default is to make deep copies

Yeah, I added the deep argument to new.extend because otherwise it returned non-contiguous catalog that I couldn't do anything with. But left it as a parameter in case it is useful to a user for speed.

Are you constructing the catalogs and then accessing them through a numpy interface?

Ah, I didn't even notice that you were actually providing the option of doing deep copies below (I just saw "shallow" in the return docs above).

If you want to optionally do deep copies, do this:

new = SourceCatalog(catalog.table.clone() if deep else catalog.table) records = [catalog[int(ii)] for ii in indices] new.extend(records, deep=deep)

(and then also document the deep parameter)

@natelust yes. This was originally intended to be used by the matchCatalogsExact function until I learned that pipe_analysis requires a Match object. But I still think it's useful to be able to go afwCatalog[indices] and get back a table that is matched in some way.

Jim, does that need to be a list comp or could a generator exp work fine?

Probably better for it to be a list or a tuple, because it'll be iterated over in pybind11, so considerations like being able to use PySequence_Fast in the C API are probably more relevant than the usual pure-Python wisdom of preferring lazy iterators.

TallJimbo · 2020-12-09T16:35:32Z

python/lsst/afw/table/catalogMatches.py

+    new : subclass of `lsst.afw.table.BaseCatalog`
+        Reindexed catalog. Records are shallow copies of those in ``catalog``.
+    """
+    new = type(catalog)(catalog.schema)


I am not confident this is a reliable way to construct a new catalog that contains everything you might want from the previous one.

Or, rather, this one definitely doesn't, and it's not appropriate for shallow-copies - type(catalog)(catalog.table) would be closer, but I think the safe thing to do is to just say SourceCatalog(catalog.table), which is definitely safe, and document this as something that doesn't work for other catalog types.

That's a shame, but I'm fine with changing it to SourceCatalog. If someone does need to reindex a different type of catalog at least this will provide a starting point.

PaulPrice

I'll defer to @TallJimbo on reindexCatalog, but I think matchCatalogsExact needs to be made generic or moved to a different package.

PaulPrice · 2020-12-09T17:04:49Z

python/lsst/afw/table/catalogMatches.py

+
+    Parameters
+    ----------
+    catalog1, catalog2: subclass of `lsst.afw.table.BaseCatalog`


I think you need a space either side of the :.

PaulPrice · 2020-12-09T17:07:03Z

python/lsst/afw/table/catalogMatches.py

+
+    # Create the keys used to merge the catalogs
+    parents1 = np.array(catalog1["parent"][sidx1])
+    peaks1 = np.array(catalog1["deblend_peakId"][sidx1])


You're using column names specific to certain processing, which I think is counter to the concept of afw as a low-level library. Can you make the column names to be function parameters? If not, I think this needs to get hoisted up to a more suitable package.

I think that this particular function is general in the sense that this is the only way to exactly match two source catalogs built from the same mergeDet catalog. So it seems to fit in this module to me. I could make it more general and just allow the user to specify the keys to match on, but I'm not sure that would be useful.

The point is that mergeDet catalogs are generated in pipe_tasks. afw shouldn't know about details from pipe_tasks or require a specific catalog schema: it is a general-purpose toolkit.

Hmm. The mergeDet concept is defined only far downstream in pipe_tasks, so I think I'm with Paul on this not really belonging in afw, but putting it in the same module or package as MergeDetectionsTask seems more appropriate than trying to invent something more general/hypothetical for this to claim to do.

PaulPrice · 2020-12-09T17:08:02Z

python/lsst/afw/table/catalogMatches.py

+    if patch1 is not None:
+        if patch2 is None:
+            msg = "If the catalogs are from different patches then patch1 and patch2 must be specified"
+            msg += ", got {} and {}".format(patch1, patch2)


I think people generally prefer f-strings to format. It reads much easier.

It depends on the people. I find format statements much more readable and frequently use them in stack code.

a separate comment though, dont use += for multiline strings do:

msg = ("this is a really long string " "and this is the rest of it")

PaulPrice · 2020-12-09T17:10:01Z

python/lsst/afw/table/catalogMatches.py

+
+        key1 = np.rec.array((parents1, peaks1, patch1, index1),
+                            dtype=[('parent', np.int64), ('peakId', np.int32),
+                                   ("patch", patch1.dtype), ("index", np.int32)])


Not at all important, but why do you mix single quotes and double quotes in the same code?

Laziness. I must have copied that code from somewhere because I rarely use single quotes, then replaced the first two column names and added more columns myself with double quotes.

fred3m · 2020-12-11T16:35:27Z

I closed this PR in order to move these functions, with the requested changes, to pipe_tasks. I'll mention the reviewers from this channel once I create the PR and wait until late Monday before I merge in order to give them a chance to review the code if they so desire.

fred3m · 2020-12-11T17:04:46Z

I lied. Sorry about the extra noise, but I realized that the reindexCatalog function really does below in afw. So I'll push an update with the changes requested by Jim.

fred3m · 2020-12-11T18:52:14Z

Changes pushed here, https://github.com/lsst/pipe_tasks/pull/441/files, and lsst-dm/pipe_analysis#66.

@PaulPrice

Modified code from @PaulPrice and @TallJimbo to apply a numpy array to index an afw catalog in the same mannor one would with either astropy tables or pandas.

fred3m force-pushed the tickets/DM-27800 branch from 3d69486 to c31a9bc Compare December 9, 2020 16:12

TallJimbo requested changes Dec 9, 2020

View reviewed changes

PaulPrice requested changes Dec 9, 2020

View reviewed changes

fred3m closed this Dec 11, 2020

fred3m reopened this Dec 11, 2020

fred3m force-pushed the tickets/DM-27800 branch from c31a9bc to c14fb27 Compare December 11, 2020 18:49

Create reindexCatalog to use an array to index a SourceCatalog

b2cca1e

Modified code from @PaulPrice and @TallJimbo to apply a numpy array to index an afw catalog in the same mannor one would with either astropy tables or pandas.

fred3m force-pushed the tickets/DM-27800 branch from c14fb27 to b2cca1e Compare June 3, 2021 13:30

fred3m merged commit ade869a into master Jun 4, 2021

fred3m deleted the tickets/DM-27800 branch June 4, 2021 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add two new functions for matching and reindexing catalogs #553

Add two new functions for matching and reindexing catalogs #553

fred3m commented Dec 9, 2020

TallJimbo Dec 9, 2020

fred3m Dec 9, 2020

TallJimbo Dec 9, 2020

fred3m Dec 9, 2020

natelust Dec 9, 2020

TallJimbo Dec 9, 2020

natelust Dec 9, 2020

fred3m Dec 9, 2020

natelust Dec 9, 2020

TallJimbo Dec 9, 2020

TallJimbo Dec 9, 2020

fred3m Dec 9, 2020

natelust Dec 9, 2020

TallJimbo Dec 9, 2020

TallJimbo Dec 9, 2020

fred3m Dec 9, 2020

PaulPrice left a comment

PaulPrice Dec 9, 2020

PaulPrice Dec 9, 2020

fred3m Dec 9, 2020

PaulPrice Dec 9, 2020

TallJimbo Dec 9, 2020

PaulPrice Dec 9, 2020

fred3m Dec 9, 2020

natelust Dec 9, 2020

PaulPrice Dec 9, 2020

fred3m Dec 9, 2020

fred3m commented Dec 11, 2020

fred3m commented Dec 11, 2020

fred3m commented Dec 11, 2020

Add two new functions for matching and reindexing catalogs #553

Add two new functions for matching and reindexing catalogs #553

Conversation

fred3m commented Dec 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PaulPrice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fred3m commented Dec 11, 2020

fred3m commented Dec 11, 2020

fred3m commented Dec 11, 2020