Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add two new functions for matching and reindexing catalogs #553

Merged
merged 1 commit into from Jun 4, 2021

Conversation

fred3m
Copy link
Contributor

@fred3m fred3m commented Dec 9, 2020

This PR adds the function reIndexCatalog adopted from code by @PaulPrice to apply a numpy array as an index to rows in an afw Catalog and the matchCatalogsExact function to use the unique combination of (parentId, peakId, patch) to exactly match two catalogs derived from the same mergeDet catalog. The main use case is to test changes in deblending and measurement algorithms on the same exact set of sources.

-------
result: tuple of `lsst.afw.table.BaseCatalog`
Both catalogs reordered such that all rows are
matched to the same source.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this actually returns a list of SourceMatch instances.

Given that, I think you should document catalog1 and catalog2 as being specifically SourceCatalog, unless you're sure this actually does work on other catalog types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I forgot to update the docs after I realized what pipe_analysis expected.

Parameters
----------
catalog1, catalog2: subclass of `lsst.afw.table.BaseCatalog`
The two catalogs to merge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this style in numpydoc parameters before. Are you sure it's allowed?

Also, these aren't actually subclasses; they're instances (subclasses would be types). But (see below), I think this should just be SourceCatalog anyway, because I'm skeptical the return type is compatible with any other catalog type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the subclasses, I'll update this.

What do you mean about the style for the numpydoc parameters? You mean listing two parameters on the same line? From https://numpydoc.readthedocs.io/en/latest/format.html :

When two or more input parameters have exactly the same type, shape and description, they can be combined:

x1, x2 : array_like
    Input arrays, description of `x1`, `x2`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that our docs are not exactly numpydoc format as I found out, so you need to consult our style guide

Reindexed catalog. Records are shallow copies of those in ``catalog``.
"""
new = type(catalog)(catalog.schema)
new.reserve(len(indices))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't want this reserve line, because you're making shallow copies and hence don't need to allocate new memory (there will be memory allocated for the pointers, but I don't think you need to worry about allocating that up-front; std::vector will be smart enough to do it plenty efficiently without any help).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default is to make deep copies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I added the deep argument to new.extend because otherwise it returned non-contiguous catalog that I couldn't do anything with. But left it as a parameter in case it is useful to a user for speed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you constructing the catalogs and then accessing them through a numpy interface?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't even notice that you were actually providing the option of doing deep copies below (I just saw "shallow" in the return docs above).

If you want to optionally do deep copies, do this:

new = SourceCatalog(catalog.table.clone() if deep else catalog.table)
records = [catalog[int(ii)] for ii in indices]
new.extend(records, deep=deep)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and then also document the deep parameter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natelust yes. This was originally intended to be used by the matchCatalogsExact function until I learned that pipe_analysis requires a Match object. But I still think it's useful to be able to go afwCatalog[indices] and get back a table that is matched in some way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jim, does that need to be a list comp or could a generator exp work fine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better for it to be a list or a tuple, because it'll be iterated over in pybind11, so considerations like being able to use PySequence_Fast in the C API are probably more relevant than the usual pure-Python wisdom of preferring lazy iterators.

new : subclass of `lsst.afw.table.BaseCatalog`
Reindexed catalog. Records are shallow copies of those in ``catalog``.
"""
new = type(catalog)(catalog.schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not confident this is a reliable way to construct a new catalog that contains everything you might want from the previous one.

Or, rather, this one definitely doesn't, and it's not appropriate for shallow-copies - type(catalog)(catalog.table) would be closer, but I think the safe thing to do is to just say SourceCatalog(catalog.table), which is definitely safe, and document this as something that doesn't work for other catalog types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a shame, but I'm fine with changing it to SourceCatalog. If someone does need to reindex a different type of catalog at least this will provide a starting point.

Copy link
Contributor

@PaulPrice PaulPrice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to @TallJimbo on reindexCatalog, but I think matchCatalogsExact needs to be made generic or moved to a different package.


Parameters
----------
catalog1, catalog2: subclass of `lsst.afw.table.BaseCatalog`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need a space either side of the :.


# Create the keys used to merge the catalogs
parents1 = np.array(catalog1["parent"][sidx1])
peaks1 = np.array(catalog1["deblend_peakId"][sidx1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using column names specific to certain processing, which I think is counter to the concept of afw as a low-level library. Can you make the column names to be function parameters? If not, I think this needs to get hoisted up to a more suitable package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this particular function is general in the sense that this is the only way to exactly match two source catalogs built from the same mergeDet catalog. So it seems to fit in this module to me. I could make it more general and just allow the user to specify the keys to match on, but I'm not sure that would be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that mergeDet catalogs are generated in pipe_tasks. afw shouldn't know about details from pipe_tasks or require a specific catalog schema: it is a general-purpose toolkit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. The mergeDet concept is defined only far downstream in pipe_tasks, so I think I'm with Paul on this not really belonging in afw, but putting it in the same module or package as MergeDetectionsTask seems more appropriate than trying to invent something more general/hypothetical for this to claim to do.

if patch1 is not None:
if patch2 is None:
msg = "If the catalogs are from different patches then patch1 and patch2 must be specified"
msg += ", got {} and {}".format(patch1, patch2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think people generally prefer f-strings to format. It reads much easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the people. I find format statements much more readable and frequently use them in stack code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a separate comment though, dont use += for multiline strings do:

msg = ("this is  a really long string "
            "and this is the rest of it")


key1 = np.rec.array((parents1, peaks1, patch1, index1),
dtype=[('parent', np.int64), ('peakId', np.int32),
("patch", patch1.dtype), ("index", np.int32)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all important, but why do you mix single quotes and double quotes in the same code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Laziness. I must have copied that code from somewhere because I rarely use single quotes, then replaced the first two column names and added more columns myself with double quotes.

@fred3m fred3m closed this Dec 11, 2020
@fred3m
Copy link
Contributor Author

fred3m commented Dec 11, 2020

I closed this PR in order to move these functions, with the requested changes, to pipe_tasks. I'll mention the reviewers from this channel once I create the PR and wait until late Monday before I merge in order to give them a chance to review the code if they so desire.

@fred3m fred3m reopened this Dec 11, 2020
@fred3m
Copy link
Contributor Author

fred3m commented Dec 11, 2020

I lied. Sorry about the extra noise, but I realized that the reindexCatalog function really does below in afw. So I'll push an update with the changes requested by Jim.

@fred3m
Copy link
Contributor Author

fred3m commented Dec 11, 2020

Modified code from @PaulPrice and @TallJimbo to apply a numpy array
to index an afw catalog in the same mannor one would with either
astropy tables or pandas.
@fred3m fred3m merged commit ade869a into master Jun 4, 2021
@fred3m fred3m deleted the tickets/DM-27800 branch June 4, 2021 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants