New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add two new functions for matching and reindexing catalogs #553
Conversation
3d69486
to
c31a9bc
Compare
------- | ||
result: tuple of `lsst.afw.table.BaseCatalog` | ||
Both catalogs reordered such that all rows are | ||
matched to the same source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this actually returns a list
of SourceMatch
instances.
Given that, I think you should document catalog1
and catalog2
as being specifically SourceCatalog
, unless you're sure this actually does work on other catalog types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I forgot to update the docs after I realized what pipe_analysis expected.
Parameters | ||
---------- | ||
catalog1, catalog2: subclass of `lsst.afw.table.BaseCatalog` | ||
The two catalogs to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen this style in numpydoc parameters before. Are you sure it's allowed?
Also, these aren't actually subclasses; they're instances (subclasses would be types). But (see below), I think this should just be SourceCatalog
anyway, because I'm skeptical the return type is compatible with any other catalog type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on the subclasses, I'll update this.
What do you mean about the style for the numpydoc parameters? You mean listing two parameters on the same line? From https://numpydoc.readthedocs.io/en/latest/format.html :
When two or more input parameters have exactly the same type, shape and description, they can be combined:
x1, x2 : array_like
Input arrays, description of `x1`, `x2`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that our docs are not exactly numpydoc format as I found out, so you need to consult our style guide
Reindexed catalog. Records are shallow copies of those in ``catalog``. | ||
""" | ||
new = type(catalog)(catalog.schema) | ||
new.reserve(len(indices)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably don't want this reserve
line, because you're making shallow copies and hence don't need to allocate new memory (there will be memory allocated for the pointers, but I don't think you need to worry about allocating that up-front; std::vector
will be smart enough to do it plenty efficiently without any help).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default is to make deep copies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I added the deep
argument to new.extend
because otherwise it returned non-contiguous catalog that I couldn't do anything with. But left it as a parameter in case it is useful to a user for speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you constructing the catalogs and then accessing them through a numpy interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't even notice that you were actually providing the option of doing deep copies below (I just saw "shallow" in the return docs above).
If you want to optionally do deep copies, do this:
new = SourceCatalog(catalog.table.clone() if deep else catalog.table)
records = [catalog[int(ii)] for ii in indices]
new.extend(records, deep=deep)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and then also document the deep
parameter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@natelust yes. This was originally intended to be used by the matchCatalogsExact
function until I learned that pipe_analysis
requires a Match
object. But I still think it's useful to be able to go afwCatalog[indices]
and get back a table that is matched in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jim, does that need to be a list comp or could a generator exp work fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better for it to be a list or a tuple, because it'll be iterated over in pybind11, so considerations like being able to use PySequence_Fast
in the C API are probably more relevant than the usual pure-Python wisdom of preferring lazy iterators.
new : subclass of `lsst.afw.table.BaseCatalog` | ||
Reindexed catalog. Records are shallow copies of those in ``catalog``. | ||
""" | ||
new = type(catalog)(catalog.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not confident this is a reliable way to construct a new catalog that contains everything you might want from the previous one.
Or, rather, this one definitely doesn't, and it's not appropriate for shallow-copies - type(catalog)(catalog.table)
would be closer, but I think the safe thing to do is to just say SourceCatalog(catalog.table)
, which is definitely safe, and document this as something that doesn't work for other catalog types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a shame, but I'm fine with changing it to SourceCatalog
. If someone does need to reindex a different type of catalog at least this will provide a starting point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to @TallJimbo on reindexCatalog
, but I think matchCatalogsExact
needs to be made generic or moved to a different package.
|
||
Parameters | ||
---------- | ||
catalog1, catalog2: subclass of `lsst.afw.table.BaseCatalog` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need a space either side of the :
.
|
||
# Create the keys used to merge the catalogs | ||
parents1 = np.array(catalog1["parent"][sidx1]) | ||
peaks1 = np.array(catalog1["deblend_peakId"][sidx1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're using column names specific to certain processing, which I think is counter to the concept of afw as a low-level library. Can you make the column names to be function parameters? If not, I think this needs to get hoisted up to a more suitable package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this particular function is general in the sense that this is the only way to exactly match two source catalogs built from the same mergeDet
catalog. So it seems to fit in this module to me. I could make it more general and just allow the user to specify the keys to match on, but I'm not sure that would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that mergeDet
catalogs are generated in pipe_tasks. afw shouldn't know about details from pipe_tasks or require a specific catalog schema: it is a general-purpose toolkit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. The mergeDet
concept is defined only far downstream in pipe_tasks
, so I think I'm with Paul on this not really belonging in afw, but putting it in the same module or package as MergeDetectionsTask
seems more appropriate than trying to invent something more general/hypothetical for this to claim to do.
if patch1 is not None: | ||
if patch2 is None: | ||
msg = "If the catalogs are from different patches then patch1 and patch2 must be specified" | ||
msg += ", got {} and {}".format(patch1, patch2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think people generally prefer f
-strings to format
. It reads much easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the people. I find format
statements much more readable and frequently use them in stack code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a separate comment though, dont use +=
for multiline strings do:
msg = ("this is a really long string "
"and this is the rest of it")
|
||
key1 = np.rec.array((parents1, peaks1, patch1, index1), | ||
dtype=[('parent', np.int64), ('peakId', np.int32), | ||
("patch", patch1.dtype), ("index", np.int32)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all important, but why do you mix single quotes and double quotes in the same code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Laziness. I must have copied that code from somewhere because I rarely use single quotes, then replaced the first two column names and added more columns myself with double quotes.
I closed this PR in order to move these functions, with the requested changes, to pipe_tasks. I'll mention the reviewers from this channel once I create the PR and wait until late Monday before I merge in order to give them a chance to review the code if they so desire. |
I lied. Sorry about the extra noise, but I realized that the |
c31a9bc
to
c14fb27
Compare
Changes pushed here, https://github.com/lsst/pipe_tasks/pull/441/files, and lsst-dm/pipe_analysis#66. |
Modified code from @PaulPrice and @TallJimbo to apply a numpy array to index an afw catalog in the same mannor one would with either astropy tables or pandas.
This PR adds the function
reIndexCatalog
adopted from code by @PaulPrice to apply a numpy array as an index to rows in an afw Catalog and thematchCatalogsExact
function to use the unique combination of(parentId, peakId, patch)
to exactly match two catalogs derived from the samemergeDet
catalog. The main use case is to test changes in deblending and measurement algorithms on the same exact set of sources.