New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-23728: Cleanup ci_hsc_gen2 to use new convert script instead of custom one #213
Conversation
ce232a4
to
358f8ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay. You can ignore my comment if you wish.
@@ -156,6 +158,9 @@ def convert(gen2root, gen3root, instrumentClass, calibFilterType, | |||
calibs : `str`, optional | |||
Path to the gen2 calibration repository to be converted. | |||
If a relative path, it is assumed to be relative to ``gen2root``. | |||
reruns : `list` [`str`], optional | |||
List of reruns to convert. They will be placed in the | |||
``instrumentClass.getName()`` collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is it shared/instrument/converted
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each rerun should go into a different collection, not the same one, and their names should be derived from the Gen2 repo names somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TallJimbo Is there a way we can codify that with some unittests in obs_base, before rolling it out to ci_hsc? If we want to enforce it, we should do so at the deepest level, probably on its own ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem will become apparent as soon as reruns contain the same output datasets. You can only have one dataset with a given dataId per collection and reruns tend to have many duplicate datasets by definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the script generates the chainName and runName passed to the Rerun from the path, unless explicitly overridden, that should cover it pretty well.
Tests in obs_base are of course possible, but making a useful mock repo to convert is probably at least a week of work on it's own, and with that ratio I'm willing to live with just coverage in ci_hsc.
85efb9c
to
73534ea
Compare
@@ -176,6 +181,9 @@ def convert(gen2root, gen3root, instrumentClass, calibFilterType, | |||
|
|||
configure_translators(instrument, calibFilterType, convertRepoConfig.ccdKey) | |||
|
|||
rerunsArg = [Rerun(rerun, runName=f"shared/{instrument.getName()}/converted", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should the runName be something like rerun/{instrument.getName()}{rerun}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please do this as part of another ticket? If we want to enforce this properly, we should do it with actual unittests in obs_base. The break/test/fix cycle with ci_hsc is way too long and cumbersome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the script already let you pass in the Rerun arguments on the command line? It eventually needs to, even if that isn't the default someday, and then we can just use that to hard-code reasonable values (like the same ones we had before? Why are we changing them on this ticket anyhow?) in ci_hsc_gen2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous to this ticket, the convert script did nothing with rerun (none of my testdata had reruns). ci_hsc has one rerun, and I added a commandline option (--reruns=[]
) to the script.
Being smarter about the reruns I think warrants some specific unittests (probably built off of a list of gen2 reruns pulled from lsst-dev:/datasets
), to ensure it is handling special characters correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to get done with this ticket and move on to something else, I'm fine with deferring that testing to another ticket. I'm not really okay with changing the behavior of this conversion (in terms of the collections it writes) because the command-line script doesn't yet have the flexibility that this and other non-unit-test cases need; that would make this ticket a regression, not a cleanup. If you need to be done with it and don't want to defer the testing to another ticket, I think we just put this ticket on hold until someone else can work on the general-purpose script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting a little confused now. This ticket adds a --reruns
command line option, which is what we want, and I thought the debate we were having was solely over what collection those reruns should go into. We know the collection has to be different for each rerun and I don't understand why we can't do something like I suggested at the top of this thread. I don't understand why choosing a rerun name has to be blocked on unit tests since we know that the script will fail as currently written if there are multiple reruns because by definition we know that the butler won't let identical dataIds be used in the same run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TallJimbo: I'm confused again: the existing ci_hsc_gen2 convert code uses runName="shared/ci_hsc/direct", chainName="shared/ci_hsc"
, neither of which follow your recommendations in RFC-663, which doesn't say anything about shared/
. So the name of the output collection written by this conversion is already going to change to shared/HSC/something
.
Are there any daf_butler restrictions on the characters in collection names? I know that dataset names have a rather hidden restriction based on VALID_NAME_REGEX
(which I keep having to root around to find because I never remember what it is called). Does that also apply to collection names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the RFC doesn't try to specify what to do when usernames and ticket numbers aren't relevant, but I think "shared" instead of username and "ci_hsc" instead of a ticket number is a reasonable extrapolation, and it's what the name has always been. I'm not super attached to it, but I don't want it to flop around every time a new developer has a different idea on how to extrapolate the convention. Either we should leave it the same and avoid churn or extend the convention, and the latter should be discussed more broadly than a PR.
In any case, I hadn't intended for output collection names to include the instrument name. It's an interesting idea that may be worth revisiting, but it's not what was RFC'd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's what the name has always been
: that's only true for ci_hsc (it is undefined anywhere else), and it's currently "shared/ci_hsc/direct"
on ci_hsc_gen2 master.
Are you sure we don't have different instruments with the same rerun names right now? I certainly wouldn't bet on that, which is why I think including instrument in the converted collection is important.
Remember: everything we're discussing here only applies to converting gen2 repos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we don't have different instruments with the same rerun names right now? I certainly wouldn't bet on that, which is why I think including instrument in the converted collection is important.
That's a good point. I wouldn't be surprised either way, and being safe by including the instrument name in converted collection names is reasonable. And now that we've got a solid reason to change this at all, I care much less about exactly what we change it to (i.e. just that we get the "ci_hsc" from the Gen2 rerun directory included).
Sorry for being such a pain on this - I may have been reacting more to having been burned on the last collection change (to just instrument name, just before I merged DM-21849) more than this one - but I think we've now converged to a space of names that are all broadly acceptable and a good reason not to just keep the one we have.
73534ea
to
542483d
Compare
The update I just pushed results in this collection name for ci_hsc_gen2: |
I can guess where the "rerun" comes from, and I'd be inclined to strip it. Where does the "HSC" come from? Just the instrument name?
I don't think we do, though avoiding shell special characters would save quoting on command-lines. |
There are existing reruns on |
542483d
to
c1b5d8f
Compare
No description provided.