Skip to content

Conversation

kfindeisen
Copy link
Member

@kfindeisen kfindeisen commented Jan 10, 2023

This PR replaces the per-visit raw collection and per-processing output collection with a single raw collection and a per-pipeline output collection, avoiding some scaling and concurrency issues in the central repository. However, the latter change required some reworking of MiddlewareInterface, which had assumed that each output collection was new and unique.

The old test tried to conflate two different problems ("do datasets
exist?" "what IDs do they have?"), making it inflexible. Breaking it up
into two separate assertions makes the code more readable, more
flexible, and more naturally extensible to cases with multiple
data IDs.
Giving the input exposures (and output exposures) distinct IDs is more
realistic, and will be necessary once we put different visits in the
same run.
Previously, there was a single raw collection for each group, and
raw/all was a chained collection that linked them together. This
architecture not only deviated from repository conventions, it had the
same scaling problems as timestamped output runs. Now that raws are
guaranteed to have unique IDs in simulated data, we can safely put
them all in a single standard run.
The new docs provide the exact location for Butler commands, and update
the contents to reflect DM-37072 and DM-37751.
prompt_prototype is an LSST-DM package, and follows the usual process
of needing to be setup once per session.
@kfindeisen kfindeisen marked this pull request as ready for review February 23, 2023 21:18
@kfindeisen kfindeisen requested a review from hsinfang February 23, 2023 21:54
Copy link
Collaborator

@hsinfang hsinfang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great improvement to the collection management!

This change allows the pipeline file to be looked up by other code, and
will make it easier to make the pipeline configurable later.
Reducing the number of possible runs allows the merging of collections
that share the same configuration and provenance. Merging any further
would result in Butler conflicts.
This commit is merely a string substitution; no attempt has been made
to change the structure of MiddlewareInterface. However, a unit test
that assumed pipelines are never unique had to be tweaked.
The old variable potentially conflicted with the `time` package.
A large output chain is not desired in the central repo, and the
chaining process was prone to races between different workers.
The cleanup keeps the local repo from growing unnecessarily large,
reducing the resource load on the prompt processing cluster. This
cleanup also supersedes the init-output removal hack added in DM-37068.
@kfindeisen kfindeisen merged commit 2f0745e into main Feb 24, 2023
@kfindeisen kfindeisen deleted the tickets/DM-36586 branch February 24, 2023 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants