New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-38454: butler mode for ModelPackages #21
Conversation
61618c4
to
dc72b2b
Compare
the Buler repo.
with a custom formatter.
E.g. the collection "pretrained_models/dummy" will be where the pretrained weights for the model package "dummy" will be stored.
This storage mode allows butler to load pretrained weights from butler and pass them down to the task just like other typical LSST tasks.
5e483af
to
cc2c627
Compare
dde8a98
to
ff074fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only big-picture concern is that I'm not sure you're getting much from the StorageAdapterBase
class and its subclasses, unless it's all just backwards compatibility and this just isn't the right ticket to remove them. Ideally I'd like to get this down to two modes of operation:
-
If you call
RBTransiNetTask.run
directly from Python and pass the payload to use as an argument, you can load it however you'd like (and this is what I'd expect all or most unit tests to do, and possibly a lot of development work, to the extent that's done in interactive Python sessions). -
If you run that task as a
PipelineTask
, the execution framework loads the payload from the butler and passes it torun
for you.
I do think there's probably a need for a concrete/final storage utility class as a place to put methods like ingest
and low-level I/O code, but I don't see a need for an abstraction layer here (again, except as a way to do backwards compatibility from the rbClassifier_data
git-package approach, which would be totally legitimate).
I also suspect that if you weren't trying to fit this into the StorageAdapterBase
scheme, a more natural NNModelPackagePayload
might actually hold the architecture/weights/metadata instead of a raw BytesIO
- but I think you've got the necessary levels of indirection here to change that in the future without breaking anything, so it's not something we need to resolve on this ticket.
I'm hitting "Approve" now because I think all the little things that should get addressed before merge are straightforward and hopefully controversial, and I'm happy to defer any big-picture issues to other tickets.
payload = NNModelPackagePayload() | ||
with open(self.fileDescriptor.location.path, "rb") as f: | ||
payload.bytes = BytesIO(f.read()) | ||
return payload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this formatter can read from and write to bytes
, you should probably implement can_read_bytes
, toBytes
, and fromBytes
, too. These will be preferred (when can_read_bytes
is True
) when reading from object stores for efficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added very simple implementations of fromBytes
and toBytes
-- the base implementation of can_read_bytes
should suffice?
I don't know a simple/standard way for testing them though.
python/lsst/meas/transiNet/modelPackages/storageAdapterButler.py
Outdated
Show resolved
Hide resolved
python/lsst/meas/transiNet/modelPackages/storageAdapterButlerHybrid.py
Outdated
Show resolved
Hide resolved
# needed (e.g. in butler mode). | ||
self.model_package_name = task.config.modelPackageName or 'N/A' | ||
|
||
self.package_storage_mode = task.config.modelPackageStorageMode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to pass these (and anything else you get from task
) directly to __init__
rather than passing task instance, to avoid introducing a circular interface dependency between the task and this class.
results = self.butler.registry.queryDatasets(StorageAdapterButler.dataset_type_name, | ||
collections=f'{StorageAdapterButler.packages_parent_collection}/{self.model_package_name}') # noqa: E501 | ||
payload = self.butler.get(list(results)[0]) | ||
self.from_payload(payload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this isn't done inside the formatter. How does from_payload
know it's a zip file it gets back? This is still subverting the role of butler. Why can't the unzip go inside the formatter and NNModelPayload have the three components in it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recapping two separate pairwise discussions: I agree, but as long as there's only one file extension (i.e. zip
) in play, we can make that change on a separate ticket and I could imagine that working better in terms of transitioning from the old way to load these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have some comments mainly re. how this matches (or not) the initial design, and how much we want to add to/remove from abstraction layers, etc. But yes, let's keep it for another ticket.
4361fbc
to
9388f53
Compare
This way we can also load a model/module which is already loaded as a file-like object.
from a butler repository.
for ingesting new model packages.
using a temporary (but real) butler repository.
Plus some minor fixes.
to be passed back and forth between the formatter and storageAdapterButler.
(only for _real_ runs of pipelines)
987aa95
to
b429e04
Compare
Co-authored-by: Jim Bosch <talljimbo@gmail.com>
Co-authored-by: Jim Bosch <talljimbo@gmail.com>
b429e04
to
f3f6e60
Compare
The commits covered by this PR cover a large time span:
Therefore, my first guess is that looking at the changes as a whole is easier, as compared to checking the commits one by one. But that might be just my taste.