New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import no pixels checksum #3610
Conversation
In the case of large datasets like SPIM, the parseData operation is highly computational intensive for very few (if not none) benefits. Assuming the --no_stats_info option is passed, this checksum calculation is turned off. Testing of this option locally with an in-place import of a 200GB SPIM dataset resulted in a 7 min client-side + 5 min server-side import. In a subsequent cleanup step we may want to simply get rid of this checksum calculation which does not make sense anymore with 5.0.
I like the idea of getting rid of the |
Re-ran travis due HDF5 error. |
Conflicting PR. Removed from build OME-5.1-merge-push#846. See the console output for more details.
|
Conflicting PR. Removed from build OME-5.1-merge-push#847. See the console output for more details.
|
The code changes do look reasonable. |
out = args.file | ||
err = args.errs | ||
|
||
if out: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
close()
these at some point? Defensively include a try / finally
?
Apart from some silly stylistic things mentioned above and discussed with @sbesson this all looks completely reasonable. 👍 |
- Modify ITest.import_image() to support --skip argument - Refactor ITest.missing_pyramid() to use import with --skip=all - Fix missing pyramid tests and remove long_running markers
Last commits should further refactor |
Based on https://ci.openmicroscopy.org/view/Failing/job/OMERO-5.1-breaking-integration-python/232/testReport/, removing |
Everything looks good. Although I didn't have a local 200GB image to play with a 200MB one was enough to show improvements, especially combined with in-place imports. The previously long running tests sped by! Regarding the help. The |
@ximenesuk: agreed the high-level exposure of the speed options requires additional work and scoping. My main rationale is that With regard to checksum vs skip options, I would still be inclined to expose a binary choice at the high-level. The advanced help would then allow the user to have more granularity on the choice of algorithm rather than an on/off choice ( Finally, the most unsatisfying piece of code in my PR is definitely the following line:
Rather than binding this check to a particular algorithm, we might consider migrating this to the mapper class with a method like |
It's certainly a meritorious idea, but fastness is a sliding scale; people might think it means "sub-cryptographic" or something. A clearer distinction might be a name that distinguishes algorithms that actually look at the (whole?) file contents from ones that don't. (I suspect that the scarce resource will typically be I/O, not CPU.) (Calling File-Size-64 a "checksum" algorithm is generous in the extreme!) |
Definitely! A third (more thorough) option would be to properly define at the Java level a new |
- Only expose this option at the low-level Java option - Map this checksum option to the high-level --skip={all,checksum} argument
Last set of commits implement a low-level |
self.command_args.append("--no_pixels_checksum") | ||
self.command_args.append("--checksum_algorithm=File-Size-64") | ||
if args.skip in ['all', 'thumbnails']: | ||
self.command_args.append("--no_thumbnails") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the --no_thumbnails
option not exposed in the help? Or am I just missing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is in the regular Java help. Arguably this could be migrated to the Import speed
section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favour of that for consistency.
I notice that underscores are used in the new options. I guess this is for consistency but might this be a good point to move to hyphens in |
This all looks okay. Will the Mac-only |
@ximenesuk: if the changes above make sense, I am tempted to implement your suggestion in separate PRs. All comments have been captured in see https://trello.com/c/l5vppgeI/325-cli-import-features. |
@sbesson that sounds reasonable. I don't know what the timescale is but having to move to hyphens in 5.1 would make some sense given it is breaking. But please go ahead and merge this PR (if only so I can rebase a local branch!) |
Ok. Leaving you guys to sort args & such in follow-on PRs. |
Follow-up of #3580. In the case of large multi-dimensional datasets, the plane checksum calculation performed in the
pixelData
step of import can be computationally intensive. This PR re-uses the thechecksumAlgorithm
set in theImportConfig
to skip this checksum calculation ifFile-Size-64
is selected further reducing the in-place import of large datasets while skipping all optional steps.In terms of client exposition, a new
--skip
argument is added to the Python CLI import plugin and serves as a shortcut to the various advanced imported options allowing to speed up import. The only performance argument not included in this option is--transfer
which may vary depending on the infrastructure (ln
vsln_s
). In terms of testing, from this branch, the following import command was run on a local server against a 200GB SPIM dataset:Once import was completed, images were browsed via client and thumbnails were generated in < 1min (with no other process running)
Using this new semantics, the
ITest.missing_pyramid()
logic is modified to use a large fake file import with--skip=all
set instead of the previous implementations. This dramatically speeds up all tests depending on missing pyramids (test_rawpixelsstore.py
andtest_thumbs.py
) so that theirlong_running
markers can be removed. These tests should be run as part of the daily integration job and are expected to be green.Comments/questions:
--transfer
in the regular help ofbin/omero import
?--no-rebase