Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import no pixels checksum #3610

Merged
merged 16 commits into from Mar 17, 2015
Merged

Conversation

sbesson
Copy link
Member

@sbesson sbesson commented Mar 12, 2015

Follow-up of #3580. In the case of large multi-dimensional datasets, the plane checksum calculation performed in the pixelData step of import can be computationally intensive. This PR re-uses the the checksumAlgorithm set in the ImportConfig to skip this checksum calculation if File-Size-64 is selected further reducing the in-place import of large datasets while skipping all optional steps.

In terms of client exposition, a new --skip argument is added to the Python CLI import plugin and serves as a shortcut to the various advanced imported options allowing to speed up import. The only performance argument not included in this option is --transfer which may vary depending on the infrastructure (ln vs ln_s). In terms of testing, from this branch, the following import command was run on a local server against a 200GB SPIM dataset:

bin/omero import --skip=all /Volumes/ome/data_repo/from_CarlZeissImaging/ -- --transfer=ln_s
...
2015-03-12 12:04:58,894 1352476    [l.Client-0] INFO      ome.formats.importer.cli.ErrorHandler - Number of errors: 0

==> Summary
6 files uploaded, 1 fileset created, 4 images imported, 0 errors in 0:13:20.211

Once import was completed, images were browsed via client and thumbnails were generated in < 1min (with no other process running)

Using this new semantics, the ITest.missing_pyramid() logic is modified to use a large fake file import with --skip=all set instead of the previous implementations. This dramatically speeds up all tests depending on missing pyramids (test_rawpixelsstore.py and test_thumbs.py) so that their long_running markers can be removed. These tests should be run as part of the daily integration job and are expected to be green.

Comments/questions:

  • is the naming scheme sensible for the CLI?
  • do we also want to expose --transfer in the regular help of bin/omero import?
  • longer term, we may want to deprecate/remove the plane checksum computation at import time as this step is obsolete since 5.0.

--no-rebase

In the case of large datasets like SPIM, the parseData operation is highly
computational intensive for very few (if not none) benefits. Assuming the
--no_stats_info option is passed, this checksum calculation is turned off.
Testing of this option locally with an in-place import of a 200GB SPIM dataset
resulted in a 7 min client-side + 5 min server-side import.

In a subsequent cleanup step we may want to simply get rid of this checksum
calculation which does not make sense anymore with 5.0.
@mtbc
Copy link
Member

mtbc commented Mar 12, 2015

I like the idea of getting rid of the pixels.sha1 column altogether.

@joshmoore
Copy link
Member

Re-ran travis due HDF5 error.

@snoopycrimecop
Copy link
Member

Conflicting PR. Removed from build OME-5.1-merge-push#846. See the console output for more details.
Possible conflicts:

  • PR Float big image #3525 jburel 'Float big image'
    • components/blitz/src/ome/services/blitz/repo/ManagedImportRequestI.java

@snoopycrimecop
Copy link
Member

Conflicting PR. Removed from build OME-5.1-merge-push#847. See the console output for more details.
Possible conflicts:

  • PR Float big image #3525 jburel 'Float big image'
    • components/blitz/src/ome/services/blitz/repo/ManagedImportRequestI.java

@mtbc
Copy link
Member

mtbc commented Mar 13, 2015

The code changes do look reasonable.

out = args.file
err = args.errs

if out:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close() these at some point? Defensively include a try / finally?

@chris-allan
Copy link
Member

Apart from some silly stylistic things mentioned above and discussed with @sbesson this all looks completely reasonable. 👍

- Modify ITest.import_image() to support --skip argument
- Refactor ITest.missing_pyramid() to use import with --skip=all
- Fix missing pyramid tests and remove long_running markers
@sbesson
Copy link
Member Author

sbesson commented Mar 13, 2015

Last commits should further refactor import.py to close file handles within try/finally block (plus additional stylistic changes).
Additionally, ITest.import_images() is modified to support a skip optional argument, ITest.missing_pyramid() is rewritten to use import_image(..., skip='all') and the long_running markers are dropped from all Python tests using missing_pyramid() /cc @joshmoore @ximenesuk

@sbesson
Copy link
Member Author

sbesson commented Mar 14, 2015

@sbesson sbesson removed the breaking label Mar 14, 2015
@ximenesuk
Copy link
Contributor

Everything looks good. Although I didn't have a local 200GB image to play with a 200MB one was enough to show improvements, especially combined with in-place imports. The previously long running tests sped by!

Regarding the help. The skip option relies on the underlying checksum amongst other things. That means either exposing the checksum option in the main help - which would mean exposing other options too in all likelihood. Or alternatively, skip could be pushed back to the advanced help where it would sit naturally under the Import speed section. As it stands skip seems to be the only effectively advanced option exposed at the top level.

@sbesson
Copy link
Member Author

sbesson commented Mar 16, 2015

@ximenesuk: agreed the high-level exposure of the speed options requires additional work and scoping. My main rationale is that advanced help is really hidden for most of the users and complex to use. Here I am in favor of migrating a subset of these advanced options to a top-level Import speed/performance Python argument group which could minimally include --skip and --transfer arguments.

With regard to checksum vs skip options, I would still be inclined to expose a binary choice at the high-level. The advanced help would then allow the user to have more granularity on the choice of algorithm rather than an on/off choice (SHA1/File-Size-64).

Finally, the most unsatisfying piece of code in my PR is definitely the following line:

 if checksumAlgorithm.equals("File-Size-64")

Rather than binding this check to a particular algorithm, we might consider migrating this to the mapper class with a method like ChecksumAlgorithmMapper.isFast(checksumAlgorithm). Thoughts @mtbc?

@mtbc
Copy link
Member

mtbc commented Mar 16, 2015

It's certainly a meritorious idea, but fastness is a sliding scale; people might think it means "sub-cryptographic" or something. A clearer distinction might be a name that distinguishes algorithms that actually look at the (whole?) file contents from ones that don't. (I suspect that the scarce resource will typically be I/O, not CPU.) (Calling File-Size-64 a "checksum" algorithm is generous in the extreme!)

@sbesson
Copy link
Member Author

sbesson commented Mar 16, 2015

Definitely! isFast() was rather for the example than a real suggestion. In the context of this PR, I concurr we want to classify computationally expensive algorithms. An issue here is that we are partly misusing the file-level checksum_algorithm option.

A third (more thorough) option would be to properly define at the Java level a new --no-plane-checksum advanced option, which would be used by the PixelData step and map this flag as well as the --checksum_algorithm choice (for the file upload only) to the top-level --skip={all,checksum}.

- Only expose this option at the low-level Java option
- Map this checksum option to the high-level --skip={all,checksum} argument
@sbesson
Copy link
Member Author

sbesson commented Mar 16, 2015

Last set of commits implement a low-level --no_pixels_checksum option for disabling the pixels checksum computation as well as documentation of new Java options in the importer advanced-help. The high-level --skip=checksum behavior is unchanged and disables both file and pixels checksum computation.

self.command_args.append("--no_pixels_checksum")
self.command_args.append("--checksum_algorithm=File-Size-64")
if args.skip in ['all', 'thumbnails']:
self.command_args.append("--no_thumbnails")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the --no_thumbnails option not exposed in the help? Or am I just missing it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in the regular Java help. Arguably this could be migrated to the Import speed section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favour of that for consistency.

@ximenesuk
Copy link
Contributor

I notice that underscores are used in the new options. I guess this is for consistency but might this be a good point to move to hyphens in import.py especially with new options.

@ximenesuk
Copy link
Contributor

This all looks okay. Will the Mac-only test_import.pybe fixed in this PR?

@sbesson
Copy link
Member Author

sbesson commented Mar 17, 2015

@ximenesuk: if the changes above make sense, I am tempted to implement your suggestion in separate PRs. All comments have been captured in see https://trello.com/c/l5vppgeI/325-cli-import-features.

@ximenesuk
Copy link
Contributor

@sbesson that sounds reasonable. I don't know what the timescale is but having to move to hyphens in 5.1 would make some sense given it is breaking. But please go ahead and merge this PR (if only so I can rebase a local branch!)

@joshmoore
Copy link
Member

Ok. Leaving you guys to sort args & such in follow-on PRs.

joshmoore added a commit that referenced this pull request Mar 17, 2015
@joshmoore joshmoore merged commit bfb8d32 into ome:develop Mar 17, 2015
@sbesson sbesson deleted the import_no_pixels_checksum branch March 17, 2015 12:02
@sbesson sbesson added this to the 5.1.0-m5 milestone Mar 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants