Import no pixels checksum #3610

sbesson · 2015-03-12T12:27:41Z

Follow-up of #3580. In the case of large multi-dimensional datasets, the plane checksum calculation performed in the pixelData step of import can be computationally intensive. This PR re-uses the the checksumAlgorithm set in the ImportConfig to skip this checksum calculation if File-Size-64 is selected further reducing the in-place import of large datasets while skipping all optional steps.

In terms of client exposition, a new --skip argument is added to the Python CLI import plugin and serves as a shortcut to the various advanced imported options allowing to speed up import. The only performance argument not included in this option is --transfer which may vary depending on the infrastructure (ln vs ln_s). In terms of testing, from this branch, the following import command was run on a local server against a 200GB SPIM dataset:

bin/omero import --skip=all /Volumes/ome/data_repo/from_CarlZeissImaging/ -- --transfer=ln_s
...
2015-03-12 12:04:58,894 1352476    [l.Client-0] INFO      ome.formats.importer.cli.ErrorHandler - Number of errors: 0

==> Summary
6 files uploaded, 1 fileset created, 4 images imported, 0 errors in 0:13:20.211

Once import was completed, images were browsed via client and thumbnails were generated in < 1min (with no other process running)

Using this new semantics, the ITest.missing_pyramid() logic is modified to use a large fake file import with --skip=all set instead of the previous implementations. This dramatically speeds up all tests depending on missing pyramids (test_rawpixelsstore.py and test_thumbs.py) so that their long_running markers can be removed. These tests should be run as part of the daily integration job and are expected to be green.

Comments/questions:

is the naming scheme sensible for the CLI?
do we also want to expose --transfer in the regular help of bin/omero import?
longer term, we may want to deprecate/remove the plane checksum computation at import time as this step is obsolete since 5.0.

--no-rebase

In the case of large datasets like SPIM, the parseData operation is highly computational intensive for very few (if not none) benefits. Assuming the --no_stats_info option is passed, this checksum calculation is turned off. Testing of this option locally with an in-place import of a 200GB SPIM dataset resulted in a 7 min client-side + 5 min server-side import. In a subsequent cleanup step we may want to simply get rid of this checksum calculation which does not make sense anymore with 5.0.

mtbc · 2015-03-12T12:33:39Z

I like the idea of getting rid of the pixels.sha1 column altogether.

joshmoore · 2015-03-12T17:54:50Z

Re-ran travis due HDF5 error.

snoopycrimecop · 2015-03-13T03:47:46Z

Conflicting PR. Removed from build OME-5.1-merge-push#846. See the console output for more details.
Possible conflicts:

PR Float big image #3525 jburel 'Float big image'
- components/blitz/src/ome/services/blitz/repo/ManagedImportRequestI.java

snoopycrimecop · 2015-03-13T06:47:12Z

Conflicting PR. Removed from build OME-5.1-merge-push#847. See the console output for more details.
Possible conflicts:

PR Float big image #3525 jburel 'Float big image'
- components/blitz/src/ome/services/blitz/repo/ManagedImportRequestI.java

mtbc · 2015-03-13T10:00:02Z

The code changes do look reasonable.

chris-allan · 2015-03-13T10:22:38Z

components/tools/OmeroPy/src/omero/plugins/import.py

+        out = args.file
+        err = args.errs
+
+        if out:


close() these at some point? Defensively include a try / finally?

chris-allan · 2015-03-13T10:26:59Z

Apart from some silly stylistic things mentioned above and discussed with @sbesson this all looks completely reasonable. 👍

- Modify ITest.import_image() to support --skip argument - Refactor ITest.missing_pyramid() to use import with --skip=all - Fix missing pyramid tests and remove long_running markers

sbesson · 2015-03-13T11:56:00Z

Last commits should further refactor import.py to close file handles within try/finally block (plus additional stylistic changes).
Additionally, ITest.import_images() is modified to support a skip optional argument, ITest.missing_pyramid() is rewritten to use import_image(..., skip='all') and the long_running markers are dropped from all Python tests using missing_pyramid() /cc @joshmoore @ximenesuk

…hecksum

sbesson · 2015-03-14T06:53:51Z

Based on https://ci.openmicroscopy.org/view/Failing/job/OMERO-5.1-breaking-integration-python/232/testReport/, removing breaking label.

ximenesuk · 2015-03-16T12:07:53Z

Everything looks good. Although I didn't have a local 200GB image to play with a 200MB one was enough to show improvements, especially combined with in-place imports. The previously long running tests sped by!

Regarding the help. The skip option relies on the underlying checksum amongst other things. That means either exposing the checksum option in the main help - which would mean exposing other options too in all likelihood. Or alternatively, skip could be pushed back to the advanced help where it would sit naturally under the Import speed section. As it stands skip seems to be the only effectively advanced option exposed at the top level.

sbesson · 2015-03-16T12:45:46Z

@ximenesuk: agreed the high-level exposure of the speed options requires additional work and scoping. My main rationale is that advanced help is really hidden for most of the users and complex to use. Here I am in favor of migrating a subset of these advanced options to a top-level Import speed/performance Python argument group which could minimally include --skip and --transfer arguments.

With regard to checksum vs skip options, I would still be inclined to expose a binary choice at the high-level. The advanced help would then allow the user to have more granularity on the choice of algorithm rather than an on/off choice (SHA1/File-Size-64).

Finally, the most unsatisfying piece of code in my PR is definitely the following line:

 if checksumAlgorithm.equals("File-Size-64")

Rather than binding this check to a particular algorithm, we might consider migrating this to the mapper class with a method like ChecksumAlgorithmMapper.isFast(checksumAlgorithm). Thoughts @mtbc?

mtbc · 2015-03-16T12:51:09Z

It's certainly a meritorious idea, but fastness is a sliding scale; people might think it means "sub-cryptographic" or something. A clearer distinction might be a name that distinguishes algorithms that actually look at the (whole?) file contents from ones that don't. (I suspect that the scarce resource will typically be I/O, not CPU.) (Calling File-Size-64 a "checksum" algorithm is generous in the extreme!)

sbesson · 2015-03-16T13:00:27Z

Definitely! isFast() was rather for the example than a real suggestion. In the context of this PR, I concurr we want to classify computationally expensive algorithms. An issue here is that we are partly misusing the file-level checksum_algorithm option.

A third (more thorough) option would be to properly define at the Java level a new --no-plane-checksum advanced option, which would be used by the PixelData step and map this flag as well as the --checksum_algorithm choice (for the file upload only) to the top-level --skip={all,checksum}.

- Only expose this option at the low-level Java option - Map this checksum option to the high-level --skip={all,checksum} argument

… help

sbesson · 2015-03-16T15:58:49Z

Last set of commits implement a low-level --no_pixels_checksum option for disabling the pixels checksum computation as well as documentation of new Java options in the importer advanced-help. The high-level --skip=checksum behavior is unchanged and disables both file and pixels checksum computation.

ximenesuk · 2015-03-17T09:24:08Z

components/tools/OmeroPy/src/omero/plugins/import.py

+            self.command_args.append("--no_pixels_checksum")
+            self.command_args.append("--checksum_algorithm=File-Size-64")
+        if args.skip in ['all', 'thumbnails']:
+            self.command_args.append("--no_thumbnails")


Is the --no_thumbnails option not exposed in the help? Or am I just missing it?

It is in the regular Java help. Arguably this could be migrated to the Import speed section

I would be in favour of that for consistency.

ximenesuk · 2015-03-17T09:26:09Z

I notice that underscores are used in the new options. I guess this is for consistency but might this be a good point to move to hyphens in import.py especially with new options.

ximenesuk · 2015-03-17T10:59:21Z

This all looks okay. Will the Mac-only test_import.pybe fixed in this PR?

sbesson · 2015-03-17T11:10:54Z

@ximenesuk: if the changes above make sense, I am tempted to implement your suggestion in separate PRs. All comments have been captured in see https://trello.com/c/l5vppgeI/325-cli-import-features.

ximenesuk · 2015-03-17T11:25:45Z

@sbesson that sounds reasonable. I don't know what the timescale is but having to move to hyphens in 5.1 would make some sense given it is breaking. But please go ahead and merge this PR (if only so I can rebase a local branch!)

joshmoore · 2015-03-17T12:00:25Z

Ok. Leaving you guys to sort args & such in follow-on PRs.

Import no pixels checksum

sbesson added 6 commits March 12, 2015 09:46

Skip plane checksum computation based on checksum_algorithm value

02879fa

Add extra logging information both client and server side during import

31d8666

Expose import improvements via a --skip arguments to the CLI import

f62e42a

Split import plugin argument setup using set_xxx_arguments() methods

fc9b6f9

Add unit test for set_login_arguments

bb0556d

sbesson added develop breaking labels Mar 12, 2015

Fix choices for skip argument

2afa288

joshmoore removed the breaking label Mar 12, 2015

chris-allan reviewed Mar 13, 2015
View reviewed changes

sbesson added 2 commits March 13, 2015 11:20

Fix long_running Python integration tests

8c1c989

- Modify ITest.import_image() to support --skip argument - Refactor ITest.missing_pyramid() to use import with --skip=all - Fix missing pyramid tests and remove long_running markers

Close out/err files if applicable in import.py

2462a39

sbesson added the breaking label Mar 13, 2015

Add defensive try/finally block

da1cd51

sbesson added 3 commits March 13, 2015 11:58

Fix unit tests

2465f51

Merge remote-tracking branch 'origin/develop' into import_no_pixels_c…

3e7ad54

…hecksum

Fix import argument order in import_image

2450a68

sbesson removed the breaking label Mar 14, 2015

sbesson added 3 commits March 16, 2015 14:45

Add --no_pixels_checksum option disabling pixels checksum computation

0e131cc

- Only expose this option at the low-level Java option - Map this checksum option to the high-level --skip={all,checksum} argument

Add --no_pixels_checksum and --no_stats_info to importer-cli advanced…

bab2dc4

… help

Fix Javadoc markers

f56c77a

ximenesuk mentioned this pull request Mar 16, 2015

Python tests library improvements #3618

Merged

ximenesuk reviewed Mar 17, 2015
View reviewed changes

joshmoore added a commit that referenced this pull request Mar 17, 2015

Merge pull request #3610 from sbesson/import_no_pixels_checksum

bfb8d32

Import no pixels checksum

joshmoore merged commit bfb8d32 into ome:develop Mar 17, 2015

sbesson deleted the import_no_pixels_checksum branch March 17, 2015 12:02

sbesson added this to the 5.1.0-m5 milestone Mar 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import no pixels checksum #3610

Import no pixels checksum #3610

sbesson commented Mar 12, 2015

mtbc commented Mar 12, 2015

joshmoore commented Mar 12, 2015

snoopycrimecop commented Mar 13, 2015

snoopycrimecop commented Mar 13, 2015

mtbc commented Mar 13, 2015

chris-allan Mar 13, 2015

chris-allan commented Mar 13, 2015

sbesson commented Mar 13, 2015

sbesson commented Mar 14, 2015

ximenesuk commented Mar 16, 2015

sbesson commented Mar 16, 2015

mtbc commented Mar 16, 2015

sbesson commented Mar 16, 2015

sbesson commented Mar 16, 2015

ximenesuk Mar 17, 2015

sbesson Mar 17, 2015

ximenesuk Mar 17, 2015

ximenesuk commented Mar 17, 2015

ximenesuk commented Mar 17, 2015

sbesson commented Mar 17, 2015

ximenesuk commented Mar 17, 2015

joshmoore commented Mar 17, 2015

Import no pixels checksum #3610

Import no pixels checksum #3610

Conversation

sbesson commented Mar 12, 2015

mtbc commented Mar 12, 2015

joshmoore commented Mar 12, 2015

snoopycrimecop commented Mar 13, 2015

snoopycrimecop commented Mar 13, 2015

mtbc commented Mar 13, 2015

chris-allan Mar 13, 2015

Choose a reason for hiding this comment

chris-allan commented Mar 13, 2015

sbesson commented Mar 13, 2015

sbesson commented Mar 14, 2015

ximenesuk commented Mar 16, 2015

sbesson commented Mar 16, 2015

mtbc commented Mar 16, 2015

sbesson commented Mar 16, 2015

sbesson commented Mar 16, 2015

ximenesuk Mar 17, 2015

Choose a reason for hiding this comment

sbesson Mar 17, 2015

Choose a reason for hiding this comment

ximenesuk Mar 17, 2015

Choose a reason for hiding this comment

ximenesuk commented Mar 17, 2015

ximenesuk commented Mar 17, 2015

sbesson commented Mar 17, 2015

ximenesuk commented Mar 17, 2015

joshmoore commented Mar 17, 2015