Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --split option #80

Merged
merged 9 commits into from
May 30, 2023

Conversation

melissalinkert
Copy link
Member

See #79.

raw2ometiff --split input.zarr output should write a set of OME-TIFF files named output_s0.ome.tiff, output_s1.ome.tiff, etc., with one file per input series. showinf -noflat output_s0.ome.tiff should detect all of the output_s*.ome.tiff files together, and display the same dimensions as the original data that was converted by bioformats2raw.

raw2ometiff input.zarr output.ome.tiff should behave as before; in addition to correctness, it's worth checking for any noticeable performance differences. I didn't see a real difference when testing with some small fake plates and CMU-1.ndpi, but it's worth double-checking.

Given how much this modifies TIFF writing internals, this will need careful review and testing with both HCS and WSI data, and with and without the new --split option.

@melissalinkert
Copy link
Member Author

Not sure why the build is failing here; all tests pass locally. Will investigate in the morning.

@melissalinkert
Copy link
Member Author

Build fixed and debugging commits force-pushed away, so ready for review now.

@DavidStirling
Copy link
Member

I've done some testing with this, it's good stuff! In general it seems to work well and the individual series files are correctly sized. Non-split mode appears unchanged.

After splitting, it seems that trying to open an individual series .tif in FIJI (and I presume any other bioformats-based viewer) 'sees' series from all the different series which were exported rather than just the opened file. The GUI seems to list them all as being within the dropped .tif but the actual data is not. This wouldn't be a major problem, however if any of the other series files are missing (e.g. we just copied Series 2 to another location and tried to open it), the importer fails to open the file with an indexing error:

java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
	at java.util.ArrayList.rangeCheck(ArrayList.java:657)
	at java.util.ArrayList.get(ArrayList.java:433)
	at loci.formats.in.OMETiffReader.initFile(OMETiffReader.java:691)
	at loci.formats.FormatReader.setId(FormatReader.java:1392)
	at loci.formats.ImageReader.setId(ImageReader.java:849)
	at io.scif.bf.BioFormatsFormat$Parser.typedParse(BioFormatsFormat.java:472)
	at io.scif.bf.BioFormatsFormat$Parser.typedParse(BioFormatsFormat.java:450)
	at io.scif.AbstractParser.parse(AbstractParser.java:244)
	at io.scif.AbstractParser.parse(AbstractParser.java:314)
	at io.scif.AbstractParser.parse(AbstractParser.java:53)
	at io.scif.AbstractReader.setSource(AbstractReader.java:271)
	at io.scif.services.DefaultInitializeService.initializeReader(DefaultInitializeService.java:91)
	at io.scif.img.ImgOpener.createReader(ImgOpener.java:483)
	at io.scif.img.ImgOpener.openImgs(ImgOpener.java:242)
	at io.scif.services.DefaultDatasetIOService.open(DefaultDatasetIOService.java:152)
	at io.scif.services.DefaultDatasetIOService.open(DefaultDatasetIOService.java:133)
	at io.scif.io.DatasetIOPlugin.open(DatasetIOPlugin.java:80)
	at io.scif.io.DatasetIOPlugin.open(DatasetIOPlugin.java:52)
	at org.scijava.io.DefaultIOService.open(DefaultIOService.java:69)
	at HandleExtraFileTypes.openImage(HandleExtraFileTypes.java:528)
	at HandleExtraFileTypes.run(HandleExtraFileTypes.java:76)
	at ij.IJ.runUserPlugIn(IJ.java:235)
	at ij.IJ.runPlugIn(IJ.java:198)
	at ij.IJ.runPlugIn(IJ.java:187)
	at ij.io.Opener.openWithHandleExtraFileTypes(Opener.java:512)
	at ij.io.Opener.openUsingHandleExtraFileTypes(Opener.java:369)
	at ij.io.Opener.openImage(Opener.java:359)
	at ij.io.Opener.openImage(Opener.java:243)
	at ij.io.Opener.open(Opener.java:109)
	at ij.io.Opener.openAndAddToRecent(Opener.java:292)
	at ij.plugin.DragAndDrop.openFile(DragAndDrop.java:194)
	at ij.plugin.DragAndDrop.run(DragAndDrop.java:160)
	at java.lang.Thread.run(Thread.java:748)

I anticipate that users will want to isolate and work on the individual series files, so perhaps it's better if they're fully decoupled from eachother? In a perfect world perhaps the reader would check for access to the sub-files before listing them as options for display.

As another smaller point, could we have the name generator strip .ome.tif(f) from the output filename when in split mode? Currently you end up with output.ome.tif_s0.ome.tiff. This would hopefully be more intuitive since non-split mode requires the extension.

@melissalinkert
Copy link
Member Author

Conflicts fixed - this is now ready for review again.

@chris-allan
Copy link
Member

@sbesson / @melissalinkert: Let's make sure we select a good cross section of data to run some representative conversions against with and without this PR to check that:

  • Performance is not adversely affected
  • Conversion without --split is byte identical

@sbesson
Copy link
Member

sbesson commented Feb 23, 2023

Similarly to the testing of #82 and similar recent efforts, I would propose the following samples as a starting point for the assessment:

This should cover a broad of multi-image filesets across several acquisition modalities Anything else obvious I might be missing @melissalinkert ?

@melissalinkert
Copy link
Member Author

This should cover a broad of multi-image filesets across several acquisition modalities Anything else obvious I might be missing @melissalinkert ?

I think that covers all of the multi-Image cases. It might also be interesting to test an even larger plate - something like 1536 x 20 fields, or the largest we've encountered so far. I don't think we have a public example of that, but a fake plate might be OK in that case since the idea is to see what happens when tens of thousands of files are created.

@sbesson
Copy link
Member

sbesson commented Mar 9, 2023

Capturing a few preliminary results from the testing of this functionality:

  • without the --split option, the NGFF -> OME-TIFF writing times with this PR included are consistent with the writing times of raw2ometiff 0.4.1 across all samples

  • with the --split option, there might be a increase in writing time depending on the samples and especially the number of series/files

  • possibly the biggest issue encountered is the fact the conversion of the large_plate NGFF (1536 wells x 20 fields) failed with the default settings with:

    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
         at java.util.Arrays.copyOfRange(Arrays.java:3664)
         at java.lang.StringBuffer.toString(StringBuffer.java:669)
         at loci.common.xml.XMLTools.getXML(XMLTools.java:261)
         at loci.formats.services.OMEXMLServiceImpl.getOMEXML(OMEXMLServiceImpl.java:494)
         at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.makeIFD(PyramidFromDirectoryWriter.java:1187)
         at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.convertPyramid(PyramidFromDirectoryWriter.java:893)
         at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.convertToPyramid(PyramidFromDirectoryWriter.java:879)
         at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.call(PyramidFromDirectoryWriter.java:253)
         at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.call(PyramidFromDirectoryWriter.java:101)
         at picocli.CommandLine.executeUserObject(CommandLine.java:1783)
         at picocli.CommandLine.access$900(CommandLine.java:145)
         at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2150)
         at picocli.CommandLine$RunLast.handle(CommandLine.java:2144)
         at picocli.CommandLine$RunLast.handle(CommandLine.java:2108)
         at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:1968)
         at picocli.CommandLine.parseWithHandlers(CommandLine.java:2349)
         at picocli.CommandLine.parseWithHandler(CommandLine.java:2284)
         at picocli.CommandLine.call(CommandLine.java:2560)
         at com.glencoesoftware.pyramid.PyramidFromDirectoryWriter.main(PyramidFromDirectoryWriter.java:223)
        ```
        
    

Discussing with @melissalinkert, a related issue is the fact the current implementation currently writes the same metadata in the header of each TIFF file. For large HCS data, this can become a siginificant overhead both in terms of performance but also in terms of disk space.

As a potential compromise solution, a proposal would be update the current implementation and store data generated with the --split option according the partial OME-XML metadata specification and using a companion file containing the OME-XML metadata - see here for a real-life example of application of this specification.

As noted in the comments, this is primarily to save disk space
(and maybe a bit of write time) for large plates which might have
thousands of files when converted with --split.
@melissalinkert
Copy link
Member Author

3167d51 should now correctly create the companion file/BinaryOnly OME-TIFFs, and the --split test has been updated accordingly. This is ready for re-testing, @sbesson.

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restarted the conversion tests using the last commit against the series of samples. While the timings are largely identical for dataset with a few series, there seems to be some performance regression as the number of series increases.

Using 3 consecutive measurements, the NIRHTa+001 NGFF plate takes 86min, 118min and 85min to convert into a single OME-TIFF and 207min, 205min and 187min to convert into a multi-file OME-TIFF.

For the plate with 30K fields of view, the conversion to a single file OME-TIFF takes 1557min and the conversion to a multi-file OME-TIFF takes 2125min (but completes without exceptions).

@melissalinkert
Copy link
Member Author

I think 9cb9421 will help, but I'd still expect --split to be slower and for the time difference to increase with the number of series. ~10 hours across 1536x20 series is obviously out of line, but no matter what --split needs to write an extra header and generate/write a small OME-XML block for every series. Even a small consistent difference of 50ms per series would add up to ~26 minutes for the 1536x20 field plate.

@sbesson
Copy link
Member

sbesson commented Mar 28, 2023

With the last commit, I am still seeing a significant overhead due to the metadata writing. The full logs at debug level:

with the relevant part being the last lines lines

% tail -n 5 *.log
==> NIRHTa+001_default.log <==
2023-03-24 21:13:54,675 [pool-2304-thread-6] DEBUG c.g.p.PyramidFromDirectoryWriter -     writing 81797 compressed bytes to NIRHTa+001.ome.tiff at 2830557771
2023-03-24 21:13:54,675 [pool-2304-thread-6] INFO  org.perf4j.TimingLogger - start[1679692434668] time[6] tag[writeTile]
2023-03-24 21:13:54,677 [pool-2304-thread-3] DEBUG c.g.p.PyramidFromDirectoryWriter -     writing 325464 compressed bytes to NIRHTa+001.ome.tiff at 2830639568
2023-03-24 21:13:54,678 [pool-2304-thread-3] INFO  org.perf4j.TimingLogger - start[1679692434666] time[11] tag[writeTile]
2023-03-24 22:26:42,624 [main] INFO  org.perf4j.TimingLogger - start[1679692383152] time[4419472] tag[convertToPyramid]

==> NIRHTa+001_split.log <==
2023-03-24 17:24:27,294 [pool-2304-thread-3] DEBUG c.g.p.PyramidFromDirectoryWriter -     writing 325464 compressed bytes to NIRHTa+001/NIRHTa+001_s2303.ome.tiff at 575996
2023-03-24 17:24:27,295 [pool-2304-thread-3] INFO  org.perf4j.TimingLogger - start[1679678667283] time[11] tag[writeTile]
2023-03-24 17:24:27,295 [pool-2304-thread-2] DEBUG c.g.p.PyramidFromDirectoryWriter -     writing 269584 compressed bytes to NIRHTa+001/NIRHTa+001_s2303.ome.tiff at 901460
2023-03-24 17:24:27,295 [pool-2304-thread-2] INFO  org.perf4j.TimingLogger - start[1679678667280] time[15] tag[writeTile]
2023-03-24 20:51:02,428 [main] INFO  org.perf4j.TimingLogger - start[1679678611862] time[12450565] tag[convertToPyramid]

==> large_plate_split.log <==
2023-03-26 05:04:50,153 [pool-30720-thread-2] DEBUG c.g.p.PyramidFromDirectoryWriter -     writing 7010 compressed bytes to large_plate/large_plate_s30719.ome.tiff at 16
2023-03-26 05:04:50,153 [pool-30720-thread-2] INFO  org.perf4j.TimingLogger - start[1679807090152] time[1] tag[writeTile]
2023-03-26 05:04:50,155 [pool-30720-thread-1] DEBUG c.g.p.PyramidFromDirectoryWriter -     writing 47140 compressed bytes to large_plate/large_plate_s30719.ome.tiff at 7026
2023-03-26 05:04:50,155 [pool-30720-thread-1] INFO  org.perf4j.TimingLogger - start[1679807090151] time[3] tag[writeTile]
2023-03-26 19:13:06,140 [main] INFO  org.perf4j.TimingLogger - start[1679806913217] time[51072923] tag[convertToPyramid]

That's still an overhead of a few seconds per output file which will lead to huge time increases for datasets in the 1K-100K files range. Would it be useful to regenerate these logs with additional debug statement e.g. in the new writeTIFFHeader logic to identify the bottleneck?

@melissalinkert
Copy link
Member Author

Based on the logs, it looks like the main difference is in writeIFDs, after all of the pixel data is written. For NIRHTa+001, writeIFDs alone is ~1h13m vs ~3h27m. I haven't yet been able to reproduce times anywhere near that, using 9cb9421 with https://downloads.openmicroscopy.org/images/OME-TIFF/2016-06/BBBC017/single-file/ (which was converted with bioformats2raw 0.6.1 and default options). I see total raw2ometiff times under 25 minutes in this case, with or without --split.

a617ff8 adds some extra stopwatches in and around writeIFDs, and the corresponding --split log when I run this locally is nirht-split.log.gz.

@sbesson
Copy link
Member

sbesson commented May 4, 2023

Coming back finally to this. I managed to run a performance assessment of this PR using the sample files mentioned in #80 (comment).

The conversions were executed without this PR using 0.4.1, with this PR using the default (single file mode) and with this PR using the split option. Each file was converted three times to account for file system caching except for the large_plate which takes a very long time.

Additionally, tests were performed both on two VMs in OpenStack using either SSD storage or non-SSD storage.

With SSD storage, the metrics were as follows:

raw2ometiff 0.4.1
large_plate.ome.zarr 1337m28.407s
Leica-1.ome.zarr 0m55.755s 0m31.082s 0m31.415s
Leica-2.ome.zarr 1m45.399s 1m16.079s 1m25.103s
Leica-3.ome.zarr 2m37.039s 1m53.134s 1m48.959s
LuCa-7color_Scan1.ome.zarr 0m52.752s 0m31.807s 0m32.470s
NIRHTa+001.ome.zarr 48m10.809s 47m34.109s 47m13.373s
raw2ometiff 0.5.0-SNAPSHOT
large_plate.ome.zarr 1349m26.828s
Leica-1.ome.zarr 0m50.247s 0m29.958s 0m29.982s
Leica-2.ome.zarr 1m35.226s 1m7.496s 1m19.366s
Leica-3.ome.zarr 2m36.889s 1m36.601s 1m39.665s
LuCa-7color_Scan1.ome.zarr 0m50.233s 0m35.132s 0m36.562s
NIRHTa+001.ome.zarr 45m20.226s 45m20.226s 46m25.964s
raw2ometiff 0.5.0-SNAPSHOT --split
large_plate.ome.zarr 1386m29.097s
Leica-1.ome.zarr 0m51.300s 0m29.229s 0m29.918s
Leica-2.ome.zarr 1m45.831s 1m14.594s 1m14.296s
Leica-3.ome.zarr 2m52.555s 1m43.229s 1m39.658s
LuCa-7color_Scan1.ome.zarr 1m1.343s 0m32.355s 0m31.770s
NIRHTa+001.ome.zarr 48m20.691s 47m37.830s 48m32.093s
NIRHTa+001.ome.zarr 45m20.226s 45m20.226s 46m25.964s

Without SSD storage, the metrics were as follows:

raw2ometiff 0.4.1
large_plate.ome.zarr 1548m54.536s
Leica-1.ome.zarr 3m39.723s 0m48.079s 0m56.267s
Leica-2.ome.zarr 7m14.477s 1m57.519s 1m50.336s
Leica-3.ome.zarr 14m20.850s 4m38.970s 4m37.677s
LuCa-7color_Scan1.ome.zarr 4m47.735s 2m45.600s 0m54.907s
NIRHTa+001.ome.zarr 94m8.806s 101m7.020s 95m51.124s
raw2ometiff 0.5.0-SNAPSHOT
large_plate.ome.zarr 1625m44.363s
Leica-1.ome.zarr 3m50.656s 1m2.817s 1m0.950s
Leica-2.ome.zarr 10m24.327s 2m36.257s 2m25.496s
Leica-3.ome.zarr 39m25.295 6m26.659s 2m47.872s
LuCa-7color_Scan1.ome.zarr 3m30.630s 0m59.561s 0m59.529s
NIRHTa+001.ome.zarr 92m4.332s 91m8.011s 98m32.255s
raw2ometiff 0.5.0-SNAPSHOT --split
large_plate.ome.zarr 2086m32.042s
Leica-1.ome.zarr 3m23.873s 1m0.789s 0m57.108s
Leica-2.ome.zarr 10m22.659s 6m11.889s 5m29.050s
Leica-3.ome.zarr 17m1.730s 3m28.244s 3m6.595s
LuCa-7color_Scan1.ome.zarr 4m2.707s 0m51.477s 1m35.233s
NIRHTa+001.ome.zarr 208m35.649s 177m36.592s 186m12.939s

Main outcome of this testing is that for a performant storage, this PR does not add any significant overhead either in single-file or multi-file mode.

However for less performant storage, there is definitely a degradation of the conversion times which likely scales with the number of files to be written and can increase the conversion by a factor 100% in some cases.
At minimum, we might want to add a note/warning about the potential impact of using the split options.

Next: I will run some tests on the converted files to compare the outcome is byte identical. Initial thought is to make use of upstream's https://github.com/ome/bioformats/blob/v6.13.0/components/test-suite/src/loci/tests/testng/EquivalentPixelsTest.java. @melissalinkert how easily do you think this could be converted into a test utility similarly to ome/bioformats#3876 ?

melissalinkert added a commit to melissalinkert/bioformats that referenced this pull request May 5, 2023
@melissalinkert
Copy link
Member Author

Initial thought is to make use of upstream's https://github.com/ome/bioformats/blob/v6.13.0/components/test-suite/src/loci/tests/testng/EquivalentPixelsTest.java. @melissalinkert how easily do you think this could be converted into a test utility similarly to ome/bioformats#3876 ?

Main issue is that it depends on testng at the moment, so either need some build changes or to split into separate classes if we want both test and command line functionality. https://github.com/melissalinkert/bioformats/commits/equivalent-pixels-tool is a place to start, but that's on top of the work already in ome/bioformats#3876 so I will hold off on a PR until that's merged.

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally managed to do the pixel tests using the upstream test-equivalent target. The only exception was the large_plate sample as the command failed with out of memory exceptions even after raising the memory considerably.

For all the other targets, the pixel tests completed successfully for all series/resolutions. I'll perform a final functional test by importing 2 samples into OMERO for visual inspection. Afterwards we should be able to get this into the upcoming raw2ometiff release

@melissalinkert
Copy link
Member Author

Final call for comments, @DavidStirling / @chris-allan. Plan to merge this on Tuesday (May 30) unless any objections.

@melissalinkert melissalinkert merged commit 9cc8db5 into glencoesoftware:master May 30, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants