New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fileset indexing: address performance issues and add configurability #156
Conversation
This commit introduces some logic to define a cut-off for indexing filesets associated with images. A new `FullTextBridge` constructor is introduced as well as a new server property that is used in the service definition file. The primary motivation for introducing this mechanism is to handle the case of multi-image filesets where set_fileset will scale as nImages x nFilesetEntries and effectively prevent the indexer to catch-up in the most advanced scenarios like HCS data.
👍 but personally I'd go higher than 1. Doing a little counting on IDR of filesets with fewer than a thousand entries:
and less than 50:
query
|
Thanks for the feedback. 66f22be increases the default value to 10 which should minimally capture several scenarios where a fileset is constituted of a master file associated with a few ancillary files - see https://bio-formats.readthedocs.io/en/latest/formats/dataset-table.html. |
Seems like an ok compromise. One more question: was adding the first |
Not really. Quite possibly, the current default might be sufficient avoid the performance degradation but I would need to re-run some reindexing measurements to be sure It might also be useful to find examples and discuss the expectation of this feature. Thinking of the HCS case again, my feeling is that there is a lot of redundancy in the fileset entries and indexing the first fileset entry might be sufficient. |
nods certainly the logic "if an insane number of files, only index the first" feels better than "if an insane number of files, index nothing". |
@sbesson the log is unfortunately full of timestamps, would you maybe advise a foolproof log text which indicates the start of reindexing ? Such as, if I run
What exactly should appear in the log as a result of this command ? |
aha, maybe
which appeared after I have now a local server and am watching the Indexer log in another terminal with |
OK, so I will take the time between the
So, without your PR on a fresh DB on the multi-file I have |
With this PR included, using the artifact on merge-ci https://merge-ci.openmicroscopy.org/jenkins/job/OMERO-build/lastSuccessfulBuild/artifact/src/target/OMERO.server-5.6.3-313-290ff28-ice36-b1472.zip multi-file reindexing report:
So this is 15:00:48 minus 5:01:43, which is 55 secs. This is improvement against the release server, albeit not as dramatic as reported in the header. Possibly I am counting some events which might not be taken as "pure reindexing" ? |
@pwalczysko I think the first entries in the indexer logs are really about the service initialization so I have been ignoring them for my measurements. I used the first occurrence of |
@sbesson Could you advise on a case
It seems to me that as I did Ad 2. above, I am having approx. double indexing times than in case Ad 1. above.
Hence, is Ad 2. a valid test ? If no, this seems interesting - is there something |
The indexer logic uses the eventlog table and goes chronologically through the series of DB events (inserts,deletes,updates). In the Indexer logs, you can entries indicated that objects are being inserted, then purged, then inserted into the index. This corresponds to your sequence of events in terms of import. The increase of the indexing time is definitely expecting. I haven't evaluated how the indexing time of a deleted plate correlates with a non-deleted plate and the two-fold factor might be largely a coincidence. |
With a fresh DB, with this PR, with the
This is 12:15:16 - 12:14:58 which makes 18 secs. This matches the reindexing time for the |
On release server, without this PR, on
This is 15:21:26 minus 15:20:47 which is 39 sec. |
Based on the benchmarking summary above, lgtm. |
this.files = files; | ||
this.parsers = parsers; | ||
this.classes = bridgeClasses == null ? new Class[] {} : bridgeClasses; | ||
this.maxFilesetSize = maxFilesetSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default in the constructor above is set to 1
.
This could allow to pass any value including negative one. Maybe add a check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you and/or anyone else who was involved in the initial review of the feature foresee a situation where someone would like to preserve the current behavior i.e. index all files in a fileset irrespectively of its size?
If so, -1
(or any negative value?) could be used as a mechanism to skip the fileset size check, with all associated warnings re performance implications. Otherwise, I can make this default to 1 if it's negative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't involved in the initial review or discussion around the feature back in late 2019. However, purely from a technical perspective I can see justification for leaving this open ended. If we add range checks here in the constructor they will apply to all subclasses and that seems like a bad design decision. There are other ways of doing this with Spring such as using properties rather than constructor arguments but I don't know if that's needed here.
Sorry for coming in late.
I don't feel particularly strongly about it but is this actually better from a user perspective? The decision to index Maybe @pwalczysko, @erindiel, @emilroz, and @muhanadz can share their experience in communicating these concepts to users to help us make a good decision here. |
Co-authored-by: jean-marie burel <j.burel@dundee.ac.uk>
Thank you @chris-allan for the heads up. My initial assumption here was that the search results will not change and that for example searching for a But reading @chris-allan comment, and re-reading the header of the PR
I seem to lean to an understanding that the fileset with the Hence my questions: Could you please clarify how the example above is affected (is expected to be affected) by the present PR ? What entries in the fileset would a hypothetical user be interested in ? The motivation of this comment is:
Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address #156 (comment) questions. Happy to have a call if necessary.
@pwalczysko this change only affects the fileset indexing. The example from the original PR (#57 (comment)) was to import With this PR, if you were to have a fileset composed of a dv + log file under |
No, thank you, I don't think this necessary. I was able to repeat the scenario you kindly suggested by importing a plate https://merge-ci.openmicroscopy.org/web/webclient/?show=plate-66055 from a local path containing the term In total, thank you for these examples and explanations, I am not worried anymore about the user experience, the performance win far outweights a confusion (if any). The main question users ask is Suggested actions:
|
Thanks for the work on this. Weighing in, from my experience in a few sessions with non technical users, it is fairly challenging to explain unexpected search results, such as matches from key-value pair or OMERO tables, and training them to use appropriate filters to narrow down those results. |
chris-allan commented last week:
I think your question, @chris-allan, is whether or not to do this at all which matches @muhanadz' sentiments. But if we are doing it anywhere, then I think it is more consistent to do it for some of the large filesets rather than none which is hard to explain. I'd suggest we move the larger question elsewhere. |
@joshmoore anything more from your side? Okay to merge and release? |
It seems #156 (comment) is still unresolved. |
In the absence of requested changes, I was working under the assumption that the state of this PR was a good first step to address the outstanding issue of the broken indexer for HCS deployments and that the discussion was expected to be moved elsewhere. If there is still an outstanding action that prevents this from getting merged, let's be more specific. At the moment, this introduces a hard cut-off based on the fileset size. In terms of the behavior, I see three possibilities:
Writing these down, I certainly feel that 2. is the hardest to communicate, so i would propose discard it. @jburel @joshmoore @pwalczysko are you proposing we implement option 3? If so, I can push a commit today and this should be ready for final review tomorrow. Note that in the use case where having the fileset metadata in the search index is not desirable (as described by @muhanadz) above, setting |
My order of preference would be: 3 > 2 > 1, since I find the current implementation harder to communicate. 👍 for having an option to disable. |
No, I was in favor of option 1. Afai understand it, only option 1 means not to do any more changes here, and this is what I am for. Option 3. is I think attractive because it does feel somewhat more thourough than option 1, but, option 3 is imho not easier to explain than 1, rather the opposite, if a mental state of a UI-bound user is considered - imho, such user will drop the effort to understand (any) indexing concepts much earlier than anticipated in this discussion. On a practical level, please consider that in case a retest of the benchmarking #156 (comment) is asked for, either a long wait or some splitting of the burden between the reviewers should be considered. The #156 (comment) was neither easy nor a short task. |
As discussed at the weekly meeting, pushed additional changes to implement behavior 3 as described in #156 (comment). 6d8d975 also modifies the injection to use a property rather than an additional constructor argument and simplify the code diff. Tested the last commit by importing the multi-file BBBC017 plate and re-indexing the database with various values of
|
With default config, and whilst searches for ""field_N"" where N = 0,1,...8 returned results, I have
Edit: This is so also with the new commit db7a913 included. |
Reindexing speed with the last commit included, on the multi-plate
This is 21:19:41 minus 21:19:19 which makes 22 secs. |
So, on the speeds, I think I have the two comparable numbers (both on multi-plate and both with this PR and both with default settings) as I think that the tiny loss of 3 secs is okay, @sbesson @joshmoore |
With setting the parameter |
Repeated also all three tests (the bullet points) in #156 (comment) - all as expected. In summary, the new benchmarking shows that the speedup is acceptable imho and behaviour of the features is as expected. |
Context
As part of the OMERO 5.5.0 release, the indexing logic was fully rewritten to use a new
FullTextIndexer2
(initial work in #10). While performing with a background reindexing as described in the documentation, we noticed the reindexing progress was extremely slow roughly with a rate of ~1%/day, meaning the process would take months to complete. Additionally, this trend suggests that even on completion, the indexer would be unlikely to catch-up with new imports.The system where the issue was identified is almost exclusively composed of HCS data. After some investigation, the source of the performance issue was traced down to the features introduced in #57, more specifically the indexing of filesets associated with images. The review suggests that most of the testing was carried out either with single-file filesets or filesets with a few entries. Under these conditions, the new operation will add a manageable overhead for every image indexed. However, when dealing with multi-image filesets, this operation will typically scale as n_images x m_entries. This becomes particularly critical in the high-content screening domain where plates are routinely composed of 1-10K images and 10-100K fileset entries. There are three related problems:
FullText
under the binary repositoryr01c01f01p01-ch1sk1fk1fl1.tiff
which will be associated to all images in each plateProposal
To mitigate the HCS issue without fully disabling the functionality introduced previously, this PR introduces a server side property
omero.search.max_fileset_size
which defines a cut-off above which fileset entries are not indexed. The default value is arbitrarily set to 1 i.e. only single-file filesets will be indexed but is up for discussion.Testing
Data-wise, the BBBC017 plate is a good representative sample for reproducing this issue and assessing the proposed solution. This is a fairly standard plate (384 wells, 6 fields of views/well) that has been converted into two variants of OME-TIFF: single file and multi-file with companion - see https://downloads.openmicroscopy.org/images/OME-TIFF/2016-06/BBBC017/. Both datasets are strictly identical in terms of metadata and only differ by their number of constituent files (1 vs 2304).
Testing should ideally be carried out from an empty database or from a populated one after recording the base reindexing time. After importing each variant of the plate mentioned above, the database should be reindexing following the instructions of the documentation and the reindexing time should be measured using the timestamps in the
Indexer-0.log
.My own measurements are given below for reference but the reindexing performance might vary depending on the system. The absolute reindexing values are less important than the ratio between the reindexing time between a single-file plate and a multi-file plate.
Without this PR, the reindexing time were found to be 15s for the single-file OME-TIFF plate versus 180s for the multi-file OME-TIFF plate.
With this PR included, the reindexing time should be identical and equal to 15s for both plates.
/cc @chris-allan @stick