Matching is too slow for 100K+ documents #317

anthony-gray · 2019-05-22T16:48:29Z

Matching just 1 thousand documents (excluding merging) takes ~1min11sec, whereas it should take ~111ms. Matching needs to scale to 1 million+ documents. Adding threads does not solve the problem. The issue is a design issue. The code should use the results of cts:value-tuples to determine the cts:uris corresponding to matches, instead of doing cts:searches on provided URIs. For example, pass cts:path-reference("memberId") and cts:and-not-query(cts:collection-query("mdm-content"), cts:collection-query("deleted")) to cts:value-tuples and use the result to produce cts:and-not-query(cts:and-query((cts:collection-query("mdm-content"), cts:path-range-query("memberId", "=", "1"))), cts:collection-query("deleted")) and cts:and-not-query(cts:and-query((cts:collection-query("mdm-content"), cts:path-range-query("memberId", "=", "2"))), cts:collection-query("deleted")) that gets passed to cts:uris to find the matching uris. I tested matching performance by calling proc:process-match-and-merge-with-options($uris, $mergeOptions, cts:true-query()) and stopping before the merge-impl:save-merge-models-by-uri call in the function by simply returning the consolidated matches.

ryanjdew · 2019-07-11T16:52:14Z

The 1.3.1 release will allow index references to be associated with properties in the match options to allow range indexes to be used with queries.

Also, some additional minor optimizations were made, but significant optimization will require outside orchestration. Data Hub 5 provides orchestration for batching the process and we have it on our roadmap to adjust the logic to both avoid duplicate work and avoid deadlocks for best performance.

anthony-gray · 2019-07-12T00:15:37Z

I will have to test this. I am very skeptical that you can achieve fast, memory efficient, matching performance for 1 million+ documents without using cts:value-tuples. You say "significant optimization will require outside orchestration." I can't post our code, but I achieved the ~111ms I mentioned using one single threaded, configurable method of a little more than 100 lines of XQuery. This method handles configurable fuzzy matching, any combination of and/or conditions for matching, and it doesn’t require one to pass the actual URIs to be matched, but rather just the conditions and collections for matching. All of this is to say that outside orchestration is not required and highly configurable, highly performant, scalable matching for millions of documents is attainable.

ryanjdew added this to the 1.3.1 milestone Jun 17, 2019

ryanjdew added a commit that referenced this issue Jun 20, 2019

Adds range index support for matching and helps GH issues #317 #320 #321

69b42ca

ryanjdew closed this as completed Jul 11, 2019

ryanjdew mentioned this issue Jul 11, 2019

Smart Mastering v1.3.1 release #325

Merged

debashissinha mentioned this issue Aug 12, 2019

Smart mastering issue for mutplie document matches taking long time marklogic/marklogic-data-hub#2628

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching is too slow for 100K+ documents #317

Matching is too slow for 100K+ documents #317

anthony-gray commented May 22, 2019 •

edited

ryanjdew commented Jul 11, 2019

anthony-gray commented Jul 12, 2019

Matching is too slow for 100K+ documents #317

Matching is too slow for 100K+ documents #317

Comments

anthony-gray commented May 22, 2019 • edited

ryanjdew commented Jul 11, 2019

anthony-gray commented Jul 12, 2019

anthony-gray commented May 22, 2019 •

edited