You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 9, 2022. It is now read-only.
Matching just 1 thousand documents (excluding merging) takes ~1min11sec, whereas it should take ~111ms. Matching needs to scale to 1 million+ documents. Adding threads does not solve the problem. The issue is a design issue. The code should use the results of cts:value-tuples to determine the cts:uris corresponding to matches, instead of doing cts:searches on provided URIs. For example, pass cts:path-reference("memberId") and cts:and-not-query(cts:collection-query("mdm-content"), cts:collection-query("deleted")) to cts:value-tuples and use the result to produce cts:and-not-query(cts:and-query((cts:collection-query("mdm-content"), cts:path-range-query("memberId", "=", "1"))), cts:collection-query("deleted")) and cts:and-not-query(cts:and-query((cts:collection-query("mdm-content"), cts:path-range-query("memberId", "=", "2"))), cts:collection-query("deleted")) that gets passed to cts:uris to find the matching uris. I tested matching performance by calling proc:process-match-and-merge-with-options($uris, $mergeOptions, cts:true-query()) and stopping before the merge-impl:save-merge-models-by-uri call in the function by simply returning the consolidated matches.
The text was updated successfully, but these errors were encountered:
The 1.3.1 release will allow index references to be associated with properties in the match options to allow range indexes to be used with queries.
Also, some additional minor optimizations were made, but significant optimization will require outside orchestration. Data Hub 5 provides orchestration for batching the process and we have it on our roadmap to adjust the logic to both avoid duplicate work and avoid deadlocks for best performance.
I will have to test this. I am very skeptical that you can achieve fast, memory efficient, matching performance for 1 million+ documents without using cts:value-tuples. You say "significant optimization will require outside orchestration." I can't post our code, but I achieved the ~111ms I mentioned using one single threaded, configurable method of a little more than 100 lines of XQuery. This method handles configurable fuzzy matching, any combination of and/or conditions for matching, and it doesn’t require one to pass the actual URIs to be matched, but rather just the conditions and collections for matching. All of this is to say that outside orchestration is not required and highly configurable, highly performant, scalable matching for millions of documents is attainable.
Matching just 1 thousand documents (excluding merging) takes ~1min11sec, whereas it should take ~111ms. Matching needs to scale to 1 million+ documents. Adding threads does not solve the problem. The issue is a design issue. The code should use the results of cts:value-tuples to determine the cts:uris corresponding to matches, instead of doing cts:searches on provided URIs. For example, pass cts:path-reference("memberId") and cts:and-not-query(cts:collection-query("mdm-content"), cts:collection-query("deleted")) to cts:value-tuples and use the result to produce cts:and-not-query(cts:and-query((cts:collection-query("mdm-content"), cts:path-range-query("memberId", "=", "1"))), cts:collection-query("deleted")) and cts:and-not-query(cts:and-query((cts:collection-query("mdm-content"), cts:path-range-query("memberId", "=", "2"))), cts:collection-query("deleted")) that gets passed to cts:uris to find the matching uris. I tested matching performance by calling proc:process-match-and-merge-with-options($uris, $mergeOptions, cts:true-query()) and stopping before the merge-impl:save-merge-models-by-uri call in the function by simply returning the consolidated matches.
The text was updated successfully, but these errors were encountered: