Skip to content

DOMHandle reuse fields like LSParser #1054

@maffelbaffel

Description

@maffelbaffel

While doing some performance tests in our codebase, I notice that DOMHandle#receiveContent takes up about 50% of a request to MarkLogic. In particular calling createLSParser and newDocumentBuilder took most of the time.

I attached two flamegraphs which you can open in a browser to navigate (hover flamebars to see % time in that function call).

  • unpatched.svg shows ~53% time in receiveContent, of that are ~14% createLSParser ~20% newDocumentBuilder and rest is a legit DOMParser#parse call (which i ofc cannot get rid of).
  • patched.svg shows ~20% time in receiveContent all of that is parsing.

Patched version is run with the code of this PR .

Tested with this code:

public class Main {

    public static void main(String[] args) {
        final DatabaseClient client = DatabaseClientFactory.newClient("localhost", 8010, new DatabaseClientFactory.DigestAuthContext("admin", "admin"));

        final QueryManager qM = client.newQueryManager();
        qM.setPageLength(Integer.MAX_VALUE);
        final StructuredQueryBuilder sqb = qM.newStructuredQueryBuilder();
        final StructureWriteHandle structureWriteHandle = new StringHandle(
            "" +
                "<search:search xmlns:search=\"http://marklogic.com/appservices/search\">" +
                sqb.and(sqb.directory(1, "/sfwordings/")).serialize() +
                "   <search:options>" +
                "       <search:extract-document-data selected=\"all\"/>" +
                "   </search:options>" +
                "</search:search>"
        ).withFormat(Format.XML);
        final RawCombinedQueryDefinition def = qM.newRawCombinedQueryDefinition(structureWriteHandle);

        long start = System.currentTimeMillis();
        for (int i = 0; i < 20; i++) {
            doSearch(qM, def);
        }
        System.out.println(System.currentTimeMillis() - start);
    }

    private static void doSearch(QueryManager qM, RawCombinedQueryDefinition def) {
        final SearchHandle search = qM.search(def, new SearchHandle(), 1);

        final DOMHandle handle = new DOMHandle();
        final MatchDocumentSummary[] results = search.getMatchResults();
        for (MatchDocumentSummary summary : results) {
            final ExtractedResult extracted = summary.getExtracted();

            if (extracted == null || extracted.isEmpty()) {
                continue;
            }

            for (ExtractedItem item : extracted) {
                // this is the crucial call -> invokes DOMHandle#receiveContent
                item.get(handle).get();
            }
        }
    }

Problem
If you have a large result set, item.get(handle).get(); may be called alot of times. In my case > 1000 times. Every call of item.get(handle).get(); invokes DOMHandle#receiveContent. Every call to DOMHandle#receiveContent creates a new LSParser.

LSParser seems to be "cachable". The constructed LSParser is configured always the same, if there is no custom resolver or factory configured. In PR I tried to reuse a default factory, document builder and, lsparser.

This leads to a performance gain to about 10% in my environment (queries which use a DOMHandle).

flamegraph.zip

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions