Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select | Performance issue with large file #97

Open
markb-trustifi opened this issue Jul 2, 2020 · 11 comments
Open

Select | Performance issue with large file #97

markb-trustifi opened this issue Jul 2, 2020 · 11 comments

Comments

@markb-trustifi
Copy link

XPath versions: 23, 24, 27.
Selecting from large file (~70000 records) takes about 3 minute. All this time the thread is stuck.
bigxmlfile.xml.zip

document = new Dom().parseFromString(strfile);
let ts = xpath.select("//*[local-name()='t' or local-name()='tab' or local-name()='br' or local-name()='p' or local-name()='si']", document);

The most time is spent in the cycle in this function:

XNodeSet.prototype.buildTree = function () {
    if (!this.tree && this.nodes.length) {
        this.tree = new AVLTree(this.nodes[0]);
        for (var i = 1; i < this.nodes.length; i += 1) {
            this.tree.add(this.nodes[i]);
        }
    }

    return this.tree;
};
@JLRishe
Copy link
Collaborator

JLRishe commented Jul 3, 2020

Thank you for looking into the location of the performance bottleneck. We already have an open issue #87 about this, so I am going to close this one.

@JLRishe JLRishe closed this as completed Jul 3, 2020
@markb-trustifi
Copy link
Author

@JLRishe FYI, it isn't related to version 27. It took the same time complete the parsing also in versions 23 and 24.

@JLRishe
Copy link
Collaborator

JLRishe commented Aug 21, 2020

@markb-trustifi Thank you for clarifying. I will reopen this issue for now.

@cleydyr
Copy link

cleydyr commented Oct 9, 2021

Hi, @markb-trustifi, I've tested your use case with the file you've attached. It didn't take 3 minutes on my machine initially, but it took 30 seconds, which I think is still so much. Now, with the changes I'm proposing in #108, it's taking 1.5 seconds on my machine. You may want to give the modified code a try.

@tremby
Copy link

tremby commented Sep 16, 2023

I am seeing a similar severe performance issue with a large file. My file is relatively simple -- a few nodes deep are a very large number of self-closing child nodes with a few attributes. I'm selecting them quite directly. Both of these queries:

  • /rootNode/ns:otherNode/ns:childNode
  • /rootNode/ns:otherNode/*[name()='childNode']

seemingly hang the code. I don't know if it's stuck in an endless busy loop or if it'll eventually exit but as I write this the process has been running 12 minutes on a somewhat fast machine and hasn't finished.

Meanwhile, xmllint --xpath "/rootNode/*[name()='otherNode']/*[name()='childNode']" myfile.xml (I didn't try to figure out the syntax for namespaces in xmllint) completes in less than a second. Something is definitely wrong!

@nick-hunter
Copy link

I'm also experiencing some serious performance issues. My application used to take 30 seconds to load XML files on start up, and now that the source files have grown, it's taking about 30mins. Almost all of that time is xpath queries. My XML files are relatively flat, and my two largest files are 6MB and 32MB. I could be doing something wrong, but I've been seeing worrying and inconsistent performance in my benchmarks.

I tested with the flatter 6MB file, and //* returned 57600 nodes in 7 minutes 44 seconds. Selecting one node usually takes about 250ms. This isn't scalable for my app. I'm now planning to refactor my application to use fast-xml-parser and phase out XPath. fast-xml-parser is able to parse the same document in 650ms. xmllint is also fast.

@simon-20
Copy link

simon-20 commented Mar 7, 2024

We also have an app that is experiencing performance issues to do with xpath selects using this library. We're dealing with ~10-60 Mb files, with a fairly complex node structure.

Over repeated runs, the most complicated files we process (~25 Mb, but many repeated nodes (tens of thousands), so although not the largest file we handle, it takes the longest) our app takes on average ~800 seconds to process a 18 Mb file using the current version of this library.

We are currently using a modified version of this library that incorporates the changes that are in the unmerged PR #107 (PR #107 has merge conflicts, but the same change has been redone as PR #120, and that has no conflicts), and using this fork of the library reduces the processing time to between 200-250 seconds.

So PR #120 gives a ~75% performance increase in at least one real-world use case.

For us, 200-250 seconds is still far too long, and we're hitting timeout issues, so we're considering our options.

Are there any updates on whether the unmerged but mergeable performance fix that drops unshift() is going to be merged? Are there other plans for improving performance?

It would not be ideal to start modifying even further one of the forks of this library that already incorporates the performance fix gained by dropping unshift().

@JLRishe
Copy link
Collaborator

JLRishe commented Mar 8, 2024

@nick-hunter @simon-20 Sorry to hear that you are both experiencing performance issues.

I can try to get the unshift change merged and published in the next week or so.

One question - what are you using for your XML DOM? If it's @xmldom/xmldom, please note that a change has been made to that package that should offer significant performance benefits when querying it from this package, but it looks like those changes are still in the next branch of the package 10 months after they were merged and I don't know when they will be included in a release. Looks like the last release was 0.8.10 7 months ago, and this change is planned for inclusion in version 0.9.

So if you are using @xmldom/xmldom, I would suggest trying the latest beta version of that package to see if it makes a difference.

@nick-hunter
Copy link

@JLRishe thanks for the info! I am using @xmldom/xmldom. I just tried using version 0.9.0-beta.11 and it made my app slower. I don't have time to do proper benchmarks today, but in my dev environment my app went from loading in 12 seconds to taking 113 seconds. Hopefully will have more time to investigate next week.

@JLRishe
Copy link
Collaborator

JLRishe commented Mar 8, 2024

@nick-hunter Thank you for checking on that. I guess I had assumed that the newly added implementation of compareDocumentPosition in xmldom would be a fast operation, but after looking at the implementation, it looks like it's actually a rather expensive operation, which would explain why it made your app even slower. In any case, I will work on getting those unshift changes added and look to see what else can be done to improve performance.

@simon-20
Copy link

thanks @JLRishe, we are using @xmldom/xmldom, so I will bear that in mind, though perhaps won't rush to try it after @nick-hunter's experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants