Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jsoup 1.15 performance problems with :has(), :matches(), ... #1872

Closed
ogolovanov opened this issue Dec 25, 2022 · 4 comments
Closed

jsoup 1.15 performance problems with :has(), :matches(), ... #1872

ogolovanov opened this issue Dec 25, 2022 · 4 comments

Comments

@ogolovanov
Copy link

ogolovanov commented Dec 25, 2022

Hi.

I am intensively using jsoup in my app.
I tried to update from jsoup 1.12.2 to jsoup 1.15.3 -> and see huge performance degradation.
After doing some tests, i see that there is a problem with pseudo selectors like :has(), :matches(), and may be other.

Document doc = Jsoup.connect("https://www.amazon.com/").get();
String title = doc.title();
System.out.println("Title: " + title);

// warm up
for(int i=0; i<1000; ++i) doc.select("*");

long s = System.currentTimeMillis();
for(int i=0; i<1000; ++i) doc.select("NoSuchElement");
System.out.println("Test #1: " + (System.currentTimeMillis() - s) + "ms");

s = System.currentTimeMillis();
for(int i=0; i<1000; ++i) doc.select("NoSuchElement:has(> div)");
System.out.println("Test #2: " + (System.currentTimeMillis() - s) + "ms");

s = System.currentTimeMillis();
for(int i=0; i<1000; ++i) doc.select("NoSuchElement:has(> div:has(> a))");
System.out.println("Test #3: " + (System.currentTimeMillis() - s) + "ms");

s = System.currentTimeMillis();
for(int i=0; i<1000; ++i) doc.select("NoSuchElement:has(> div > a)");
System.out.println("Test #4: " + (System.currentTimeMillis() - s) + "ms");

for(int i=0; i<1000; ++i) doc.select("NoSuchElement:matches(^abc$)");
System.out.println("Test #5: " + (System.currentTimeMillis() - s) + "ms");

Jsoup 1.12.2 result:

Title: Amazon.com. Spend less. Smile more.
Test #1: 32ms
Test #2: 37ms
Test #3: 36ms
Test #4: 39ms
Test #5: 74ms

Here you can see, that:

  1. :has() pseudo selector works as expected = no extra time consumed
  2. :matches() pseudo selector consumes some extra time = unexpected behaviour

Jsoup 1.15.3 result:

Title: Amazon.com. Spend less. Smile more.
Test #1: 14ms
Test #2: 30ms
Test #3: 85ms
Test #4: 20ms
Test #5: 174ms

Here you can see, that

  1. :has() preudo selector consumes extra time, especially when its nested
  2. :has(> div > a) works much faster than :has(> div:has(> a)) = unexpected behaviour
  3. :matches() preudo selector consumes extra time, and thats time is significant ( 2.5x times slower that in jsoup 1.12.2 )

Resume:

  1. [jsoup 1.15.3] there are performance problems while using :has() and :matches() pseudo selectors when context element is absent
  2. [jsoup 1.12.2] there are no problems with :has() pseudo selector, but there are problems with :matches() pseudo selector
  3. [jsoup 1.15.3] :has(> div > a) and :has(div a) pseudo selectors work much faster than :has(> div:has(> a)) pseudo selector in both situations - when context element is absent and when its not
  4. [jsoup 1.12.2] same problem, but only when context element is not absent
  5. [jsoup 1.15.3] in general :has() and :matches() preudo selectors work slower than in jsoup 1.12.2 when context element is not absent
  6. i showed some simple selectors, but my app use hunders of them and a lot of them are complex -> real performance degradation is 3x
@jhy
Copy link
Owner

jhy commented Jan 5, 2023

:has is always going to be more expensive than a simple > descender check, because it needs to create a sub-evaluation list. If you can use >, do. Given your metrics, I don't consider this an issue.

I reviewed the matches and the calls it makes into text() and don't immediately see significant changes between the versions, or immediate optimization opportunities. text() needs to do the normalizations it does, so we can't change that directly.

Can you try matchesWholeText or matchesWholeOwnText and see if you get better perf with those? They will be doing less work.

@jhy jhy added the needs-more-info More information is needed from the reporter to progress the issue label Jan 6, 2023
@jhy
Copy link
Owner

jhy commented Jan 24, 2023

(Closing, as not directly actionable. Happy to take a look at any specific performance suggestions.)

@jhy jhy closed this as not planned Won't fix, can't repro, duplicate, stale Jan 24, 2023
@jhy jhy removed the needs-more-info More information is needed from the reporter to progress the issue label Jun 1, 2023
@jhy
Copy link
Owner

jhy commented Jun 1, 2023

I have added a cost-based query planner which should improve the performance of these selectors. See commit: 32d247d

@ogolovanov
Copy link
Author

I have added a cost-based query planner which should improve the performance of these selectors. See commit: 32d247d

Yes, now it works much better, thanks

P.s. In general performance of v1.17.2 is very similar to v1.12.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants