java.lang.IndexOutOfBoundsException for 27 pages - tested 862_674 pages. #768

marekhudyma · 2020-07-13T20:38:48Z

Current behavior

I got java.lang.IndexOutOfBoundsException while parsing 27 pages!

I use:

    <dependency>
      <groupId>org.jodd</groupId>
      <artifactId>jodd-lagarto</artifactId>
      <version>5.1.5</version>   
    </dependency>

Exception that I have:

java.lang.IndexOutOfBoundsException

Expected behavior

No exception.

Steps to Reproduce the Problem

My code:

    @Override
    public List<CharSequence> parse(CharSequence html) {
        LagartoParser lagartoParser = new LagartoParser(html.toString());
        LagartoParserConfig config = new LagartoParserConfig();
        config.setEnableConditionalComments(false);  // <----- 
        config.setEnableRawTextModes(false);             // <----- 
        lagartoParser.setConfig(config);
        TagVisitorImpl tagVisitor = new TagVisitorImpl();
        lagartoParser.parse(tagVisitor);
        return tagVisitor.getLinks();
    }

        class TagVisitorImpl implements TagVisitor {
        @Override
        public void tag(Tag tag) {
            href = tag.getAttributeValue("href");
            if (href != null) {
                // ... 
            }
        }

Pages that I parsed:

https://www.dropbox.com/sh/279el3cheql3esc/AAAc-btAJgyWUF89fTJ_1cTPa?dl=0

How I found it ?
I found my old zip file with 1 million downloaded pages (ok, is it 862_674) and I parsed it.
I will try to push it to my github.. the zip file is 13GB big, it should be possible (?)
I will let you know later. But you have all failing pages, so you can start fixing it.

The text was updated successfully, but these errors were encountered:

igr · 2020-07-13T20:41:56Z

No problem, I didnt forget this. In fact, I am moving Lagarto to a separate, individual project, as it deserves special attention - it is one of the two most used libraries of Jodd.

Thank you for helping with this:) Stay tuned, the plan is to migrate this week

slandelle · 2020-07-13T21:00:04Z

Here's a simple reproducer for the first page:

<html>
<body>
<!--->
-->
</body>
</html>

Please note the comment block is not closed in <!---> as there's only 3 dashes (and not 4).

igr · 2020-07-13T21:00:51Z

These comments are going to kill me :)))))

igr · 2020-07-13T21:01:11Z

I believe this one is fixed in last PR, but not released yet.

slandelle · 2020-07-13T21:03:01Z

Are snapshots published on Sonatype or should I build locally?

igr · 2020-07-13T21:05:27Z

Locally, sorry.

slandelle · 2020-07-13T21:09:18Z

Will give it a try tomorrow then :)

slandelle · 2020-07-13T21:12:15Z

Note: that's the same <!---> pattern in all of the reported files.

igr · 2020-07-13T21:15:17Z

Yap, just tested here, the above HTML snippet works on master

slandelle · 2020-07-13T21:17:35Z

So hopefully you're fine now 🤞

igr · 2020-07-13T21:17:58Z

Haha, this was close :)))

marekhudyma · 2020-07-16T19:27:22Z

I want to upload all html pages that I used in the test. Unfortunately, it looks like Github has limit of 5GB per repository - I have 13GB. I will upload it to S3, but I need to return from travel - around 1 week.

I downloaded all the html and in http-client set by default UTF-8 encoding. In the specification I saw that finding a proper encoding is tricky. Can you recommend some library to do it?

igr · 2020-07-16T20:38:15Z

Hey @marekhudyma awesome! You can use https://wetransfer.com as well.
In a week - I promise - I will migrate to new repo and have the 6.0.0-snapshot ready!

Yeah, the encoding is tricky, but isn't it set by the response? Are you using Jodd http library? One idea is to download everything with same encoding and then resolve encoding from e.g. meta tags, and then reload string. Just na idea, thinking out loud here.

marekhudyma · 2020-07-17T21:44:55Z

I found this definition of encodings and it is not trivial:
https://w3c.github.io/html-reference/syntax.html#character-encoding
That's why I ask for any automated way to find it. I will google more ;-)

I use Async Http Client, but I am still evaluating which async http client gives biggest configuration (for example stop downloading if file is bigger than x bytes).

I wrote some performance test that compare Jsoup, Jericho, HtmlCleaner, Lagarto with default configuration and Lagarto with tuned configuration. It looks Lagarto is the fastest ;-)
But I wanted to compare Jerry parser as well. Unfortunately documentation is not clear for me how to use it. I see for examples like:

Jerry doc = Jerry.jerry("some_html");
doc.$("div");

But my Intellij cannot resolve method "$". Is documentation not updated or I do something wrong?

igr · 2020-07-17T21:46:01Z

I had to remove $ as in e.g. Graal it is not allowed. Try select.

igr · 2020-07-17T21:47:07Z

btw regarding speed ... its Lagarto < LagartoDOM < Jerry

igr · 2020-07-17T21:48:26Z

Try also just s() instead of $()

igr · 2020-07-20T14:20:41Z

Here, new repo created: https://github.com/oblac/jodd-lagarto, migration is in progress :)

igr · 2020-07-20T14:27:17Z

@marekhudyma see oblac/jodd-lagarto#3

I am thinking of way how to add a bunch of files for the tests. I guess I can put them in the Docker image so not to increase the repo size?

marekhudyma · 2020-07-20T16:05:00Z

Thanks for this hint.
I need to admit the documentation is very outdated. Please fix it. Do you need an issue for it? ;-)

The LagartoDOM is so badly documented, I am not sure how to use it.
https://jodd.org/lagarto/dom-builder.html
It simply doesn't have this methods:

LagartoDOMBuilder domBuilder = new LagartoDOMBuilder() {
        @Override
        protected DOMBuilderTagVisitor createDOMDomBuilderTagVisitor() {
            return new MyDOMBuilderTagVisitor();
        }
    }

igr · 2020-07-20T16:08:05Z

I know @marekhudyma. I will put my efforts now in Lagarto. Even a separate web site :)

https://github.com/oblac/jodd/blob/master/jodd-lagarto/src/test/java/jodd/lagarto/dom/LagartoHtmlRendererTest.java#L56

marekhudyma · 2020-07-21T12:04:07Z

As I promised,
Here is a simple project with test of your parser. Test just check if there is no exception.. nothing complicated, but still provide some value as we discussed.
https://github.com/marekhudyma/html-parser-correctness

To make it work, you need to download 13 GB of downloaded pages. It is around 1 million (862_542) pages. Go to folder IN and execute aws S3 bucket synchronization:

 aws s3 sync s3://html-top1m .

Then run: JoddLagartoCorrectnessTest
Parsing ~1m pages was not so long.

marekhudyma · 2020-07-21T12:18:18Z

I got a problem to commit all files to GitHub, but maybe commiting 10 of them would be an option to have automated test. All of them you can run from time to time..

marekhudyma · 2020-07-21T12:18:44Z

(I think there was a limit of 5GB for whole repository.. )

igr · 2020-07-21T17:04:27Z

Just download the files :) I believe Docker image with all the files will make sense...

igr · 2020-07-26T20:57:27Z

@marekhudyma what are your memory settings for the test? I have an issue with 27.zip.

igr · 2020-07-29T21:10:01Z

Soon... www.joddlagarto.com :)

igr · 2020-07-31T11:24:48Z

@slandelle @marekhudyma @moh-sushi

Here is the snapshot that is a beta:

https://oss.sonatype.org/content/repositories/snapshots/org/jodd/jodd-lagarto/

6.0.0.20200731131821-SNAPSHOT

I made some minor API improvements, that is basically it.

The website is ready: www.joddlagarto.com :)

I will now remove Lagarto from here and do the big test, that is all whats left.

please, if you have some time, you can try this snapshot before its released :)

slandelle · 2020-07-31T12:24:58Z

Would be great if LagartoDomBuilderConfig#setErrorLogger and LagartoDomBuilderConfig#setDebugLogger could return this instead of void so configuration could be chained.

Wondering about the perf impact of passing a Consumer<Supplier> though.

igr · 2020-07-31T13:55:06Z

Ah true.

Regarding performance, I was thinking in skipping the construction of debug strings... but I will do that better now.
Thanx!

slandelle · 2020-07-31T13:57:12Z

https://github.com/oblac/jodd-lagarto/blob/a377b7686461aa658aebe3d5f5f6817519d233ec/src/main/java/jodd/lagarto/dom/LagartoDOMBuilderTagVisitor.java#L504

Nope :P

igr · 2020-07-31T14:52:44Z

Yeah, I got carried away :)

marekhudyma · 2020-07-31T14:59:51Z

@igr Nice website it comming.
I will run correctness test against snapshot using this project: https://github.com/marekhudyma/html-parser-correctness
I will be realistic: probably I will not build docker image from these files.. please handle it by yourself :)
I also write comparison test (between libraries). Probably it will takes around 2 weeks for me to return to this topic. The main concept is to compare with java-benchmarking - to show difference with 1 parsing, parsing 1million pages, to show who is faster in long run AND I want to compare DOM vs no-DOM parsers to show how many times you need to parse 1 document so building DOM makes sense.

igr · 2020-07-31T15:03:57Z

Awesome @marekhudyma - you dont have to do anything, you already did enough :)

I am handling this locally, Docker images are also out of option, so it's going to be all locally.

Just please use the latest version, that is all that I ask :)

marekhudyma · 2020-07-31T19:04:51Z

@igr I run test aginst all these pages and there were no exceptions. I run the newest snapshot version

    <repositories>
        <repository>
            <id>maven-snapshots</id>
            <url>http://oss.sonatype.org/content/repositories/snapshots</url>
            <layout>default</layout>
            <releases>
                <enabled>false</enabled>
            </releases>
            <snapshots>
                <enabled>true</enabled>
            </snapshots>
        </repository>
    </repositories>

        <dependency>
            <groupId>org.jodd</groupId>
            <artifactId>jodd-lagarto</artifactId>
            <version>6.0.0.20200731131821-SNAPSHOT</version>
            <scope>test</scope>
        </dependency>

Congratulations, you've fixed issue ;-)

neroux · 2020-08-01T19:18:30Z

Soon... www.joddlagarto.com :)

Why not lagarto.jodd.org?

Also, why Godaddy? 😲 ;)

igr · 2020-08-02T06:40:07Z

Why not lagarto.jodd.org?

Good question, @neroux... to be honest, I was looking for the quickest way to make documentation, and it seems that GitBook works so far. The new doc is totally separated, has a separate repository etc. - hence the new URL

I believe I can use a subdomain on jodd.org, but the Lagarto pages would look different from the rest of the existing website. Would that make sense?
I would probably then need to migrate the rest of the documentation too, to separate subdomains :)

Also, why Godaddy? 😲 ;)

10+ years ago it seemed like a good idea;)))

neroux · 2020-08-02T07:03:12Z

I believe I can use a subdomain on jodd.org, but the Lagarto pages would look different from the rest of the existing website. Would that make sense?
I would probably then need to migrate the rest of the documentation too, to separate subdomains :)

Please, no worries. I was just curious. Either approach will work just fine. Using hostnames of jodd.org might just be cheaper in the end :)

10+ years ago it seemed like a good idea;)))

I know, I also was with that company once until their CEO went on hunting safaris for elephants and leopards.

igr · 2020-08-02T22:47:18Z

I know, I also was with that company once until their CEO went on hunting safaris for elephants and leopards.

Damn, didn't know that. Do you have an alternative @neroux to recommend? I am bit tired of their non-working website anyway

slandelle · 2020-08-03T06:03:23Z

Been using https://www.gandi.net for ages.

neroux · 2020-08-03T06:16:33Z

Recently I have been using mostly www.porkbun.com for gTLD domains.

igr · 2020-08-04T08:14:37Z

I will migrate from Godday on the next payment cycle, too much for me atm...

@neroux makes sense, https://lagarto.jodd.org it is.

I will try to add Reader as an input, and then I will release lagarto to 6.

For now, I will update this repo and current docs.

igr added a commit that referenced this issue Jul 13, 2020

Add new test for #768

4e79b26

igr self-assigned this Aug 3, 2020

igr added jodd-lagarto bug labels Aug 3, 2020

igr closed this as completed Aug 4, 2020

java.lang.IndexOutOfBoundsException for 27 pages - tested 862_674 pages. #768

java.lang.IndexOutOfBoundsException for 27 pages - tested 862_674 pages. #768

Comments

marekhudyma commented Jul 13, 2020

Current behavior

Expected behavior

Steps to Reproduce the Problem

igr commented Jul 13, 2020

slandelle commented Jul 13, 2020

igr commented Jul 13, 2020

igr commented Jul 13, 2020

slandelle commented Jul 13, 2020

igr commented Jul 13, 2020

slandelle commented Jul 13, 2020

slandelle commented Jul 13, 2020 • edited Loading

igr commented Jul 13, 2020

slandelle commented Jul 13, 2020

igr commented Jul 13, 2020

marekhudyma commented Jul 16, 2020

igr commented Jul 16, 2020

marekhudyma commented Jul 17, 2020

igr commented Jul 17, 2020

igr commented Jul 17, 2020

igr commented Jul 17, 2020

igr commented Jul 20, 2020

igr commented Jul 20, 2020

marekhudyma commented Jul 20, 2020

igr commented Jul 20, 2020

marekhudyma commented Jul 21, 2020

marekhudyma commented Jul 21, 2020

marekhudyma commented Jul 21, 2020

igr commented Jul 21, 2020

igr commented Jul 26, 2020 • edited Loading

igr commented Jul 29, 2020

igr commented Jul 31, 2020

slandelle commented Jul 31, 2020

igr commented Jul 31, 2020

slandelle commented Jul 31, 2020

igr commented Jul 31, 2020

marekhudyma commented Jul 31, 2020

igr commented Jul 31, 2020

marekhudyma commented Jul 31, 2020

neroux commented Aug 1, 2020

igr commented Aug 2, 2020 • edited Loading

neroux commented Aug 2, 2020

igr commented Aug 2, 2020

slandelle commented Aug 3, 2020

neroux commented Aug 3, 2020

igr commented Aug 4, 2020

slandelle commented Jul 13, 2020 •

edited

Loading

igr commented Jul 26, 2020 •

edited

Loading

igr commented Aug 2, 2020 •

edited

Loading