Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different output when parsing HTML #790

Closed
longcdf opened this issue Dec 31, 2018 · 3 comments
Closed

Different output when parsing HTML #790

longcdf opened this issue Dec 31, 2018 · 3 comments

Comments

@longcdf
Copy link

longcdf commented Dec 31, 2018

Hi,
I'm facing some strange behavior when trying to parse a HTML file using LibTidy.
I have a html file contains something likes this:

<li class="ikonpunkt span8"><a href="http://www.vegvesen.no/jobb/Ledige+stillinger" class="menuitem"><img alt="Ledige stillinger hvit" src="http://www.vegvesen.no/jobb/For+studenter+og+nyutdannede/nyheter/_attachment/872915/binary/1031072?_ts=14d1da193f0"/><h3>Ledige stillinger</h3></a></li>

I'm using LibTidy to parse this file.
And it randomly generate 2 outputs:
1 -

<li class="ikonpunkt span8"><a href="http://www.vegvesen.no/om+statens+vegvesen/kontakt+oss" class="menuitem"><img alt="Kontakt oss hvit" src="http://www.vegvesen.no/jobb/For+studenter+og+nyutdannede/nyheter/_attachment/873039/binary/1031133?_ts=14d1dd66378">
<h3>Kontakt oss</h3>
</a></li>

2-

<li class="ikonpunkt span8"><a href="http://www.vegvesen.no/om+statens+vegvesen/kontakt+oss" class="menuitem"><img alt="Kontakt oss hvit" src="http://www.vegvesen.no/jobb/For+studenter+og+nyutdannede/nyheter/_attachment/873039/binary/1031133?_ts=14d1dd66378"></a>
<h3><a href="http://www.vegvesen.no/om+statens+vegvesen/kontakt+oss" class="menuitem">Kontakt oss</a></h3>
</li>

It happens very randomly.
Could you please help me explain it and suggest a way to overcome this issue?
Thanks

@geoffmcl
Copy link
Contributor

@longcdf thank you for the issue... but at this time do not understand...

Using your input, and a config of -w 0, on current 5.7.17, I repeat the output of 1., given an obvious different name <h3>Ledige stillinger</h3>... but that error aside...

What version of libTidy? Hopefully https://github.com/htacg/tidy-html5 next source...

How are you using libTidy? In what container, app, lib, whatever... src...

A random happening can only be explained as due to configuration, at the moment libTidy runs... care with multiple threads...

A library, like libtidy.a, has to produce the same output, given a config, and an input, a gzillion times over... forever... no random change... not possible... code paths can not change, without change...

So, at this moment, I really do not see how libTidy could very randomly output 2...

Even the sense is changed - the <h3> header is also a link to the URL... another anchor <a ...>name</a> added... the input must have changed...

At this time do not understand the problem... more information needed... thanks...

@longcdf
Copy link
Author

longcdf commented Jan 1, 2019

Hi @geoffmcl , thank you for your response.
I'm using tidys.lib 5.6.0 downloaded from http://binaries.html-tidy.org/
This issue happens when I'm using multi-threading.
I'm having about 20 threads, each thread will parse a different html file.
I tested by using tidyParseFile followed by a tidySaveFile to make sure I didn't touch the tinyDoc.

Original file: https://drive.google.com/open?id=1Ch8QarwXZfpz_KUBSdoQk2cORWBItmGb
Correct output file: https://drive.google.com/open?id=1DxW27av3umJYZjQhFBycQPvgpXm_WGNi
Randomly wrong output file: https://drive.google.com/open?id=1pwEJMpPlhEG3vZ5g6uS0T_NihV5b3wT4

One more note is it's not always failed exactly like the second output file. Maybe less or more differences but always has problem at the <h3> tag.

So I wonder maybe the multithreading cause the issue? Cause when I'm using single thread it is fine.

Thanks.

@balthisar
Copy link
Member

Looks like this has been address. Please feel to reopen if I'm wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants