Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't parse UTF16 html string #744

ekalchev opened this issue Jul 8, 2018 · 4 comments

Can't parse UTF16 html string #744

ekalchev opened this issue Jul 8, 2018 · 4 comments


Copy link

ekalchev commented Jul 8, 2018

I can't get this code to work. The output is empty html without h1 and p tags. It works for utf8 and ascii but not for UTF16

Am I doing something wrong or this is defect?

                                 int rc = 0;
				TidyDoc tdoc = tidyCreate();
				TidyBuffer output = { 0 };
				TidyBuffer errbuf = { 0 };
				char* test = (char*)u"<html><head><meta name = 'author' content = 'John Doe'></head><body><h1>My First Heading  𠜱 𠝹 𠱓 𠱸</h1><p>My first paragraph.</p></body></html>";
				rc = tidySetInCharEncoding(tdoc, "utf16le");
				rc = tidySetOutCharEncoding(tdoc, "utf16le");
				rc = tidyParseString(tdoc, test);
				rc = tidySaveBuffer(tdoc, &output);
Copy link

cmb69 commented Oct 20, 2018

Calling tidyParseString() with UTF-16 encoded input does not seem to be supported (under the hood it uses TY_(tmbstrlen) which does not report the proper byte length for UTF-16). See, for instance, PHP's wrapper on how to set up an appropriate input buffer.

Copy link

This works. Thanks!

Copy link

@cmb69, thank you for pointing this out... yes the internal strlen service will stop at the first 0 byte, so can fail with a utf16 string...

Maybe this is suggested by the tidyParseString(TidyDoc,ctmbstr) uses a string type ctmbstr, which is based on a char, but could certainly be documented better...

@ekalchev as pointed out, PHP uses tidyParseBuffer(), with a TidyBuffer, which supplies length, is used, and provide the desired character encoding is specified should work as expected with all supported char encoding...

Certainly look forward to feedback on documentation improvements... that is in the tidy.h doxygen comments... suggestions, patches, PR, comments very welcome... thanks...

Copy link

Documentation change pushed to reflect this. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

4 participants