Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't parse UTF16 html string #744

Closed
ekalchev opened this issue Jul 8, 2018 · 4 comments
Closed

Can't parse UTF16 html string #744

ekalchev opened this issue Jul 8, 2018 · 4 comments

Comments

@ekalchev
Copy link

ekalchev commented Jul 8, 2018

I can't get this code to work. The output is empty html without h1 and p tags. It works for utf8 and ascii but not for UTF16

Am I doing something wrong or this is defect?

                                 int rc = 0;
				TidyDoc tdoc = tidyCreate();
				TidyBuffer output = { 0 };
				TidyBuffer errbuf = { 0 };
				char* test = (char*)u"<html><head><meta name = 'author' content = 'John Doe'></head><body><h1>My First Heading  𠜱 𠝹 𠱓 𠱸</h1><p>My first paragraph.</p></body></html>";
				rc = tidySetInCharEncoding(tdoc, "utf16le");
				rc = tidySetOutCharEncoding(tdoc, "utf16le");
				rc = tidyParseString(tdoc, test);
				rc = tidySaveBuffer(tdoc, &output);
@cmb69
Copy link
Contributor

cmb69 commented Oct 20, 2018

Calling tidyParseString() with UTF-16 encoded input does not seem to be supported (under the hood it uses TY_(tmbstrlen) which does not report the proper byte length for UTF-16). See, for instance, PHP's wrapper on how to set up an appropriate input buffer.

@ekalchev
Copy link
Author

This works. Thanks!

@geoffmcl
Copy link
Contributor

@cmb69, thank you for pointing this out... yes the internal strlen service will stop at the first 0 byte, so can fail with a utf16 string...

Maybe this is suggested by the tidyParseString(TidyDoc,ctmbstr) uses a string type ctmbstr, which is based on a char, but could certainly be documented better...

@ekalchev as pointed out, PHP uses tidyParseBuffer(), with a TidyBuffer, which supplies length, is used, and provide the desired character encoding is specified should work as expected with all supported char encoding...

Certainly look forward to feedback on documentation improvements... that is in the tidy.h doxygen comments... suggestions, patches, PR, comments very welcome... thanks...

@balthisar
Copy link
Member

Documentation change pushed to reflect this. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants