Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Unicode support? #7

Closed
CrimsonVex opened this issue May 1, 2015 · 6 comments
Closed

No Unicode support? #7

CrimsonVex opened this issue May 1, 2015 · 6 comments

Comments

@CrimsonVex
Copy link

It seems like there's no unicode support, because CDocument .parse only accepts std::string, which doesn't seem unicode friendly (at least under Windows)

@TechnikEmpire
Copy link

http://stackoverflow.com/questions/3257263/how-do-i-get-stl-stdstring-to-work-with-unicode-on-windows

http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring

Have you run into a specific issue where you actually see/note a failure given the current implementation, or are you just assuming that there will be an issue?

Since a std::string is basically a sloppy wrapper around raw bytes, and the std::strings are used internally only by the lib to compare against other std::strings, you're just basically doing byte to byte comparisons so I'm not seeing where anything is going to go wrong, regardless of platform. Again though if you have a specific bug and reproducible error, please do share.

@CrimsonVex
Copy link
Author

I'm using the .NET WebRequest library, and the HTML responses are of type System::String.

In order to have these usable to gumbo-query, it must be converted into an std::string. Below is my function for doing so:

std::string SystemToStdString(String^ s)
{ msclr::interop::marshal_context context; return context.marshal_asstd::string(s); }

Now, when I print the HTML output of an HttpRequest made in .NET as a system::string, the unicode characters are there, however, after converting that HTML output to an std::string, all unicode characters become '?'.

EDIT: Perhaps there's a bug in my conversion function - I'll look into it and get back to you.

I've also come across this, hence why I have assumed (perhaps wrongly) that gumbo-query doesn't support unicode for at least its .text() function:
http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring

@TechnikEmpire
Copy link

tbh I could be wrong too I'm no expert when it comes to character encoding. I've never had an issue where I've needed to learn, so I haven't bothered to. I do know the .NET string object is a proper string object that does concern itself with encoding (where the C++ string is really just a wrapper around an array of bytes), so I can definitely see where you're not going to get a 1:1 conversion. If possible, try marshaling raw a byte array from the .NET side to a C-style array of char on the C++ side and then construct a std::string around that array of bytes. Just look up the appropriate std::string constructor. If that won't work, maybe investigate using some of the available encoding functions System.Text.Encoding available on the .NET side to convert to a more appropriately encoded string before marshaling over to the native side.

https://msdn.microsoft.com/en-us/library/kdcak6ye%28v=vs.110%29.aspx

@CrimsonVex
Copy link
Author

I've spent the afternoon trying all sorts of things, especially trying to convert between system::string and std::string. It seems as though std::string simply can't handle unicode characters reliably, especially on windows, and the universal solution that I've found almost everywhere is a need for std::wstring instead.

I only need std::wstring for the .text() function of any given CSelection, but I'm not too sure where to start in modifying the gumbo-query library to achieve this, as I noticed that gumbo-parser seems to use std::string.

I'll keep trying.

@TechnikEmpire
Copy link

If you're already using .NET, don't even bother with gumbo query. That's my 2 cents. There' s an excellent library that I was using in the C# version of my code before I did a full port to C++ called CSQuery. https://github.com/jamietre/CsQuery - It uses the Validator.Nu html parsing engine which is what is used in gecko/firefox and has full blown selector support. It's basically a port of the entire jquery lib to C#. It's available from Nuget. Does that solve you problem?

@CrimsonVex
Copy link
Author

I'm too loyal - it took me ages to get gumbo-query working, so I'm with it for life. Considering I'm using C++ .NET I may as well stick with gumbo.
I've fixed the encoding problem, using the System::string to std::string function from:

http://blog.nuclex-games.com/mono-dotnet/cxx-cli-string-marshaling/

And the std::string to System::string function: (gcnew String(s.c_str(), 0, s.length(), Encoding::UTF8))

Thank you for your assistance :)

I see no need to switch to CsQuery - gumbo-query does everything I need very well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants