Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faulty wide char conversion on Windows #733

Closed
TBProAudio opened this issue Apr 18, 2021 · 22 comments
Closed

Faulty wide char conversion on Windows #733

TBProAudio opened this issue Apr 18, 2021 · 22 comments

Comments

@TBProAudio
Copy link
Contributor

Windows only!

Functions UTF8ToUTF16 and UTF16ToUTF8 seem to produce garbage if language specific characters like "ä","ö","ü" etc are involved.
Currently the Windows functions MultiByteToWideChar and WideCharToMultiByte are called with the parameter "CP_UTF8" which seems to be the problem. Using "CP_ACP" instead seems to fix the issue.

All Windows based formats are affected (vst2, vst3, aax).

To Reproduce:
Make a folder with with the name "Öffnen" (engl. open) and create a file in this folder. Call PromptForFile and open this file. Full file name shows garbage.

@AlexHarker
Copy link
Collaborator

The functions are supposed to covert to and from UTF8, and I think I would have tested them when I wrote them for at least some non-standard characters. iPlug2 strings in general should be UTF8.

Can you say more about how you are inspecting the file name, as well as the setup of the system you are on? The fix suggested doesn't seem like the correct approach to me, but it would be good to know more and see if we can figure out what is going on here.

@TBProAudio
Copy link
Contributor Author

Hmm, maybe I missed something here, sorry.
Maybe I have overlooked the fact that IPLUG2 is now fully UTF8, which means that old libraries/code using fopen/fread/write need more attention. I need to elaborate this much more ...

@TBProAudio
Copy link
Contributor Author

First, there is nothing wrong with the current UTF8ToUTF16/UTF16ToUTF8 implementation!

But currently we cannot use it in this way as there are some things to consider which you may comment:

  1. Currently we use fopen under Win/Mac to open files. _wfopen seems to be missing under Mac
  2. WDL_String seems to miss wchar_t interfaces (e.g. SetFormated)

So I think we first need to switch our file i/o from char to wchar_t and then force UTF8 instead of ANSI CP.

@AlexHarker
Copy link
Collaborator

For Mac you can use fopen without issue. You have a couple of options here. One is simply to do your own conversion from UTF8 to ANSI CP on windows before you open. Another is to wrap your file reading routines for each platform to handle different routes more generally. I have a library for plugins that does this and I use std::ifstream and std::ofstream. Files are opened with wide (16 byte) strings on windows post conversion and simply with the UTF8 path on Mac.

Obviously, UTF8ToUTF16 can be used for the conversion when you need to do it.

@AlexHarker
Copy link
Collaborator

BTW - WDL_String is part of WDL, rather than plug2 so we aren't likely to add comments there - not sure if the UTF8 thing is documented anywhere - @olilarkin?

@TBProAudio
Copy link
Contributor Author

Thank you Alex.
So in a first step we enabled wchar_t for all windows based file io, but still CP_ACP. As soon all file i/o supports wchar_t we can switch to UTF8.

BTW: As WDL_String misses wchar_t support we created a small class to do the conversion in a smart way:

class cwchar_t
{
public:
	cwchar_t(WDL_String str) : m_wc(NULL)
	{
		int str_len = str.GetLength() * sizeof(wchar_t) + 1;
		m_wc = new wchar_t[str_len];
		UTF8ToUTF16(m_wc, str.Get(), str_len);
	}

	cwchar_t(const WDL_String* str) : m_wc(NULL)
	{
		int str_len = str->GetLength() * sizeof(wchar_t) + 1;
		m_wc = new wchar_t[str_len];
		UTF8ToUTF16(m_wc, str->Get(), str_len);
	}

	~cwchar_t()
	{
		if (m_wc)
		{
			delete[] m_wc;
		}
	}

public:
	operator const wchar_t* () const { return m_wc; }

private:
	wchar_t* m_wc;
};

Maybe someone likes it.

@TBProAudio
Copy link
Contributor Author

TBProAudio commented Apr 20, 2021

Seems to be my fault, sorry. Works as expected :-)

[One more question:

What is the correct way to show "ä","ö","ü" with g.DrawText(...).

BTW: IPLUG2 popup menu seems to have problems to handle "ä","ö","ü", system popup shows it correctly ....]

@TBProAudio
Copy link
Contributor Author

For Mac you can use fopen without issue. You have a couple of options here. One is simply to do your own conversion from UTF8 to ANSI CP on windows before you open. Another is to wrap your file reading routines for each platform to handle different routes more generally. I have a library for plugins that does this and I use std::ifstream and std::ofstream. Files are opened with wide (16 byte) strings on windows post conversion and simply with the UTF8 path on Mac.

Obviously, UTF8ToUTF16 can be used for the conversion when you need to do it.

Hi Alex,

After some test on Mac I found a curious issue (for me): in some strings (char *) a special character is encoded with 2 bytes, in some 3 bytes are used. In the debugger both look the same, but have different length. Do you know how to detect the 2 byte or 3 byte scenario? And how to convert properly?
Thank you

@AlexHarker
Copy link
Collaborator

It is usual for UTF8 to use a variable number of bytes for encoding characters (between 1 and 4) - this should happen on all platforms.

https://en.wikipedia.org/wiki/UTF-8

@TBProAudio
Copy link
Contributor Author

Thanks, complicated stuff :-)
In any case it seems that 3 byte unicode shows garbage with g.DrawText(), 2 byte seem to work ...
Any idea?

@AlexHarker
Copy link
Collaborator

Which backends and platforms?

@TBProAudio
Copy link
Contributor Author

TBProAudio commented Apr 23, 2021

Sorry, I forgot: Mac, nanovg, e.g. APP (but I guess the other plug formats as well)

@TBProAudio
Copy link
Contributor Author

one more note:
pGraphics->AttachPopupMenuControl(DEFAULT_LABEL_TEXT) is enabled.
If disabled (aka use the Mac platform popup-menu), strings with 3 bytes special characters are displayed properly.

@AlexHarker
Copy link
Collaborator

Are you able to test with skia?, and (just for sanity) a different control (one that just draws a problem string would be fine).

@AlexHarker
Copy link
Collaborator

BTW - I've just tried putting the letters into the text in iPlugEffect and everything draws correctly with NanoVG, so there's some step we've not got the same - this could be control-specific or to do with how the string is being generated.

FWIW as far as I can tell ä should be encoded as 2 bytes in UTF8.

@TBProAudio
Copy link
Contributor Author

yes, I will try with skia.

Just to sum up (all for Mac, nanovg):
IPLUG2 popup: does not work
system popup: works
g.DrawText(): does not work (any control)

@TBProAudio
Copy link
Contributor Author

yes, UTF8 2 bytes works
UTF8 3 bytes does not. But should or not?
I think it is a 3 byte UTF8 problem ...

@AlexHarker
Copy link
Collaborator

Yes - I've managed to confirm that here now - this seems like a potential bug, although I guess we should also check that the font you have supports the character in question - which might involve looking in a font editor.

@AlexHarker
Copy link
Collaborator

I get a space for a 3-byte char BTW, rather than "garbage" - hence my question about missing characters.

@TBProAudio
Copy link
Contributor Author

correct, a space is drawn.
So, "böse" becomes "bo se".
OK I see, could be font thing ...

This is why I asked if there is a method available on Mac to convert 3 bytes special characters to 2 bytes ...

@AlexHarker
Copy link
Collaborator

ö should not be a three byte encoding - it should have 2 bytes only:

https://www.compart.com/en/unicode/U+00F6

Each Unicode encoding should (as far as I know) be unique.

@AlexHarker
Copy link
Collaborator

Looks possible that the umlaut might been separately encoded in your example as a combining diaerisis (2 bytes):

https://www.compart.com/en/unicode/U+0308

and a lowercase o (1 byte). I tried that here and I also get a space, but it's intriguing that it is encoded that way, rather than as the unicode point I linked above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants