Input codepoints bigger than 2 bytes (e.g. emojis) get casted into single 2 byte char #7227

LobbyDivinus · 2023-09-19T09:00:14Z

Issue details

I noticed in my game that certain characters (i.e. emojis like 🏗️) get not reported correctly to the application via InputProcessor.keyTyped.

What happens in detail:

Let's type the emoji 🏗️ which consists of the codepoint 0x1F3D7 (that is, more than two bytes so it will have to be represented using multiple chars)
The GLFWCharCallback in https://github.com/libgdx/libgdx/blob/bb799a9cd0a92cadb0425acd6064654cd7a4d6c8/backends/gdx-backend-lwjgl3/src/com/badlogic/gdx/backends/lwjgl3/DefaultLwjgl3Input.java#L51C14-L59C4 receives the correct (4 byte) integer codepoint value.
However, the callback casts the codepoint into a (2 byte) char and only puts that into the eventQueue for the application to consume. In my example that is 0xF3D7 which is not a valid character.

A solution may be to put the upper 2 bytes of the integer codepoint into the eventQueue, too, if they are non zero.

Reproduction steps/code

See description above. All you need is to do is to check the values the keyTyped(char character) method in a set input processor gets called with when typing 🏗️ or similar codepoints.

As this issue relies on actual key input I don't know how tests can be written for it.

Version of libGDX and/or relevant dependencies

ligGDX 1.12.0 with lwjgl3 backend.

Please select the affected platforms

Windows
Probably all platforms but I tested on Windows only.

LobbyDivinus · 2023-09-19T10:01:50Z

Apparently Character.highSurrogate and Character.lowSurrogate can be used to convert an integer codepoint to a char surrogate pair that would be compatible with String codepoint methods.

To detect if a codepoint is supplementary in the first place Character.isSupplementaryCodePoint can be used.

tommyettinger · 2023-09-19T10:22:01Z

Yeah... you're just scratching the surface of what makes emoji so incredibly challenging to get right. For example, this emoji "👨🏽‍❤️‍👨🏿" is made of the 12 chars {'\uD83D', '\uDC68', '\uD83C', '\uDFFD', '\u200D', '\u2764', '\uFE0F', '\u200D', '\uD83D', '\uDC68', '\uD83C', '\uDFFF'}. It actually involves multiple distinct emoji characters joined into one with a ZWJ (Zero-Width Joiner) character (which is of course zero-width and invisible). There are also skin tone modifiers applied to the two people. I'm not sure if Windows 11 can enter some of the most complex multi-part emoji into a form; Windows 10 can't to my knowledge. Windows 10 can specify a skin tone for one person, but not two people in one emoji char.

You might want to look at how TextraTypist handles this. It has a separate TextureAtlas that allows looking up the emoji bitmaps given their human-enterable names, which are Strings and not single chars (because that emoji I just showed is 12 chars, it really needs a String).

We can't change this easily without making keyTyped() return a String, which would be quite strange if each keystroke was its own String. That many String objects would also make Android's underpowered GC contemplate catching fire to end its suffering. There is probably some partial solution that gets words as they are entered, or something? I don't really know how native Android and iOS apps handle their autocorrect and emoji entry.

I think this should be closed as not planned, but I'd like to hear what other people think.

LobbyDivinus · 2023-09-19T11:32:57Z

Yes, emojis can be hard. My argument would be that given that glfw sends a 32 bit code point it may be beneficial to send that exact code point to the application using a surrogate pair to let the application handle it (assuming that the application builds up an input string by appending the characters it receives via keyTyped). From my testing the glfw callback gets called multiple times when multiple code points are in play for a single emoji. Issuing keyTyped multiple times then - dependent on the code point entered - should not be a big deal and wouldn't require to change existing applications and the interface would not change.

On the contrary, I would argue that the current approach is kind of broken as it removes information and by doing so effectively sends invalid characters to keyTyped. If issuing keyTyped potentially multple times to support surrogate pairs is not an option I'd propose to not send keyTyped in this case at all as the application cannot know if a character was part of a supplementary code point.

LobbyDivinus · 2023-09-25T08:16:25Z

In case someone else stumbles across this issue I was able to fix it without modifying libGDX by using inheritance and reflection.

Basically, I wrap the default char callback with a custom one that does handle 32bit codepoints in the following way:

        charCallback = new GLFWCharCallback() {
            @Override
            public void invoke(long window, int codepoint) {
                if (Character.isSupplementaryCodePoint(codepoint)) {
                    defaultCharCallback.invoke(window, Character.highSurrogate(codepoint));
                    defaultCharCallback.invoke(window, Character.lowSurrogate(codepoint));
                } else {
                    defaultCharCallback.invoke(window, codepoint);
                }
            }
        };

The entry point is to override createInput() of Lwjgl3Application to let it return a custom DefaultLwgl3Input subclass that sets the own callback in windowHandleChanged() using GLFW.glfwSetCharCallback(windowHandle, charCallback);

Et voila, we can input emojis:

tommyettinger · 2023-09-25T08:51:24Z

This is a clever solution! I don't see where it needs reflection, unless I'm missing something. That's not a bad thing! Graal Native Image has a hard time with reflection even in LWJGL3 projects. I think this can be done with just some additional anonymous class extensions for now... How does this handle emoji with skin color specifiers, such as 👨🏾‍💻 ? I'd guess it would be multiple sequential high-low pairs.

LobbyDivinus · 2023-09-25T09:02:49Z

Thanks! I needed reflection to get the old callback I called defaultCharCallback in my code as it is is not accessible in my subclass as it is package private.

Yes, the specifiers get reported to the application using multiple high low pairs. The game does not handle them correctly, yet. But in comparison to before it now receives all the data needed to handle them in the future 😁My main concern was that most emojis could not be entered at all and resulted in broken characters which is solved now.

tommyettinger · 2023-09-25T09:25:36Z

This might be appropriate to add to libGDX's input handling, since emoji are not going away anytime soon... I think we need to make sure something can render the emoji in libGDX, and I don't know if BitmapFont can.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input codepoints bigger than 2 bytes (e.g. emojis) get casted into single 2 byte char #7227

Input codepoints bigger than 2 bytes (e.g. emojis) get casted into single 2 byte char #7227

LobbyDivinus commented Sep 19, 2023 •

edited

Loading

LobbyDivinus commented Sep 19, 2023 •

edited

Loading

tommyettinger commented Sep 19, 2023

LobbyDivinus commented Sep 19, 2023

LobbyDivinus commented Sep 25, 2023 •

edited

Loading

tommyettinger commented Sep 25, 2023

LobbyDivinus commented Sep 25, 2023

tommyettinger commented Sep 25, 2023

Input codepoints bigger than 2 bytes (e.g. emojis) get casted into single 2 byte char #7227

Input codepoints bigger than 2 bytes (e.g. emojis) get casted into single 2 byte char #7227

Comments

LobbyDivinus commented Sep 19, 2023 • edited Loading

Issue details

Reproduction steps/code

Version of libGDX and/or relevant dependencies

Please select the affected platforms

LobbyDivinus commented Sep 19, 2023 • edited Loading

tommyettinger commented Sep 19, 2023

LobbyDivinus commented Sep 19, 2023

LobbyDivinus commented Sep 25, 2023 • edited Loading

tommyettinger commented Sep 25, 2023

LobbyDivinus commented Sep 25, 2023

tommyettinger commented Sep 25, 2023

LobbyDivinus commented Sep 19, 2023 •

edited

Loading

LobbyDivinus commented Sep 19, 2023 •

edited

Loading

LobbyDivinus commented Sep 25, 2023 •

edited

Loading