Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input codepoints bigger than 2 bytes (e.g. emojis) get casted into single 2 byte char #7227

Open
1 task done
LobbyDivinus opened this issue Sep 19, 2023 · 7 comments
Open
1 task done

Comments

@LobbyDivinus
Copy link

LobbyDivinus commented Sep 19, 2023

Issue details

I noticed in my game that certain characters (i.e. emojis like πŸ—οΈ) get not reported correctly to the application via InputProcessor.keyTyped.

What happens in detail:

  1. Let's type the emoji πŸ—οΈ which consists of the codepoint 0x1F3D7 (that is, more than two bytes so it will have to be represented using multiple chars)
  2. The GLFWCharCallback in https://github.com/libgdx/libgdx/blob/bb799a9cd0a92cadb0425acd6064654cd7a4d6c8/backends/gdx-backend-lwjgl3/src/com/badlogic/gdx/backends/lwjgl3/DefaultLwjgl3Input.java#L51C14-L59C4 receives the correct (4 byte) integer codepoint value.
  3. However, the callback casts the codepoint into a (2 byte) char and only puts that into the eventQueue for the application to consume. In my example that is 0xF3D7 which is not a valid character.

A solution may be to put the upper 2 bytes of the integer codepoint into the eventQueue, too, if they are non zero.

Reproduction steps/code

See description above. All you need is to do is to check the values the keyTyped(char character) method in a set input processor gets called with when typing πŸ—οΈ or similar codepoints.

As this issue relies on actual key input I don't know how tests can be written for it.

Version of libGDX and/or relevant dependencies

ligGDX 1.12.0 with lwjgl3 backend.

Please select the affected platforms

  • Windows
    Probably all platforms but I tested on Windows only.
@LobbyDivinus
Copy link
Author

LobbyDivinus commented Sep 19, 2023

Apparently Character.highSurrogate and Character.lowSurrogate can be used to convert an integer codepoint to a char surrogate pair that would be compatible with String codepoint methods.

To detect if a codepoint is supplementary in the first place Character.isSupplementaryCodePoint can be used.

@tommyettinger
Copy link
Member

Yeah... you're just scratching the surface of what makes emoji so incredibly challenging to get right. For example, this emoji "πŸ‘¨πŸ½β€β€οΈβ€πŸ‘¨πŸΏ" is made of the 12 chars {'\uD83D', '\uDC68', '\uD83C', '\uDFFD', '\u200D', '\u2764', '\uFE0F', '\u200D', '\uD83D', '\uDC68', '\uD83C', '\uDFFF'}. It actually involves multiple distinct emoji characters joined into one with a ZWJ (Zero-Width Joiner) character (which is of course zero-width and invisible). There are also skin tone modifiers applied to the two people. I'm not sure if Windows 11 can enter some of the most complex multi-part emoji into a form; Windows 10 can't to my knowledge. Windows 10 can specify a skin tone for one person, but not two people in one emoji char.

You might want to look at how TextraTypist handles this. It has a separate TextureAtlas that allows looking up the emoji bitmaps given their human-enterable names, which are Strings and not single chars (because that emoji I just showed is 12 chars, it really needs a String).

We can't change this easily without making keyTyped() return a String, which would be quite strange if each keystroke was its own String. That many String objects would also make Android's underpowered GC contemplate catching fire to end its suffering. There is probably some partial solution that gets words as they are entered, or something? I don't really know how native Android and iOS apps handle their autocorrect and emoji entry.

I think this should be closed as not planned, but I'd like to hear what other people think.

@LobbyDivinus
Copy link
Author

Yes, emojis can be hard. My argument would be that given that glfw sends a 32 bit code point it may be beneficial to send that exact code point to the application using a surrogate pair to let the application handle it (assuming that the application builds up an input string by appending the characters it receives via keyTyped). From my testing the glfw callback gets called multiple times when multiple code points are in play for a single emoji. Issuing keyTyped multiple times then - dependent on the code point entered - should not be a big deal and wouldn't require to change existing applications and the interface would not change.

On the contrary, I would argue that the current approach is kind of broken as it removes information and by doing so effectively sends invalid characters to keyTyped. If issuing keyTyped potentially multple times to support surrogate pairs is not an option I'd propose to not send keyTyped in this case at all as the application cannot know if a character was part of a supplementary code point.

@LobbyDivinus
Copy link
Author

LobbyDivinus commented Sep 25, 2023

In case someone else stumbles across this issue I was able to fix it without modifying libGDX by using inheritance and reflection.

Basically, I wrap the default char callback with a custom one that does handle 32bit codepoints in the following way:

        charCallback = new GLFWCharCallback() {
            @Override
            public void invoke(long window, int codepoint) {
                if (Character.isSupplementaryCodePoint(codepoint)) {
                    defaultCharCallback.invoke(window, Character.highSurrogate(codepoint));
                    defaultCharCallback.invoke(window, Character.lowSurrogate(codepoint));
                } else {
                    defaultCharCallback.invoke(window, codepoint);
                }
            }
        };

The entry point is to override createInput() of Lwjgl3Application to let it return a custom DefaultLwgl3Input subclass that sets the own callback in windowHandleChanged() using GLFW.glfwSetCharCallback(windowHandle, charCallback);

Et voila, we can input emojis:
image

@tommyettinger
Copy link
Member

This is a clever solution! I don't see where it needs reflection, unless I'm missing something. That's not a bad thing! Graal Native Image has a hard time with reflection even in LWJGL3 projects. I think this can be done with just some additional anonymous class extensions for now... How does this handle emoji with skin color specifiers, such as πŸ‘¨πŸΎβ€πŸ’» ? I'd guess it would be multiple sequential high-low pairs.

@LobbyDivinus
Copy link
Author

Thanks! I needed reflection to get the old callback I called defaultCharCallback in my code as it is is not accessible in my subclass as it is package private.

Yes, the specifiers get reported to the application using multiple high low pairs. The game does not handle them correctly, yet. But in comparison to before it now receives all the data needed to handle them in the future 😁My main concern was that most emojis could not be entered at all and resulted in broken characters which is solved now.

@tommyettinger
Copy link
Member

This might be appropriate to add to libGDX's input handling, since emoji are not going away anytime soon... I think we need to make sure something can render the emoji in libGDX, and I don't know if BitmapFont can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants