Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Unicode characters in GDScript identifiers #916

Closed
Zireael07 opened this issue May 27, 2020 · 46 comments · Fixed by godotengine/godot#71676
Closed

Allow Unicode characters in GDScript identifiers #916

Zireael07 opened this issue May 27, 2020 · 46 comments · Fixed by godotengine/godot#71676
Milestone

Comments

@Zireael07
Copy link

Describe the project you are working on:
2d space game

Describe the problem or limitation you are having in your project:
Can't use scientific symbols or accented letters (and my native language has some, often creating minimal pairs with unaccented ones) in variable names (scientific symbols would massively shorten some variables I use)

Another example use case: godotengine/godot#24785 (comment)

Describe the feature / enhancement and how it helps to overcome the problem or limitation:
Allow unicode characters in GDScript identifiers

Describe how your proposal will work, with code, pseudocode, mockups, and/or diagrams:

If this enhancement will not be used often, can it be worked around with a few lines of script?:
Nope, requires core changes (parser)

Is there a reason why this should be core and not an add-on in the asset library?:
Not possible to do via add-on due to parser changes.

Original issue: godotengine/godot#24785

IIRC this is not covered in @vnen's GDScript rework.

@Calinou
Copy link
Member

Calinou commented May 27, 2020

I think @vnen was working on adding this a few days ago.

@vnen
Copy link
Member

vnen commented May 27, 2020

Well, this works. Question is if we really want it:

gdscript-emoji

@vnen
Copy link
Member

vnen commented May 27, 2020

BTW, I did this very naively in this example, accepting anything that's beyond basic ASCII range. This would accept symbols and things that look like space to be part of identifiers.

Doing this properly would require following the Unicode Standard Annex 31: https://unicode.org/reports/tr31

Or maybe we can expect users to use this responsibly and any report of those characters being allowed will be closed as not-a-bug.

@Calinou
Copy link
Member

Calinou commented May 27, 2020

@vnen We could probably disallow having irregular whitespace characters anywhere else than in strings and comments.

@txj-mssl
Copy link

I would like to support the use of unicode for identifiers。
The end result of this question, support or no support?

@txj-mssl
Copy link

The popular c# and python both support unicode, and if godot wants to have more users in the non-English speaking world, it must support unicode

@Calinou Calinou changed the title Unicode characters in GDScript identifiers Allow Unicode characters in GDScript identifiers May 28, 2020
@vnen
Copy link
Member

vnen commented May 28, 2020

I would like to support the use of unicode for identifiers。
The end result of this question, support or no support?

This is an open discussion, there's no conclusion yet.

There's a few challenges to overcome:

  1. The code editor font has no fallback. So you need a font with all glyphs otherwise you might put a character that doesn't show in the editor. This needs to be fixed.
  2. As I mentioned, we should probably follow UAX#31 (like Python does) instead of allowing any character in the identifier. Otherwise space characters could be inserted in an identifier and become tricky to see. Maybe it's not a problem: let users misuse it, but I want some conclusion on this regard.

We could probably disallow having irregular whitespace characters anywhere else than in strings and comments.

That's already the case, I think. But to forbid those inside identifiers would need a blacklist of sorts.

@mischiefaaron
Copy link

Yes, I very much want it. But perhaps it might cause trouble to implement something like that in declarations. For strings & comments however I am fully aboard.

@vnen
Copy link
Member

vnen commented Jun 15, 2020

Yes, I very much want it. But perhaps it might cause trouble to implement something like that in declarations. For strings & comments however I am fully aboard.

It already works on strings and comments. The problem, as I mentioned, is that the code editor font doesn't have full Unicode and it doesn't allow fallback fonts. So if you want emoji or something, you have to change to a font which support those (like I did in the example image) and I couldn't find a monospace font that worked.

@mischiefaaron
Copy link

So what we would need to do is find a monospace font that would look good in Godot's code editor, has full unicode support and the proper licensing... then we can make it a proposal so that emoji support could be implemented, correct?

@txj-mssl
Copy link

That's probably it, I don't know.

@Calinou
Copy link
Member

Calinou commented Jun 16, 2020

@agameraaron I don't know of any open source monospace font that includes good emoji support.

Hack and its parent DejaVu Sans Mono have a very extensive character set, but they don't support colored emoji. (Monochrome emoji can be tough to understand, so I wouldn't recommend settling for them.)

Also, why wouldn't the code editor font allow fallbacks? It uses a DynamicFont just like everything else in the Godot editor.

@vnen
Copy link
Member

vnen commented Jun 16, 2020

@Calinou

Also, why wouldn't the code editor font allow fallbacks? It uses a DynamicFont just like everything else in the Godot editor.

The problem is that the settings only ask for a font path, not a DynamicFont. It doesn't have any fallback option. That's probably easy to solve but right now it's an issue.

@vnen
Copy link
Member

vnen commented Jun 16, 2020

@agameraaron

So what we would need to do is find a monospace font that would look good in Godot's code editor, has full unicode support and the proper licensing... then we can make it a proposal so that emoji support could be implemented, correct?

No, it is already supported (in comments and strings that is). It's just that the editor default font doesn't have emojis. So if you have emojis in there they won't be shown, which can be confusing (but nothing is really stopping one from doing it). If you use an external editor you can see those characters.

What we need is fallback font setting to show all characters by default. It doesn't matter much if emojis are monospaced IMO. But the regular characters should be.

The proposal here is for identifiers, which currently don't allow anything other than basic ASCII letters and numbers (and underscore). But that requires following the standard, at least in my view, which is not trivial.

@Calinou
Copy link
Member

Calinou commented Jun 17, 2020

@vnen Right, that makes sense. We should probably find a way to load the system emoji font as a fallback, as emoji fonts are notoriously large in terms of file size (bundling them in the binary would likely enlarge it significantly). However, the exact paths for these fonts are OS-specific and often require guesswork.

@aaronfranke
Copy link
Member

aaronfranke commented Jul 24, 2020

Even if this was supported, isn't it usually considered best practice to write code in English?

Also, I have cross-language portability concerns with this proposal, since 💩 is not a valid identifier in C#.

EDIT: But yes let's use ID_Start and ID_Continue for identifiers.

@bojidar-bg
Copy link

bojidar-bg commented Jul 25, 2020

Also, I have cross-language portability concerns with this proposal, since 💩 is not a valid identifier in C#.

It is already possible to name a GDScript function if, new, or a myriad of other things. Likewise, you can name a C# function or. But, since both languages can access the call/get/set API, they can always call such functions through it.

(And GDNative can already assign arbitrary character arrays for function names, likely including empty strings. Restricting those so that all languages are "happy" is not going to be elegant.)

@vnen
Copy link
Member

vnen commented Jul 25, 2020

since 💩 is not a valid identifier in C#.

It's not valid in Python either. Hence my concern with following the proper UAX#31 standard, which would correctly disallow some weird stuff in identifiers in general, while still allowing multi-language support.

@txj-mssl
Copy link

What I'm interested in is that variable names can use Chinese characters or French characters. I'm not interested in using emoticons for variable names.

@Zireael07
Copy link
Author

Unless 'weird stuff' covers scientific symbols like https://en.wikipedia.org/wiki/Astronomical_symbols (mostly interested in Sun and Earth) or ρ (the lower case Greek letter rho) which is used for density, or other such stuff, I don't care. Emoji aren't the reason I posted the proposal, after all.

@vnen
Copy link
Member

vnen commented Jul 29, 2020

@Zireael07 BTW those don't seem to be allowed in Python. Not sure about other languages.

For me weird are things that gets confusing, like other types of space or a quote character that make it look like a string (my main concern about this is editors/keyboards that might insert them, or a copy-paste with formatting). But again, maybe we can just don't care at all and let users use whatever they want in the identifiers.

@akien-mga
Copy link
Member

I think we shouldn't focus discussion too much on esoteric "joke" uses like using emojis as identifiers, but indeed as @txj-mssl expresses it, being able to write identifiers in say Chinese is an important part of making the engine accessible.

Some of us might have a strong bias as native speakers of languages using Latin script, and comfortable with English, that identifiers/code should be in English as much as possible, but this is not true of everyone and some users might have valid reasons for wanting to write their code with e.g. Chinese characters.

So if it's easy to enable and doesn't impact performance, I think we should do it.

There might be additional complexity if some users want to use e.g. RTL scripts in their code (like Arabic identifiers), but I believe @bruvzg's work on Complex Text Layouts might already be capable of handling it (or close to it).

@akien-mga
Copy link
Member

For me weird are things that gets confusing, like other types of space or a quote character that make it look like a string (my main concern about this is editors/keyboards that might insert them, or a copy-paste with formatting). But again, maybe we can just don't care at all and let users use whatever they want in the identifiers.

If we're concerned about users adding non-printable Unicode characters by mistake, we could maybe add a warning for that (disabled by default, to allow Unicode out of the box). So users who suspect something can enable the warning and be notified of non-ascii characters used in identifiers.

@vnen
Copy link
Member

vnen commented Jan 24, 2021

I agree 100% with this. The question is mostly how (or rather, whether we should filter the list of allowed characters not).

It's quite easy to just allow everything beyond ASCII range, as those are not used in any token. Just like I did in my small sample (#916 (comment)). It would also not be hard to add a warning if any of such characters are used, but if the user is already using identifiers like this, they won't be able to tell if there's some invisible character somewhere (which include characters not supported by the editor font), as they would be flooded by this warning if turned on.

I believe we should follow the UAX#31 standard, which is meant for this purpose (it defines what is a valid identifier). This requires much more work (including reading the standard itself), but makes everything safer. Although I don't think it allows some special symbols, which some people wants to use (e.g. godotengine/godot#24785 (comment)). We could potentially break the standard to allow some extra symbols that people want to use, but I would rather avoid it if possible. This still would cover the case of people writing identifiers in their preferred language, which is the important part IMO.

@vnen
Copy link
Member

vnen commented Jan 24, 2021

Also, before I had some concern with Unicode support across platforms, but now with the changes by @bruvzg this is not an issue anymore.

@astrale-sharp
Copy link

It might be a good idea/ workaround to have a warning triggered when

  • Using a font that doesn't support all unicode char AND
  • There is such a char being used in the current script

It wouldn't be risky anymore to use unicode and it wouldn't make it necessary to have a big font in godot (and we all care love godot beeing a small binary so thats <3).
BUT having no default options to be able to just see these chars out of the box make gdscript less universal (cant copy paste any gdscript, cant understand an addon you dealed if there is unicode in it etc) since it requires an additional step to get to see all script which might be not trivial for beginners.

@raytopianprojects
Copy link

I'm a little confused on why we just don't let someone use whatever characters they want? I mean it's their choice if they use this emoji 💩 or letter in their native language. No matter how silly anyone thinks it is.

I understand that this might add more complexity to the language and editor but I think it's worth it in the end because accessibility is worth it.

Also if we need a font that supports this we could just use JetBrains Mono. It's relatively small coming in at 2.8mb, it free and open source, supports 145 languages, and it looks quite pretty as it was designed with code in mind.

@Calinou
Copy link
Member

Calinou commented Mar 5, 2021

Also if we need a font that supports this we could just use JetBrains Mono. It's relatively small coming in at 2.8mb, it free and open source, supports 145 languages, and it looks quite pretty as it was designed with code in mind.

See godotengine/godot#36198. Note that the character set of Hack (the current script editor font) is actually more extensive than JetBrains Mono's.

@vnen
Copy link
Member

vnen commented Mar 10, 2021

I'm a little confused on why we just don't let someone use whatever characters they want? I mean it's their choice if they use this emoji or letter in their native language. No matter how silly anyone thinks it is.

It's not about silliness, it's about usability. If, say, someone copy-pastes code from another source and it contains stylized quotation marks like “string”, that will be interpreted as an identifier, not a string, which probably will cause confusion.

I don't really want to make an executive decision, so we need a consensus on what to do. The options I see are:

  • Don't really care about anything, just let any character above ASCII range to be considered part of an identifier. This includes stylized and non-printable characters. Whether or not the ScriptEditor should warn or not is another matter (though related, it's not a concern of the language implementation, as GDScript don't know what font you're using).
  • Make a tentative list of forbidden characters in identifiers, like the quotes I mentioned above and the non-printable characters like different types of space. Not sure exactly how to compile this efficiently (the Unicode range is quite big), but we could try. Also note that Unicode also changes over time, so we would need to keep this up-to-date. My main concern about this is that if we find another character to forbid, it essentially breaks compatibility. So the initial list should be also mostly final, and every update of Unicode we need to check the new characters to see if something must be forbidden too.
  • Follow the UAX#31 standard which is the Unicode recommendation for this exact usage (identifiers in programming languages). This is backwards compatible with new Unicode versions so we wouldn't need to keep it so up-to-date (probably only per Godot release and likely won't need any changes, unless new characters are added).

IMO the middle option is not really viable. Either we remove any limitation (and let users shoot themselves in the foot) or we follow the proper standard.

@Calinou
Copy link
Member

Calinou commented Jun 21, 2021

Rust 1.53.0 added support for Unicode identifiers, with some limitations (e.g. emoji are not allowed).

rustc also warns you about potentially confusing identifier names:

warning: identifier pair considered confusable between `s` and `s`

@adabru
Copy link

adabru commented Jul 17, 2021

Add an option "Allow any Unicode" which is turned off by default in the next version?
Having UAX#31 as default as soon as its implementation is finished, keeping the previous flag for people who like emojis or scientific symbols and don't distribute their code?

@Calinou
Copy link
Member

Calinou commented Jul 17, 2021

Add an option "Allow any Unicode" which is turned off by default in the next version?

This should not be implemented as an option, as GDScript behavior should not depend on project settings to be more predictable across projects. We either go all in, or don't do it 🙂

@timothyqiu
Copy link
Member

Did the introduction of ICU (via complex text layout) make following UAX#13 easier?

@Calinou
Copy link
Member

Calinou commented Jul 17, 2021

Did the introduction of ICU (via complex text layout) make following UAX#13 easier?

The complex text layout server may be disabled at build-time, so GDScript behavior shouldn't rely on it either. Godot still has a lightweight fallback text server to use when you want to optimize the binary size. You can select this Fallback text server by building with module_text_server_fb_enabled=yes and running with the --text-driver Fallback command line argument.

@akien-mga
Copy link
Member

IMO we should follow UAX #31 as suggested by some here. If there's a standard for this exact use case, we should aim to follow it.

@timothyqiu
Copy link
Member

The complex text layout server may be disabled at build-time, so GDScript behavior shouldn't rely on it either.

I mean, if we were to make GDScript support Unicode Identifiers, does it make sense to make GDScript rely on ICU?

@vnen
Copy link
Member

vnen commented Sep 20, 2021

We could potentially make this only available if ICU is present. I assume anyone using it in identifiers are also using for other things and thus needed it already anyway.

I took a look and the ICU library does makes this much easier to implement.

@Calinou
Copy link
Member

Calinou commented Sep 20, 2021

We could potentially make this only available if ICU is present. I assume anyone using it in identifiers are also using for other things and thus needed it already anyway.

This might be an issue for headless usage, since ICU would become a dependency for your project on the server side any of your scripts (or add-ons!) happen to use Unicode identifiers. In other words, I don't think this behavior should be made an option: #916 (comment)

@akien-mga
Copy link
Member

We could potentially make this only available if ICU is present. I assume anyone using it in identifiers are also using for other things and thus needed it already anyway.

I took a look and the ICU library does makes this much easier to implement.

CC @bruvzg

@bruvzg
Copy link
Member

bruvzg commented Oct 18, 2021

I took a look and the ICU library does makes this much easier to implement.

ICU do not have full UAX#31 implementation, but provide sufficient amount of glyph properties to implement it. Here's a draft for UAX#31 - godotengine/godot#53956

@bruvzg
Copy link
Member

bruvzg commented Oct 19, 2021

rustc also warns you about potentially confusing identifier names:

UAX#31 itself do not include any features to check for confusing elements. ICU have it implemented in the USpoofChecker, but it's a part of ICU i18n library, and we only include ICU common.

If similar functionality is desired, we'll need to include ICU i18n library (2x ICU code size + up to 20 MB of ICU data).

@txj-mssl
Copy link

very thanks Rémi Verschelde and everybody,l see 3.4rc1 support editer chineses document,my heart is jump,I will try to promote the use of Godot in China。

@me2beats
Copy link

me2beats commented Jul 30, 2022

Add an option "Allow any Unicode" which is turned off by default in the next version?

This should not be implemented as an option, as GDScript behavior should not depend on project settings to be more predictable across projects. We either go all in, or don't do it 🙂

could this be allowed by parsing an annotation like @support_unicode?

@Calinou
Copy link
Member

Calinou commented Jul 30, 2022

could this be allowed by parsing an annotation like @support_unicode?

While better than project settings, I'm still not fond of keywords/annotations that allow changing language behavior. It'll break code if you copy-paste it between scripts without also copying the keyword/annotation.

@Hapenia-Lans
Copy link

Even if this was supported, isn't it usually considered best practice to write code in English?

Also, I have cross-language portability concerns with this proposal, since 💩 is not a valid identifier in C#.

EDIT: But yes let's use ID_Start and ID_Continue for identifiers.

Maybe it's late to reply :o, but I would like to point out that "best practice to write code in English", is for common categories. The words involved are often common words and technical terms, both are easy to translate, easy to understand for people who use them.

But in game-dev, there are many things can't or hard to translate into meaningful English words. There are many unique words
and concept in traditional folks and myths, which are hard to translate. And these are wonderful materials for game making. We cannot ask programmers to become linguists or translators. What's more, many of these words are difficult for even linguists to translate, and quite a few are not translated at all.

Let's take an example. In Chinese mythology, there are things called "Jing Luo"(经络), which refer to the paths through which certain mystical substances (generally translated as Qi, but it's not accurate) flow within the human body. This word does have translation, it's "meridian", which usually means "one of the lines that is drawn from the North Pole to the South Pole on a map of the world". They are unrelated, which make things getting worse. And what about “小周天”(Xiao Zhou Tian, means a certain way of Qi running in Jing Luo)? How can this translate into a word? So, we common use it's pinyin (A method of recording the pronunciation of Chinese characters, similar to romaji in Japanese) to give these identifiers a name.

Here is a horrible code snippet:

# Xian Ren -> 仙人,Similar to "God" in Roman mythology but not
class_name XianRen extends CharacterBody2D 


## Zhen Qi Capacity -> 真气容量,Similar concept to "Qi" or ether but not, 
## this means the capacity of Zhen Qi for this character
export var zhen_qi_capacity: int 


# Zhen Qi Yun Zhuan Fang Shi -> 真气运转方式,means how Zhen Qi running in body.
enum ZhenQiYunZhuanFangShi { 

    # Xiao Zhou Tian -> 小周天,A certain way of Qi running in body
    XiaoZhouTian, 

    # Da Zhou Tian -> 大周天,Another way of Qi running in body ...
    DaZhouTian, 
    ...
}
...

# more terrible functions and variables below ...

Even Chinese programmers feel pain to understand the code above. And it's completely unreadable for people living in English-Speaking countries. The idea behind using "best practice" is to make communication easier, but in this case, it brings nothing good for anyone. And I think this is a common problem in non-English-Speaking countries.

@Zireael07
Copy link
Author

woot woot! It's taken so long I've forgotten I was the person who originally requested this :P

M☉, here I come!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Implemented
Development

Successfully merging a pull request may close this issue.