Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow special characters in usernames #6830

Closed
2 tasks done
tigre-bleu opened this issue Mar 19, 2018 · 11 comments
Closed
2 tasks done

Allow special characters in usernames #6830

tigre-bleu opened this issue Mar 19, 2018 · 11 comments

Comments

@tigre-bleu
Copy link

Currently, the username can only be letters, numbers and underscores.

The users in my LDAP directory have uids like "firstname.lastname" with their mail "firstname.lastname@example.com".

With the current implementation, mastodon does not create a profile for the users as there is a "dot" in the username.

I think that my situation can be quite common, especially with LDAP integration as sysadmins will not want to create a new ID just for mastodon.

Why doesn't mastodon support extended usernames? Am I missing some obvious reason?

As another solution, it should be possible to authenticate to mastodon with the email adress from the LDAP directory. With the current solution (setting LDAP_UID=mail), the user creation fails with the error regarding the "letters/numbers only".


  • I searched or browsed the repo’s other issues to ensure this is not a duplicate.
  • This bug happens on a tagged release and not on master (If you're a user, don't worry about this).
@Gargron
Copy link
Member

Gargron commented Mar 19, 2018

Why doesn't mastodon support extended usernames? Am I missing some obvious reason?

The regex for detecting mentions is widespread between different systems and it does not expect a dot in the first part of the username. So this really cannot be changed.

I suggest maybe stripping out the dot, so it becomes firstnamelastname, or firstname_lastname

@tigre-bleu
Copy link
Author

I understand the difficulty but it is quite annoying if mastodon is the only service that have a specific username.

For regular accounts, it is possible to login via email. In this case I see no limitation on the email adress. What about the following workflow for LDAP connection:

  • Admin configures LDAP_UID=mail
  • User connects the first time with his email adress
  • Mastodon searches LDAP for the user based on the email and authenticates
  • Mastodon asks the user for a username
  • Mastodon creates the user in the instance with this username and associates the email adress
  • In the future, the user connects with his email address

@tigre-bleu
Copy link
Author

As far as I understand, this issue is not really fix so it should not be closed.

@Gargron, What do you think of the workflow above?

@SaadAK96
Copy link

this definitely needs to be fixed!

@ettmetal
Copy link

Is there any chance of opening this back up for discussion?

I found this issue after trying to set up a username that used a character I don't think of as particularly 'special': à.

An inclusive web needs better representation than 'only Latin script without diacritics'. We should be past the days of technological limitations preventing use of characters outside of an anglo-centric set of letters.

@RokeJulianLockhart
Copy link

Indeed. For not the entirety of UTF-8 to be supported by default is genuinely insane, when surely all that is necessary is sanitization by quotation of the username string when it stored and processed.

@julian-a-avar-c
Copy link

julian-a-avar-c commented Aug 1, 2023

Should a new issue with this name flagged as "suggestion" be created? Because I agree with @ettmetal and @RokeJulianLockhart.

This is supposed to be the federated social network, if this was Twitter (or X now lol) I would be told that the company resides in the USA, so English characters is good enough, and if I don't like it I can go away. This is Mastodon though.

Is there a technical reason for closing this? If mountains have to be moved, it might not be worth it. I'm not a collaborator, but I ask that this be open so that it can get more exposure and get worked on eventually. I think UTF8 usernames would be incredible for a lot of people outside of English speaking countries (That's a lot of them).

I retract this, I just finished reading #8503. There are security concerns.

Summarizing for future non-dev readers: Unicode contains many invisible and other tricky characters that would make phishing easier (We are already seeing the consequences of .zip domains, so you can imagine how this could be a problem). If you want your username to contain "special" characters such as "à", you can add those in your display name.

Note: It is not specified why not the "problematic" characters only are removed instead of limiting usernames to just [a-z0-9_] but to me the current solution is good enough, so I won't argue. This is good enough.

@RokeJulianLockhart
Copy link

RokeJulianLockhart commented Aug 3, 2023

Unicode contains many invisible and other tricky characters that would make phishing easier (We are already seeing the consequences of .zip domains, so you can imagine how this could be a problem). If you want your username to contain "special" characters such as "à", you can add those in your display name.

@jaacko-torus, surely the solution then is to use a blacklist rather than a whitelist (by default)? Or, if an instance's administrators are seriously concerned about security, allow them to switch the rules to a whitelist instead. Consider a personal server for a Chinese or Arabian family. Why would we restrict them using their own names? To force them to be Latinized seems genuinely dismissive of their needs simply because we are not them.

@ettmetal
Copy link

ettmetal commented Aug 3, 2023

@jaacko-torus, I think you referenced the wrong issue? The one you linked is about the UX of the follow button on a profile that has been migrated. If that's the right issue, I'll need to more context on how you believe that's demonstrating the issues you're talking about, if you can 😅


A user handle is a user's identity on a service. It's not about having characters in a username, it's about being able to genuinely represent identity.

In most definitions, 'à' is not even a special character. Special characters are usually defined as something along the lines of 'non alphanumeric characters' or 'letters and numbers'. 'À' is a letter.

I understand the security issues inherent in having all of Unicode available. I suppose that you referencing the .zip issues is the strong phishing attacks that it can be used for (Here's an article explaining it for anyone interested: https://medium.com/@bobbyrsec/the-dangers-of-googles-zip-tld-5e1e675e59a5)

Crucially, the character which makes that phishing tactic so strong, "∕" (U+2215), is not an alphanumeric character. It is indeed a special character. It's not commonly used to form words, and isn't commonly used as a way to define a person's identity. There are literal names that use the character 'à'.

I agree with @RokeJulianLockhart: settling on "it's good enough" because you can meet your needs for expressing your identity within the limits seems dismissive. When we talk about what is "good enough", I think it's always important to consider "for whom". Is the character set being limited to the set of Latin characters used in English and some other languages good enough for all users?

@julian-a-avar-c
Copy link

julian-a-avar-c commented Sep 28, 2023

  • @ettmetal Whoops, I meant to reference Can't use non latin characters in username misskey-dev/misskey#8503
    • I just assumed that it was a Mastodon issue since it was referenced on the thread (new to GitHub workflows, sorry :P)
  • As for the id, I have the same perspective (as was written in my original post)
    • The issue being fixed would be good 👍
    • A blacklist would be preferable to a whitelist
    • It would be more culturally inclusive (I, personally, want Mastodon to be the nice, friendly, and global alternative, and this would help)
    • I'm against this issue being closed, I have no idea why it was decided to be completed

But: I do not have any security background, and it's quite the serious and heavy topic. As such, I believe non developers can be satisfied. The reason is simple, the id is this internal thing, users don't see the id except when they copy-paste. My id can be literally anything, a UUID if I'd like, it really doesn't matter, it is meant to be copy-pasted. If you are looking for someone in the search bar, you have two options: (1) copy-paste their id, (2) copy-paste their display name. The display name can have any character, so, imo inclusive enough.

Why are dots . prioritized over Greek and Arabic? No clue. But I will not argue about things that are already done, and that's my message to non-devs. There are reasons, they might not be the best ones, but there are always reasons.


We can talk about it, or we can fix it. Let's go.

First of all, I would like us to define "special characters" as anything that is NOT in the regex please, otherwise, give me an alternative, I don't want to say "anything that is NOT in the regex" every time.

The regex for user names is the following:

USERNAME_RE = /[a-z0-9_]+([a-z0-9_\.]+[a-z0-9_]+)?/i
# EDIT: It also appears that `"jaacko.torus" == "jaackotorus"` for `"."`.
# We might have to make similar considerations, when adding some of these characters.
# For later: https://github.com/kibousoft/mastodon/blob/72d8d3160a63e61ba2af59bf9e90006048bce835/app/validators/unique_username_validator.rb

It can be found in this file, which was modified by @Gargron (remember to give the man some love, he added .! Thank you very much you gentle sir ❤️ 💯 🔥 🐳 🎉) when the issue was closed.

Once more, I am no security expert, and as such, any recommendations as to what the actual characters that we are blacklisting are, should be thoroughly inspected by you guys.


Let's define scope. I believe all "word" characters should be included, that is, everything that can be rendered by, say Noto Sans, minus symbols and characters that can be used for phishing.

I recommend: Have a whitelist, AND a blacklist, with the blacklist running after the whitelist.

Looking through the character classes, I would say the whitelist should include "by default": Ll, Lm, Lo, Lt, Lu, Mn, Nd, Nl, No, Pc, Pd, and Sk. As a little side note: I agree with a previous comment by @RokeJulianLockhart that this stuff should be configurable. As I am no Mastodon admin, idk where that code lies, but I think that should be a separate issue. For now though, it might be good to come up with a set of good defaults.

Does that sound good?

@julian-a-avar-c
Copy link

julian-a-avar-c commented Sep 28, 2023

Ah, almost forgot, the database that collects this information would also need to be modified, since most databases don't allow willy nilly UTF characters in strings without some more custom setup 🙄 (This is another one of those, idk why, but it's that way problems). Here's a Stack Overflow QA that goes over the issue. (I believe Mastodon only uses PostgreSQL)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants