Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more options to language selection #18538

Open
rwmpelstilzchen opened this issue May 27, 2022 · 36 comments
Open

Add more options to language selection #18538

rwmpelstilzchen opened this issue May 27, 2022 · 36 comments
Labels
suggestion Feature suggestion

Comments

@rwmpelstilzchen
Copy link

Pitch

Currently the new language selection dropdown tool (introduced in #18420) supports a limited number of languages. I suggest:

  • Adding more languages. A more comprehensive list of languages can be obtained from the Wikimedia equivalent or from ISO 639-3
  • Adding an ‘Other’ option, for languages which are not listed.
  • Adding a ‘non-linguistic’ option, for paintings, music, source code, numeric data, etc.

¹ How does language tagging works across the Fediverse? Is there any standard for language codes? This should be taken into consideration.

Motivation

Mainly two groups of users will benefit from the suggestion:

  • Speakers of minority languages.
  • Artists, which will be able to post non-verbal works without limiting them to any specific language.
@rwmpelstilzchen rwmpelstilzchen added the suggestion Feature suggestion label May 27, 2022
@Gargron
Copy link
Member

Gargron commented May 27, 2022

We support all ISO-639-1 languages, plus a few ISO-639-3 languages added on demand. Bear in mind that the fediverse does not have enough people for granular language subdivisions to make sense, since it would cut your audience to miniscule proportions. If there is an active community of a specific language that is not represented in the list, then I will add it.

@rwmpelstilzchen
Copy link
Author

rwmpelstilzchen commented May 27, 2022

Thanks for the reply :-)

Bear in mind that the fediverse does not have enough people for granular language subdivisions to make sense, since it would cut your audience to miniscule proportions.

By ‘granular language subdivisions’ you mean things like subdividing similar and mutually intelligible dialects? That is truly counter-productive.

What I was talking about is adding distinct languages which are missing. For example, according to this, Nigerian Pidgin has 120 million speakers but no ISO-639-1 code; so do Cantonese (over 80 million speakers), Bhojpuri (52.5 million speakers) and other languages.

Limiting one’s feed to a set of languages is one use of language tagging. Others include proper settings for screen readers, searching for toots in a certain languages (for example, in order to find other speakers of said language), language-dependent typographical settings, machine translation, etc. These depend on having the correct language tagged.

@Wuzzy2
Copy link

Wuzzy2 commented May 28, 2022

Adding a ‘non-linguistic’ option

I'm just dropping by here to remind everyone that there is an ISO-639-3 code for that: zxx.

Adding an ‘Other’ option, for languages which are not listed.

This also has an ISO-639-3 code: mis. This stands for "uncoded language", which means it is a language for which no other official ISO code fits.

https://en.wikipedia.org/wiki/ISO_639-2#Special_situations

@onmyouji
Copy link

onmyouji commented May 29, 2022

Probably related to this: #8933

It would be great if there's a distinction between traditional (zh-hant) and simplified (zh-hans) characters.

@rwmpelstilzchen
Copy link
Author

A related toot: https://skeletons.gay/notice/AJs5T1hEp1Pr6oWlbk

@rwmpelstilzchen
Copy link
Author

@onmyouji

It would be great if there's a distinction between traditional (zh-hant) and simplified (zh-hans) characters.

Isn’t that on a different level, namely that of writing systems, as opposed to languages?

@onmyouji
Copy link

idk, there're different ways to classify them. Personally I just want to be able to filter my timeline, so that it can distinguish between the two.

In ubuntu, I think they have Chinese (Traditional), Chinese (Simplified), Chinese (Hongkong).

In msft dotnet they have:

aa

zh	Chinese (zh)
Chinese (Simplified) (zh-Hans)
Chinese (Simplified, China) (zh-Hans-CN)
Chinese (Simplified, Hong Kong SAR China) (zh-Hans-HK)
Chinese (Simplified, Macau SAR China) (zh-Hans-MO)
Chinese (Simplified, Singapore) (zh-Hans-SG)
Chinese (Traditional) (zh-Hant)
Chinese (Traditional, Hong Kong SAR China) (zh-Hant-HK)
Chinese (Traditional, Macau SAR China) (zh-Hant-MO)
Chinese (Traditional, Taiwan) (zh-Hant-TW)

@Yoxem
Copy link

Yoxem commented Jun 3, 2022

Some languages have their local varieties (for example: en-US and en-UK), and some languages using double or multiple writing systems (eg. Mongolian, Hokkien/Taiwanese/Min Nan, Mandarin, Aramaic languages).
If all the varieties are all listed, the table will be less readable. However, for some users, they may help.

@Yoxem
Copy link

Yoxem commented Jun 6, 2022

Recently, I thought it may be help to let the admin to add customized language code or delete it manually. for example:
If a Britain Instance, it may be:

{language_option: 
# All the languages in Britain: Scots, English, Gaelic, Welsh, Cornish, Other languages
["sco", "en", "gd", "cy", "cor", "other"]
}}

For a Taiwan Instance:

{language_option: [
# Taiwanese Mandarin, Taiwanese(Hokkien), Taiwanese Hakka, Austronesian, Min Dong, other languages
"cmn-tw", "nan-tw", "hak-tw", "map", "cdo", "other"]
}}

The filter logic can be:

if iso639code not in language_option{
   iso639code = "other"
}

@poga
Copy link

poga commented Jun 7, 2022

It would be great to have separated options at least for zh-Hant, zh-Hans, and zh-Hant-HK. We(g0v.social) is an active instance with primaily zh-Hant, some zh-Hant-HK, and some Taiwanese(Hokkien), Taiwanese Hakka users.

Cultures for these users is really different from zh-Hans. We suffered a lot of moderation burden when joinmastodon.org put us all under the "Chinese" banner.

@Gargron
Copy link
Member

Gargron commented Jun 7, 2022

We had been using automated language detection using cld3 on posts for many years, which was often wildly inaccurate (the main reason it has now been removed), but even in the best case, you could not expect it to distinguish correctly between regional variants of languages (pt-PT vs pt-BR and so on). This limitation has influenced the design of language filtering features to act on (essentially) language families instead of individual, exact languages (e.g. pt instead of either pt-PT or pt-BR).

Even though that was the main reason for the design, I think it still makes sense. I perfectly understand that not every variant of a language is mutually intelligible, but I maintain that:

  • There are not enough representatives of every language on Mastodon
  • Filtering results to the point of returning nothing is not a good thing
  • Filtering results to related, even though not always mututally intelligle languages is better than not filtering languages at all

That is to say, I think that sticking to zh and having people across Chinese languages see each others' posts is better than them also seeing English, German, Arabic, Persian, Finnish and so on, and is better than using zh-Hant-HK and then seeing no content at all / not being seen by anybody.

@nemobis
Copy link
Contributor

nemobis commented Jun 9, 2022

CLDR is a more sensible source for locales to support. https://cldr.unicode.org/

At a minimum you need to have a name for the language code, if you want to offer it in an interface. And you need to know at least some basics like script and directionality if you need to use the language code to define your HTML output.

@kuanyui
Copy link

kuanyui commented Sep 7, 2022

Filtering results to the point of returning nothing is not a good thing

No, according to your opinion, I still don't understand the reason to limit user from having a choice to do what they want to do.

And there's already a language filter in preference which can perfectly avoid the problem you are concerning, what on earth are you talking about?

Screenshot_20220907_211607-1

That is to say, I think that sticking to zh and having people across Chinese languages see each others' posts is better than them also seeing English, German, Arabic, Persian, Finnish and so on, and is better than using zh-Hant-HK and then seeing no content at all / not being seen by anybody.

Following your logic, I strongly recommend that Mastodon should also remove Ukrainian from languages list and merge them into Russian because there are more filtered results for Ukrainian & Russian speakers.

I even wrote an user script to batch adding ~600 filters (because Mastodon's filter still doesn't support RegExp) to filter out annoying zh_CN messages. Looks so delicate, doesn't it?

Screenshot_20220907_210208

@onmyouji
Copy link

onmyouji commented Sep 7, 2022

Filtering results to the point of returning nothing is not a good thing

If users filter something, to the point that nothing gets shown on their timeline, then that's the filter working as intended. It's their choice, that's why the users use filter in the first place.

Maybe it's not a good thing for mastodon.social, that are trying to attract as many new users as it can. I understand that new users will tend to stay if the timeline has many contents and there're a lot of engagements.

But not all instances are trying to become mastodon.social.

@nemobis
Copy link
Contributor

nemobis commented Sep 7, 2022 via email

@onmyouji
Copy link

onmyouji commented Sep 9, 2022

So it cannot be assumed that the user knows why they're seeing the amount of posts
they're seeing.

Sorry, can you explain more about that statement?

Let's say a user join an instance, they think the local timeline is too crowded with posts from different languages that they don't want to see.

They go to their language filter settings, and tick Spanish and Korean. When they go back to the local timeline, it shows a lot fewer posts (or maybe nothing).

So you're telling me, the user won't be able to know why the instance they joined suddenly becomes a dead town, after they specifically choose to filter the language?

I mean, that's what the filter is for right? To cut the amount of content the users see, to choose specifically what content (language) should appear in the timeline.

@nemobis
Copy link
Contributor

nemobis commented Sep 9, 2022 via email

@gjvnq
Copy link

gjvnq commented Nov 5, 2022

I think it's very easy and important to add the special ISO 631-3 language codes:

  • mis - Other language.
  • mul - Multiple languages - For when a toot containst multiple languages simoultaneously without any one being the "main one", e.g. Hello! Hola! Bonjour! Olá in a single toot.
  • und - Unknown languages - For when the author doesn't know the language of the toot's content. (e.g. a picture of some ancient writing)
  • zxx - No language - For things like equations, emoji and other stuff without accompanying words.

As for how it interacts with the filter language option, they would all be a separate subsetting so these codes will be enabled by default even for users that already have a filter language setting.

This sounds like a good first issue. Can I try to implement it with some hope of it being merged?

@gjvnq
Copy link

gjvnq commented Nov 5, 2022

As for some of the sublanguage issues brought by @Gargron @poga @Yoxem @onmyouji and others, I think that the ideal way forward is to treat languages as "hierarchical" things" with tri-state checkboxes for all non-leaf settings.

For Portuguese (which I do speak) it would be like:

[ ] - Portuguese (pt)
  [ ] - Country or Region
    [ ] - Brazil (BR)
    [ ] - Angola (AO)
    [ ] - Portugal (PT)
    [ ] - Mozambique (MZ)
    [ ] - São Tomé e Príncipe (ST)
    [ ] - Macao (MO)
    [ ] - Equatorial Gine (GQ)
    [ ] - Capo Verde (CV)
    [ ] - East Timor (TL)
    [ ] - Other places

Note: this list came from the member states/regions of the CPLP - Comunidade de Países de Língua Portuguesa (Lusophone Commonwealth) and is roughly in order of speakers.

Note: I only included the ISO codes here to make myself clear but I don't think that they should be exposed to the end user, at least by default

So a setting for Chinese wouldn't be a simple checkbox but a tree of (msotly tri-state) checkboxes like:

[ ] - Chinese languages (zho)
  [ ] - Mandarin (cmn)
    [ ] - Writing systems
      [ ] - Traditional (Hant)
      [ ] - Simplified (Hans)
      [ ] - Other wiriting systems
    [ ] - Country or Region
      [ ] - Mainland China (ZH)
      [ ] - Chinese Taipei/Taiwan (TW)
      [ ] - Singapora (SG)
      [ ] - Other places
  [ ] - Cantonese (yue)
    [ ] - Writing systems
      [ ] - Traditional (Hant)
      [ ] - Simplified (Hans)
      [ ] - Other wiriting systems
    [ ] - Country or Region
      [ ] - Mainland China (ZH)
      [ ] - Chinese Taipei/Taiwan (TW)
      [ ] - Other places
  [ ] - Souther min (nan)
    [ ] - Writing systems
      [ ] - Traditional (Hant)
      [ ] - Simplified (Hans)
      [ ] - Other wiriting systems
    [ ] - Country or Region
      [ ] - Popular Republic of China (ZH)
      [ ] - Republic of China (TW)
      [ ] - Other places

Note: I don't speak any Chinese language(s) so I'm surely exaggerating and undergerating with some of these options.

To avoid making the UI too intimidating, one option would be to let the user manually add countries (with a dropdown selector) or codes ot their filter languages setting. Something like:

[ ] - Portuguese (pt)
  [ ] - Country or Region
    [ ] - Brazil (BR)
    [ ] - Angola (AO)
    [ ] - Portugal (PT)
    [ ] - Other places
        - Add place: [ ____________ ⌵ ] [ + ]

@onmyouji
Copy link

onmyouji commented Nov 6, 2022

Can I try to implement it with some hope of it being merged?

Don't bother, just save your time. If you've been following mastodon since the beginning, you'll know that it won't get merged.

Everyone and every major organization on the planet (e.g. Linux, Steam, Twitter, Microsoft etc) knows to make a distinction between Traditional and Simplified Chinese, it's just common sense.

But the fact that in year 2022, users still have to beg him for it should explain to you how reasonable he is.

@kuanyui
Copy link

kuanyui commented Nov 6, 2022

Can I try to implement it with some hope of it being merged?

Don't bother, just save your time. If you've been following mastodon since the beginning, you'll know that it won't get merged.

Everyone and every major organization on the planet (e.g. Linux, Steam, Twitter, Microsoft etc) knows to make a distinction between Traditional and Simplified Chinese, it's just common sense.

But the fact that in year 2022, users still have to beg him for it should explain to you how reasonable he is.

Yeah, cannot agree with you anymore.

I've felt so ridiculous and tired on such topic (as a Taiwanese, we are forced to face similar situations everyday), as if I'm just keeping explaining something to some Pro-China who always give runaround and just mean to force Taiwanese to use their language.

@falconshark
Copy link

As a Hongkonger, I agree with @kuanyui too.
Such like English (UK) and English (US) is not a same thing, Traditional Chinese (Hong Kong) , Traditional Chinese (Taiwan), Simplified Chinese is not a same thing. Please don't try to focus Chinese user from different region using same language.

@xatier
Copy link

xatier commented Nov 6, 2022

As a multilingual Taiwanese, I want to avoid seeing any toots from China. I would need to set up this long filter to get rid of those characters (and being carefully not to block Japanese kanji accidentally). I read toots in English, Japanese, Traditional Chinese (TW/HK).

Please do not force users from different countries using the same language, as @dollars0427 said.

image

@kuanyui
Copy link

kuanyui commented Nov 6, 2022

BTW, for the non-Chinese users, I can tell you, evenzh_TW and zh_HK use nearly same characters (Traditional Chinese characters), I completely can NOT understand the toots written in zh_HK.

Just like both Italian and English use Latin alphabets but they cannot understand each other.

@kuanyui
Copy link

kuanyui commented Nov 6, 2022

It seems no sign of an official solution to this topic in seeable future, so provide the User Script I had written before, which can batch install lots of filters, as a workaround for those who feel tired to this world.

https://gist.github.com/kuanyui/b8afc959cb4c17d17b45c6ae0669a2ec

@sorz
Copy link

sorz commented Nov 6, 2022

After read all the comments, I found that the language options actually serve two distant functions that (some) users want.

  1. Filtered out content which are unintelligible to the user
  2. Use language as a proxy, to filtered out a group of ppl (or toots) the user don't like

Many of those debates around zh-CN, zh-TW, zh-Hant, zh-Hans, etc., are fall into the second category. For example, one commenter said:

I'm just keeping explaining something to some Pro-China who always give runaround and just mean to force Taiwanese to use Simplified Chinese.

They can read them, but they want use language option to avoid those "Pro-China" ppl.

Another example:

As a multilingual Taiwanese, I want to avoid seeing any toots from China.

Again, they can understand those toots, but want to use language as a proxy to filtered out toots from China.

In fact, Mandarin Chinese written in simplified/traditional script ARE inter-intelligible (with context and a little guessing if they never see the other variant). The need of distinguishing between them is likely fall into category 2.

However, the category 1 problem indeed exists around zh. Hong Konger use Cantonese regularly, rather than Mandarin. Cantonese and Mandarin ARE different languages and they are NOT inter-intelligible, but both of them are categorized as zh. And this is not entirely depend on specific region (HK), neither on script (traditional script): ppl on Guangzhou also use Cantonese written in Simplified script.

@xatier
Copy link

xatier commented Nov 6, 2022

@sorz, correct.

Regardless either 1 or 2, it should be understandable (and should be respected) that one would like to avoid contents they do not enjoy when browsing their timelines on Mastodon. This is undoubtedly the point of having filters around.

A language always carries culture differences and implicit references beyond itself, the (true) meanings of a particular language represent the culture and values of a particular social group. Forcing different social groups to be recognized under the same zh flag is not the best experience one would appreciate. Not to mention the culture/political conflicts between Chinese users.

@guyemerson
Copy link

guyemerson commented Nov 7, 2022

"Chinese" is a language family, on the same level of granularity as "Romance" or "Germanic" or "Slavic". I would like to be able to select Hokkien (ISO 639-3: nan).

This is regardless of geographical region or writing system. Some people write romanised Hokkien (the most common systems being POJ and Tailo), and some people write Chinese characters (several systems exist, usually with traditional characters). For more information, see: https://en.wikipedia.org/wiki/Written_Hokkien

Insisting on "zh" for Hokkien and other non-Mandarin Chinese languages is extremely offputting, and perpetuates the marginalisation of these languages.

If there is an active community of a specific language that is not represented in the list, then I will add it.

This does not seem like a welcoming attitude towards minority languages. But taking this at face value, I would say that the most vibrant online Hokkien community is probably the Taiwanese Hokkien community (where the language is often referred to as simply "Taiwanese"). For some examples of Hokkien toots, see: https://g0v.social/tags/Taigi -- I hope this is evidence enough that this language should be supported.

@guyemerson
Copy link

It seems that very small communities can indeed get their languages included in the dropdown list:
#20168

@akerbeltz
Copy link

Yes, that seems really strange. I made the leap over from Twitter today and when I migrated my English/Cantonese account (I have several which I separate by languages) to my dismay I saw that the only way I can tag my Cantonese posts is either to mis-label them English or mis-label them 中文, never mind not being able to filter out Mainland stuff in 中文 I have no interest in.

It's not a zero sum game, I think it's great people can choose Navajo, Inuktitut and tokipona (had to Google that one) but it seems like a list that is grossly skewed (though I couldn't say what to, perhaps languages the original devs were interested in?)

I like the idea of hierarchical nesting, it makes sense as there are a lot of languages. Surely Mastodon can't possible be looking to be more linguistically restrictive than Twitter?

@ChasBelov
Copy link

I've closed #23700 as a duplicate of #18538. Copying the following from that issue:

For Cantonese, WebAIM reports suggest that three-letter language codes such as "yue" are not well supported by screen readers, so "zh-YUE" is likely the better choice. That said, I have not tested this.

@yheuhtozr
Copy link

yheuhtozr commented Nov 11, 2023

That is to say, I think that sticking to zh and having people across Chinese languages see each others' posts is better than them also seeing English, German, Arabic, Persian, Finnish and so on, and is better than using zh-Hant-HK and then seeing no content at all / not being seen by anybody.

@Gargron The new ISO 639 is just out with substantial enlargement, so I can back up my opinion below with the official evidence. While the full text is proprietary, please allow me to cite its section 6.2.1:

Where spoken intelligibility between language varieties is marginal, the existence of a common
literature or of a common ethnolinguistic identity with a central language variety that both
speaker communities understand is a strong indicator that they should nevertheless be considered
language varieties of the same individual language.

which essentially means that they publicly admit grouping up several practically unintelligible "dialects" into an identical ISO 639 code (the situation has always existed but now explicitly ratified). From the backstage perspective, this assumes the existence of other methods to specify subdivisions, such as IETF language tag or upcoming ISO 21636 framework, so in such cases we will need the help of combined language tags.

Bear in mind that the fediverse does not have enough people for granular language subdivisions to make sense, since it would cut your audience to miniscule proportions.

In my observation, this problem affects more the users of some of the "big languages", including Chinese in this thread, or Arabic, where they nominally believe they all still speak "Latin" but actually not.

@ChasBelov
Copy link

With regard to zxx, while an image might not have a language, its alternative text would. I know not all images have alternative text, but they are supposed to have it. Similarly, emojis tend to have English Unicode names (which I suppose might have translations, but I don't know that).

At work, my colleague who was responsible for setting standards for Chinese translation told me that while people who normally use Simplified Chinese (zh-HANS) can usually also read Traditional Chinese (zh-HANT), that the reverse is not true. So we always use zh-HANT.

@xatier
Copy link

xatier commented Nov 11, 2023

@ChasBelov no, this is not true. Many Simplified Chinese users don't read Traditional Chinese, they may not be familiar with the characters, phrases, etc. Your colleagues may be able to recognize both, I believe this is not common among Chinese language family users.

Again, please try to be respectful of the culture/political differences to all social groups. We are all different. :)

@yheuhtozr
Copy link

@ChasBelov It depends (mostly on their educational background). A more accurate metaphor about the Simplified/Traditional Chinese situation is that: Trad. characters looks to Simp. users 𝔅𝔩𝔞𝔠𝔨𝔩𝔢𝔱𝔱𝔢𝔯, while Simp. characters to Trad. users something like Unifon. So it is probably easier for Simp. user to recognize Trad. text (as the more they learn Classical Chinese) than the other way around, but for most ordinary people, they are just guesswork either way.

@xatier
Copy link

xatier commented Jun 19, 2024

Recently updated my Simplified Chinese filter, feel free to apply to your timeline 😃

This should cover a good percentage of Simplified Chinese toots, without filtering out Japanese 漢字.

业为习书买亚产亲们关兴击创务卖卫发员喷头妈婴实审对导层带庆应开忆战户择换时术权样梦欢汉爱现种笔经给络维网罗见觉计认记识试该诱说谓败费边达过运还这进远选递键长问间陆难韩页题风飞饭鸡鹅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
suggestion Feature suggestion
Projects
None yet
Development

No branches or pull requests