Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-introduce automatic language detection for posts #25568

Open
gkaklas opened this issue Jun 24, 2023 · 8 comments
Open

Re-introduce automatic language detection for posts #25568

gkaklas opened this issue Jun 24, 2023 · 8 comments
Labels
suggestion Feature suggestion

Comments

@gkaklas
Copy link

gkaklas commented Jun 24, 2023

Pitch

Many users (even bots!) don't correctly set the language tag when posting. This can result in timelines containing very frequent unreadable posts for users not speaking the language, which is inconvenient.

I imagine it may cause even more inconvenience for users of screen readers, since they might be configured for one language, but try to start narrating posts written in another.

Personally, I have started muting accounts that set the language tag incorrectly, in the hope that this might at least catch frequent posters for my followed hashtags, but obviously this isn't a solution, and I still get a lot of noise in my timeline.

Relevant issues: #21400 #19893 #21631 #22598
Language detection seems to have been removed in #17478

Feel free to close as a duplicate, but I felt we could have a consolidated and maybe more technical issue. Searching is:issue is:open language in:title gives 68 results, and in general it's difficult to get a sense of the status for this feature and all our options.

Motivation

Here is a summary on some of the issues I found and some more thoughts:

(Obviously client-side solutions are inexpensive for the server, but would require implementing the feature in every client i.e. web UI, mobile and desktop apps)

  • Add option "Ask for each post" to the "Posting language" preference setting (and have the post editor behave accordingly) #21631 suggests asking the user the language for each post they make
    • should be easy to implement
    • requires user cooperation, user might find the option annoying if enabled by default
  • Language detection based on first characters #21400 suggests doing a rough estimation based on characters in post
    • (probably better if done server-side)
    • probably cheaper than more advanced heuristics
    • Super easy way to tag some languages like English, German, Greek, Arabic, and Russian.
      • Not saying I'm pro this, but if in general we face technical limitations with other (or this) techniques, with this method we can at least remove the English tag from non-English posts, which would filter out noise for I suspect a huge userbase (couldn't find statistics on this)
    • Might sometimes give false results for:
      • posts containing the same content in multiple languages
      • posts containing words, names, or references in other languages (probably easily solvable with something like, calculating the percentage of words detected for each language etc)
  • This comment mentions they experimented with lingua-py
  • Would we like this to happen on the Mastodon server? If we opt for a more accurate but computationally expensive solution that doesn't scale well with the number of posts we might prefer to:
    • Run an custom server with an API: this way smaller instances (even non-Mastodon ones!) with less resources can be served
    • Write a local daemon to run on the server of the Mastodon instance, but in a compiled language
    • Using LibreTranslate's detection (LibreTranslate autodetect limitations and issues #22598) might be the simplest solution, but may require coordination with the LibreTranslate project first, because of the increased load on their public instances. Maybe there could be an instance specifically hosted for the Fediverse by either project?
    • (The first two might be overkill or over-engineering, but I just wanted to list all possibilities out there, since I just want to help out the project members who will decide how Mastodon, as a whole, would like to handle it)
  • https://github.com/topics/language-detection
  • Regardless of the preferred solution, we might also want the detection to only be triggered on-demand by other users:
    • A user could "report" a post as being tagged with the incorrect language. This would mean that the server would have to change the language of the post after it has been posted, which we may not want, or it may cause issues with federation?
    • More beneficial might be for a user to report another user: This would run language detection for every post of that user in the future. (In cases of a "malicious" or accidental report, the detection could stop running after posting a few times with the correct language tag, or analyzing a few of the user's random past posts)
  • Another idea is that we could not care about it, and leave it to the clients to handle it. However I don't like this idea because:
    • of duplication of code and effort
    • in case of e.g. an external detection API, like LibreTranslate, there would be multiplied load on the API
    • it might increase latency for the user, use more network data, etc

Unfortunately I'll probably won't be able to help with the code a lot because of limited time, inexperience with the codebase, and I haven't used Ruby in a while... But I might be able to test some functionality on a test instance!

Has this been discussed before? Have I missed an important issue on the tracker? Is there an option I have missed? (I have obviously set my preferred languages in settings, tried multiple clients, and have examined a few random posts with the API to check their language tag) Is there some work currently on the way on a separate fork or branch but not deployed yet?

@gkaklas gkaklas added the suggestion Feature suggestion label Jun 24, 2023
@Gargron
Copy link
Member

Gargron commented Jun 24, 2023

It's not possible to detect language accurately for short posts. I'd be open to improving the UX in a way that motivates people to select their language correctly, but I don't think we should go back to misclassifying languages.

@gkaklas
Copy link
Author

gkaklas commented Jun 24, 2023

Of course, ideally the UX would encourage users to tag with the correct language and we would prefer them to be mindful of the option (like with people spending some time to write image captions). But many users may not care, not all clients may implement the feature, or the user might find the option annoying and disable it altogether.

I haven't done extensive research on this, but the description in lingua-py's repo says "suitable for long and short text alike", and on "For very short text snippets such as Twitter messages, [CLD3 and others] do not provide adequate results. [...] Lingua aims at eliminating these problems." This user says that the results looked encouraging 🤔.

I could propose that we could test it with existing toots first to see the accuracy, but someone knowing more than me on the subject might have already done it, and some results in lingua-py's repo might be relevant for this use case.

There is also the option of running the detection when a user (or many users) report another user as generally tagging with the wrong language. This way, automatic detection can't be worse since the posts would already be tagged with the wrong language. The reports could also ask the reporting user what language they think it is, and factor that in with the automatic detection. And automatic detection could also be disabled for really small (e.g. 2-word) posts.

@mxamber
Copy link

mxamber commented Jun 26, 2023

How about some sort of more prominent language selector? As it is now, I imagine many users glance past the language button or may not even know it's there; visually highlighting it might help. If automatically classifying languges entirely isn't on the menu, some sort of indication if a language different from the set one has been possibly identified might also help, like a warning backgroumnd colour for the button if e.g. it's set to English but the software has identified a reasonable probability the post is actually in German.

(The best thing I can think of for visually highlighting the button is displaying a flag next to the language, but that doesn't really work with colonial languages like French and English that are the official languages of several countries.)

@Deuchnord
Copy link

The Ice Cubes client for iOS has a good proposal IMO for this kind of issue: let's say I'm redacting a new post in English, but the language is set to French (because I usually post in French, and I forgot to change it). Then, when I tap the Publish button, the application asks me to confirm the language first. Mastodon's Web UI could reproduce this behavior, couldn't it?

Screenshot

Here, I have written a post in English ("This post is definitely not in French"), and I have tapped the "Post" button on top right.
The application has then displayed two options:

  • Post in English (detected language)
  • Post in French (selected language)

@gunchleoc
Copy link
Contributor

I like the Ice Cubes approach.

There's also new technology out there that might improve the quality of detection - article published today: https://diff.wikimedia.org/2023/10/24/open-language-identification-api-for-200-languages/

@Haui1112
Copy link

Haui1112 commented Dec 9, 2023

Instance owner here. Same issue and the problem is bigger as most of you see.

If every 10th person who sees cyrillic, chinese or other foreign characters spammed in their feed gives up and goes back to xitter it is 10% of all new users gone. This could derail both mastodon and the fediverse, just because we thought its "not that bad".

I'm an admin myself and understand you get used to bugs but this is preventable and therefore a nonissue in my book.

Take out the cyrillic and chinese alphabets (posts that contain them) for those not using them first.

Next thing is custom phrase blacklists (not manual but whole lists of phrases). Like "ich gehe" detects german, throws out the post.

This should work pretty easily without massive traffic or cpu load.

@mcgrew
Copy link

mcgrew commented Jan 8, 2024

I would like to see some sort of language auto-detection myself.

Personally I see this as a client-side solution. Have the user select what languages to auto-detect, look up the words in dictionaries for each of those languages at post time, and if more of the words match a different language than the one selected, prompt the user for confirmation before posting. Of course a different detection might be needed for languages that don't use spaces, such as Japanese.

By default this could be disabled since I assume most users only post in one language (American here, so I may be wrong on this).

This is similar to the Ice Cubes solution. I would think this would have minimal CPU impact, and then only client side. I only post in 2 languages, English and Japanese, but I sometimes forget to tag the language correctly, so it would be nice to be prompted by the UI if it thinks I've made a mistake.

@brendanjones
Copy link

I suspect people posting with the wrong language selected is less of a problem than you might think, because language filters still allow posts from followed hashtags or boosts into your home feed. So even if the right language is selected, those posts still make it into your feed. See #20937 and #20241.

That said, I do notice the occasional person I follow doing it (quite rarely) so it'd be nice to stop it happening the few times that it does. The Ice Cubes approach shown above is perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
suggestion Feature suggestion
Projects
None yet
Development

No branches or pull requests

8 participants