-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-introduce automatic language detection for posts #25568
Comments
It's not possible to detect language accurately for short posts. I'd be open to improving the UX in a way that motivates people to select their language correctly, but I don't think we should go back to misclassifying languages. |
Of course, ideally the UX would encourage users to tag with the correct language and we would prefer them to be mindful of the option (like with people spending some time to write image captions). But many users may not care, not all clients may implement the feature, or the user might find the option annoying and disable it altogether. I haven't done extensive research on this, but the description in lingua-py's repo says "suitable for long and short text alike", and on "For very short text snippets such as Twitter messages, [CLD3 and others] do not provide adequate results. [...] Lingua aims at eliminating these problems." This user says that the results looked encouraging 🤔. I could propose that we could test it with existing toots first to see the accuracy, but someone knowing more than me on the subject might have already done it, and some results in lingua-py's repo might be relevant for this use case. There is also the option of running the detection when a user (or many users) report another user as generally tagging with the wrong language. This way, automatic detection can't be worse since the posts would already be tagged with the wrong language. The reports could also ask the reporting user what language they think it is, and factor that in with the automatic detection. And automatic detection could also be disabled for really small (e.g. 2-word) posts. |
How about some sort of more prominent language selector? As it is now, I imagine many users glance past the language button or may not even know it's there; visually highlighting it might help. If automatically classifying languges entirely isn't on the menu, some sort of indication if a language different from the set one has been possibly identified might also help, like a warning backgroumnd colour for the button if e.g. it's set to English but the software has identified a reasonable probability the post is actually in German. (The best thing I can think of for visually highlighting the button is displaying a flag next to the language, but that doesn't really work with colonial languages like French and English that are the official languages of several countries.) |
The Ice Cubes client for iOS has a good proposal IMO for this kind of issue: let's say I'm redacting a new post in English, but the language is set to French (because I usually post in French, and I forgot to change it). Then, when I tap the Publish button, the application asks me to confirm the language first. Mastodon's Web UI could reproduce this behavior, couldn't it? Here, I have written a post in English ("This post is definitely not in French"), and I have tapped the "Post" button on top right.
|
I like the Ice Cubes approach. There's also new technology out there that might improve the quality of detection - article published today: https://diff.wikimedia.org/2023/10/24/open-language-identification-api-for-200-languages/ |
Instance owner here. Same issue and the problem is bigger as most of you see. If every 10th person who sees cyrillic, chinese or other foreign characters spammed in their feed gives up and goes back to xitter it is 10% of all new users gone. This could derail both mastodon and the fediverse, just because we thought its "not that bad". I'm an admin myself and understand you get used to bugs but this is preventable and therefore a nonissue in my book. Take out the cyrillic and chinese alphabets (posts that contain them) for those not using them first. Next thing is custom phrase blacklists (not manual but whole lists of phrases). Like "ich gehe" detects german, throws out the post. This should work pretty easily without massive traffic or cpu load. |
I would like to see some sort of language auto-detection myself. Personally I see this as a client-side solution. Have the user select what languages to auto-detect, look up the words in dictionaries for each of those languages at post time, and if more of the words match a different language than the one selected, prompt the user for confirmation before posting. Of course a different detection might be needed for languages that don't use spaces, such as Japanese. By default this could be disabled since I assume most users only post in one language (American here, so I may be wrong on this). This is similar to the Ice Cubes solution. I would think this would have minimal CPU impact, and then only client side. I only post in 2 languages, English and Japanese, but I sometimes forget to tag the language correctly, so it would be nice to be prompted by the UI if it thinks I've made a mistake. |
I suspect people posting with the wrong language selected is less of a problem than you might think, because language filters still allow posts from followed hashtags or boosts into your home feed. So even if the right language is selected, those posts still make it into your feed. See #20937 and #20241. That said, I do notice the occasional person I follow doing it (quite rarely) so it'd be nice to stop it happening the few times that it does. The Ice Cubes approach shown above is perfect. |
Pitch
Many users (even bots!) don't correctly set the language tag when posting. This can result in timelines containing very frequent unreadable posts for users not speaking the language, which is inconvenient.
I imagine it may cause even more inconvenience for users of screen readers, since they might be configured for one language, but try to start narrating posts written in another.
Personally, I have started muting accounts that set the language tag incorrectly, in the hope that this might at least catch frequent posters for my followed hashtags, but obviously this isn't a solution, and I still get a lot of noise in my timeline.
Relevant issues: #21400 #19893 #21631 #22598
Language detection seems to have been removed in #17478
Feel free to close as a duplicate, but I felt we could have a consolidated and maybe more technical issue. Searching
is:issue is:open language in:title
gives 68 results, and in general it's difficult to get a sense of the status for this feature and all our options.Motivation
Here is a summary on some of the issues I found and some more thoughts:
(Obviously client-side solutions are inexpensive for the server, but would require implementing the feature in every client i.e. web UI, mobile and desktop apps)
English
tag from non-English posts, which would filter out noise for I suspect a huge userbase (couldn't find statistics on this)Unfortunately I'll probably won't be able to help with the code a lot because of limited time, inexperience with the codebase, and I haven't used Ruby in a while... But I might be able to test some functionality on a test instance!
Has this been discussed before? Have I missed an important issue on the tracker? Is there an option I have missed? (I have obviously set my preferred languages in settings, tried multiple clients, and have examined a few random posts with the API to check their language tag) Is there some work currently on the way on a separate fork or branch but not deployed yet?
The text was updated successfully, but these errors were encountered: