Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain Spanish twitter pages are identified as English #10

Closed
xiaochengh opened this issue Mar 6, 2018 · 1 comment
Closed

Certain Spanish twitter pages are identified as English #10

xiaochengh opened this issue Mar 6, 2018 · 1 comment

Comments

@xiaochengh
Copy link

This is found in crbug.com/809243.

Repro twitter page: https://twitter.com/paurubio

Almost all tweets are Spanish, but it's still identified as English when Chrome tries to detect page language using CLD.

I'm also attaching Chrome's text dump that's passed to CLD for language detection: dump.txt

@jasonriesa
Copy link
Collaborator

Language ID for multilingual spans on multilingual pages is not supported by Chrome right now. CLD3 will return the most prevalent language it finds. Looking at your dump, there is mostly English text with some short Spanish text segments, so the model returns English. We don't do any special processing for Twitter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@jasonriesa @xiaochengh and others