Skip to content
This repository has been archived by the owner on Sep 4, 2023. It is now read-only.

Improved language detection #30

Closed
andrenatal opened this issue Jan 13, 2022 · 2 comments · Fixed by #152
Closed

Improved language detection #30

andrenatal opened this issue Jan 13, 2022 · 2 comments · Fixed by #152
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@andrenatal
Copy link
Contributor

andrenatal commented Jan 13, 2022

We should have a more expanded language detection heuristic in terms of content extraction and analysis to determine the page's language instead of traversing just through divs.

@andrenatal andrenatal added the enhancement New feature or request label Jan 26, 2022
@andrenatal
Copy link
Contributor Author

Maybe we could use this https://fasttext.cc/docs/en/language-identification.html

@andrenatal andrenatal self-assigned this Feb 4, 2022
@andrenatal andrenatal added this to the W4 milestone Feb 10, 2022
@andrenatal andrenatal changed the title Improved language detecion Improved language detection Feb 10, 2022
@andrenatal andrenatal modified the milestones: W4, W5 Feb 10, 2022
@kpu
Copy link
Contributor

kpu commented Feb 14, 2022

According to #103 (comment) the code no is coming out of CLD2 instead of nb and nn. This is a tell that the small version of the CLD2 classifier is being used. The full version has more languages and better accuracy (including for languages other than Norwegian). See bitextor/warc2text@e2543e4 for an example of using the full CLD2 classifier.
Or use fasttext.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
2 participants