Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to keep documents that can't be identified #88

Open
Uinelj opened this issue Feb 3, 2023 · 1 comment
Open

Option to keep documents that can't be identified #88

Uinelj opened this issue Feb 3, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@Uinelj
Copy link
Member

Uinelj commented Feb 3, 2023

We could add an option that enables keeping documents that are not identifiable (where the classifier can't infer a document language), for further inspection.

@Uinelj Uinelj added the enhancement New feature or request label Feb 3, 2023
@Uinelj Uinelj self-assigned this Feb 3, 2023
@chris-ha458
Copy link

In the case of mC4 (also called c4/multilingual)
The undetermined portion('und') for mC4 3.1 this is when according to their langID cld3, the highest confidence for a language is <0.95. Since, Ungoliant works differently and with different langID tools and models (fasttext, lid176.bin but I hope to petition to change this to lid218) specific processes and cutoffs might have to be different.
Seeing how ungoliant records per sentence confidence score, many options could be explored.
The current average confidence weighted per byte seems a very good compromise especially compared to simple mean.

In any case this would be very useful. The 'und' portion of mC4 is second only to english in quantity or byte size and rife for opportunities where humans can get involved to salvage data or understand langID behaviors.

I and some others are actively doing such salvaging and here is an example of such salvaging efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants