-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
False positives on the English-only subset. #7
Comments
Much better results using the de/en/es/fr/it/nl subset, but I'm not excited about the additional memory usage of larger ngram subsets and I only need English language detection. $content = 'Nostrum et sapiente in ipsam amet quas ut. Adipisci dolores nihil a facere est voluptas et nostrum. Nobis at laborum odit deleniti ut voluptatem. Modi recusandae ad ut incidunt minima molestiae.';
$eld = new LanguageDetector('ngramsM60-6.5ijqhj4oecso0kwcok4k4kgoscwg80o.php');
$eld->cleanText(true);
$eld->detect($content);
|
The isReliable() function can definitely be improved. The main problem is that when ELD finds only one language on a string, it scores it very high; some accommodation in this regard, for very small subsets or with only one language, should be added. I am currently finishing ver. 3.0.0, with a new scoring system, which will most likely not solve this issue, but I will leave this "issue" open, to attack this problem next at the ver. 3.0 PD: Since I cannot provide you with a quick fix, in case you still want to try use ELD for this English only scenario, what you would need to do is create your own benchmark of English and non English strings, and modify the |
Keep us posted on your journey to version 3.0 @nitotm! |
It is taking me a bit longer, since I decided to integrate all changes that where long planed, new suggestions, and I keep finding things to improve. Improvements in accuracy are great, also in efficiency, although since I have added more steps and a bigger database, I’m not sure if it’s finally going to be faster, but the additions are worth it. |
The English-only ngrams subset doesn't work too well. Is it trained on the same dataset as the others?
The text was updated successfully, but these errors were encountered: