Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False positives on the English-only subset. #7

Open
pryley opened this issue Nov 21, 2023 · 4 comments
Open

False positives on the English-only subset. #7

pryley opened this issue Nov 21, 2023 · 4 comments

Comments

@pryley
Copy link

pryley commented Nov 21, 2023

The English-only ngrams subset doesn't work too well. Is it trained on the same dataset as the others?

$content = 'Nostrum et sapiente in ipsam amet quas ut. Adipisci dolores nihil a facere est voluptas et nostrum. Nobis at laborum odit deleniti ut voluptatem. Modi recusandae ad ut incidunt minima molestiae.';

$eld = new LanguageDetector('ngramsM60-1.2rrx014rx6yos0gkkogws8ksc0okcwk.php');
$eld->cleanText(true);
$eld->detect($content);
Nitotm\Eld\LanguageResult {
  +language: "en",
  +scores: [
    "en" => 0.52298237476809,
  ],
  -numNgrams: 49,
  -avgScore: [...],
  language: "en",
  scores: [
    "en" => 0.52298237476809,
  ],
  isReliable(): true,
}
@pryley
Copy link
Author

pryley commented Nov 21, 2023

Much better results using the de/en/es/fr/it/nl subset, but I'm not excited about the additional memory usage of larger ngram subsets and I only need English language detection.

$content = 'Nostrum et sapiente in ipsam amet quas ut. Adipisci dolores nihil a facere est voluptas et nostrum. Nobis at laborum odit deleniti ut voluptatem. Modi recusandae ad ut incidunt minima molestiae.';

$eld = new LanguageDetector('ngramsM60-6.5ijqhj4oecso0kwcok4k4kgoscwg80o.php');
$eld->cleanText(true);
$eld->detect($content);
Nitotm\Eld\LanguageResult {#8965
  +language: "it",
  +scores: [
    "it" => 0.30525471552257,
    "en" => 0.29610196351268,
    "fr" => 0.28801600185529,
    "es" => 0.24844426406926,
    "nl" => 0.17167980828695,
    "de" => 0.15610892084106,
  ],
  -numNgrams: 49,
  -avgScore: [...],
  language: "it",
  scores: [
    "it" => 0.30525471552257,
    "en" => 0.29610196351268,
    "fr" => 0.28801600185529,
    "es" => 0.24844426406926,
    "nl" => 0.17167980828695,
    "de" => 0.15610892084106,
  ],
  isReliable(): false,
}

@nitotm
Copy link
Owner

nitotm commented Nov 21, 2023

The isReliable() function can definitely be improved.
I would say, as is, ELD is not a good software to know if a string is from a specific language or not, using a one language subset.

The main problem is that when ELD finds only one language on a string, it scores it very high; some accommodation in this regard, for very small subsets or with only one language, should be added.

I am currently finishing ver. 3.0.0, with a new scoring system, which will most likely not solve this issue, but I will leave this "issue" open, to attack this problem next at the ver. 3.0

PD: Since I cannot provide you with a quick fix, in case you still want to try use ELD for this English only scenario, what you would need to do is create your own benchmark of English and non English strings, and modify the 'en' => 0.0378, at resources/avgScore.php, increase it, and try to either calculate/guess the optimal value. Also, before that, in this scenario it might help to decrease $relevancy = 27; at src/LanguageDetector.php, to 820, but I’m just guessing here.

@flexchar
Copy link

flexchar commented Feb 1, 2024

Keep us posted on your journey to version 3.0 @nitotm!

@nitotm
Copy link
Owner

nitotm commented Feb 2, 2024

It is taking me a bit longer, since I decided to integrate all changes that where long planed, new suggestions, and I keep finding things to improve.

Improvements in accuracy are great, also in efficiency, although since I have added more steps and a bigger database, I’m not sure if it’s finally going to be faster, but the additions are worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants