False positives on the English-only subset. #7

pryley · 2023-11-21T18:01:55Z

The English-only ngrams subset doesn't work too well. Is it trained on the same dataset as the others?

$content = 'Nostrum et sapiente in ipsam amet quas ut. Adipisci dolores nihil a facere est voluptas et nostrum. Nobis at laborum odit deleniti ut voluptatem. Modi recusandae ad ut incidunt minima molestiae.';

$eld = new LanguageDetector('ngramsM60-1.2rrx014rx6yos0gkkogws8ksc0okcwk.php');
$eld->cleanText(true);
$eld->detect($content);

Nitotm\Eld\LanguageResult {
  +language: "en",
  +scores: [
    "en" => 0.52298237476809,
  ],
  -numNgrams: 49,
  -avgScore: [...],
  language: "en",
  scores: [
    "en" => 0.52298237476809,
  ],
  isReliable(): true,
}

The text was updated successfully, but these errors were encountered:

pryley · 2023-11-21T18:06:08Z

Much better results using the de/en/es/fr/it/nl subset, but I'm not excited about the additional memory usage of larger ngram subsets and I only need English language detection.

$content = 'Nostrum et sapiente in ipsam amet quas ut. Adipisci dolores nihil a facere est voluptas et nostrum. Nobis at laborum odit deleniti ut voluptatem. Modi recusandae ad ut incidunt minima molestiae.';

$eld = new LanguageDetector('ngramsM60-6.5ijqhj4oecso0kwcok4k4kgoscwg80o.php');
$eld->cleanText(true);
$eld->detect($content);

Nitotm\Eld\LanguageResult {#8965
  +language: "it",
  +scores: [
    "it" => 0.30525471552257,
    "en" => 0.29610196351268,
    "fr" => 0.28801600185529,
    "es" => 0.24844426406926,
    "nl" => 0.17167980828695,
    "de" => 0.15610892084106,
  ],
  -numNgrams: 49,
  -avgScore: [...],
  language: "it",
  scores: [
    "it" => 0.30525471552257,
    "en" => 0.29610196351268,
    "fr" => 0.28801600185529,
    "es" => 0.24844426406926,
    "nl" => 0.17167980828695,
    "de" => 0.15610892084106,
  ],
  isReliable(): false,
}

nitotm · 2023-11-21T18:56:04Z

The isReliable() function can definitely be improved.
I would say, as is, ELD is not a good software to know if a string is from a specific language or not, using a one language subset.

The main problem is that when ELD finds only one language on a string, it scores it very high; some accommodation in this regard, for very small subsets or with only one language, should be added.

I am currently finishing ver. 3.0.0, with a new scoring system, which will most likely not solve this issue, but I will leave this "issue" open, to attack this problem next at the ver. 3.0

PD: Since I cannot provide you with a quick fix, in case you still want to try use ELD for this English only scenario, what you would need to do is create your own benchmark of English and non English strings, and modify the 'en' => 0.0378, at resources/avgScore.php, increase it, and try to either calculate/guess the optimal value. Also, before that, in this scenario it might help to decrease $relevancy = 27; at src/LanguageDetector.php, to 8–20, but I’m just guessing here.

flexchar · 2024-02-01T12:19:16Z

Keep us posted on your journey to version 3.0 @nitotm!

nitotm · 2024-02-02T01:12:56Z

It is taking me a bit longer, since I decided to integrate all changes that where long planed, new suggestions, and I keep finding things to improve.

Improvements in accuracy are great, also in efficiency, although since I have added more steps and a bigger database, I’m not sure if it’s finally going to be faster, but the additions are worth it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False positives on the English-only subset. #7

False positives on the English-only subset. #7

pryley commented Nov 21, 2023 •

edited

pryley commented Nov 21, 2023 •

edited

nitotm commented Nov 21, 2023 •

edited

flexchar commented Feb 1, 2024

nitotm commented Feb 2, 2024

False positives on the English-only subset. #7

False positives on the English-only subset. #7

Comments

pryley commented Nov 21, 2023 • edited

pryley commented Nov 21, 2023 • edited

nitotm commented Nov 21, 2023 • edited

flexchar commented Feb 1, 2024

nitotm commented Feb 2, 2024

pryley commented Nov 21, 2023 •

edited

pryley commented Nov 21, 2023 •

edited

nitotm commented Nov 21, 2023 •

edited