Skip to content

Automatic Language Identification

Jos Denys edited this page Sep 7, 2021 · 8 revisions

From v1.2 the iKnow engine supports Automatic Language Identification. Detection for all iKnow languages, except Japanese, can be automated. This can be done on document (source in IRIS speak) level, and even on sentence level, allowing for correct processing of multilingual documents.

ALI in indexing documents

specifying languages with the "|" separator.

Multiple languages can be specified using the pipe symbol, as is demonstrated in following source snippet :

text_source = "Ceci n'est pas une pipe. This is not a paper plane."
engine.index(text_source, "en|fr", traces=True)
switch_count = 0
for trace in engine.m_traces:
   key, value = trace.split(':', 1)[0],trace.split(':', 1)[1]
   if (key=='SwitchKnowledgebase'):
   	switch_count += 1

if switch_count != 2:
   raise ValueError(f'2 language switches expected: en:fr and fr:en !')

Python C++

specifying all languages.

All languages can be selected using the asterisk symbol. Beware: language inspection is not cheap, especially if "sentence level" detection is activated, for every sentence all loaded languages are matched to find the best fit.

engine.index(text_source, "*", traces=True)

Python

document versus sentence level.

As in IRIS, default language detection is on sentence level. iKnow standalone also supports detection on document level, only the first sentence is examined to find the best matching language, this is a big performance booster ! Switching to "document level" is done by specifying the fourth parameter : detect_language_at="document", the default (detect_language_at="sentence") is activated if no fourth parameter is used :

text_source = "Ceci n'est pas une pipe. This is not a paper plane."
engine.index(text_source, "en|fr", traces=True, detect_language_at="document")
switch_count = 0
for trace in engine.m_traces:
   key, value = trace.split(':', 1)[0],trace.split(':', 1)[1]
   if (key=='SwitchKnowledgebase'):
   	switch_count += 1

if switch_count != 1:
   raise ValueError(f'Only ONE language switch expected !')

Pyhon C++

ALI for document inspection

ALI can be used without indexing, as a way to detect the language of some text content, the method returns a tuple :

  • language_identified = engine.IdentifyLanguage(text_source)
  • language_identified[0] : the language code.
  • language_identified[1] : the certainty paramters (on a scale of 0 to 1).
text_source = "Be the change you want to see in life."
language_identified = engine.IdentifyLanguage(text_source)
if language_identified[0] != "en":
   raise ValueError(f'Language "en" is not recognised.')
else:
   print(f'Language {language_identified[0]!r} detected with certainty {language_identified[1]!r}.')

text_source = "Oder die Erkundung der Natur - und zwar ohne Anleitung."
language_identified = engine.IdentifyLanguage(text_source)
if language_identified[0] != "de":
   raise ValueError(f'Language "de" is not recognised.')
else:
   print(f'Language {language_identified[0]!r} detected with certainty {language_identified[1]!r}.')

text_source = "Микротерминатор может развивать скорость до 30 сантиметров за секунду, пишут калининградские СМИ.";
language_identified = engine.IdentifyLanguage(text_source)
if language_identified[0] != "ru":
   raise ValueError(f'Language "ru" is not recognised.')
else:
   print(f'Language {language_identified[0]!r} detected with certainty {language_identified[1]!r}.')

text_source = "En Argentina no hay estudios previos reportados en cuanto a la elaboración de vinos cítricos ni de «vino de naranja».";
language_identified = engine.IdentifyLanguage(text_source)
if language_identified[0] != "es":
   raise ValueError(f'Language "es" is not recognised.')
else:
   print(f'Language {language_identified[0]!r} detected with certainty {language_identified[1]!r}.')

text_source = "En pratique comment le faire ?";
language_identified = engine.IdentifyLanguage(text_source)
if language_identified[0] != "fr":
   raise ValueError(f'Language "fr" is not recognised.')
else:
   print(f'Language {language_identified[0]!r} detected with certainty {language_identified[1]!r}.')

text_source = "こんな台本でプロットされては困る、と先生は言った。";
language_identified = engine.IdentifyLanguage(text_source)
if language_identified[0] != "":
   raise ValueError(f'Language "" is not recognised.')
else:
   print(f'Language {language_identified[0]!r} detected with certainty {language_identified[1]!r}.')

text_source = "Op basis van de afzonderlijke evaluatieverslagen stelt de Commissie een synthese op communautair niveau op."
language_identified = engine.IdentifyLanguage(text_source)
if language_identified[0] != "nl":
   raise ValueError(f'Language "nl" is not recognised.')
else:
   print(f'Language {language_identified[0]!r} detected with certainty {language_identified[1]!r}.')

text_source = "Distingue-se o mercado de um produto ou serviço dos mercados de fatores de produção, capital e trabalho."
language_identified = engine.IdentifyLanguage(text_source)
if language_identified[0] != "pt":
   raise ValueError(f'Language "pt" is not recognised.')
else:
   print(f'Language {language_identified[0]!r} detected with certainty {language_identified[1]!r}.')

text_source = "Jag är bäst i klassen. Ingen gör efter mig, kan jag inte lämna. Var försiktig, är gräset alltid grönare på andra sidan."
language_identified = engine.IdentifyLanguage(text_source)
if language_identified[0] != "sv":
   raise ValueError(f'Language "sv" is not recognised.')
else:
   print(f'Language {language_identified[0]!r} detected with certainty {language_identified[1]!r}.')

text_source = "грошових зобов'язань, прийнятих на себе згідно з умов цього договору."
language_identified = engine.IdentifyLanguage(text_source)
if language_identified[0] != "uk":
   raise ValueError(f'Language "uk" is not recognised.')
else:
   print(f'Language {language_identified[0]!r} detected with certainty {language_identified[1]!r}.')

text_source = "Létající jaguár je novela spisovatele Josefa Formánka z roku 2004."
language_identified = engine.IdentifyLanguage(text_source)
if language_identified[0] != "cs":
   raise ValueError(f'Language "cs" is not recognised.')
else:
   print(f'Language {language_identified[0]!r} detected with certainty {language_identified[1]!r}.')

Python C++