Language detection default for 'unknown' language #86

martinreynaert · 2022-03-29T16:30:34Z

Hi,

We have thousands of article abstracts. A lot are in mixed languages, the actual languages present are unknown, there are no delimiters between the segments of an abstract in different languages.

So we want to recognize the languages on the sentence level. This requires us to use UCTO, for prior sentence splitting.

So we use 'detectlanguages', with a limited set of those, the most likely ones, Ucto parameter:

--detectlanguages=<lang1,lang2,..langn> - try to assign a language to each line of text input. Default = 'lang1'

As we primarily want to recognize and retain the sentences in English, we are not helped by Ucto's behaviour of labeling everything it does not know as our first language, ie. 'eng'. That defeats our purposes. We would be much obliged if you could also equip Ucto with the FoLiA-langcat parameter:

--lang= use 'lan' for unindentified text. (default 'nld')

In fact, I would not at all mind if the default were set to 'unk' (for: unknown).

After all, it is rather disconcerting to see a sentence written in a non-Latin script such as the Thai here being labeled: 'eng'.

  `<s xml:id="JSTOR.music.01446.p.1.s.6">
    <t>บทความนี้ใช้ภาพสเปกโทรกราฟ (spectrographic images) เพื่อพินิจ พิเคราะห์คุณภาพของเสียง (timbre) ในการขับร้องเพลงไทยเดิม โดยมุ่ง ศึกษาการเอื้อน ซึ่งเป็นการเปล่งเสียงดําเนินทํานองโดยใช้เสียงสระเเละพยัญชนะ เฉพาะกลุ่ม สเปกโทรกราฟคือภาพจากคอมพิวเตอร์ทิ่เเสดงฟันดาเมนท้ล (fundamental) โอเวอร์โทน (overtone) เเละเสียงต่างๆ ขคงการเเสดง ดนตรี เสียงเอื้อนเเบ่ง ได้โดยรวมเป็นห้าเสียงพื้นฐาน ผู้วิจัยได้วิเคราะห์เสียงเอื้อน พื้นฐานดังกล่าว ทั้งโดยเเยกศึกษาเฉพาะเสียง เเละศึกษาเสียงเอื้อนในบริบทดนตรี อีกทังยังศึกษาเปรียบเทียบคุณภาพของเสียงจากการขับร้องเพลงไทยหลายทางต่าง ยุคสมัย ผุ้วิจัยเสนอเเนะว่า ความสอดคล้องในคุณภาพของห้าเสียงเอื้อนพื้นฐานอัน เป็นมรดกตกทอดมาหลายรุ่น ผนวกกับลักษณะเฉพาะของการเเสดงขับร้องนีเอง ที่ เป็นตัวกําหนดเสียงที่โดดเด่นเป็นเคกลักษณ์ของการขับร้องเพลงไทยเดิม</t>
    <lang class="eng"/>`

Thank you!

The text was updated successfully, but these errors were encountered:

kosloot · 2022-03-30T12:12:37Z

Ok, this sounds like a feasible request.
I would have to dive into this for a solution

proycon · 2022-04-04T11:07:38Z

unk is an existing iso-639-3 code: https://iso639-3.sil.org/code/unk , we shouldn't abuse it for something else. If a language can't be identified (or with not enough confidence), it'd be better simply not to output the <lang> element at all.

kosloot · 2022-04-04T11:12:36Z

I am aware of the unk code, but fortunately there is also an und code we could use. Which is exactly what I am heading to.

proycon · 2022-04-04T11:14:32Z

Ha, right, I was already wondering if there was something like that. Good idea.

martinreynaert · 2022-04-04T12:33:13Z

Sounds good!

kosloot · 2022-04-04T14:42:23Z

Ok, I added code to handle 'und' languages.
When adding 'und' to the --detectlanguages option, the 'default' language will be 'undefined' and those sentences will
remain untokenized, and added 'as is' to the FoLiA output.
@martinreynaert please test and comment.

martinreynaert · 2022-04-08T10:37:34Z

Thank you!

I have tested 'und' in Ucto's language detection mode. Results appear much more reliable than before! See this remarkable example: JSTOR.music.00656.p.1.s.7

I had 1477 input files for testing. However, 25 of these gave empty output. And a message in *stderr. I attach them for your convenience.

I saw at least one where there's only non-Latin script (file: JSTOR.music.01437). That text I think should nevertheless be incorporated in FoLiA. I have yet to check what happens when there is mixed Latin and non-Latin text.

Another is more unclear, the text seems just plain English to me, but Ucto complains ""ucto: ucto: conflicting language(s) assigned"" and returns an empty file (see: JSTOR.music.00072 and 17 more files).

UCTO.FailLangDetect.20220407.MRE.tar.gz

Again: thanks! Ucto has already been greatly improved for our purposes!
``

kosloot · 2022-04-11T10:46:42Z

I added some code to avoid the ucto: ucto: conflicting language(s) message.
Beware that this stems from an unsolvable "chicken egg" problem:
To detect languages, we need to detect detect sentence bounds, which requires tokenization, But to tokenize, we need to know the language.

At the moment we guess some sentence bounds, use the detected fragments to detect the language, and then tokenize the longest utterance within the same language. This works quite well, but not always. As libtextcat sometimes makes strange decisions.

example:
Educated at the famous monastery in St. Gallen, he went as a wandering student in search of learning as'

This is first split at the '.' in St., then the first part: Educated at the famous monastery in St. is detected as English,
but the second part: Gallen, he went as a wandering student in search of learning as' is somehow detected as Dutch.
As a consequence, this utterance will be tokenized as 2 Sentences.

When both parts would have been seen as English, then it would be correctly seen as 1 Sentence.

This problem is NOT resolvable by Ucto

martinreynaert · 2022-04-25T19:21:34Z

Thank you kosloot!

Also for the further explanations about how this works.

We have now run this on several thousands of files: not a single one failed.

I consider this matter closed.

martinreynaert assigned proycon and kosloot Mar 29, 2022

kosloot added a commit that referenced this issue Apr 1, 2022

start working on issue #86

697d0fc

martinreynaert closed this as completed Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language detection default for 'unknown' language #86

Language detection default for 'unknown' language #86

martinreynaert commented Mar 29, 2022

kosloot commented Mar 30, 2022

proycon commented Apr 4, 2022

kosloot commented Apr 4, 2022

proycon commented Apr 4, 2022

martinreynaert commented Apr 4, 2022

kosloot commented Apr 4, 2022

martinreynaert commented Apr 8, 2022 •

edited

Loading

kosloot commented Apr 11, 2022

martinreynaert commented Apr 25, 2022

Language detection default for 'unknown' language #86

Language detection default for 'unknown' language #86

Comments

martinreynaert commented Mar 29, 2022

kosloot commented Mar 30, 2022

proycon commented Apr 4, 2022

kosloot commented Apr 4, 2022

proycon commented Apr 4, 2022

martinreynaert commented Apr 4, 2022

kosloot commented Apr 4, 2022

martinreynaert commented Apr 8, 2022 • edited Loading

kosloot commented Apr 11, 2022

martinreynaert commented Apr 25, 2022

martinreynaert commented Apr 8, 2022 •

edited

Loading