NCI: words sometimes split into small pieces #17

jowagner · 2020-11-12T11:47:59Z

Issue #4 reports: There are cases of words split into small pieces, e.g. T UA R A SC Á I L B H L I A N TÚ I L A N O M B UD S MA N 1 9 9 7.

How frequent is this issue? Are there any tools we could use to automatically detect and fix such cases?

An idea for detecting the errors may be to scan in a window of say 5 tokens for a surge in OOV rate that does not go hand-in-hand with a high rate of unknown character n-grams after removal of all spaces.

An idea for fixing the errors may be to synthesise a parallel corpus of text with this error automatically inserted and the original text and then train

an MT system to translate from a mixture of split words and normal words to normal words
a sequence tagger to tag each space whether it should be removed

jowagner added enhancement New feature or request idea Future work idea NCI Processing the New Corpus of Ireland question Further information is requested labels Nov 12, 2020

jowagner mentioned this issue Nov 12, 2020

NCI: character replaced by space #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCI: words sometimes split into small pieces #17

NCI: words sometimes split into small pieces #17

jowagner commented Nov 12, 2020 •

edited

NCI: words sometimes split into small pieces #17

NCI: words sometimes split into small pieces #17

Comments

jowagner commented Nov 12, 2020 • edited

jowagner commented Nov 12, 2020 •

edited