NCI: words sometimes split into small pieces #17
Labels
enhancement
New feature or request
idea
Future work idea
NCI
Processing the New Corpus of Ireland
question
Further information is requested
Issue #4 reports: There are cases of words split into small pieces, e.g. T UA R A SC Á I L B H L I A N TÚ I L A N O M B UD S MA N 1 9 9 7.
How frequent is this issue? Are there any tools we could use to automatically detect and fix such cases?
An idea for detecting the errors may be to scan in a window of say 5 tokens for a surge in OOV rate that does not go hand-in-hand with a high rate of unknown character n-grams after removal of all spaces.
An idea for fixing the errors may be to synthesise a parallel corpus of text with this error automatically inserted and the original text and then train
The text was updated successfully, but these errors were encountered: