Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCI: words sometimes split into small pieces #17

Open
jowagner opened this issue Nov 12, 2020 · 0 comments
Open

NCI: words sometimes split into small pieces #17

jowagner opened this issue Nov 12, 2020 · 0 comments
Labels
enhancement New feature or request idea Future work idea NCI Processing the New Corpus of Ireland question Further information is requested

Comments

@jowagner
Copy link
Collaborator

jowagner commented Nov 12, 2020

Issue #4 reports: There are cases of words split into small pieces, e.g. T UA R A SC Á I L B H L I A N TÚ I L A N O M B UD S MA N 1 9 9 7.

How frequent is this issue? Are there any tools we could use to automatically detect and fix such cases?

An idea for detecting the errors may be to scan in a window of say 5 tokens for a surge in OOV rate that does not go hand-in-hand with a high rate of unknown character n-grams after removal of all spaces.

An idea for fixing the errors may be to synthesise a parallel corpus of text with this error automatically inserted and the original text and then train

  • an MT system to translate from a mixture of split words and normal words to normal words
  • a sequence tagger to tag each space whether it should be removed
@jowagner jowagner added enhancement New feature or request idea Future work idea NCI Processing the New Corpus of Ireland question Further information is requested labels Nov 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request idea Future work idea NCI Processing the New Corpus of Ireland question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant