Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: possible to retrieve untokenized sentences? #10

Open
pirolen opened this issue Oct 14, 2021 · 1 comment
Open

Question: possible to retrieve untokenized sentences? #10

pirolen opened this issue Oct 14, 2021 · 1 comment
Assignees

Comments

@pirolen
Copy link

pirolen commented Oct 14, 2021

May sound silly, but would it be possible to create a method that would allow retrieving sentences from the tokenizer without whitespace between punctuation marks (e.g. untokenized)? E.g. maybe providing a tuple that would hold two versions of a sentence, both the tokenized, as well as the original?

It is practical to keep the untokenized sentence in some scenarios (e.g. showing them to end users), and reconstructing it by script would be rather hacky and imprecise I guess.

@proycon
Copy link
Owner

proycon commented Oct 14, 2021

Not at bad idea at all, the information is availabe inside ucto after all (and propagated to the FoliA output), so we could do something similar for the python binding.

@proycon proycon self-assigned this Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants