Question: possible to retrieve untokenized sentences? #10

pirolen · 2021-10-14T20:08:45Z

May sound silly, but would it be possible to create a method that would allow retrieving sentences from the tokenizer without whitespace between punctuation marks (e.g. untokenized)? E.g. maybe providing a tuple that would hold two versions of a sentence, both the tokenized, as well as the original?

It is practical to keep the untokenized sentence in some scenarios (e.g. showing them to end users), and reconstructing it by script would be rather hacky and imprecise I guess.

proycon · 2021-10-14T20:39:02Z

Not at bad idea at all, the information is availabe inside ucto after all (and propagated to the FoliA output), so we could do something similar for the python binding.

proycon self-assigned this Oct 14, 2021

proycon added the enhancement label Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: possible to retrieve untokenized sentences? #10

Question: possible to retrieve untokenized sentences? #10

pirolen commented Oct 14, 2021

proycon commented Oct 14, 2021

Question: possible to retrieve untokenized sentences? #10

Question: possible to retrieve untokenized sentences? #10

Comments

pirolen commented Oct 14, 2021

proycon commented Oct 14, 2021