Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding of soft word breaks / hyphenation #66

Closed
proycon opened this issue Feb 7, 2019 · 3 comments
Closed

Encoding of soft word breaks / hyphenation #66

proycon opened this issue Feb 7, 2019 · 3 comments
Assignees
Labels
enhancement question ready Implemented but not released yet
Milestone

Comments

@proycon
Copy link
Owner

proycon commented Feb 7, 2019

FoLiA currently can not properly encode soft word breaks, i.e. situations where a word is visually broken apart and hyphenated in the original text. Currently we see FoLiA's <br/> element (as a text markup element inside <t>) used in these situations, often with a hyphen or another symbol (artifact of an OCR system perhaps?):

<t> ... ook genoemd Tos-<br/>kaansche </t>
<t>...verzoek van den werk¬<br/>nemer...</t>

This, however, represents an explicit break and effectively splits a word into two tokens, which is semantically wrong when a soft break is intended and lead to all kinds of problems in further linguistic progressing. FoLiA is first and foremost concerned with accurate representation of the text, accurate linguistic units, and presentational representation comes secondary. We see this situation deteriorate in practice, as sometimes we see a word even gets split across paragraphs, which is wrong in all cases.

We may want to introduce a new element (<t-hbr/>?) to explicitly encode a hyphenised break (without a preceding hyphen symbol, it would be implied!), which most linguistic processing tools, especially tokenisers, can then simply ignore. Example:

<t>...verzoek van den werk<t-hbr/>nemer...</t>

Note that this is different from HTML's <wbr> and LaTeX's \hyp which represents an opportunity for wordbreak ( and probably there's also a unicode point for this) rather than the fact that there actually was a wordbreak/hyphenation. We're not so interested in representing those in FoLiA.

Related:

Tagging also @JessedeDoes and @kdepuydt this is especially prevalent in INT material.

@kosloot
Copy link
Collaborator

kosloot commented Feb 7, 2019

Well, I would prefer ditching soft hyphens altogether, as they have no 'real meaning'.
If you really want to keep ALL the formatting from the original text, then a solution like this is acceptable. And preferable above adding symbols like ¬ or <br/> which become part of the text.

@proycon
Copy link
Owner Author

proycon commented Feb 7, 2019

Yes, this is of course only for scenarios where one really wants to encode soft breaks, nobody is required to do so of course.

@proycon proycon added this to the v2.0 milestone Feb 13, 2019
@proycon proycon added the to do staged to be worked on label Feb 13, 2019
@proycon proycon added in progress and removed to do staged to be worked on labels Feb 20, 2019
proycon added a commit to proycon/foliapy that referenced this issue Feb 20, 2019
@proycon proycon added ready Implemented but not released yet and removed in progress labels Feb 20, 2019
@proycon
Copy link
Owner Author

proycon commented Feb 20, 2019

This is now implemented (for the upcoming FoLiA 2.0), documentation: https://folia.readthedocs.io/en/latest/hyphenation_annotation.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement question ready Implemented but not released yet
Projects
None yet
Development

No branches or pull requests

3 participants