Encoding of soft word breaks / hyphenation #66

proycon · 2019-02-07T13:16:33Z

FoLiA currently can not properly encode soft word breaks, i.e. situations where a word is visually broken apart and hyphenated in the original text. Currently we see FoLiA's <br/> element (as a text markup element inside <t>) used in these situations, often with a hyphen or another symbol (artifact of an OCR system perhaps?):

<t> ... ook genoemd Tos-<br/>kaansche </t>

<t>...verzoek van den werk¬<br/>nemer...</t>

This, however, represents an explicit break and effectively splits a word into two tokens, which is semantically wrong when a soft break is intended and lead to all kinds of problems in further linguistic progressing. FoLiA is first and foremost concerned with accurate representation of the text, accurate linguistic units, and presentational representation comes secondary. We see this situation deteriorate in practice, as sometimes we see a word even gets split across paragraphs, which is wrong in all cases.

We may want to introduce a new element (<t-hbr/>?) to explicitly encode a hyphenised break (without a preceding hyphen symbol, it would be implied!), which most linguistic processing tools, especially tokenisers, can then simply ignore. Example:

<t>...verzoek van den werk<t-hbr/>nemer...</t>

Note that this is different from HTML's <wbr> and LaTeX's \hyp which represents an opportunity for wordbreak ( and probably there's also a unicode point for this) rather than the fact that there actually was a wordbreak/hyphenation. We're not so interested in representing those in FoLiA.

The text was updated successfully, but these errors were encountered:

kosloot · 2019-02-07T13:44:51Z

Well, I would prefer ditching soft hyphens altogether, as they have no 'real meaning'.
If you really want to keep ALL the formatting from the original text, then a solution like this is acceptable. And preferable above adding symbols like ¬ or <br/> which become part of the text.

proycon · 2019-02-07T14:12:12Z

Yes, this is of course only for scenarios where one really wants to encode soft breaks, nobody is required to do so of course.

proycon · 2019-02-20T11:48:37Z

This is now implemented (for the upcoming FoLiA 2.0), documentation: https://folia.readthedocs.io/en/latest/hyphenation_annotation.html

proycon added enhancement question labels Feb 7, 2019

proycon assigned proycon, martinreynaert and kosloot Feb 7, 2019

proycon added this to the v2.0 milestone Feb 13, 2019

proycon added the to do staged to be worked on label Feb 13, 2019

proycon added a commit that referenced this issue Feb 20, 2019

Added hyphenation annotation (t-hbr) #66

5e0a419

proycon added in progress and removed to do staged to be worked on labels Feb 20, 2019

proycon added a commit to proycon/foliapy that referenced this issue Feb 20, 2019

Adding Hyphbreak class (proycon/folia#66)

c90e952

proycon added ready Implemented but not released yet and removed in progress labels Feb 20, 2019

proycon closed this as completed Mar 13, 2019

proycon mentioned this issue Mar 18, 2019

Implement Hyphbreak LanguageMachines/libfolia#30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding of soft word breaks / hyphenation #66

Encoding of soft word breaks / hyphenation #66

proycon commented Feb 7, 2019 •

edited

Loading

kosloot commented Feb 7, 2019 •

edited

Loading

proycon commented Feb 7, 2019

proycon commented Feb 20, 2019 •

edited

Loading

Encoding of soft word breaks / hyphenation #66

Encoding of soft word breaks / hyphenation #66

Comments

proycon commented Feb 7, 2019 • edited Loading

kosloot commented Feb 7, 2019 • edited Loading

proycon commented Feb 7, 2019

proycon commented Feb 20, 2019 • edited Loading

proycon commented Feb 7, 2019 •

edited

Loading

kosloot commented Feb 7, 2019 •

edited

Loading

proycon commented Feb 20, 2019 •

edited

Loading