Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with leading/trailing whitespace in text content #88

Closed
proycon opened this issue Dec 8, 2020 · 32 comments
Closed

Problems with leading/trailing whitespace in text content #88

proycon opened this issue Dec 8, 2020 · 32 comments
Assignees
Labels
bug ready Implemented but not released yet
Milestone

Comments

@proycon
Copy link
Owner

proycon commented Dec 8, 2020

We had an extensive earlier discussion on this in #34, but an issue popped up.

foliatextcontent produces FoLiA likke the follow:

<p xml:id="FH-OllevierGeets-1.tif.text.par_1_21">
       <t>
         <t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str>
       </t>
       <t class="OCR">
         <t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str>
       </t>
       <str xml:id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">
         <t offset="0">INTRODUCTION</t>
         <t offset="0" class="OCR">INTRODUCTION</t>
         <relation format="text/hocr+xml" xlink:href="OllevierGeets-1.hocr" xlink:type="simple">
           <xref id="word_1_233" type="str"/>
         </relation>
       </str>
     </p>

folialint stumbles on this with a text consistency problem:

ticcl_output/OllevierGeets.ticcl.folia.xml failed: Unresolvable text: Text for str(ID=FH-OllevierGeets-1.tif.text.par_1_21.word_1_233, textclass='current'), has incorrect offset 0
        original msg=Unresolvable text: Reference (ID FH-OllevierGeets-1.tif.text.par_1_21,class='current') found, but no text match at offset=0 Expected 'INTRODUCTION' but got '
        INT'

Because of the newline and indentation, the offset is considered wrong, as the text is assumed to be "\n\s\s\s\s\s\s\s\s\INTRODUCTION".

foliavalidator stumbles over something identical but later on (different order of evaluation perhaps?):

TEXT VALIDATION ERROR: Text for String, ID FH-OllevierGeets-4.tif.text.par_1_36.word_1_708, textclass OCR, has incorrect offset 0 or invalid reference
VALIDATION ERROR on full parse by library (stage 2/3), in OllevierGeets.ticcl.folia.xml
UnresolvableTextContent: Reference (ID FH-OllevierGeets-4.tif.text.par_1_36, class=OCR) found but no text match at specified offset (0)! Expected 'DISCUSSION', got '
        D'

The offsets do not do any kind of space normalization by default, as addressed in #34, a text like:

<s>
    <t>This is
         a sentence</t>
</s>

This really means This is\n\s\s\s\s\s\s\s\s\sa sentence. and not This is a sentence.

But, I think we should be able to strip leading and trailing spaces from the text as a whole, I think the following fragment below should be semantically identical to the first fragment. The fact that in turned into the fragment above is probably because of standard XML prettification algorithms.

<p xml:id="FH-OllevierGeets-1.tif.text.par_1_21">
       <t><t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str></t>
       <t class="OCR"><t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str></t>
       <str xml:id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">
         <t offset="0">INTRODUCTION</t>
         <t offset="0" class="OCR">INTRODUCTION</t>
         <relation format="text/hocr+xml" xlink:href="OllevierGeets-1.hocr" xlink:type="simple">
           <xref id="word_1_233" type="str"/>
         </relation>
       </str>
     </p>

Just like we don't allow empty texts, I think we can probably strip leading and trailing spaces (=emptiness) when doing text validation and offset computation (this does not affect any intermediate spaces, also not in multiline content!).

@martinreynaert
Copy link
Collaborator

I agree. I think it is far better to normalize these. Thanks!

@pirolen
Copy link

pirolen commented Dec 8, 2020 via email

proycon added a commit that referenced this issue Dec 8, 2020
@proycon
Copy link
Owner Author

proycon commented Dec 8, 2020

@pirolen If you think it's a tokenisation issue then it's best to put it in https://github.com/LanguageMachines/ucto/issues . If you're referring to insertion/deletion corrections in FLAT then best to put it in https://github.com/proycon/flat/issues

@proycon proycon added this to the v2.4.1 milestone Dec 8, 2020
proycon added a commit that referenced this issue Dec 8, 2020
@kosloot
Copy link
Collaborator

kosloot commented Dec 10, 2020

I tried to reproduce this problem, but folialint failed to fail

Are you sure this isn't already fixed on Nov 17:

commit 64218577550c6f3763dbbc75f668252fd4f3f03d
Author: Ko van der Sloot K.vanderSloot@let.ru.nl
Date: Tue Nov 17 15:38:41 2020 +0100

Fixed problem with text-conststency errors for within

Or maybe it is very related?

UPDATE:
Sorry. :{
I was able to get an error using your example: issue88.2.4.1.folia.xml

proycon added a commit to LanguageMachines/libfolia that referenced this issue Dec 10, 2020
proycon added a commit to LanguageMachines/foliatest that referenced this issue Dec 10, 2020
@proycon
Copy link
Owner Author

proycon commented Dec 10, 2020

I think I tackled this now in libfolia as well, I'll continue by testing it in the PICCL context where the issue emerged.

@proycon
Copy link
Owner Author

proycon commented Mar 9, 2021

I'm afraid our problems with whitespace are not over yet. I take the example @kosloot gave in LanguageMachines/foliautils#56.

This output has been formatted this way by libxml2 itself, but this formatting is not compatible with the FoLiA assumptions we held until now:

       <t>
        <t-str xml:id="text.p.1.t-str.1">
          <t-style>deel<t-hbr/></t-style>
        </t-str>
        <t-str xml:id="text.p.1.t-str.2">
          <t-style>woord</t-style>
        </t-str>
        <t-str>extra</t-str>
      </t>

With the current rules we applied, the text representation that both foliapy and libfolia give is:

deel
        woord
        extra

Also if we simplify the example to:

       <t>
          <t-style>deel<t-hbr/></t-style>
          <t-style>woord</t-style>
          <t-str>extra</t-str>
      </t>

We get that same result.

The extra bonus is that as soon as we add a space prior to the word extra, that libxml2 serializes the whole <t> block in a single line!! Which is far more in line what we intend FoLiA (except for the fact that the leading space would be stripped).

I don't think the text representations are good as they are, with all the indentation, and I think what we're getting now is at odds with how XML sees things. I think what we want in this case is one of two options:

  1. we want the text "deelwoordextra" (without any intermediate spaces), so stripping ALL the initial and trailing spaces outside the markup elements.
  2. The alternative interpretation is to go for the text "deel woord extra", with a single space between all the parts. This would be in line with what HTML does:
<span>
    <span>deel</span>
    <span>woord</span>
    <span>extra</span>
</span>

(see https://download.anaproy.nl/deelwoordextra.html)

If we go for option 1, this does beg the question how we would represent a space if we do want it, say for example between woord and extra. I think the solution to that would be:

       <t>
          <t-style>deel<t-hbr/></t-style>
          <t-style>woord</t-style> <t-str>extra</t-str>
      </t>

If we go for option 2, then it begs the question how we would represent the non-spaced scenario, the solution would be:

       <t>
          <t-style>deel<t-hbr/></t-style>
          <t-style>woord</t-style><t-str>extra</t-str>
      </t>

I think we're currently closer to option 1 in our interpretations, but I need to do some investigation whether option 2 isn't the more natural XML interpretation (after all, it's what HTML does too). Whatever we choose, we have to take into account the fact that twe didn't impose this strictness before and therefore be lenient not to break older files, as addressed in issue #92.

Of course, the one-line solution avoids all these problems in all cases and is the simplest, but it's apparently not what libxml2 prefers to output (pretty formatting), nor something we can expect users to adhere too:

       <t><t-style>deel<t-hbr/></t-style><t-style>woord</t-style> <t-str>extra</t-str></t>

It would be good if we had a way to normalize our FoLiA's to force this one-line representation (as an extra tool), because it would be a valuable preprocessing step that can solve issues like proycon/foliatools#29 and make things easier for parsers that can't deal with all these complexities.

@proycon proycon modified the milestones: v2.4.1, v2.5.0 Mar 9, 2021
@kosloot
Copy link
Collaborator

kosloot commented Mar 9, 2021

Hmm, it truly is complex. I ponder about the <t-hbr/> in your example. Shouldn't that yield

deelwoord extra
deelwoord
extra

or

deel-
woord
extra

or such? Anyway not just a space after 'deel' I assume, but some representation of the <t-hbr/>.

@proycon
Copy link
Owner Author

proycon commented Mar 9, 2021

Ah yes, possibly, I didn't consider any representation of t-hbr . I don't think we currently represent it even, do we? Let's save that for another issue :)

@kosloot
Copy link
Collaborator

kosloot commented Mar 9, 2021

Well, it was the source for LanguageMachines/foliautils#56
One of the heads of this dragon

@pirolen
Copy link

pirolen commented Mar 9, 2021

After tokenization with ucto, the t-hbr is gone/turned into a token boundary. In my ideal workflow, the soft break would stay recoverable (and propagatable to FLAT and folia2html), if possible at all.

proycon added a commit to LanguageMachines/libfolia that referenced this issue Mar 24, 2021
proycon added a commit to LanguageMachines/foliatest that referenced this issue Mar 24, 2021
@proycon
Copy link
Owner Author

proycon commented Mar 24, 2021

A remaining issue, raised by @kosloot, is whether we should actively normalize the more exotic unicode spaces ( see https://en.wikipedia.org/wiki/Whitespace_character#Unicode) to a normal space. This is probably a good idea, but we may need to introduce an explicit <t-hspace> element in case people want to explicitly specify things like space width.

@pirolen
Copy link

pirolen commented Mar 24, 2021

Thanks!
Some more test examples from me would include superscript styling, where the superscripted characters would ideally be adjacent without whitespace to their context on the left and sometimes right, in examples 2 and 3:

<t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.5">
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>im wirtschaftlichen Interessenkampf gegen die Agrarpartei verwert<t-hbr/></t-style>
          </t-str>
          <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.6">
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>baren Schauergemälde bieten</t-style>
            <t-style><feat subset="font_typeface" class="superscript"/><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>6</t-style>
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>, oder welche die Agrarverhältnisse</t-style>
          </t-str>
        <t class="OCR">
          <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p3.t-str.1">
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>Um nicht gewisse Bemerkungen über die Arbeitsverfassung im</t-style>
          </t-str>
          <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p3.t-str.2">
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>ganzen</t-style>
            <t-style><feat subset="font_typeface" class="superscript"/><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>1</t-style>
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>) bei jedem einzelnen Bezirk wiederholen zu müssen, habe</t-style>
          </t-str>
        </t>
        <t class="OCR">
          <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p4.t-str.1">
            <t-style><feat class="superscript" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="12." subset="font_size"/><feat class="{6B4F7D42-EA4B-4F65-B62C-458C902232DA}" subset="font_style"/>1</t-style>
            <t-style><feat class="Times New Roman" subset="font_family"/><feat class="12." subset="font_size"/><feat class="{6B4F7D42-EA4B-4F65-B62C-458C902232DA}" subset="font_style"/>) Grundlage bleibt nach wie vor in dieser Beziehung die Schrift von v.d. Goltz,</t-style>
          </t-str>
        </t>

@proycon
Copy link
Owner Author

proycon commented Mar 24, 2021

@pirolen To accomplish that in the new situation, there can not be a newline between the two elements (so they must be on the same line). I think this is generated by FoLiA-abby right? We'll have to make sure it produces proper FoLiA in such cases.

@pirolen
Copy link

pirolen commented Mar 24, 2021

Yes, the examples come from FoLiA-abby.

@kosloot
Copy link
Collaborator

kosloot commented Mar 24, 2021

We have to look into this as soon as all text issues have been resolved. At the moment it is a moving target.

proycon added a commit that referenced this issue Mar 24, 2021
proycon added a commit to proycon/foliapy that referenced this issue Mar 24, 2021
proycon added a commit to LanguageMachines/libfolia that referenced this issue Mar 24, 2021
proycon added a commit to proycon/foliapy that referenced this issue Mar 25, 2021
proycon added a commit to proycon/foliapy that referenced this issue Mar 25, 2021
@kosloot
Copy link
Collaborator

kosloot commented Mar 29, 2021

Still I think we are getting into trouble anyway.
To illustrate the dilemma a simplified example:

Original text:
item1²
Possible FoLiA text: (as @pirolen would like to see it, I suppose)

<t>
  <t-str>item1<t-style><feat class="superscript" subset="font_typeface"/>2</t-style></t-str>
</t>

When using ucto, a string will be extracted like this: item12
imho this is quite useless.
For further processing, we need a way to "know" that the 2 isn't part of the word item1
Any ideas HOW to accomplish this?
Inserting a space (or newline or such) in the FoLiA is a bit harsh, But still I would prefer item1 2 over item12.

In an ideal world, extracting text form the FoLiA would re-introduce the superscript, item1²but that would depend on the set used, and in general these sets can be user-defined, and are open, so any translation is possible.

I'm stuck here

@proycon
Copy link
Owner Author

proycon commented Mar 29, 2021

I see the problem yes. Technically, following all the rules, the text serialisation item12 is correct. Inserting a space would be too harsh indeed. But I agree that from a tokenisation perspective you would indeed prefer to have item1 and 2 as different tokens. This would then indeed be a problem for the tokeniser (ucto) to tackle, but it is hard to get right and would make all kinds of assumptions we can't really make, so whatever we do would have to be an opt-in parameter I think.

In an ideal world, extracting text form the FoLiA would re-introduce the superscript, item1²but that would depend on the set used, > and in general these sets can be user-defined, and are open, so any translation is possible.

Indeed, and in general styles don't transfer to plain text. You'd need a markup language for that (like Markdown). Properly interpreting styles in custom sets can only be done by the user. We don't certainly don't want text serialisation in FoLiA libraries to even attempt that.

@pirolen
Copy link

pirolen commented Mar 29, 2021

Superscript and subscript are the t-style classes that would imply a token boundary, the others don't (e.g. italic, bold).
Maybe these two could be treated somewhat differently from the rest, so that they always encode a non-breaking boundary (which is not a whitespace boundary)?
I guess t-hbr does not apply here, but perhaps something like https://en.wikipedia.org/wiki/Zero-width_joiner ?

@kosloot
Copy link
Collaborator

kosloot commented Mar 29, 2021

@pirolen:
Maarten and I were thinking in the same direction. Another candidate would be the
Zero-Width-space
It's up to ucto and such to interpret that as a token separator.

@proycon To make this more generic: Could we extend the t-style with an attribute like separator="true"
Which would make text extraction insert that joiner or zero-width?

BUT: There is also another issue, text like: ²footnote text
here the joiner/separator has to come AFTER the ². So maybe the only feasible way to do it is surrounding the text with a special symbol.
It is really tricky.

@pirolen
Copy link

pirolen commented Mar 29, 2021

Would be nice if adding the special symbol around the t-style text element would solve it.

The whole phenomenon reminds me a bit of the choice of tags in sequence labeling, where one can use the prefixes I-O, B-I-O, etc. in combination with the applicable tag (like for a named entity), or simply use the name of the tag as the label. Each of the choices implicitly encodes a specific logic for the tools that ingest the labeled data (and for the humans who interpret them).

@kosloot
Copy link
Collaborator

kosloot commented Mar 30, 2021

More pondering on this:
One problem with 'hidden' characters is their size. Do they count for offset's and string length?
For instance, assuming the separated attribute is implemented:

<t>
  <t-str>item1<t-style><feat class="superscript" subset="font_typeface" separated="yes"/>2</t-style></t-str><t-str>something</t-str>
</t>

(The original text was: item² something)

What should we do here. I assume there is no need to insert 'hidden' characters here, but to implement the str() extraction function so that it does 'the right thing'
But for the fragment above, should str() render:
item1 2 something
OR
item1*2* something were * is a ZERO-WIDTH character, as we were suggesting.

This might raise a lot of problems later on.
What is the offset of '2' in this string? 5 or 6?
And 9 or 7 for 'something'?
Same problems with the string length.

Maybe the clearest solution is, to implement the'separated'attribute, with the semantics of:
when extracting text, insert a space before AND after the styled token.
(and avoiding multiple spaces)

In this way we do not break any old behavior, and don't introduces fuzzy and surprising characters.

@pirolen
Copy link

pirolen commented Mar 30, 2021

Gut feeling:
to render the separator as space would be confusing for humans (e.g. evaluators of OCR extraction), because there is visually no space before/after the sub-/superscripted text (so rather also no hidden character to add to the offset and string length count).

Would it be feasible to regard/treat sub-/superscripted text as a specific type of punctuation? :-o Semantically it seems related to it (=it aids and directs the reading of the text). But just like soft break, its behavior could be configurable. ?

@proycon
Copy link
Owner Author

proycon commented Mar 30, 2021

Maarten and I were thinking in the same direction. Another candidate would be the
Zero-Width-space. It's up to ucto and such to interpret that as a token separator.

Just to prevent confusion: I definitely don't think there should be zero-width spaces in the FoLiA itself. At most the text extraction function could output one where a token boundary must occur and no space happens, but that would have to be an opt-in feature. And as you said, I foresee issues with the offsets then. So I see where you are going with the separated attribute.

Fundamentally, the issue we're discussing now is a tokeniser issue rather than a FoLiA representation problem (so I see it as distinct form the original issue in this thread). The question is how the tokeniser decides what to tokenize and what not:

  • What you're essentially suggesting with the separated attribute is to encode extra information in the FoLiA that gives the tokeniser extra information.
  • An alternative would be to provide the information directly to the tokeniser as a parameter, something like: treat all t-style's with class superscript as separate tokens. (an FQL query might work here but libfolia doesn't implement that and that'd be too much work)

Text content on higher levels is by definition untokenised (so I'm a bit skeptic about adding tokenisation details in there), text content on the word/token level is by definition tokenised. The issue is of course getting from A to B here (which is the task of the tokeniser).

I'm following the line of the extra attribute Ko suggested. But I'm trying to think in a generic way if we expand FoLiA for this: we're essentially encoding some extra 'cue' in the FoLiA to help another tool do its job, and such a cue is needed because the information is not present in the FoLiA yet, or is too complexly encoded. This might be useful for other uses cases than the one we are considering now.

What if we introduce a generic tagattribute that allows people to tag any FoLiA element, the value being a space-delimited list of some user defined vocabulary (because it is tool-specific)? We could then use a value like token or separate for the tokenisation cues:

<t>
  <t-str>item1<t-style tag="token"><feat class="superscript" subset="font_typeface"/>2</t-style></t-str><t-str>something</t-str>
</t>

It's essentially what Ko suggested but stretched to be more generic, it gives some processor-specific flexibility. You can envision tool A setting particular tags, and tool B acting on them.

Note: I opened a new issue for this proposal, see below

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug ready Implemented but not released yet
Projects
None yet
Development

No branches or pull requests

4 participants