Implement text validation by default, even for shallow validation #24

proycon · 2016-11-08T14:03:07Z

Fail on invalid constructions like:

<p xml:id="test.p.2">
  <head><t>Dit is paragraaf 2</t></head>
  <t>Dit is de rest</t>
</p>

and check offsets.

The text was updated successfully, but these errors were encountered:

kosloot · 2016-11-15T15:38:43Z

Please clarify what is exactly 'invalid' in this case:
a paragraph may have a head, and may have a textcontent

proycon · 2016-12-09T15:44:15Z

The head is part of the paragraph, so the text of the paragraph is assumed to also cover the head.

kosloot · 2016-12-14T08:33:39Z

It still isn't clear what you mean here:
Should the paragraph text in this case be:
Dit is paragraaf 2 Dit is de rest

On the fly testing is quite hard then...

kosloot · 2017-04-10T14:35:02Z

So the conclusion is:
The textcontent of a structure element is composed of the textcontents of the underlying structure. (if present, and all in the same class of course)
It is up to the library implementation to enforce this.
To play safe, you could only add textcontent to the lowest elements (e.g. Words) but that makes human readability awkward.
adding checks to the library is preferable.
An important issue is in that case whether the <t> content should reflect the untokenized variant of the text from underlying layers. I think it should.

proycon · 2017-04-13T09:21:34Z

Agreed..

Checks are needed for the validators, outputting warnings if text is inconsistent
<t> on higher levels indeed represents the untokenised variants.

…ent instantiation) (proycon/folia#24)

…cter at some point in the future) #24

proycon · 2017-04-13T14:04:12Z

Should still be expanded with offset/ref checks as well

kosloot · 2017-06-26T10:33:10Z

Proposal for Text Validation

FoLiA allows for text on multiple structural levels. Text on words higher than the word level (e.g sentence level, paragraph level) by definition reflect untokenised text. When text is specified on multiple levels, there is a level of redundancy and it may occur that the text on one level conflicts with the text on another level, this is a problem that needs to be detected at validation time. FoLiA v1.5 intends to enforce this.

Text validation has two aspects:

The text of a deeper element must occur in the higher element's text too (if present), and must occur in te right place:
Offset information (if present) must match. Offsets start at 0 and each unicode point counts as 1.

The main issue in text checking is how to deal with whitespaces:

We can go for the very strict approach, which seems to be the road we've been on until now, where whitespace has to be exactly matched (<t>x</t> <t>y</t> requires exactly <t>x y</t> for the encompassing structure and nothing else) or for a more relaxed approach, where we only care whether there is whitespace and don't care how much whitespace that is (<t>x</t> <t>y</t> would also allow <t>x y</t> and even <t>x\ny</t>). I'm leaning towards favouring this second more lenient scenario, so that's what this proposal will be about. A sub-issue here is how to deal with literal tabs in the input text (LanguageMachines/ucto#25). In the more lenient scenario, a tab would just be treated as a space. In a strict scenario it would be more problematic.

A linebreak can be considered a special kind of whitespace. For the more lenient proposal on text validation, I think we can consider newlines (and carriage returns) again the same as spaces.

We're obviously not the first ones to run into this issue, the W3C has thought about it too and there is a normalize-spaces() function in XPath which would produce exactly the string I'm proposing we use for text validation. Similarly, HTML relies on this same principle (even for visualisation; multiple spaces, tabs, and linebreaks are meaningless and all just count as a space).

When it comes to XML parsing and serialisation, I think the output should match the input, so retaining all tabs/spaces/newlines as-is. If we don't do that may be forced to solve issue #12 (CDATA).

My proposal for FoLiA v1.5 would be:

For text validation: Use the normalized-spaces approach. I realize this is a step backwards from the strict approach, as things like text delimiters, explicit linebreaks (<br/>) will not be part of the actual text validation anymore.
For parsing and serialisation of XML: retain all spaces, tabs and newlines exactly as in the input.
For printing (string serialisation in the library) (text()/ str()), print the content as is (if it has newlines/multiple spaces/tabs, so be it and let the underlying tools deal with it as they wish). Introduce a normalize parameter for text() that does produce the normalised variant (and which in turn can be used by the text validation checker).
For text offset information and validation, use the normalized-spaces variant again. Implies that whitespace can be manipulated without changing offsets, and that TEXTDELIMITER attributes in the library can be tweaked without affecting FoLiA documents. (scrapped, offset information is precise, each unicode points counts as 1)

Implementation-wise, @kosloot identifies an important issue in LanguageMachines/libfolia#9:

Immediate top-down validation does not work when appending a lower-level structural element to a higher-level structural element with text as we're dealing with a partial text until all lower elements have been added. I guess a moment for text validation is just prior to serialisation to file, or on explicit call.
I think we must also take care, especially in libfolia, that text validation does not come at too high a performance cost.

Another thing that complicates text validation is corrections, the authoritative route through new/current is followed by default and text content on higher level elements tends to be a reflection of the corrected content.

<s><t>This is a test</t></s>

<w>
<correction>
 <new><t>test</t></new>
 <original auth="no"><t>test</t></original>
</correction>
</w>

This however does raise some issues which need to be addressed still and are not covered here yet, I'll write up on it in a separate comment.

Thoughts and comments?

kosloot · 2017-07-19T15:03:18Z

Most of this make sense.
Bit concerned about the offset remark:

<s>
   <t>This             is    odd</t>
  <w id=1 offset=0>
     <t>This</t>
  </w>
  <w id=2 offset=6>
     <t>is</t>
  </w>
  <w id=13 offset=8>
     <t>odd</t>
  </w>
</s>`

I would think odd has an offset of 24. not 8

kosloot · 2017-08-14T07:33:23Z

Remark: it is doable to have NEWLINES explicit in FoLiA text using <br/> tags but multiple spaces and tabs will still be a problem. So may just forget about the <br/>

… , using normalize_spaces(), added a normalize_spaces attribute on text() methods.

proycon · 2017-09-28T18:02:20Z

Text validation should not descend into morphology, as complex morphemes may not correspond entirely with higher levels. Example:

          <w xml:id="WR-P-E-J-0000000001.p.1.s.2.w.16">
            <t>genealogie</t>
            <pos class="N(soort,ev,basis,zijd,stan)" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn"/>
            <lemma class="genealogie"/>
	    <morphology>
	      <morpheme class="complex">
		<t>genealogie</t>
		<feat class="[[genealogisch]adjective[ie]]noun/singular" subset="structure"/>
		<pos class="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/>
		<morpheme class="complex">
                  <feat class="N_A*" subset="applied_rule"/>
                  <feat class="[[genealogisch]adjective[ie]]noun" subset="structure"/>
                  <pos class="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/>
                  <morpheme class="stem">
                    <t>genealogisch</t>
                    <pos class="A" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/>
                  </morpheme>
                  <morpheme class="affix">
                    <t>ie</t>
                    <feat class="[ie]" subset="structure"/>
                  </morpheme>
		</morpheme>
		<morpheme class="inflection">
                  <feat class="singular" subset="inflection"/>
		</morpheme>
              </morpheme>
            </morphology>
          </w>

…1.5.xml #24

proycon · 2017-09-28T19:20:04Z

@kosloot , I hadn't replied to your offset concern yet. I understand the concern and think I agree with you. It would be a bit too odd indeed and the offset would even lose information.

So let's revert this idea and just count each unicode point as 1 (as it was) and not normalize the offsets in any way. This would imply the check is very strict and an error is easily made (but perhaps a nice counterweight to the offset-less text validation which is now more flexible).

…d (part) [the new text validator discovered it] #24

…extContent.ref attribute is now a string (ID), getreference() method resolves the actual reference and does the final validation

proycon · 2017-09-29T10:24:43Z

A small remark on the implementation of text validation: I consider it a validation step and not a parse step; so in the Python library text checking is only done when explicitly asked (not on every document load, to save performance). The validation methods and by extension the validator tool of course always explicitly turn text checking on (for FoLiA v1.5+).

kosloot · 2017-10-02T10:45:30Z

I don't agree (fully).
This would allow for invalid documents to be generated and occur 'in the wild'
Most users don't or scarcely run validators.
Parsing is already expensive. checking the text validity doesn't impose a severe burden IMHO
We check setnames and such too. Would you skip that also?

…iven a textclass, what kind of correction handling is appropriate. This is used by text validation (proycon/folia#24). The method is limited and can not deal with complex situations (nested corrections, inconsistencies), in such cases text validation will be skipped alltogether for that element.

…ent instantiation) (proycon/folia#24)

… , using normalize_spaces(), added a normalize_spaces attribute on text() methods.

…extContent.ref attribute is now a string (ID), getreference() method resolves the actual reference and does the final validation

…iven a textclass, what kind of correction handling is appropriate. This is used by text validation (proycon/folia#24). The method is limited and can not deal with complex situations (nested corrections, inconsistencies), in such cases text validation will be skipped alltogether for that element.

proycon added the enhancement label Nov 8, 2016

proycon added this to the v1.4 milestone Nov 8, 2016

kosloot mentioned this issue Nov 8, 2016

Implement better on-the-fly validation for Text LanguageMachines/libfolia#9

Closed

proycon removed this from the v1.4 milestone Dec 9, 2016

proycon self-assigned this Apr 10, 2017

proycon added this to the v1.4.2 milestone Apr 10, 2017

kosloot mentioned this issue Apr 10, 2017

Frogging other FoLiA tags then <s> LanguageMachines/frog#30

Closed

proycon added a commit to proycon/pynlpl that referenced this issue Apr 13, 2017

Implemented text validation (enable with textvalidation=True on Docum…

0871415

…ent instantiation) (proycon/folia#24)

proycon added a commit that referenced this issue Apr 13, 2017

Implemented text validation (currently warnings only, to be made stri…

b65c803

…cter at some point in the future) #24

proycon modified the milestones: v1.5, v1.4.2 Jun 23, 2017

proycon mentioned this issue Jul 19, 2017

Ucto fails to tokenise certain folia input? LanguageMachines/ucto#25

Closed

proycon mentioned this issue Sep 26, 2017

Handling text redundancy in upcoming FoLiA v1.5 proycon/foliadocserve#10

Closed

proycon added a commit to proycon/pynlpl that referenced this issue Sep 28, 2017

Reimplemented more relaxed form of text validation (proycon/folia#24)…

d7bdf72

… , using normalize_spaces(), added a normalize_spaces attribute on text() methods.

proycon added a commit that referenced this issue Sep 28, 2017

Added textvalidation example (folia v1.5), copy of foliatest/examplev…

7f4695f

…1.5.xml #24

proycon added the PRIORITY label Sep 28, 2017

proycon added a commit to proycon/pynlpl that referenced this issue Sep 28, 2017

Added two invalid text tests (proycon/folia#24)

0d3be5a

proycon added a commit to proycon/pynlpl that referenced this issue Sep 28, 2017

Added two invalid text tests (proycon/folia#24)

4edf720

proycon added a commit that referenced this issue Sep 28, 2017

ref attribute was missing on text content, because a level was skippe…

7427e72

…d (part) [the new text validator discovered it] #24

proycon added a commit that referenced this issue Sep 28, 2017

Corrected more erroneous offsets in example.xml #24

4ffcaa5

proycon added a commit to proycon/pynlpl that referenced this issue Sep 28, 2017

Implemented offset checking for text validation (proycon/folia#24), T…

20757fb

…extContent.ref attribute is now a string (ID), getreference() method resolves the actual reference and does the final validation

proycon closed this as completed Oct 8, 2017

proycon mentioned this issue Oct 10, 2017

[documentation] newlines and whitespace in FoLiA text content (<t>) #34

Closed

proycon added a commit to proycon/foliapy that referenced this issue Sep 6, 2018

Implemented text validation (enable with textvalidation=True on Docum…

aa4cc24

…ent instantiation) (proycon/folia#24)

proycon added a commit to proycon/foliapy that referenced this issue Sep 6, 2018

Reimplemented more relaxed form of text validation (proycon/folia#24)…

2b92850

… , using normalize_spaces(), added a normalize_spaces attribute on text() methods.

proycon added a commit to proycon/foliapy that referenced this issue Sep 6, 2018

Added two invalid text tests (proycon/folia#24)

73351df

proycon added a commit to proycon/foliapy that referenced this issue Sep 6, 2018

Added two invalid text tests (proycon/folia#24)

31aa5d8

proycon added a commit to proycon/foliapy that referenced this issue Sep 6, 2018

Implemented offset checking for text validation (proycon/folia#24), T…

4577c34

…extContent.ref attribute is now a string (ID), getreference() method resolves the actual reference and does the final validation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement text validation by default, even for shallow validation #24

Implement text validation by default, even for shallow validation #24

proycon commented Nov 8, 2016

kosloot commented Nov 15, 2016

proycon commented Dec 9, 2016

kosloot commented Dec 14, 2016

kosloot commented Apr 10, 2017 •

edited

Loading

proycon commented Apr 13, 2017

proycon commented Apr 13, 2017

kosloot commented Jun 26, 2017

proycon commented Jul 19, 2017 •

edited

Loading

kosloot commented Jul 19, 2017 •

edited

Loading

kosloot commented Aug 14, 2017

proycon commented Sep 28, 2017

proycon commented Sep 28, 2017

proycon commented Sep 29, 2017

kosloot commented Oct 2, 2017

Implement text validation by default, even for shallow validation #24

Implement text validation by default, even for shallow validation #24

Comments

proycon commented Nov 8, 2016

kosloot commented Nov 15, 2016

proycon commented Dec 9, 2016

kosloot commented Dec 14, 2016

kosloot commented Apr 10, 2017 • edited Loading

proycon commented Apr 13, 2017

proycon commented Apr 13, 2017

kosloot commented Jun 26, 2017

proycon commented Jul 19, 2017 • edited Loading

Proposal for Text Validation

kosloot commented Jul 19, 2017 • edited Loading

kosloot commented Aug 14, 2017

proycon commented Sep 28, 2017

proycon commented Sep 28, 2017

proycon commented Sep 29, 2017

kosloot commented Oct 2, 2017

kosloot commented Apr 10, 2017 •

edited

Loading

proycon commented Jul 19, 2017 •

edited

Loading

kosloot commented Jul 19, 2017 •

edited

Loading