Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement text validation by default, even for shallow validation #24

Closed
proycon opened this issue Nov 8, 2016 · 14 comments
Closed

Implement text validation by default, even for shallow validation #24

proycon opened this issue Nov 8, 2016 · 14 comments
Assignees
Milestone

Comments

@proycon
Copy link
Owner

proycon commented Nov 8, 2016

Fail on invalid constructions like:

<p xml:id="test.p.2">
  <head><t>Dit is paragraaf 2</t></head>
  <t>Dit is de rest</t>
</p>

and check offsets.

@kosloot
Copy link
Collaborator

kosloot commented Nov 15, 2016

Please clarify what is exactly 'invalid' in this case:
a paragraph may have a head, and may have a textcontent

@proycon proycon removed this from the v1.4 milestone Dec 9, 2016
@proycon
Copy link
Owner Author

proycon commented Dec 9, 2016

The head is part of the paragraph, so the text of the paragraph is assumed to also cover the head.

@kosloot
Copy link
Collaborator

kosloot commented Dec 14, 2016

It still isn't clear what you mean here:
Should the paragraph text in this case be:
Dit is paragraaf 2 Dit is de rest

On the fly testing is quite hard then...

@kosloot
Copy link
Collaborator

kosloot commented Apr 10, 2017

So the conclusion is:
The textcontent of a structure element is composed of the textcontents of the underlying structure. (if present, and all in the same class of course)
It is up to the library implementation to enforce this.
To play safe, you could only add textcontent to the lowest elements (e.g. Words) but that makes human readability awkward.
adding checks to the library is preferable.
An important issue is in that case whether the <t> content should reflect the untokenized variant of the text from underlying layers. I think it should.

@proycon
Copy link
Owner Author

proycon commented Apr 13, 2017

Agreed..

  • Checks are needed for the validators, outputting warnings if text is inconsistent
  • <t> on higher levels indeed represents the untokenised variants.

proycon added a commit to proycon/pynlpl that referenced this issue Apr 13, 2017
proycon added a commit that referenced this issue Apr 13, 2017
@proycon
Copy link
Owner Author

proycon commented Apr 13, 2017

Should still be expanded with offset/ref checks as well

@proycon proycon modified the milestones: v1.5, v1.4.2 Jun 23, 2017
@kosloot
Copy link
Collaborator

kosloot commented Jun 26, 2017

see also LanguageMachines/libfolia#9

@proycon
Copy link
Owner Author

proycon commented Jul 19, 2017

Proposal for Text Validation

FoLiA allows for text on multiple structural levels. Text on words higher than the word level (e.g sentence level, paragraph level) by definition reflect untokenised text. When text is specified on multiple levels, there is a level of redundancy and it may occur that the text on one level conflicts with the text on another level, this is a problem that needs to be detected at validation time. FoLiA v1.5 intends to enforce this.

Text validation has two aspects:

  • The text of a deeper element must occur in the higher element's text too (if present), and must occur in te right place:
  • Offset information (if present) must match. Offsets start at 0 and each unicode point counts as 1.

The main issue in text checking is how to deal with whitespaces:

We can go for the very strict approach, which seems to be the road we've been on until now, where whitespace has to be exactly matched (<t>x</t> <t>y</t> requires exactly <t>x y</t> for the encompassing structure and nothing else) or for a more relaxed approach, where we only care whether there is whitespace and don't care how much whitespace that is (<t>x</t> <t>y</t> would also allow <t>x y</t> and even <t>x\ny</t>). I'm leaning towards favouring this second more lenient scenario, so that's what this proposal will be about. A sub-issue here is how to deal with literal tabs in the input text (LanguageMachines/ucto#25). In the more lenient scenario, a tab would just be treated as a space. In a strict scenario it would be more problematic.

A linebreak can be considered a special kind of whitespace. For the more lenient proposal on text validation, I think we can consider newlines (and carriage returns) again the same as spaces.

We're obviously not the first ones to run into this issue, the W3C has thought about it too and there is a normalize-spaces() function in XPath which would produce exactly the string I'm proposing we use for text validation. Similarly, HTML relies on this same principle (even for visualisation; multiple spaces, tabs, and linebreaks are meaningless and all just count as a space).

When it comes to XML parsing and serialisation, I think the output should match the input, so retaining all tabs/spaces/newlines as-is. If we don't do that may be forced to solve issue #12 (CDATA).

My proposal for FoLiA v1.5 would be:

  • For text validation: Use the normalized-spaces approach. I realize this is a step backwards from the strict approach, as things like text delimiters, explicit linebreaks (<br/>) will not be part of the actual text validation anymore.
  • For parsing and serialisation of XML: retain all spaces, tabs and newlines exactly as in the input.
  • For printing (string serialisation in the library) (text()/ str()), print the content as is (if it has newlines/multiple spaces/tabs, so be it and let the underlying tools deal with it as they wish). Introduce a normalize parameter for text() that does produce the normalised variant (and which in turn can be used by the text validation checker).
  • For text offset information and validation, use the normalized-spaces variant again. Implies that whitespace can be manipulated without changing offsets, and that TEXTDELIMITER attributes in the library can be tweaked without affecting FoLiA documents. (scrapped, offset information is precise, each unicode points counts as 1)

Implementation-wise, @kosloot identifies an important issue in LanguageMachines/libfolia#9:

  • Immediate top-down validation does not work when appending a lower-level structural element to a higher-level structural element with text as we're dealing with a partial text until all lower elements have been added. I guess a moment for text validation is just prior to serialisation to file, or on explicit call.
  • I think we must also take care, especially in libfolia, that text validation does not come at too high a performance cost.

Another thing that complicates text validation is corrections, the authoritative route through new/current is followed by default and text content on higher level elements tends to be a reflection of the corrected content.

<s><t>This is a test</t></s>

<w>
<correction>
 <new><t>test</t></new>
 <original auth="no"><t>test</t></original>
</correction>
</w>

This however does raise some issues which need to be addressed still and are not covered here yet, I'll write up on it in a separate comment.

Thoughts and comments?

@kosloot
Copy link
Collaborator

kosloot commented Jul 19, 2017

Most of this make sense.
Bit concerned about the offset remark:

<s>
   <t>This             is    odd</t>
  <w id=1 offset=0>
     <t>This</t>
  </w>
  <w id=2 offset=6>
     <t>is</t>
  </w>
  <w id=13 offset=8>
     <t>odd</t>
  </w>
</s>`

I would think odd has an offset of 24. not 8

@kosloot
Copy link
Collaborator

kosloot commented Aug 14, 2017

Remark: it is doable to have NEWLINES explicit in FoLiA text using <br/> tags but multiple spaces and tabs will still be a problem. So may just forget about the <br/>

proycon added a commit to proycon/pynlpl that referenced this issue Sep 28, 2017
… , using normalize_spaces(), added a normalize_spaces attribute on text() methods.
@proycon
Copy link
Owner Author

proycon commented Sep 28, 2017

Text validation should not descend into morphology, as complex morphemes may not correspond entirely with higher levels. Example:

          <w xml:id="WR-P-E-J-0000000001.p.1.s.2.w.16">
            <t>genealogie</t>
            <pos class="N(soort,ev,basis,zijd,stan)" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn"/>
            <lemma class="genealogie"/>
	    <morphology>
	      <morpheme class="complex">
		<t>genealogie</t>
		<feat class="[[genealogisch]adjective[ie]]noun/singular" subset="structure"/>
		<pos class="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/>
		<morpheme class="complex">
                  <feat class="N_A*" subset="applied_rule"/>
                  <feat class="[[genealogisch]adjective[ie]]noun" subset="structure"/>
                  <pos class="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/>
                  <morpheme class="stem">
                    <t>genealogisch</t>
                    <pos class="A" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/>
                  </morpheme>
                  <morpheme class="affix">
                    <t>ie</t>
                    <feat class="[ie]" subset="structure"/>
                  </morpheme>
		</morpheme>
		<morpheme class="inflection">
                  <feat class="singular" subset="inflection"/>
		</morpheme>
              </morpheme>
            </morphology>
          </w>

proycon added a commit to proycon/pynlpl that referenced this issue Sep 28, 2017
proycon added a commit to proycon/pynlpl that referenced this issue Sep 28, 2017
@proycon
Copy link
Owner Author

proycon commented Sep 28, 2017

@kosloot , I hadn't replied to your offset concern yet. I understand the concern and think I agree with you. It would be a bit too odd indeed and the offset would even lose information.

So let's revert this idea and just count each unicode point as 1 (as it was) and not normalize the offsets in any way. This would imply the check is very strict and an error is easily made (but perhaps a nice counterweight to the offset-less text validation which is now more flexible).

proycon added a commit that referenced this issue Sep 28, 2017
…d (part) [the new text validator discovered it] #24
proycon added a commit to proycon/pynlpl that referenced this issue Sep 28, 2017
…extContent.ref attribute is now a string (ID), getreference() method resolves the actual reference and does the final validation
@proycon
Copy link
Owner Author

proycon commented Sep 29, 2017

A small remark on the implementation of text validation: I consider it a validation step and not a parse step; so in the Python library text checking is only done when explicitly asked (not on every document load, to save performance). The validation methods and by extension the validator tool of course always explicitly turn text checking on (for FoLiA v1.5+).

@kosloot
Copy link
Collaborator

kosloot commented Oct 2, 2017

I don't agree (fully).
This would allow for invalid documents to be generated and occur 'in the wild'
Most users don't or scarcely run validators.
Parsing is already expensive. checking the text validity doesn't impose a severe burden IMHO
We check setnames and such too. Would you skip that also?

proycon added a commit to proycon/pynlpl that referenced this issue Oct 5, 2017
…iven a textclass, what kind of correction handling is appropriate. This is used by text validation (proycon/folia#24). The method is limited and can not deal with complex situations (nested corrections, inconsistencies), in such cases text validation will be skipped alltogether for that element.
@proycon proycon closed this as completed Oct 8, 2017
proycon added a commit to proycon/foliapy that referenced this issue Sep 6, 2018
proycon added a commit to proycon/foliapy that referenced this issue Sep 6, 2018
… , using normalize_spaces(), added a normalize_spaces attribute on text() methods.
proycon added a commit to proycon/foliapy that referenced this issue Sep 6, 2018
proycon added a commit to proycon/foliapy that referenced this issue Sep 6, 2018
proycon added a commit to proycon/foliapy that referenced this issue Sep 6, 2018
…extContent.ref attribute is now a string (ID), getreference() method resolves the actual reference and does the final validation
proycon added a commit to proycon/foliapy that referenced this issue Sep 6, 2018
…iven a textclass, what kind of correction handling is appropriate. This is used by text validation (proycon/folia#24). The method is limited and can not deal with complex situations (nested corrections, inconsistencies), in such cases text validation will be skipped alltogether for that element.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants