Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

offset information for segments (from, to) -- what's the best move? #16

Open
bansp opened this issue Mar 17, 2017 · 28 comments
Open

offset information for segments (from, to) -- what's the best move? #16

bansp opened this issue Mar 17, 2017 · 28 comments

Comments

@bansp
Copy link
Contributor

bansp commented Mar 17, 2017

This is a request for advice.
I need to provide offset information for <seg>, ISO-LAF-style (numerical, starting from 0).
I reject @corresp with the quasi-XPointer that we are never going to see implemented, I'm afraid, and which is just scary otherwise. I want @from and @to so that the markup is understandable not only for high-end parsing tools. There are two options for this, minimally:

  • define them locally in the ODD (@mode="add"), as <dataRef name="nonNegativeInteger"/>
  • add <seg> to the class att.citing, and fix the @unit attribute to the value "character" (and document the convention of starting from 0, but that needs to be done either way)

Question: which of these moves feels more in-line with our overall goal here, and which of them is more likely to get accepted by the Council when we get to submit the relevant tickets? The att.citing strategy feels more universal, while the former creates yet another {@from, @to} pair, which the Council may be unhappy with.

@laurentromary
Copy link
Owner

You're giving a lot of elements that may help shaping an answer. My feeling would be that we need a specific class (not abusing att.citing for reasons we could list up in this thread), say att.referring with the following properties:

  • @from, @to, @referringMode
  • used in span immediately to replace existing @from and @to
  • consequently have @from and @to have a double type: teidata.word and teidata.pointer
    and you can use it for whatever other similar purpose.
    Makes sense?

@bansp
Copy link
Contributor Author

bansp commented Mar 17, 2017

Thanks. Do we have precedents with attributes having double (multiple) data types? I kind of assumed up till now that it is a matter of principle to have a single data type defined for attributes. I'm just wondering about how revolutionary such a thing would be for the Council.

@bansp
Copy link
Contributor Author

bansp commented Mar 17, 2017

I can imagine an objection: "what if someone does @from="#sect1" @to="5"?"
Obviously, there are many ways to mess up even with a single data type, but I have a feeling that this would be seen as a cause for rejecting the ticket/proposal.
Or is your @referringMode supposed to handle that? Two values, "pointer" and "numeric", plus Schematron?

@laurentromary
Copy link
Owner

@laurentromary
Copy link
Owner

As to your objection, this is indeed the role of @referringMode which should trigger the appropriate Schematron rule.

@bansp
Copy link
Contributor Author

bansp commented Mar 17, 2017

Super, thanks for this info.
So this is a logically separate initiative, because the class should go into a separate spec file att.referring.xml and it affects <span> and <seg> for general purposes (although they will definitely shine in the standoff module).
Sounds like a ticket for practically now, correct?

@laurentromary
Copy link
Owner

Yes, go!

@bansp
Copy link
Contributor Author

bansp commented Mar 17, 2017

Started by preparing a separate project for that in the LingSIG space: https://github.com/orgs/LingSIG/projects/2
I'm not sure I will be able to do more today (last day before flying).
A link to the list of projects is a better reference, at least the description is visible there: https://github.com/orgs/LingSIG/projects

@laurentromary
Copy link
Owner

The links do not work!

@bansp
Copy link
Contributor Author

bansp commented Mar 17, 2017

Oh that's bad news. They work for me, which may mean that they are not accessible to people from the "organization" only (I've sent you an invite). I didn't realise that within public organizations at GitHub, access may be restricted. That's a bit worrying.

@bansp
Copy link
Contributor Author

bansp commented Mar 21, 2017

Ahh, teidata.probCert is not exactly a precedent: it's a single data type that has two internal variants. Still, let's experiment and see.

@bansp
Copy link
Contributor Author

bansp commented Mar 21, 2017

Similarly with data.numeric (one datatype to bind them...):

  <alternate>
         <dataRef name="double"/>
         <dataRef name="token" restriction="(\-?[\d]+/\-?[\d]+)"/>
         <dataRef name="decimal"/>
      </alternate>

@bansp
Copy link
Contributor Author

bansp commented Mar 21, 2017

I have created the relevant specs now but got stuck at the Schematron. I already use a rule which forces both attributes to either be uniformly integer or uniformly URI. At this point, @referringMode becomes an extra static-typing kind of check, and since it has to be optional (so that <span> remains compatible), it's a, well, optional static-typing check, which is very close to spurious static-typing check... So I'm thinking of dropping it altogether.

I would like to open this for discussion. Will reference the relevant diff in the next note.

bansp added a commit to LingSIG/TEI that referenced this issue Mar 21, 2017
@bansp
Copy link
Contributor Author

bansp commented Mar 21, 2017

Here goes: LingSIG/TEI@c30ab72

Two specs added: att.referring.xml and teidata.pointerInt.xml , one file changed (span.xml).
Please note that the Schematron binding @referringMode is lacking, because I can't any longer see the rationale for using this attribute.

@bansp
Copy link
Contributor Author

bansp commented Mar 22, 2017

This is a better link, to a pull request which combines the commits (needed to fix a bit):
https://github.com/LingSIG/TEI/pull/1/files

@bansp
Copy link
Contributor Author

bansp commented Mar 22, 2017

(just a note: it works as expected; the Schematron is indeed too tight and @referringMode seems indeed spurious)

@laurentromary
Copy link
Owner

Should not we drop the technical check and see @referringMode as a documentation mechanisms (like @Unit in a metaphoric way)?

@bansp
Copy link
Contributor Author

bansp commented Mar 23, 2017

And I was so happy with this simplistic view that there's just the integer offset and ID-based pointing... sigh. OK then, more power to patterns, less mess in the Schematron layer: let's define patterns of
< referringMode, from, to >, with potential conflation of the first member

  • character, NNinteger, NNinteger (NN = nonNegative)
  • inter-character point (icp), NNinteger, NNinteger
  • byte, NNinteger, NNinteger
  • pointer, pointer, pointer
  • second, integer, integer
  • int(erval), integer, integer

(the last one is meant as a catch-all, for other uses)

I would model that as an alternation of patterns, driven by the value of @referringMode.

There should be a way to fix one value of the referringMode, to minimize verbosity in the actual markup. Maybe simply in the particular ODD, so it's outside of the proposal. Another thing that has to be documented in the encodingDesc (I guess there?) is the initial value of character-based indexes (it can be 0 or 1).

Does the above make sense?

@bansp
Copy link
Contributor Author

bansp commented Mar 23, 2017

PS. I can do that in RNG, but I'm afraid I see no way of implementing this in ODD, where I can't see a way to express a pattern of attributes...

@laurentromary
Copy link
Owner

All in all this is exactly the way I was seeing this. We need to find a clever way to implement.

@bansp
Copy link
Contributor Author

bansp commented Mar 23, 2017

Yep. The way I see now is not very clever, but I've asked the gurus:

http://tei-l.970651.n3.nabble.com/can-pure-ODD-define-an-alternation-of-attribute-patterns-td4029511.html

@bansp
Copy link
Contributor Author

bansp commented Mar 23, 2017

Ah, there is a theoretical way to express this in pure ODD, but depending on one's view, the fact that it doesn't work is either a feature or a bug: TEIC/Stylesheets#144

@bansp
Copy link
Contributor Author

bansp commented Mar 24, 2017

Implemented Syd's awesome suggestion, the result is a bit too lax now in the sense that it allows for from ="4" to="#gizmo" in the absence of @referringMode, but that's an easy fix that I will implement tomorrow. Gosh, it looks like we're looking at a workable solution!

One more thing: I put in "second" as a tip of the hat to our speech-transcription colleagues, but there surely should be more. I'll have a look at Thomas Schmidt's article to see what other values for referringMode should have. I am not sure if the anonymous "interval" should be there (?)

@laurentromary
Copy link
Owner

This is really interesting! Concerning "second" the format is determined by the ISO 8601 (or so). I guess we should have "temporal" there.

@bansp
Copy link
Contributor Author

bansp commented Mar 31, 2017

I have left "seconds" out after all because after all speech transcription assumes an indexed timeline. Because URI is so loose (it's so hard to even create an ill-formed URI), I have added the value "id" to @referringMode, to mimick the behaviour of IDREF (it alerts if there is no suitable target in the document being edited). I find it super handy when editing by hand.

One burning question is: should I add <seg> to the att.referring class right away? Or rather leave it as an example of potential customization? The thing is, we want it available under <standoff>, it embeds the whole point, after all.

@bansp
Copy link
Contributor Author

bansp commented Apr 3, 2017

I think that the code-related portion is done, what remains is to weave some narration into the IA chapter (not easy, because it has a really nice flow exactly where we should somehow interrupt it), and to decide whether this ticket should only modify <span> or also <seg>. The example I've given in the documentation is one of <seg> but that shouldn't matter much.

This is what I put into my ODD to make this work for <seg>:

                <elementSpec ident="seg" module="linking" mode="change">
                    <classes mode="change">
                        <memberOf key="att.referring"/>
                    </classes>
                </elementSpec>

                <classSpec ident="att.referring" mode="change" type="atts" module="tei">
                    <constraintSpec scheme="schematron" ident="default_mode" mode="replace">
                        <constraint>
                            <sch:rule
                                context="*[local-name() = ('span','seg')][not(@referringMode) and @from and @to]">
                                <sch:assert test="@from castable as xsd:nonNegativeInteger">The
                                    default form of @from is a non-negative integer</sch:assert>
                                <sch:assert test="@to castable as xsd:nonNegativeInteger">The
                                    default form of @to is a non-negative integer</sch:assert>
                            </sch:rule>
                        </constraint>
                    </constraintSpec>
                    <attList>
                        <attDef ident="referringMode" usage="opt" mode="change">
                            <defaultVal>icp</defaultVal>
                        </attDef>
                    </attList>
                </classSpec>

This is only complex because there appears to be no way to communicate the current state of ODD to Schematron. Otherwise, the constraint could be part of att.referring and we would only need to modify the default value in the ODD.

@bansp
Copy link
Contributor Author

bansp commented Apr 4, 2017

Instead of adding <seg> directly to att.referring, we might add att.segLike to att.referring.

@laurentromary
Copy link
Owner

This is probably a good move!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants