Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ontotrace nexml validation #8

Open
uyedaj opened this issue Sep 28, 2018 · 12 comments
Open

ontotrace nexml validation #8

uyedaj opened this issue Sep 28, 2018 · 12 comments
Assignees
Labels

Comments

@uyedaj
Copy link
Contributor

uyedaj commented Sep 28, 2018

@balhoff When I run the nexml files that Wasila sent me (https://github.com/phenoscape/scate/tree/master/data)
through ontotrace through the validator (http://www.nexml.org/phylows/validator) I get the following error (here Malabarba-1998_Ontotrace.xml). Looks like the issue is the polymorphism again?

going to read file 'Malabarba-1998_Ontotrace.xml'
reading from handle
created temporary file '/tmp/o3Bz65Cow4'
copied uploaded data to '/tmp/o3Bz65Cow4'
read file 'Malabarba-1998_Ontotrace.xml', copied contents to '/tmp/o3Bz65Cow4'
created java validator invocation
executing java validator: java -classpath /var/www/html/phylows/../downloads/validator.jar -Dxml=/tmp/o3Bz65Cow4 -Dxsd=/var/www/html/phylows/../xsd/nexml.xsd -Dns=http://www.nexml.org/2009 validator.XmlValidator
executing perl validator
initializing Bio::Phylo::Parsers::Nexml=HASH(0x1c155a0)
going to parse xml
processed <otus id="tfc794139-a39f-4f2a-aa2e-b611ac9f9166"/> (line 71)
processed <otu id="VTO_0065575"/> (line 71)
processed <otu id="VTO_0038118"/> (line 71)
processed <otu id="VTO_0038133"/> (line 71)
processed <otu id="VTO_0038106"/> (line 71)
processed <otu id="VTO_0038138"/> (line 71)
processed <otu id="VTO_0065893"/> (line 71)
processed <otu id="VTO_0065636"/> (line 71)
processed <otu id="VTO_0065788"/> (line 71)
processed <otu id="VTO_0038109"/> (line 71)
processed <otu id="VTO_0065571"/> (line 71)
processed <otu id="VTO_0065877"/> (line 71)
processed <otu id="VTO_0038108"/> (line 71)
processed <otu id="VTO_0065585"/> (line 71)
processed <otu id="VTO_0065747"/> (line 71)
processed <otu id="VTO_0038128"/> (line 71)
processed <otu id="VTO_0065095"/> (line 71)
processed <otu id="VTO_0038140"/> (line 71)
processed <otu id="VTO_0038135"/> (line 71)
processed <otu id="VTO_0038120"/> (line 71)
processed <otu id="VTO_0038111"/> (line 71)
processed <otu id="VTO_0038124"/> (line 71)
Processed block id: tfc794139-a39f-4f2a-aa2e-b611ac9f9166 (line 71)
going to parse characters element (line 236)
processed <characters id="cc4104afe-da7a-43e7-a1c5-cb4a40fc9161"/> (line 236)
processed <row id="r57d9b9d3-75bc-498e-a165-60330c7ecabd"/> (line 236)
set char: '1 1 and 0 1 0' (line 236)
Invalid data for row Datum49 (type Standard: 1 1 and 0 1 0)
Invalid data for row Datum49 (type Standard: 1 1 and 0 1 0)
@balhoff
Copy link
Member

balhoff commented Sep 28, 2018

I think the issue is the polymorphism. It probably doesn't like that the state symbol for the polymorphic condition is 1 and 0. But the schema says that is okay.

@balhoff
Copy link
Member

balhoff commented Sep 28, 2018

I’ll try to figure out if I can find where this is getting checked in the Perl.

@balhoff
Copy link
Member

balhoff commented Sep 30, 2018

@rvosa does the Bio::Phylo NeXML validator handle polymorphic states? It seems to be complaining about a string character symbol.

@balhoff
Copy link
Member

balhoff commented Oct 3, 2018

@uyedaj if you replace the polymorphism symbols with a number, e.g. "10", it validates. I believe this is an error in the validator. For now you can replace

symbol="1 and 0" or symbol="0 and 1"

with

symbol="2"

Will this work? Additionally you will want to make sure RNeXML is handling polymorphisms, if you are using that.

@hlapp
Copy link
Member

hlapp commented Oct 3, 2018

As mentioned at last meeting, based on ropensci/RNeXML#171 I think right now polymorphic characters whose symbols aren't numeric are all going to run into the same issue.

What would be good to have while looking at possible solutions is the format that polymorphic character states normally take in matrices. Is it an extra numeric symbol (e.g., 2), or some concatenation of possible symbols (e.g., 1 and 0, or 1, 0, or 10?). I.e., what would most comparative methods code in R expect such as symbol to look like?

@uyedaj
Copy link
Contributor Author

uyedaj commented Oct 3, 2018

Most methods can't really handle polymorphic characters... DNA sequence data has dedicated characters for ambiguous bases, while I think the standard for binary characters is generally just to put a '?'. I think a lot of the nexus parsers will have issues if it's anything longer than a single character, so for now it seems like an extra numeric symbol would be a good option.

@balhoff
Copy link
Member

balhoff commented Oct 3, 2018

@hlapp do you think I should update OntoTrace to choose a random unused digit (for presence/absence, of course this can be 2) to just make this work? As you might expect I would much prefer for tools that read NeXML to just handle it as designed. :-)

The "1 and 0" symbol works really well for display in Phenex, which is one reason I'm reluctant to change.

NEXUS has its own scheme to represent polymorphism, and I would expect libraries that read/write both to interconvert between them.

@hlapp
Copy link
Member

hlapp commented Oct 3, 2018

@hlapp do you think I should update OntoTrace to choose a random unused digit (for presence/absence, of course this can be 2) to just make this work?

Based on @uyedaj's response I was indeed going to suggest that. However, I am also going to argue that making this change is in fact the right thing to do, and not just a kludge to make this work.

Technically, as I verified earlier against the NeXML schema, the type of the symbol attribute for polymorphic_state_set is indeed xs:anySimpleType, so strings are allowed. That the NeXML validator complains about this is thus a problem with the validator, not the data file at hand. However, for StandardState and StandardStateSet the type of the symbol attribute is integer. I would argue that allowing this much room for confusion in implementations is a mistake in the NeXML schema itself. Furthermore, as per @uyedaj's answer, it's also rather inconsistent with common practice in comparative method tools (which I would say runs counter to the original design goals of NeXML, as reflected in the prefix of its name, another reason why arguably this should be regarded as a mistake in the schema).

One of our goals for the API is to facilitate more widespread adoption of our data and capabilities. One aspect of this should be to return data in lossless formats that can be as straightforwardly as possible converted into the input formats needed by existing tool ecosystems. In most cases, the latter means NEXUS. Obviously, NEXUS can't encode well all the information that constitutes our data value propositions, so isn't lossless and hence we don't output NEXUS but instead do NeXML or JSON-LD. But one of the beauties of the NeXML format is that changing a PolymorphicStateSet's symbol from "1 and 0" (a multi-character string) to "2" (a single digit integer) incurs no loss of information, because the states in the set are explicitly defined and not meant to be only (or even at all) parseable from the symbol.

So instead of asking every single consumer implementation to have to perform this substitution, we can make it for them, without loss of information in the data that we return.

The "1 and 0" symbol works really well for display in Phenex, which is one reason I'm reluctant to change.

But to quote from your own response:

As you might expect I would much prefer for tools that read NeXML to just handle it as designed. :-)

Arguably, it is Phenex that doesn't treat NeXML data as it is designed. Namely, it's Phenex that gives semantic significance to the value of the symbol of polymorphic state set (namely, expecting it to show the states in the set), rather than, if and where needed, extracting this information from the data as NeXML was designed to support it.

@rvosa
Copy link

rvosa commented Oct 4, 2018 via email

@balhoff
Copy link
Member

balhoff commented Oct 4, 2018

@hlapp:

Arguably, it is Phenex that doesn't treat NeXML data as it is designed. Namely, it's Phenex that gives semantic significance to the value of the symbol of polymorphic state set (namely, expecting it to show the states in the set), rather than, if and where needed, extracting this information from the data as NeXML was designed to support it.

Hey, now I think this is getting a bit twisted. 😆 I shouldn't be dinged for going the extra mile and generating a friendly label while staying within the spec.

I still think it's quite bad for readers of NeXML not to recognize polymorphic states. If that's not represented in their model they should do something like treat it as unknown instead of charging ahead. In NEXUS there is no additional state symbol generated anyway; a polymorphism would be represented as (0 1), which must be handled by the parser.

If we return polymorphisms as a third state symbol 2, I worry this opens the door to even more incorrect interpretations by incomplete parsers.

@hlapp
Copy link
Member

hlapp commented Oct 4, 2018

I shouldn't be dinged for going the extra mile and generating a friendly label while staying within the spec.

I'm not dinging anyone here 😄 However, I think it's important to realize that the symbol attribute isn't a label attribute. Labels are indeed mostly meant for human readers, and thus can be discarded without impacting utility or readability for tools. (For example, labels in NEXUS are comments for a reason.) The symbol for a state isn't a label though, it's the piece of information a tool will use to compute with the data. So the symbol first and foremost needs to be tool-friendly, not human-friendly.

I still think it's quite bad for readers of NeXML not to recognize polymorphic states.

You mean human readers of a NeXML file, machine readers of a NeXML file, or users of a tool that lets them inspect the data in a NeXML file?

I think quite firmly that human readers of NeXML files is not, or at least not an important use case. XML is not for human readability, and pretending otherwise just leads down bad paths. A machine reader of a NeXML file will need a single-character symbol. For a user of a tool for inspecting a NeXML file, the tool does have the information needed to construct a user-friendly label for a polymorphic state (do you disagree with that?).

If that's not represented in their model they should do something like treat it as unknown instead of charging ahead.

I think that's what @uyedaj was saying is what is typically done, right? I.e., most comparative tools can't distinguish (in the sense of computing differently) between polymorphic and uncertain states.

In NEXUS there is no additional state symbol generated anyway; a polymorphism would be represented as (0 1), which must be handled by the parser.

That's a property of the NeXML format, and I agree that tools converting NeXML to NEXUS should probably write out a polymorphic state that way. But what to do depends on the target of the conversion.

If we return polymorphisms as a third state symbol 2, I worry this opens the door to even more incorrect interpretations by incomplete parsers.

Possibly, so maybe it should be a question mark instead, as recommended by @uyedaj and as is the common symbol used for this?

@balhoff
Copy link
Member

balhoff commented Oct 23, 2018

Update—@hlapp has updated RNeXML to handle polymorphic states. I will update Phenoscape NeXML generation to use NEXUS-style polymorphism symbols (e.g. (0 1) for increased compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants