Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML that is well-formed according to xmllint reported as "Not well-formed" by JHOVE #310

Open
bitsgalore opened this issue Feb 22, 2018 · 12 comments
Assignees
Labels
bug A product defect that needs fixing good-first-issue Issue suitable for inexperienced developers P2 Medium priority issues to be scheduled in a future release

Comments

@bitsgalore
Copy link
Member

bitsgalore commented Feb 22, 2018

Dev Effort

1D - investigation

Description

Look at the following METS file:

mets-test.xml

According to xmllint (v. 20904) it contains well-formed XML:

xmllint --noout mets-test.xml

Result: no error messages (which means it is well-formed)

Next I try to validate the same file with JHOVE (v. 1.18.1):

jhove -m XML-hul met-tests.xml

Result:

Jhove (Rel. 1.18.1, 2017-11-30)
 Date: 2018-02-22 12:53:55 CET
 RepresentationInformation: mets-test.xml
  ReportingModule: XML-hul, Rel. 1.4 (2007-01-08)
  LastModified: 2018-02-21 17:31:20 CET
  Size: 106387
  Format: XML
  Status: Not well-formed
  SignatureMatches:
   XML-hul
  ErrorMessage: XML document structures must start and end within the same entity.: Line = 1, Column = 97
  MIMEtype: text/xml

So according to JHOVE the file is not well-formed! The error message is even more puzzling, as the position corresponds to a namespace definition.

The strangest thing of all is that some XML documents that would pass a "well-formed" by JHOVE only yesterday suddenly give me the above error today! I initially suspected something weird going on in my JHOVE configuration, but after uninstalling + reinstalling JHOVE + checking on 2 different machines (Windows + Linux machine) I keep getting the above error for several XML documents that somehow passed well-formedness checks earlier on. Or am I overlooking something obvious here?

@anjackson
Copy link
Member

Just to make things more confusing, I downloaded it and it works fine for me!

Jhove (Rel. 1.14.0, 2016-10-06)
 Date: 2018-02-22 12:49:27 GMT
 RepresentationInformation: /Users/andy/Downloads/mets-test.xml
  ReportingModule: UTF8-hul, Rel. 1.6 (2014-07-18)
  LastModified: 2018-02-22 12:49:17 GMT
  Size: 106387
  Format: UTF-8
  Status: Well-Formed and valid
  MIMEtype: text/plain; charset=UTF-8
  UTF8Metadata: 
   Characters: 106386
   UnicodeBlocks: Basic Latin, Latin-1 Supplement
   LineEndings: LF

But note that's JHOVE 1.14.6 (from Homebrew).

Is the real problem the fact that it's failing to download the XSDs? e.g. because it's not picked up the proxy?

@bitsgalore
Copy link
Member Author

@anjackson I think you're on to something: I just re-ran JHOVE on that file (and some other ones that were giving me this problem) and it's now working for me as well (same JHOVE versions, same files, on both Windows and Linux machine)! So yes the cause might well be JHOVE failing to download the XSDs. However is this is so, I'd really expect that JHOVE would tell me this instead of marking it as "Not well-formed" (for one thing, the XSDs are not needed to check for well-formedness).

@anjackson
Copy link
Member

@bitsgalore I can't remember the details, but I've hit problems before with JHOVE giving really weird errors when remote XSDs have not been available. One of those times when I wondered what the advantage of running JHOVE is, compared to xmllint/etc.

@bitsgalore
Copy link
Member Author

Update to the above: as an additional test I re-ran JHVOVE after disabling my network connection. I expected that this would reproduce my original error. Instead, JHOVE parsed the document correctly and reported it as "Well-formed, but not valid", indicating in an InfoMessage that the schema could not be read. Which only makes things even more puzzling ...

@anjackson As for the added value of JHOVE over xmllint: xmllint doesn't automatically fetch the XSDs, so you have to specify an XSD on the command-line (I think you even need to download a local copy of the XSD, but I'm not 100% sure; also I don't remember how/if xmllint handles multiple XSD definitions). This makes xmllint a massive pain in the ass with things like METS files, and JHOVE makes handling these a lot easier. That is, until you end up running into weird problems like this one!

@anjackson
Copy link
Member

@bitsgalore in that case,xmlstarlet val FTW!

@marhop
Copy link
Member

marhop commented Feb 22, 2018

@anjackson may I point out that you accidentally used the UTF8-hul, not XML-hul? So nothing about XML validity here.

But anyway, I also validated the file with JHOVE 1.18.1 without web access i.e., no schema files available. It works as expected: JHOVE reports the file to be well-formed, but not valid.

@anjackson
Copy link
Member

anjackson commented Feb 22, 2018

Hah! Thanks @marhop - JHOVE's behaviour still confusing me after all these years. You think I'd know by now.

EDIT: Note that the times I've had trouble with JHOVE downloading XSD is not when they are simply not available (that's fine), but when the server returns not-XML.

@bitsgalore
Copy link
Member Author

@marhop good call, I had overlooked that in @anjackson's answer as well!

@anjackson just had a look at xmlstarlet, but just like xmllint it needs a reference to the schema as a command line arg. Also it's not clear to me how it handles multiple schemas (if at all?).

@anjackson
Copy link
Member

anjackson commented Feb 22, 2018

@bitsgalore Really!? Sorry, I thought I'd checked that, although it was a long time ago. My apologies for misremembering.

@bitsgalore
Copy link
Member Author

@anjackson no prob. Incidentally it does handle remote XSDs (and so does xmllint, I now see); looks like both tools are really similar.

@tledoux
Copy link
Contributor

tledoux commented Feb 22, 2018

Some more comments.
Indeed, from my experience, xmllint or xmlstartlet don't cope very well with multiple schemas. One way to make them work is to create a wrapper.xsd which import all the namespace you need and then call xmllint with the schema option.

Moreover, relying on external URL to validate xml files is not very safe: currently, the loc site is having problems (I get error 500 when asking for http://www.loc.gov/standards/mets/mets.xsd)
The best way to handle schemas is to have a local copy of every xsd and implement a catalog.
In Jhove, you can parameter that in the jhove.conf

 <module>
  <class>edu.harvard.hul.ois.jhove.module.XmlModule</class>
  <param>withTextMD=true</param>
  <param>schema=http://www.example.com/schema;/home/schemas/exampleschema.xsd</param>
 </module>

This is roughly documented here and probably more should be done here.

FWIIW, for XML files, we use Jhove to extract information (the textMD structure) and then we validate the XML with Xerces coupled with a catalog resolver, fed with the schemas we have decided to import locally. A XML file using an unknown schemas is just checked for well-formedness.

@ghost ghost added the bug A product defect that needs fixing label Mar 9, 2018
@ghost ghost added this to the Release v1.20 milestone Mar 9, 2018
@carlwilson carlwilson removed this from the Release v1.20 milestone Feb 28, 2019
@ghost ghost added the P2 Medium priority issues to be scheduled in a future release label Feb 28, 2019
@ghost ghost assigned ghost and carlwilson and unassigned ghost Feb 28, 2019
@ghost ghost added this to the Hack week initiation milestone Feb 28, 2019
@ghost ghost unassigned carlwilson Feb 28, 2019
@carlwilson carlwilson added the good-first-issue Issue suitable for inexperienced developers label Apr 23, 2020
@carlwilson carlwilson self-assigned this Jun 27, 2022
@carlwilson carlwilson modified the milestones: Hackathon tasks , OPF Hackathon 2023 Tasks Jun 21, 2023
@carlwilson carlwilson removed this from the OPF Hackathon 2023 Tasks milestone Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A product defect that needs fixing good-first-issue Issue suitable for inexperienced developers P2 Medium priority issues to be scheduled in a future release
Projects
Status: No status
Development

No branches or pull requests

6 participants