-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't allow multiple incomplete manifests (was: Make explicit that incomplete manifest are permitted) #6
Comments
Checking all the manifests present is important since the characteristics of checksum algorithms are not the same. If someone goes through the effort to provide more than one, I think it's correct to require checking them all before saying the bag is valid. |
Why should it be correct to list one payload file in one incomplete manifest, and another file listed in a second incomplete manifest? So there is not a single manifest that lists the payload? What if I just want to know all the payload files? If I receive a bagit with |
I'm a little bit confused by your example. What would said implementation do if a bag only contained a One reason for providing multiple manifests is that hashing algorithms have different characteristics: speed, collisions, etc. Perhaps a piece of software is doing a quick audit of a bag (randomly picking payload files to check) and md5 is good enough. But when it comes to validating a bag it seems fair to require a BagIt implementation check all the manifests. I honestly don't remember why creating a union of the manifests is required in order to know what payload files are present in the bag. I wonder if one or more of {@Ardvaark, @justinlittman, @andyboyko, @jkunze, @dbrunton} can remember? Perhaps Postel's Law was invoked... I agree that it would simplify clients that aren't doing validation, and just want a list of the payload files, if they could expect a given manifest to contain a full listing. But at this stage I don't think this is a strong enough reason to potentially invalidate existing bags that don't do this. |
One of our early uses of BagIt was for National Digital Newspaper Program content. In our environment we were extracting existing fixities from METS which used one algorithm and creating fixities for additional files in another algorithm (our default). Hence the union. |
That is to say, we were constructing a bag using metadata from multiple inputs, each with different fixity algorithms, and we wanted to take a use-what-you-have approach that didn't require us to recompute the missing fixities, very much in keeping both with BagIt's minimalist approach and our own inherent laziness as programmers. |
And with that said, I vote to close this issue as NTBF, both because it was intentional design choice, and it would retroactively invalidate otherwise valid and validatable bags, both imagined and real. |
👍 |
Ok, so if consensus is that it is deliberately allowed to have those kind of "split" manifests with some files only in some manifests, then I will close this. I assume a given BagIt profile can still be stricter about this :) |
I think I'll Reopen this - I think if split manifests are deliberately allowed, then this should be explicitly mentioned in the specification. That is, something like:
...I suggest byte-value comparison rather than Unicode normalization of filenames. This is possible because of the bag encoding that sets a uniform encoding for all tagfiles. However with bag encodings like utf-8 it is possible to have multiple "similar" filenames, like 👍
.. but this should be addressed in a separate issue - some filesystems like OS X will always normalize filenames, therefore a bag like the above would immediately become invalid on unpacking. Note: This comment does itself suffer from broken Unicode normalization at GitHub, as |
See jkunze/bagitspec#6 - I think a profile should be able to restrict this so that a consumer can rely on a shortlist of manifests as complete manifests.
I thought it was explicitly mentioned in the specification. That's how you learned about it! |
It is just implicit:
You have to be more of a logician to see from this that "Aha, so that mean I am allowed to NOT list a file in SOME of the manifests." If you are simply want to consume or parse BagIt, then you could quickly misread this so that all payload files must be listed, there should be at least one manifest (misread as "oh, so they MUST be listed there then") - and then simply glance over the fact that this implicitly permits any or all of the manifest to be individually incomplete. The last sentence, "Payload files MAY be listed in more than one payload manifest" can be read in two ways:
|
See jkunze/bagitspec#6 - I think a profile should be able to restrict this so that a consumer can rely on a shortlist of manifests as complete manifests.
This makes the incomplete manifests explicit (issue jkunze#6).
Hey @johnscancella - Are you sure the NDNP bags won't be broken by requiring ALL payload files be listed on ALL manifests? |
@justinlittman yes, because they are an old version and will follow that specification. But going forward you will have to list all files in all manifests. |
I think my original issue will be solved by #19 having "Every payload manifest MUST list every payload file exactly once." which I am quite pleased with. As for backwards compatibility this will presumably be I'll close this when merged. |
@stain That's my advice to implementers and the test case in our compliance suite has it as invalid only for 1.0: https://github.com/LibraryOfCongress/bagit-conformance-suite/tree/master/v1.0/invalid/notAllManifestsListAllFiles |
Be more specific than "SHA-2 family"
Closing as implemented in https://tools.ietf.org/html/rfc8493#section-2.1.3
|
3.4 says
so does that mean it is perfecly valid to have payload files mentioned in only some manifests? E.g.
manifest-md5.txt
manifest-sha1.txt
That means it is impossible to validate a bag without going through ALL the
manifest-*
files - for nothing else but to check that there is not a spurious extra file in any of the manifest files that the consumer is not able to checksum against. (at least the fileformat of the file is given :)What is the purpose of allowing multiple manifest files? I would assume it would be to allow consumers to pick one they support and validate the content of the bag - rather than for producers that update an existing bag without caring about existing manifest files. If this is the case, then I would hope for all manifests to be complete and list the same payload files (but not necessarily in the same order.)
This applies as well to the tag manifests, if present.
The text was updated successfully, but these errors were encountered: