Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't allow multiple incomplete manifests (was: Make explicit that incomplete manifest are permitted) #6

Closed
stain opened this issue Jul 29, 2015 · 17 comments

Comments

@stain
Copy link

stain commented Jul 29, 2015

3.4 says

  1. Every file in every payload manifest MUST be present.
  2. Every payload file MUST be listed in at least one manifest.
    Payload files MAY be listed in more than one payload manifest.

so does that mean it is perfecly valid to have payload files mentioned in only some manifests? E.g.

manifest-md5.txt

e2b25c29051b9d6f0ae7ef5b12a3251b data/file1.txt

manifest-sha1.txt

24f5c8113eade58f5d5fa1b38fbbe1076ad3408a data/file2.txt

That means it is impossible to validate a bag without going through ALL the manifest-* files - for nothing else but to check that there is not a spurious extra file in any of the manifest files that the consumer is not able to checksum against. (at least the fileformat of the file is given :)

What is the purpose of allowing multiple manifest files? I would assume it would be to allow consumers to pick one they support and validate the content of the bag - rather than for producers that update an existing bag without caring about existing manifest files. If this is the case, then I would hope for all manifests to be complete and list the same payload files (but not necessarily in the same order.)

This applies as well to the tag manifests, if present.

@edsu
Copy link
Contributor

edsu commented Jul 29, 2015

Checking all the manifests present is important since the characteristics of checksum algorithms are not the same. If someone goes through the effort to provide more than one, I think it's correct to require checking them all before saying the bag is valid.

@stain
Copy link
Author

stain commented Jul 29, 2015

Why should it be correct to list one payload file in one incomplete manifest, and another file listed in a second incomplete manifest? So there is not a single manifest that lists the payload? What if I just want to know all the payload files?

If I receive a bagit with manifest-md5.txt and manifest-sha512.txt, but say I'm in Windows 10 and thought it was an achievement just to be able to check md5s (Actually Get-FileHash can do pretty much anything) - yet now I would still have to parse manifest-sha512.txt as well to see if there are any spurious files not mentioned in manifest-md5.txt (which I don't have the implementation to check), just to know if the bag is complete or not and to build a complete file listing that I can present to the user.

@edsu
Copy link
Contributor

edsu commented Jul 30, 2015

I'm a little bit confused by your example. What would said implementation do if a bag only contained a manifest-md5.txt?

One reason for providing multiple manifests is that hashing algorithms have different characteristics: speed, collisions, etc. Perhaps a piece of software is doing a quick audit of a bag (randomly picking payload files to check) and md5 is good enough. But when it comes to validating a bag it seems fair to require a BagIt implementation check all the manifests.

I honestly don't remember why creating a union of the manifests is required in order to know what payload files are present in the bag. I wonder if one or more of {@Ardvaark, @justinlittman, @andyboyko, @jkunze, @dbrunton} can remember? Perhaps Postel's Law was invoked...

I agree that it would simplify clients that aren't doing validation, and just want a list of the payload files, if they could expect a given manifest to contain a full listing. But at this stage I don't think this is a strong enough reason to potentially invalidate existing bags that don't do this.

@justinlittman
Copy link

One of our early uses of BagIt was for National Digital Newspaper Program content. In our environment we were extracting existing fixities from METS which used one algorithm and creating fixities for additional files in another algorithm (our default). Hence the union.

@Ardvaark
Copy link

Ardvaark commented Aug 3, 2015

That is to say, we were constructing a bag using metadata from multiple inputs, each with different fixity algorithms, and we wanted to take a use-what-you-have approach that didn't require us to recompute the missing fixities, very much in keeping both with BagIt's minimalist approach and our own inherent laziness as programmers.

@Ardvaark
Copy link

Ardvaark commented Aug 3, 2015

And with that said, I vote to close this issue as NTBF, both because it was intentional design choice, and it would retroactively invalidate otherwise valid and validatable bags, both imagined and real.

@edsu
Copy link
Contributor

edsu commented Aug 3, 2015

👍

@stain
Copy link
Author

stain commented Aug 3, 2015

Ok, so if consensus is that it is deliberately allowed to have those kind of "split" manifests with some files only in some manifests, then I will close this.

I assume a given BagIt profile can still be stricter about this :)

@stain stain closed this as completed Aug 3, 2015
@stain
Copy link
Author

stain commented Aug 6, 2015

I think I'll Reopen this - I think if split manifests are deliberately allowed, then this should be explicitly mentioned in the specification. That is, something like:

Note: Individual payload manifest files does not necessarily list every payload file, even if the bag is considered complete, which only requires that every file file in the payload is mentioned in at least one manifest file. This means that the complete list of payload files can only be found by inspecting all manifest files.

The complete list of payload files of a complete bag can be found by concatenating all manifest-*.txt files and extracting the file path, skipping any duplicates by comparing the byte value of the file paths.

Note that as it is not a requirement for every tag files to be mentioned in tag manifests, a similar complete list cannot be produced for tag files.

...I suggest byte-value comparison rather than Unicode normalization of filenames. This is possible because of the bag encoding that sets a uniform encoding for all tagfiles. However with bag encodings like utf-8 it is possible to have multiple "similar" filenames, like 👍

e2b25c29051b9d6f0ae7ef5b12a3251b data/première.txt
9051b9d6f0ae7ef5b12a3251be2b25c2 data/première.txt

.. but this should be addressed in a separate issue - some filesystems like OS X will always normalize filenames, therefore a bag like the above would immediately become invalid on unpacking.

Note: This comment does itself suffer from broken Unicode normalization at GitHub, as première.txt above might be rendered with ' over the r rather than the e.

@stain stain reopened this Aug 6, 2015
@stain stain changed the title Don't allow multiple incomplete manifests Make explicit that incomplete manifest are permitted (was: Don't allow multiple incomplete manifests) Aug 6, 2015
stain added a commit to stain/bagit-profiles that referenced this issue Aug 6, 2015
See jkunze/bagitspec#6 - I think a profile should be able to restrict this so that a consumer can rely on a shortlist of manifests as complete manifests.
@edsu
Copy link
Contributor

edsu commented Aug 6, 2015

I thought it was explicitly mentioned in the specification. That's how you learned about it!

@stain
Copy link
Author

stain commented Aug 6, 2015

It is just implicit:

complete A bag which comprises all elements required by this
specification, with all files listed in all payload and tag
manifests present
, all payload files present listed in at least
one manifest. See Section 3.

  1. Every file in every payload manifest MUST be present.
  2. Every payload file MUST be listed in at least one manifest.
    Payload files MAY be listed in more than one payload manifest.

You have to be more of a logician to see from this that "Aha, so that mean I am allowed to NOT list a file in SOME of the manifests."

If you are simply want to consume or parse BagIt, then you could quickly misread this so that all payload files must be listed, there should be at least one manifest (misread as "oh, so they MUST be listed there then") - and then simply glance over the fact that this implicitly permits any or all of the manifest to be individually incomplete.

The last sentence, "Payload files MAY be listed in more than one payload manifest" can be read in two ways:

  • There MAY be more than one payload manifests, which lists (all) payload files
  • For each payload file, that file MAY be listed in more than one payload manifest file.

stain added a commit to stain/bagit-profiles that referenced this issue Aug 6, 2015
See jkunze/bagitspec#6 - I think a profile should be able to restrict this so that a consumer can rely on a shortlist of manifests as complete manifests.
stain added a commit to stain/bagitspec that referenced this issue Aug 6, 2015
This makes the incomplete manifests explicit (issue jkunze#6).
stain added a commit to stain/bagitspec that referenced this issue Aug 6, 2015
This makes the incomplete manifests explicit (issue jkunze#6).

Also partially fixes jkunze#8 as fetch.txt content should be in the manifest
@johnscancella
Copy link

@stain with the new proposal it does indeed require that ALL payload files be listed in ALL manifests
#17

@justinlittman
Copy link

Hey @johnscancella - Are you sure the NDNP bags won't be broken by requiring ALL payload files be listed on ALL manifests?

@johnscancella
Copy link

@justinlittman yes, because they are an old version and will follow that specification. But going forward you will have to list all files in all manifests.

@stain
Copy link
Author

stain commented Mar 29, 2018

I think my original issue will be solved by #19 having "Every payload manifest MUST list every payload file exactly once." which I am quite pleased with. As for backwards compatibility this will presumably be BagIt-Version: 1.0 or so?

I'll close this when merged.

@stain stain changed the title Make explicit that incomplete manifest are permitted (was: Don't allow multiple incomplete manifests) Don't allow multiple incomplete manifests (was: Make explicit that incomplete manifest are permitted) Mar 29, 2018
@acdha
Copy link

acdha commented Mar 29, 2018

@stain That's my advice to implementers and the test case in our compliance suite has it as invalid only for 1.0: https://github.com/LibraryOfCongress/bagit-conformance-suite/tree/master/v1.0/invalid/notAllManifestsListAllFiles

stain pushed a commit to stain/bagitspec that referenced this issue Apr 20, 2018
Be more specific than "SHA-2 family"
@stain
Copy link
Author

stain commented Mar 8, 2019

Closing as implemented in https://tools.ietf.org/html/rfc8493#section-2.1.3

  • Every payload manifest MUST list every payload file name exactly once

@stain stain closed this as completed Mar 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants