Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest filename escaping #46

Open
pwinckles opened this issue Feb 13, 2022 · 6 comments
Open

Manifest filename escaping #46

pwinckles opened this issue Feb 13, 2022 · 6 comments

Comments

@pwinckles
Copy link

pwinckles commented Feb 13, 2022

In regards to writing file paths in manifest files, the spec states the following:

If a filepath includes a Line Feed (LF), a Carriage Return (CR), a Carriage-Return Line Feed (CRLF), or a percent sign (%), those characters (and only those) MUST be percent-encoded following [RFC3986].

My reading of the intent of the spec is for the manifest files to be usable by unix checksum utilities. However, this percent-encoding requirement breaks compatibility. While CR and LF are rare to find in a file path, this encoding requirement becomes a problem because it necessitates the encoding of % too. It is fairly common to percent-encode a file name if you're worried about special characters. Per spec, these percent-encoded file names would then be double-encoded when written to the manifest, making the file unusable by checksum utilities.

I have browsed a large number of the existing BagIt implementations on GitHub, and I have yet to find a single implementation that implements this requirement correctly. Implementations either 1) do no encoding or 2) only encode CR and LF and do not encode %. The first behavior is broken for file names that contain an LF or CR and the second behavior is broken for file names that are naturally percent-encoded. And they're both broken for an actual implementation of the spec.

I am currently working on yet another implementation and it's hard to decide what to do here. If I implement the spec as written, my bags will be unusable any other implementation. This seems to suggest that I should ignore the spec and not encode anything, which is the more prevalent and less broken than doing the partial encoding.

Unix checksum utilities use a entirely different mechanism to handle newlines within file names. When there is a newline in a file name, the newline is represented as \n and a \ is added to the beginning of the line. Additionally, literal \ characters are escaped with another \ and a \ is also added to the beginning of the line.

For example, let's say that we have the file named new\nline (important, this must be an actual newline and not the characters \ and n) and one named back\slash, and then executed the following:

# On linux
$ sha256sum *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de  back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee  new\nline

# On mac
$ shasum -a 256 *
\e6f805fa5fc041ab4bb7aa119641f77ac3e9f42106bc9f92354080692736c8de  back\\slash
\7ba826f0c347f6adc4686c8d1f61aeb2e2e98322749cd4f82204c926f4022cee  new\nline

This seems like a much more reasonable encoding to support, though it is a shame that the output of these utilities is not codified.

@acdha
Copy link
Member

acdha commented Feb 16, 2022

Thanks for the report — this is a pretty busy week for me but I had a couple of initial thoughts. Clearly we need to add some additional tests to https://github.com/LibraryOfCongress/bagit-conformance-suite/!

One thing I was wondering is how many of these tools promise support for BagIt 1.0, and thus whether this might simply be treated as part of bringing an existing (likely 0.97-focused) tool into full support for the spec. I definitely would like to make a 1.0 update to bagit-python.

That language was introduced between 0.97 and the first 1.0 drafts, and I'm not sure whether which came first: this portion of the payload manifest and the same portion of the fetch.txt description which obviously is very URL-focused. I'd have to check the old discussion but I believe the intention was to avoid ambiguity by making it always be safe to run filenames through a URL decoding function even in cases where the original filename was itself URL-encoded (which is not uncommon in certain communities). Since most of the common escaping conventions are valid characters in filenames on at least some operating system and filesystem combinations (e.g. new\nline is a valid filename), I'm not sure there's a better option than working with each of the implementers to cover this case.

@pwinckles
Copy link
Author

pwinckles commented Feb 16, 2022

It appears the encoding was introduced here as a response to newline characters appearing in some file names, a case the 0.97 spec did not support. However, the draft language did not also include encoding %, which I think is where the confusion originated. Encoding the % is essential otherwise you would not be able to distinguish between the distinct files new\nline.txt and new%0Aline.txt.

All of the implementations that I opened issues for claim to support 1.0. With the exception of bagit-python, I did not open issues against 0.97 implementations.

If there ever is a next version of the BagIt spec, I think it would be nice if the escaping was handled the same way as checksum utilities. Compatibility with them is enormously useful.

@pwinckles
Copy link
Author

I might also point out that if implementations percent-decode by only decoding the CR, LF, and characters then they will remain mostly compatible with incorrect 1.0 implementations. Normally, when you percent-decode you decode any encoded character as described here. However, by not decoding every encoded character a correct 1.0 implementation could validate an unescaped path like test℅201.txt from a current implementation.

I don't think I have ever used a percent-encoding library that allows you to control the characters that are decoded, so doing this will likely require a custom implementation or a series of string search and replaces.

@richardrodgers
Copy link

See issue in my repo for response, in line with these remarks...

@acdha
Copy link
Member

acdha commented Feb 16, 2022

Link to that comment: richardrodgers/bagit#33 (comment)

@pwinckles
Copy link
Author

After spending some time discussing this with some coworkers, I see that I did a poor job succinctly stating a desired change to the spec.

The existing language:

If a filepath includes a Line Feed (LF), a Carriage Return (CR), a Carriage-Return Line Feed (CRLF), or a percent sign (%), those characters (and only those) MUST be percent-encoded following [RFC3986].

Should be changed to something like:

If a filepath includes a Line Feed (LF), a Carriage Return (CR), a Carriage-Return Line Feed (CRLF), or a backslash (\), then those characters MUST be replaced with the literal strings \n, \r, \r\n, and \\ respectively. Additionally, if any characters are replaced in a path, then the manifest entry line MUST be prefixed with a backslash (\). For example: \d8e8fca2dc0f896fd7cb4cb0031ba249 file-with\nnewline

This would make the BagIt format compatible with unix checksum utilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants