Skip to content
This repository has been archived by the owner on Dec 6, 2022. It is now read-only.

Handing of non-utf-8 posix filenames #3

Closed
kevina opened this issue Oct 30, 2017 · 43 comments
Closed

Handing of non-utf-8 posix filenames #3

kevina opened this issue Oct 30, 2017 · 43 comments

Comments

@kevina
Copy link
Contributor

kevina commented Oct 30, 2017

Based on what I gather from #1 of the requirements is to be able to backup an Unix filesystem onto ipfs. In order to do this it is important to be able to handle any valid Unix/Posix filename. Unix filenames may contain any character except null (0x00) and and /, they is no guarantee that they are valid UTF-8 strings, even if this is now the best practice.

So the question is how to handle them in the CBOR encoding. The simplest option would be just to (1) make them CBOR byte strings and be done with it. This could present a problem when converting it to JSON.

A slightly more complicated option is to (2) make use of the text/byte distinction in CBOR. If a string is a valid UTF-8 string then it MUST be encoded as a text string if not then it is encoded as a byte string. I use the word MUST (in the RFC sense) so that there is only one way to encoded a given filename. Given this option a UTF-8 string can be encoded in JSON as is, but a byte string still needs special treatment.

So the question is how to handle a non-valid utf-8 bytes in JSON. As I see it there are several options (j1) Encode the bytes using the non-standard JSON escape sequence \0x## where ## is the hex value of the byte for example 0x77. (j2) Somehow encode the bytes in Unicode itself, one option would be for the / to act as a marker that the next charter is a literal byte and not a utf-8 character, for example the byte 0x77 could be encoded as /\u0077. (j3) If we elect to make use of the CBOR byte/string distinction then any filenames that are bytes can just be encoded using as a BASE64-string. (j4) A slightly more compact version of 4 is to assume the non-utf-8 string is ISO-8859-1 and convert to Unicode, when decoding it will be converted back from UTF-8 to ISO-8859-1 with no lose of information.

For (j3) or (j4) there will need to be way to signal that the string should not be interpreted as UTF-8 in JSON. One idea I have is to start the string with a '/' as that can be included in any filename.

Finally, there is always the option to just force UTF-8 (and maybe even a more restive set as I think @mib-kd743naq wants), but then there will be filenames that can not be repressed in a IPLD Unixfs and some other format will likely be needed for backing up the filesystem.

@whyrusleeping @Stebalien others, thoughts?

[Edited to include option (4) for JSON encoding.]

@Stebalien
Copy link

See ipfs/kubo#1710 for a lot of discussion on this. What we should support in paths is an open question. Note: I don't endorse every suggestion in this thread.

Also, to be a bit pedantic, names in maps (and link names) are unicode, not necessarily utf-8 (cbor just encodes them using utf-8 but that's not a hard requirement). Our DagJSON format doesn't currently support binary at all so I wouldn't try optimizing for that. However...

There are a few ways to go:

  1. Don't support non-unicode paths.
  2. Use a format like {name: "fname" /* possibly binary */}. However, this doesn't fix paths (which we do generally want to be unicode.
  3. Have rules for importing/exporting.
  4. Have rules for importing only (don't convert back when exporting) and only turn this on with a flag.

Personally, I like 4 (and I'd like to be even more restrictive and forbid, e.g., newlines and control codes). While there could technically be files that have such characters in their names, they're unlikely to exist outside of either buggy or evil programs.


Note: I don't intend to lose any sleep over breaking existing paths if users have decided to, e.g., put newlines in them.

@Stebalien
Copy link

Just to make it clear, DagJSON doesn't currently support binary (although the go daemon will take a best-effort approach and spit out a base64 encoded string). We've discussed adding a special {'/': 'multibaseValue', 'type': 'binary'} syntax but haven't agreed on doing so.

@kevina
Copy link
Contributor Author

kevina commented Oct 31, 2017

Note I thought one of the eventual goals is to remove the name attribute from links as they are not always meaningful, I apologize if I was mistaken. Map keys in CBOR can be anything including byte strings (even though the RFC recommends using text strings).

(1) I do not support, but see the next paragraph (2) could be a problem if filenames are keys in a map. I do not fully understand what you are trying to say with (3) or (4).

Above all if we intend for ipld-unixfs to be used a replacement of tar (or to a lesser extended actually be used as a unixfs as a filesystem that insists on utf-8 paths could create compatibility problems) then we have to somehow support any filename that is valid in Unix. If we do not then we need to be honest and up-front about that (for example say that ipld-unixfs can be used to faithfully represented an traditional unixfs, provided that all filenames are valid utf-8). If someone wants to use ipfs in this way (replacement for tar or as a unixfs where for whatever reason filenames are not utf-8 or contain any of the forbidden characters) then a different format needs to be used---even if that format is identical to ipld-unixfs but with the name component interpreted as a byte string rather than a utf-8 string.

@mib-kd743naq
Copy link

and maybe even a more restive set as I think @mib-kd743naq wants

@kevina actually I do not want a restrictive set. My use case / interest in IPFS is strictly and literally "store random .tar files faithfully within IPFS". The reason I mentioned "maybe we just want to limit things" was specifically because of how... unprepared the tooling currently is. Consider ~$ ipfs ls QmZcYTLQNo1hJtGxQD8d9sHtmnBBamhDHydtqJ2zxzLSx7 ( from ipfs/kubo#4292 (comment) )

In other words: I would love to be able to store everything sans \x00 and \x2F in IPFS, but I am realistic what is (not) achievable given the realities and focuses of the project...

@kevina kevina mentioned this issue Nov 1, 2017
@Kubuxu
Copy link

Kubuxu commented Nov 1, 2017

Consolidating my comments on this topic:


Copied from

In response to @mib-kd743naq:

It seems clearer from that standpoint that IPFS can't be both "fully POSIX compliant" and "an upgraded protocol with less cruft".

I disagree with this statement, what is stored (and can be accessed through the API) versus what is presented to the user are two things that can be different. As most of shells are not about to handle binary strings (as variables) it makes sense to have encoding in presenting them.

In response to @kevina proposal:

The key type can either be a byte or text string

I think it being either String or bytestring (as CBOR calls it) will create more problems than we expect. It is still problematic with JSON (as we can't well represent bytearrays in JSON, we are thinking about something that would workaround this problem by allowing you to ask for many formats with order of preference and it would return first format that is about to fully represent the data).

It would mean that there is no minimal schema for this format. Depending on what we decide to do with filenames (UTF-8 only, arbitrary, safe characters only) we should make it either a string or bytestring.

@Kubuxu
Copy link

Kubuxu commented Nov 1, 2017

Also we need to think what IPLD resolver will be able to handle as keys. If it doesn't handle / in the key (as it is currently, I think) we have no POSIX.

@kevina
Copy link
Contributor Author

kevina commented Nov 1, 2017

Another possibility:

(3) Use UTF-8 or a restrictive set, but provide a field for storing the original name. A basic implementation is allowed to reject invalid filenames and required to preserve this field when copying files, but otherwise is not required to make use of it.

When the field is used the translation from an invalid name to something that is representable for the for the original filename is outside the scope of the spec and the best translation may in fact defend on the filenames themself. For example if the filenames are in a non-utf-8 encoding the best behavior may be to translate then to UTF-8 but preserver the original name. A very basic implementation could replace the invalid character with something else and then append numbers if two filenames are the same in a directory.

@Stebalien
Copy link

@Kubuxu

as we can't well represent bytearrays in JSON, we are thinking about something that would workaround this problem by allowing you to ask for many formats with order of preference and it would return first format that is about to fully represent the data

unixfs can't use DagJSON anyways (at the moment) because files are raw binary. Honestly, we only use DagJSON in the HTTP API so we should just avoid it as much as possible.

Also we need to think what IPLD resolver will be able to handle as keys. If it doesn't handle / in the key (as it is currently, I think) we have no POSIX.

I'm not sure I understand. POSIX doesn't allow for / in filenames either. However, the current thinking is to allow / in IPLD keys but say that the current pathing scheme can't path over them. Basically, the current path scheme is a language for traversing IPLD objects, not necessarily the only one (although we could specify an escaping scheme up-front).

Note: IPDL paths != unixfs paths.

@kevina

(5) Use UTF-8 or a restrictive set, but provide a field for storing the original name. A basic implementation is allowed to reject invalid filenames and required to preserve this field when copying files, but otherwise is not required to make use of it.

That's probably the best option (that's what I was getting at with option 3 but simply keeping the original filename in a separate, binary field is probably better). However, I'd add two constraints:

  1. By default, we should reject such files (enabling import with a flag). Otherwise, imported files will have different names in IPFS.
  2. By default, we definitely shouldn't use such names when exporting from unixfs to a local filesystem (enabled with a flag) for the same reason.

That is, provide some form of unixfs --preserve-weird-names=true flag.


Again, just to be pedantic, UTF-8 != Unicode (I don't want to tie ourselves to an encoding in the spec).

@Kubuxu
Copy link

Kubuxu commented Nov 1, 2017

I'm not sure I understand. POSIX doesn't allow for / in filenames either

I forgot about that for some reason.

@kevina
Copy link
Contributor Author

kevina commented Nov 1, 2017

I don't want to tie ourselves to an encoding in the spec

So you are saying we should say filenames should be Unicode but allow then be encoded in UTF-8 or UCS-2, UTF-16 and UTF-32?

I quite literal mean UTF-8 here.

@Stebalien
Copy link

So you are saying we should say filenames should be Unicode but allow then be encoded in UTF-8 or UCS-2, UTF-16 and UTF-32?

Unicode is a standard for text. It's a character set along with a set of rules governing valid strings and how to interpret them; it, in itself, doesn't dictate how these characters map to bytes. UTF-8 a (one) mapping between these conceptual character strings and bytes.

The precise encoding used in any given situation will be up to the IPLD format (e.g., CBOR), the language (e.g., go), and the OS/Filesystem.

For example, Windows uses UTF-16 for filenames. When importing files on Windows, we'd treat filenames as Unicode strings (decoding the UTF-16 filenames into whatever Unicode encoding our language uses) and then re-encode them in the IPLD format's unicode encoding (UTF-8 for CBOR).

On, e.g., Linux, we'd have to assume that paths (on import) are UTF-8 encoded as Linux doesn't actually specify how to interpret paths as Unicode (although we could probably use the $LANG or $LC_* variables but that's probably not worth it).

@Kubuxu
Copy link

Kubuxu commented Nov 1, 2017

Is JSON not supporting bytestrings the main (or the only) reason behind limiting to Unicode?

@kevina
Copy link
Contributor Author

kevina commented Nov 1, 2017

Is JSON not supporting bytestrings the main (or the only) reason behind limiting to Unicode?

Yes, the primary one (at least for me).

But there is also the issue on how to handle non-UTF-8 client side.

@kevina
Copy link
Contributor Author

kevina commented Nov 1, 2017

although we could probably use the $LANG or $LC_* variables but that's probably not worth it

Using LANG is not a good idea. It is not an intrinsic part of the filesystem (like say codepages on FAT) and can be set incorrectly. It could be used as a hint to translate the encoding into Unicode and store the original untranslated name in a field as I suggested in (3).

@Stebalien
Copy link

Is JSON not supporting bytestrings the main (or the only) reason behind limiting to Unicode?

For me, no. Paths are a user interface and should therefore be user friendly (as much as possible). Furthermore, we for security, having allowing identical looking path names is generally not a good idea (we may not be able to do much about this but we should, at least, try, IMO). That's why I'd go for a restricted subset of Unicode (exclude control characters and newlines) as well. It may also be worth it to apply unicode normalization to filenames/paths (to avoid duplicate names).

Using LANG is not a good idea. It is not an intrinsic part of the filesystem (like say codepages on FAT) and can be set incorrectly. It could be used as a hint to translate the encoding into Unicode and store the original untranslated name in a field as I suggested in (3).

I agree with the first part. Honestly, I don't think we should even use it as a hint. If the filename can be decoded as utf-8, we'd use that (it will get turned back into utf-8 when exported unle


Also, we there are actually two ways to optimize this. We've been considering optimizing for import (support importing from the greatest number of filesystems) but one can also optimize for export (support exporting onto the greatest number of filesystems). Now, I wouldn't go so far as to ban ASCII characters such as * to please Windows but this is something we should consider. For example, I believe Plan-9 (and I know Redox) requires all filenames to be utf-8 byte strings (Unicode).

@kevina
Copy link
Contributor Author

kevina commented Nov 1, 2017

@kevina wrote:
Using LANG is not a good idea. It is not an intrinsic part of the filesystem (like say codepages on FAT) and can be set incorrectly. It could be used as a hint to translate the encoding into Unicode and store the original untranslated name in a field as I suggested in (3).

@Stebalien wrote:
I agree with the first part. Honestly, I don't think we should even use it as a hint.

I think this decision should be outside the spec and left up the implementation. The ipld-unixfs spec should not define how non-Unicode names are handled other then other than (1) it is permissible to simply reject them, or (2) they should be translated to a unique name within the directory and the original name should be stored in a field. In my view how this translation is done is outside the scope of ipld-unixfs.

@Kubuxu
Copy link

Kubuxu commented Nov 1, 2017

I think, we should not limit the storage format because of user interface problems. Those problems should be solved on user interface level. For example tar and ls handle problematic entries using console escape sequences:

/dev/shm$ ls fooo/
?[97m?[5m?[41m?[5mRED ALERT?[5m?[0m?[5m?
/dev/shm$ tar -tf fooo.tar 
fooo/
fooo/\033[97m\033[5m\033[41m\033[5mRED ALERT\033[5m\033[0m\033[5m\n

Canonization and downgrading "unique character meanings" opens many more issues. At its core it is very similar to case-insensitivity but can be really overlooked as it doesn't affect english, or most Latin alphabet languages. I also don't feel like it if feasible to protect against it fully, new letters are added to Unicode every year (or so) those letters would not be downgraded properly and could be used in the same way.


My option is: if JSON compatibility is deemed too high cost for loosing full support that POSIX provides currently, thus full capability for archiving data, we should go with only / and 0x00. The user interface problem can be solved at the user interface, in multiple ways.

There exist much better ideas, I'm sure:

  • escaping
  • warning
  • stopping the operation, warning users and requesting special flag (in case of console) or confirmation (normal apps).

optimize for export (support exporting onto the greatest number of filesystems). Now, I wouldn't go so far as to ban ASCII characters such as * to please Windows

At the time Windows made all of those trade-offs, they all made sense. Compare it with trade-offs that Unix took and compare pain points. I haven't heard much about phishing attempts or any major exploits because of not limiting the character space but I frequently hear about people not being able to create aux.doc file (I got bitten by it multiple times while working with obfuscated Java code) or not being able to call a paper Packet routing using A* search algorithm.pdf. The decision made complete sense back then but now 20 years latter it doesn't. I don't want to be in this spot in 20 years when we are not able to support something because Unicode decided to change something or new standard arose.

@Stebalien
Copy link

I think, we should not limit the storage format because of user interface problems. Those problems should be solved on user interface level. For example tar and ls handle problematic entries using console escape sequences:

Unfortunately, everything does this differently. That means I can't, e.g. parse tar output and then use it in another tool. We could just get all tools to agree on a particular UI representation but we have exactly what I'm proposing. I don't care about the binary representation, I just want a consistent user (and cli) representation as defined by the spec.

Canonization and downgrading "unique character meanings" opens many more issues. At its core it is very similar to case-insensitivity but can be really overlooked as it doesn't affect english, or most Latin alphabet languages.

Not quite. I'm specifically talking about cases where there are two equivalent ways to represent the same visual character in Unicode. The problem is that:

  1. We don't want a user typing a path and a user copying a path to end up with different files.
  2. UIs will often canonicalize Unicode strings anyways.

I'm talking about NFC. I think you're confusing this with NFKD and friends.

I also don't feel like it if feasible to protect against it fully, new letters are added to Unicode every year (or so) those letters would not be downgraded properly and could be used in the same way.

There's actually a stability guarantee. To be fully complaint, we'd have to avoid unassigned character blocks (which we might want to do anyways) but it's unlikely that the Unicode spec will change this algorithm much, if at all (they say so on that page). It exists primarily to make UTF-8 backwards compatible with many existing character encoding schemes.

My option is: if JSON compatibility is deemed too high cost for loosing full support that POSIX provides currently, thus full capability for archiving data, we should go with only / and 0x00. The user interface problem can be solved at the user interface, in multiple ways.

JSON compatibility is not the issue. Please read the issue I linked to in my first post (#1710).

I haven't heard much about phishing attempts or any major exploits because of not limiting the character space

Bad scripts that assume that, e.g., file's can't contain newlines/spaces are written all the time; usually by accident, not malice. This is why the latest version of ls (the GNU version, at least) now quotes all filenames with spaces and escapes all newlines. Unfortunately, this also breaks scripts that try to parse the output of ls (they shouldn't have been doing this but it's convenient so they did). I'd like to specify a user-friendly set of characters such that we're unlikely to end up in situations like this.

Also, existing filesystems are mostly concerned with offline/single user use-cases. Here, we're dealing with online/multi-user systems (e.g., the web).

@kevina
Copy link
Contributor Author

kevina commented Nov 2, 2017

Note: @Stebalien

JSON compatibility is not the issue.

It is part of this issue.

Part of this issue is determining if the filenames should be UTF-8 and stored as CBOR text strings, or if there should be any allowed character in Unix and stored as byte strings. If we store them as byte strings then we have a problem converting them to JSON.

@Stebalien
Copy link

If we store them as byte strings then we have a problem converting them to JSON.

We'll have a problem regardless given that files will be binary and it would be nice to, e.g., support binary extended attributes etc. That's why I'm not concerned about JSON.

Part of this issue is determining if the filenames should be UTF-8 and stored as CBOR text strings, or if there should be any allowed character in Unix and stored as byte strings.

That is, if we choose to allow arbitrary byte strings, we'll use the byte string type; I wouldn't let the choice of byte string versus text string inform the decision about valid file names. The one place where this does matter is map key names. Currently, IPLD specifies that all map/object key names must be Unicode text strings. If we allow arbitrary byte strings, we wouldn't be able to use file names as keys in objects.

@ehmry
Copy link

ehmry commented Nov 2, 2017

@kevina do you have any examples of non-textual filenames that are non-malicious?

@kevina
Copy link
Contributor Author

kevina commented Nov 2, 2017

@kevina do you have any examples of non-textual filenames that are non-malicious?

Older harddrives (or just folders) are very likely to contain non-UTF-8 characters. Here is one tar file that contains a non-utf-8 encoded file: http://ftp.gnu.org/gnu/aspell/dict/es/aspell-es-0.50-2.tar.bz2. I am sure I can find many others.

Off hand I can not think of any non-textual filenames that are not malicious, but I also see that as beside the point. I am concerned about the ability to faithfully store any POSIX file and restore it back, this includes filenames that may very well be malicious. I mostly agree with @Kubuxu, but am okay if this filename is stored in a separate field and also okay if we reject certain filenames by default when creating a directory. I am also okay if we flag directory that could contain problematic entries.

@Stebalien
Copy link

Older harddrives (or just folders) are very likely to contain non-UTF-8 characters. Here is one tar file that contains a non-utf-8 encoded file: http://ftp.gnu.org/gnu/aspell/dict/es/aspell-es-0.50-2.tar.bz2. I am sure I can find many others.

Hm. Would it be possible to semi-reliably detect that and choose the right encoding (I cringe as I say this as, historically, content "sniffing" has been the source of many security bugs...).

@kevina
Copy link
Contributor Author

kevina commented Nov 2, 2017

Hm. Would it be possible to semi-reliably detect that and choose the right encoding (I cringe as I say this as, historically, content "sniffing" has been the source of many security bugs...).

With out additional context (such as using the LANG or LANGUAGE environmental variable) not really unless very sophisticated algorithms are used. There is just too much similarity between the 8-bit character sets. For an overview see http://czyborra.com/charsets/.

The best we can do is it contains characters in the range A0-FF that are not part of valid UTF-8 multibyte sequence then assume it is ISO-8859-1. If it also contains characters that are in range of 80-9F then assume it is CP1252 (CP1252 is a superset of ISO-8859-1). This is likely to be correct maybe 50% of the time and at worst will display strange characters. This conversion is unlikely to create any security problems.

This is all the more reason that if we decide to not just use byte strings for filenames the handling of the visible path competent of non-Unicode filenames should be left to the application and outside the scope of the spec.

@kevina
Copy link
Contributor Author

kevina commented Nov 3, 2017

Everyone: I took into account the feedback I got from this thread and rewrote the spec in #2.

@mib-kd743naq
Copy link

Wasn't sure where is the best place to comment so will do it here: could we sidestep the entire "we want one string for addressing, but can't it be both unicode and bytes" by using a clever-er encoding like this one used by a unicode-aware VM runtime?

I have not dug deeply into this, but it seems the implementer knew what they were doing...

@kevina
Copy link
Contributor Author

kevina commented Jan 31, 2018

Apparently python also has a hack to preserve non-utf8 sequences in unix filenames I have not looked into it but I believe it makes use of the private-use-area.

@warpfork
Copy link

warpfork commented May 9, 2018

I came across this blog post from a project which handles many odd filenames in practice, written in Python: http://beets.io/blog/paths.html

Highlights:

  • bytes, bytes everywhere.
  • they express some worries about Py3k's "Surrogate Escapes" and do not use them.
  • they do indeed have a displayable_path function which makes a UI-printable string but is explicitly called out as allowed to be lossy.

@kevina
Copy link
Contributor Author

kevina commented May 10, 2018

@warpfork thanks for the link, it prompted me to look up how characters that are no part of a valid uf-8 sequence are handled in python, they are escaped using the Unicode surrogate characters. From https://docs.python.org/3/library/codecs.html#codec-base-classes:

On decoding, replace [non utf-8] byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data.

Surrogate characters are special characters in Unicode that are always meant to be paired, so an single surrogate is invalid by itself in unicode. Hence, this a is cleaver hack that works well for python internal encoding but is not useful to us, as an unmatched surrogate character is invalid in Unicode.

@mib-kd743naq The solution used by the vm runtime can work but I don't recommend it. It is a variant of the backslash escape sequence except that 0x10FFFD is used instead of the backslash. Like any form of backslash escaping, the escape character must itself be escaped. The vm-runtime does not seam to provide support for the escaping of 0x10FFFD (based on the comment), probably because they don't expect the escape character to ever be used, but this is not guaranteed as 0x10FFFD like any private use point is free to be used for some other purpose by someone else.

Thus, I stand my original recommendation to use a special field fname to contain the original filename if it can not be represented in Unicode. This solution also allows us to enforce additional restrictions on the name such as disallowing control characters or requiring Unicode normalization, while still being able to keep the original filename when it is needed.

@mib-kd743naq
Copy link

@kevina so essentially you are proposing the solution considered-and-rejected here.

What do you think about the subsequent proposal. Except in the case of IPFS a null/absent "charset" field means utf8 ( perhaps mandated in NFC form? ) and any other present "charset" is respected, perhaps with a multi-table style approach to id=>encoding mappings...

/cc @whyrusleeping

@kevina
Copy link
Contributor Author

kevina commented May 10, 2018

@mib-kd743naq I am not sure how that is going to work for us. The core of the problem is that IPLD paths must be Unicode, unix file names do not have an encoding, if the filename is not utf-8 it will need to be converted in the library to utf-8 somehow for use in the IPLD path. I don't want to add this complexity to the standard. In this will require the names be bytes and not strings which was basically rejected in the thread above. Also note that the current proposal (see #2) has the name as the map key and not a field, so I am not sure how a tagged name is going to work as map keys, when the keys themself are not all in the same encoding.

For 99(.9)% of files now days will be in UTF-8 on a modern POSIX system. The fname proposal is to cover the edge cases when they are not. fname will only be used when the Unicode filename, encoded as utf-8, differs from the original filename. How non-unicode filenames are converted to Unicode will be outside of the scope of the standard other than to require that the converted name be unique within the directory.

@mib-kd743naq
Copy link

@kevina sure thing, IPLD path must be encoded as utf8, any non-utf8 information must be handled via side-channel is sufficiently clear.

What still needs to be codified however ( and this likely needs a new issue ): which utf8 encoding ( which normalization if any )

@warpfork
Copy link

(Speaking with @Stebalien IRL right now; thus with a "we":)

Summary:

  • We are going to faithfully keep bytes in a value fname (iff it differs from a normalized path). We acknowledge that this is necessary for faithful archival usage.
  • We will use normalized paths in IPLD link keys. It is desirable to do this because in practice, we expect users to be able to copy-paste paths and treat them as IPLD selectors in e.g. URLs in the gateway in the most common use (acknowledging that this will not operate for esoteric paths which required fname treatment).
  • It is implicit from this that our normalization function will not be bijective, e.g. it will be lossy, and it may cause "conflicts".
  • We will defer specifying the behavior in case of conflicting normalized paths when adding a directory to IPFS until later. This is a rare, rare case in practice. (It may actually be acceptable to halt and demand user action, even. But again, deferring: we will have a ConflictBehavior func(...).)
  • We probably should gate the fname support behind a flag. It causes sufficiently interesting behavior (e.g. IPLD selectors won't work in the obvious way for all their files) that users should be aware of what they're getting into.

Fwiw, I understand the desire to avoid adding the complexity of specific encoding choices to the standard, but feel it's inevitable. If we have a normalization function, it's important to define and use it consistently: even when operating in cautious mode and halting rather than employing the fname support, it's important to be clear and consistent about when that halt happens.

@mib-kd743naq
Copy link

mib-kd743naq commented Jul 12, 2018

We will use normalized paths in IPLD link keys.

One thing to note - in the discussions so far I did not see an explicit note regarding unicode normalization. I.e. this is still largely unanswered. NFC vs NFD vs whateverelse matters becase many fun fun reasons

It is implicit from this that our normalization function will not be bijective, e.g. it will be lossy, and it may cause "conflicts"

The core of ipfs/kubo#4292 says "hi!"

As a meta-thought for @warpfork: perhaps it would be worthwhile starting a github-wiki page or somesuch with the express idea of no discussion, but rather a table/tasklist enumerating various competing concerns perhaps with links to discussion and with indicators of priority/difficulty/conflicts-with/etc

I loathe being the one who calls for more "bureacracy", but at the same time the concern space is vast ( as I amply demonstrated yesterday ), and we won't get many opportunities to "get the basics right" in the future ( UnixFS V2 may very well be the very last one )

/cc @b5

@mib-kd743naq
Copy link

@warpfork a good starting set of questions on the naming sub-can-of-worms is contained within ipfs/kubo#4292 (comment)

@kevina
Copy link
Contributor Author

kevina commented Jul 13, 2018

In response to "I understand the desire to avoid adding the complexity of specific encoding choices to the standard, but feel it's inevitable." I think we are making this more complicated then needs be. The only encoding we should be specifying is Unicode (and perhaps UTF-8 for the binary representation) , support for anything else should be outside the scope of the standard. The reason why fname was suggested was to handle the corner cases.

If we do chose a normalization I would suggest NFC. This is just based on my experience with Unicode in my spellchecker Aspelll (http://aspell.net). I do not think NFD is really used in practice, but I could be wrong. When the normalization to NFC changes something the original string can be encoded in fname.

My apologizes if I am missing anything.

@warpfork
Copy link

I'm having a read of https://www.unicode.org/reports/tr15/#Norm_Forms right now. I think I also will stand in support of NFC. And 100% with "When the normalization to NFC changes something the original string can be encoded in fname".

@warpfork
Copy link

Go seems to have ready-made implementations of the four common normalizers, including NFC, happily: https://godoc.org/golang.org/x/text/unicode/norm#Form.Bytes should do precisely what we need.

@mib-kd743naq
Copy link

I think I also will stand in support of NFC.

I should have been clearer when I made my comments, apologies. The question is not "what makes sense in the context of UnixFS". Remember - the naming of "links" ( i.e. directory names ) "leaks" into IPLD-proper. This is a desired outcome due to "IPLD selectors" and "Path traversal" being logically the same thing. The ( likely a bit outdated ) IPLD spec itself has in its opening paragraph: merkle-paths: unix-style paths for traversing merkle-dags

The downside is of course that at this point you are no longer working on the UnixFS layer at all. It is straight IPLD. Which currently does not define naming normalization at all. Thus my alarm of "just calling it" within the context of UnixFS.

@warpfork
Copy link

Yup, agree. The comment about NFC was purely to that scope -- preference of NFC over NFD, and particularly over NFKD and NFKC -- wherever that normalization comes to rest.


Yup. We're definitely quantum-entangled with the IPLD spec here -- whatever we do needs to be coherent between both IPLD and unixfsv2.

I was looking at some TODOs in that very same spec doc (e.g. "TODO: list path resolving restrictions") and agree it might be high time to ensure we have some clarity up there too.


So how to go about that clarity...

My general feeling is that we should push as much as standardization as possible into IPLD, since the issue is general to string keys and normalization thereof; and "nice" (non-fname-requiring) unixfs names may be a further subset (but most certainly subset) of allowable IPLD keys. So for example, the choice of NFC as a required normalization should, if at all possible, be pushed up to IPLD (unless we've got some very rigorous reasoning about why normalization doesn't belong there).

I think this may leave one further question: what's the boundary between allowable IPLD keys vs valid merkle path segments.

@vmx commented in irc earlier:

i would say merkle paths are a kind of IPLD selector

IPLD selectors are expanded to be anything. it could be merkle paths like, css selectors, graphql, something encoded as pb, anything

So... 👍 IPLD selectors are a general concept. Simultaneously, referring back to the spec:

IPLD paths MUST layer cleanly over UNIX and The Web (use /, have deterministic transforms for ASCII systems).

So, all together (and perhaps this is review, forgive me), it seems that merklepaths (or some sort of "path", anyway) are a critical part of our desired outcome, even as IPLD selectors may be a superset of merklepath expressivity. And thus we need to think very hard about how we feel about allowing IPLD link keys that are not expressible as a merklepath segment.

Aaaand... yep, here we are, back to the path restrictions part of the spec needs significant elucidation.

@whyrusleeping
Copy link

pushing that into ipld makes sense to me, though we need to be careful to handle malformed objects (on parsing bytes, we need to validate all path names, could be $$$)

@kevina
Copy link
Contributor Author

kevina commented Jul 16, 2018

I am strongly in favor that UnixfsV2 file names should be valid IPLD path components. When archiving files that do not meet the IPLD path component spec, they should be transformed somehow or possible allowed to be rejected. This transformation should be outside of the UnixFSv2 spec as there is no best way to handle the conversion. Implementations could be allowed to just return an error (or though this is not user friendly). When an implementation does not return an error the original byte string should be stored in fname which can be used when extracting the archive on the filesystem.

@rvagg
Copy link
Member

rvagg commented Dec 6, 2022

closing for archival

@rvagg rvagg closed this as completed Dec 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants