-
Notifications
You must be signed in to change notification settings - Fork 3
Handing of non-utf-8 posix filenames #3
Comments
See ipfs/kubo#1710 for a lot of discussion on this. What we should support in paths is an open question. Note: I don't endorse every suggestion in this thread. Also, to be a bit pedantic, names in maps (and link names) are unicode, not necessarily utf-8 (cbor just encodes them using utf-8 but that's not a hard requirement). Our DagJSON format doesn't currently support binary at all so I wouldn't try optimizing for that. However... There are a few ways to go:
Personally, I like 4 (and I'd like to be even more restrictive and forbid, e.g., newlines and control codes). While there could technically be files that have such characters in their names, they're unlikely to exist outside of either buggy or evil programs. Note: I don't intend to lose any sleep over breaking existing paths if users have decided to, e.g., put newlines in them. |
Just to make it clear, |
Note I thought one of the eventual goals is to remove the name attribute from links as they are not always meaningful, I apologize if I was mistaken. Map keys in CBOR can be anything including byte strings (even though the RFC recommends using text strings). (1) I do not support, but see the next paragraph (2) could be a problem if filenames are keys in a map. I do not fully understand what you are trying to say with (3) or (4). Above all if we intend for ipld-unixfs to be used a replacement of tar (or to a lesser extended actually be used as a unixfs as a filesystem that insists on utf-8 paths could create compatibility problems) then we have to somehow support any filename that is valid in Unix. If we do not then we need to be honest and up-front about that (for example say that ipld-unixfs can be used to faithfully represented an traditional unixfs, provided that all filenames are valid utf-8). If someone wants to use ipfs in this way (replacement for tar or as a unixfs where for whatever reason filenames are not utf-8 or contain any of the forbidden characters) then a different format needs to be used---even if that format is identical to ipld-unixfs but with the name component interpreted as a byte string rather than a utf-8 string. |
@kevina actually I do not want a restrictive set. My use case / interest in IPFS is strictly and literally "store random .tar files faithfully within IPFS". The reason I mentioned "maybe we just want to limit things" was specifically because of how... unprepared the tooling currently is. Consider In other words: I would love to be able to store everything sans |
Consolidating my comments on this topic: Copied from In response to @mib-kd743naq:
I disagree with this statement, what is stored (and can be accessed through the API) versus what is presented to the user are two things that can be different. As most of shells are not about to handle binary strings (as variables) it makes sense to have encoding in presenting them. In response to @kevina proposal:
I think it being either String or bytestring (as CBOR calls it) will create more problems than we expect. It is still problematic with JSON (as we can't well represent bytearrays in JSON, we are thinking about something that would workaround this problem by allowing you to ask for many formats with order of preference and it would return first format that is about to fully represent the data). It would mean that there is no minimal schema for this format. Depending on what we decide to do with filenames (UTF-8 only, arbitrary, safe characters only) we should make it either a string or bytestring. |
Also we need to think what IPLD resolver will be able to handle as keys. If it doesn't handle |
Another possibility: (3) Use UTF-8 or a restrictive set, but provide a field for storing the original name. A basic implementation is allowed to reject invalid filenames and required to preserve this field when copying files, but otherwise is not required to make use of it. When the field is used the translation from an invalid name to something that is representable for the for the original filename is outside the scope of the spec and the best translation may in fact defend on the filenames themself. For example if the filenames are in a non-utf-8 encoding the best behavior may be to translate then to UTF-8 but preserver the original name. A very basic implementation could replace the invalid character with something else and then append numbers if two filenames are the same in a directory. |
unixfs can't use DagJSON anyways (at the moment) because files are raw binary. Honestly, we only use DagJSON in the HTTP API so we should just avoid it as much as possible.
I'm not sure I understand. POSIX doesn't allow for Note: IPDL paths != unixfs paths.
That's probably the best option (that's what I was getting at with option 3 but simply keeping the original filename in a separate, binary field is probably better). However, I'd add two constraints:
That is, provide some form of Again, just to be pedantic, UTF-8 != Unicode (I don't want to tie ourselves to an encoding in the spec). |
I forgot about that for some reason. |
So you are saying we should say filenames should be Unicode but allow then be encoded in UTF-8 or UCS-2, UTF-16 and UTF-32? I quite literal mean UTF-8 here. |
Unicode is a standard for text. It's a character set along with a set of rules governing valid strings and how to interpret them; it, in itself, doesn't dictate how these characters map to bytes. UTF-8 a (one) mapping between these conceptual character strings and bytes. The precise encoding used in any given situation will be up to the IPLD format (e.g., CBOR), the language (e.g., go), and the OS/Filesystem. For example, Windows uses UTF-16 for filenames. When importing files on Windows, we'd treat filenames as Unicode strings (decoding the UTF-16 filenames into whatever Unicode encoding our language uses) and then re-encode them in the IPLD format's unicode encoding (UTF-8 for CBOR). On, e.g., Linux, we'd have to assume that paths (on import) are UTF-8 encoded as Linux doesn't actually specify how to interpret paths as Unicode (although we could probably use the |
Is JSON not supporting bytestrings the main (or the only) reason behind limiting to Unicode? |
Yes, the primary one (at least for me). But there is also the issue on how to handle non-UTF-8 client side. |
Using LANG is not a good idea. It is not an intrinsic part of the filesystem (like say codepages on FAT) and can be set incorrectly. It could be used as a hint to translate the encoding into Unicode and store the original untranslated name in a field as I suggested in (3). |
For me, no. Paths are a user interface and should therefore be user friendly (as much as possible). Furthermore, we for security, having allowing identical looking path names is generally not a good idea (we may not be able to do much about this but we should, at least, try, IMO). That's why I'd go for a restricted subset of Unicode (exclude control characters and newlines) as well. It may also be worth it to apply unicode normalization to filenames/paths (to avoid duplicate names).
I agree with the first part. Honestly, I don't think we should even use it as a hint. If the filename can be decoded as utf-8, we'd use that (it will get turned back into utf-8 when exported unle Also, we there are actually two ways to optimize this. We've been considering optimizing for import (support importing from the greatest number of filesystems) but one can also optimize for export (support exporting onto the greatest number of filesystems). Now, I wouldn't go so far as to ban ASCII characters such as |
I think this decision should be outside the spec and left up the implementation. The ipld-unixfs spec should not define how non-Unicode names are handled other then other than (1) it is permissible to simply reject them, or (2) they should be translated to a unique name within the directory and the original name should be stored in a field. In my view how this translation is done is outside the scope of ipld-unixfs. |
I think, we should not limit the storage format because of user interface problems. Those problems should be solved on user interface level. For example tar and ls handle problematic entries using console escape sequences:
Canonization and downgrading "unique character meanings" opens many more issues. At its core it is very similar to case-insensitivity but can be really overlooked as it doesn't affect english, or most Latin alphabet languages. I also don't feel like it if feasible to protect against it fully, new letters are added to Unicode every year (or so) those letters would not be downgraded properly and could be used in the same way. My option is: if JSON compatibility is deemed too high cost for loosing full support that POSIX provides currently, thus full capability for archiving data, we should go with only There exist much better ideas, I'm sure:
At the time Windows made all of those trade-offs, they all made sense. Compare it with trade-offs that Unix took and compare pain points. I haven't heard much about phishing attempts or any major exploits because of not limiting the character space but I frequently hear about people not being able to create |
Unfortunately, everything does this differently. That means I can't, e.g. parse tar output and then use it in another tool. We could just get all tools to agree on a particular UI representation but we have exactly what I'm proposing. I don't care about the binary representation, I just want a consistent user (and cli) representation as defined by the spec.
Not quite. I'm specifically talking about cases where there are two equivalent ways to represent the same visual character in Unicode. The problem is that:
I'm talking about NFC. I think you're confusing this with NFKD and friends.
There's actually a stability guarantee. To be fully complaint, we'd have to avoid unassigned character blocks (which we might want to do anyways) but it's unlikely that the Unicode spec will change this algorithm much, if at all (they say so on that page). It exists primarily to make UTF-8 backwards compatible with many existing character encoding schemes.
JSON compatibility is not the issue. Please read the issue I linked to in my first post (#1710).
Bad scripts that assume that, e.g., file's can't contain newlines/spaces are written all the time; usually by accident, not malice. This is why the latest version of Also, existing filesystems are mostly concerned with offline/single user use-cases. Here, we're dealing with online/multi-user systems (e.g., the web). |
Note: @Stebalien
It is part of this issue. Part of this issue is determining if the filenames should be UTF-8 and stored as CBOR text strings, or if there should be any allowed character in Unix and stored as byte strings. If we store them as byte strings then we have a problem converting them to JSON. |
We'll have a problem regardless given that files will be binary and it would be nice to, e.g., support binary extended attributes etc. That's why I'm not concerned about JSON.
That is, if we choose to allow arbitrary byte strings, we'll use the byte string type; I wouldn't let the choice of byte string versus text string inform the decision about valid file names. The one place where this does matter is map key names. Currently, IPLD specifies that all map/object key names must be Unicode text strings. If we allow arbitrary byte strings, we wouldn't be able to use file names as keys in objects. |
@kevina do you have any examples of non-textual filenames that are non-malicious? |
Older harddrives (or just folders) are very likely to contain non-UTF-8 characters. Here is one tar file that contains a non-utf-8 encoded file: http://ftp.gnu.org/gnu/aspell/dict/es/aspell-es-0.50-2.tar.bz2. I am sure I can find many others. Off hand I can not think of any non-textual filenames that are not malicious, but I also see that as beside the point. I am concerned about the ability to faithfully store any POSIX file and restore it back, this includes filenames that may very well be malicious. I mostly agree with @Kubuxu, but am okay if this filename is stored in a separate field and also okay if we reject certain filenames by default when creating a directory. I am also okay if we flag directory that could contain problematic entries. |
Hm. Would it be possible to semi-reliably detect that and choose the right encoding (I cringe as I say this as, historically, content "sniffing" has been the source of many security bugs...). |
With out additional context (such as using the LANG or LANGUAGE environmental variable) not really unless very sophisticated algorithms are used. There is just too much similarity between the 8-bit character sets. For an overview see http://czyborra.com/charsets/. The best we can do is it contains characters in the range A0-FF that are not part of valid UTF-8 multibyte sequence then assume it is ISO-8859-1. If it also contains characters that are in range of 80-9F then assume it is CP1252 (CP1252 is a superset of ISO-8859-1). This is likely to be correct maybe 50% of the time and at worst will display strange characters. This conversion is unlikely to create any security problems. This is all the more reason that if we decide to not just use byte strings for filenames the handling of the visible path competent of non-Unicode filenames should be left to the application and outside the scope of the spec. |
Everyone: I took into account the feedback I got from this thread and rewrote the spec in #2. |
Wasn't sure where is the best place to comment so will do it here: could we sidestep the entire "we want one string for addressing, but can't it be both unicode and bytes" by using a clever-er encoding like this one used by a unicode-aware VM runtime? I have not dug deeply into this, but it seems the implementer knew what they were doing... |
Apparently python also has a hack to preserve non-utf8 sequences in unix filenames I have not looked into it but I believe it makes use of the private-use-area. |
I came across this blog post from a project which handles many odd filenames in practice, written in Python: http://beets.io/blog/paths.html Highlights:
|
@warpfork thanks for the link, it prompted me to look up how characters that are no part of a valid uf-8 sequence are handled in python, they are escaped using the Unicode surrogate characters. From https://docs.python.org/3/library/codecs.html#codec-base-classes:
Surrogate characters are special characters in Unicode that are always meant to be paired, so an single surrogate is invalid by itself in unicode. Hence, this a is cleaver hack that works well for python internal encoding but is not useful to us, as an unmatched surrogate character is invalid in Unicode. @mib-kd743naq The solution used by the vm runtime can work but I don't recommend it. It is a variant of the backslash escape sequence except that Thus, I stand my original recommendation to use a special field |
@kevina so essentially you are proposing the solution considered-and-rejected here. What do you think about the subsequent proposal. Except in the case of IPFS a null/absent "charset" field means /cc @whyrusleeping |
@mib-kd743naq I am not sure how that is going to work for us. The core of the problem is that IPLD paths must be Unicode, unix file names do not have an encoding, if the filename is not utf-8 it will need to be converted in the library to utf-8 somehow for use in the IPLD path. I don't want to add this complexity to the standard. In this will require the names be bytes and not strings which was basically rejected in the thread above. Also note that the current proposal (see #2) has the name as the map key and not a field, so I am not sure how a tagged name is going to work as map keys, when the keys themself are not all in the same encoding. For 99(.9)% of files now days will be in UTF-8 on a modern POSIX system. The |
@kevina sure thing, IPLD path must be encoded as utf8, any non-utf8 information must be handled via side-channel is sufficiently clear. What still needs to be codified however ( and this likely needs a new issue ): which utf8 encoding ( which normalization if any ) |
(Speaking with @Stebalien IRL right now; thus with a "we":) Summary:
Fwiw, I understand the desire to avoid adding the complexity of specific encoding choices to the standard, but feel it's inevitable. If we have a normalization function, it's important to define and use it consistently: even when operating in cautious mode and halting rather than employing the |
One thing to note - in the discussions so far I did not see an explicit note regarding unicode normalization. I.e. this is still largely unanswered. NFC vs NFD vs whateverelse matters becase many fun fun reasons
The core of ipfs/kubo#4292 says "hi!" As a meta-thought for @warpfork: perhaps it would be worthwhile starting a github-wiki page or somesuch with the express idea of no discussion, but rather a table/tasklist enumerating various competing concerns perhaps with links to discussion and with indicators of priority/difficulty/conflicts-with/etc I loathe being the one who calls for more "bureacracy", but at the same time the concern space is vast ( as I amply demonstrated yesterday ), and we won't get many opportunities to "get the basics right" in the future ( UnixFS V2 may very well be the very last one ) /cc @b5 |
@warpfork a good starting set of questions on the naming sub-can-of-worms is contained within ipfs/kubo#4292 (comment) |
In response to "I understand the desire to avoid adding the complexity of specific encoding choices to the standard, but feel it's inevitable." I think we are making this more complicated then needs be. The only encoding we should be specifying is Unicode (and perhaps UTF-8 for the binary representation) , support for anything else should be outside the scope of the standard. The reason why If we do chose a normalization I would suggest NFC. This is just based on my experience with Unicode in my spellchecker Aspelll (http://aspell.net). I do not think NFD is really used in practice, but I could be wrong. When the normalization to My apologizes if I am missing anything. |
I'm having a read of https://www.unicode.org/reports/tr15/#Norm_Forms right now. I think I also will stand in support of NFC. And 100% with "When the normalization to |
Go seems to have ready-made implementations of the four common normalizers, including NFC, happily: https://godoc.org/golang.org/x/text/unicode/norm#Form.Bytes should do precisely what we need. |
I should have been clearer when I made my comments, apologies. The question is not "what makes sense in the context of UnixFS". Remember - the naming of "links" ( i.e. directory names ) "leaks" into IPLD-proper. This is a desired outcome due to "IPLD selectors" and "Path traversal" being logically the same thing. The ( likely a bit outdated ) IPLD spec itself has in its opening paragraph: merkle-paths: unix-style paths for traversing merkle-dags The downside is of course that at this point you are no longer working on the UnixFS layer at all. It is straight IPLD. Which currently does not define naming normalization at all. Thus my alarm of "just calling it" within the context of UnixFS. |
Yup, agree. The comment about NFC was purely to that scope -- preference of NFC over NFD, and particularly over NFKD and NFKC -- wherever that normalization comes to rest. Yup. We're definitely quantum-entangled with the IPLD spec here -- whatever we do needs to be coherent between both IPLD and unixfsv2. I was looking at some TODOs in that very same spec doc (e.g. "TODO: list path resolving restrictions") and agree it might be high time to ensure we have some clarity up there too. So how to go about that clarity... My general feeling is that we should push as much as standardization as possible into IPLD, since the issue is general to string keys and normalization thereof; and "nice" (non-fname-requiring) unixfs names may be a further subset (but most certainly subset) of allowable IPLD keys. So for example, the choice of NFC as a required normalization should, if at all possible, be pushed up to IPLD (unless we've got some very rigorous reasoning about why normalization doesn't belong there). I think this may leave one further question: what's the boundary between allowable IPLD keys vs valid merkle path segments. @vmx commented in irc earlier:
So... 👍 IPLD selectors are a general concept. Simultaneously, referring back to the spec:
So, all together (and perhaps this is review, forgive me), it seems that merklepaths (or some sort of "path", anyway) are a critical part of our desired outcome, even as IPLD selectors may be a superset of merklepath expressivity. And thus we need to think very hard about how we feel about allowing IPLD link keys that are not expressible as a merklepath segment. Aaaand... yep, here we are, back to the path restrictions part of the spec needs significant elucidation. |
pushing that into ipld makes sense to me, though we need to be careful to handle malformed objects (on parsing bytes, we need to validate all path names, could be $$$) |
I am strongly in favor that UnixfsV2 file names should be valid IPLD path components. When archiving files that do not meet the IPLD path component spec, they should be transformed somehow or possible allowed to be rejected. This transformation should be outside of the UnixFSv2 spec as there is no best way to handle the conversion. Implementations could be allowed to just return an error (or though this is not user friendly). When an implementation does not return an error the original byte string should be stored in |
closing for archival |
Based on what I gather from #1 of the requirements is to be able to backup an Unix filesystem onto ipfs. In order to do this it is important to be able to handle any valid Unix/Posix filename. Unix filenames may contain any character except null (
0x00
) and and/
, they is no guarantee that they are valid UTF-8 strings, even if this is now the best practice.So the question is how to handle them in the CBOR encoding. The simplest option would be just to (1) make them CBOR byte strings and be done with it. This could present a problem when converting it to JSON.
A slightly more complicated option is to (2) make use of the text/byte distinction in CBOR. If a string is a valid UTF-8 string then it MUST be encoded as a text string if not then it is encoded as a byte string. I use the word MUST (in the RFC sense) so that there is only one way to encoded a given filename. Given this option a UTF-8 string can be encoded in JSON as is, but a byte string still needs special treatment.
So the question is how to handle a non-valid utf-8 bytes in JSON. As I see it there are several options (j1) Encode the bytes using the non-standard JSON escape sequence
\0x##
where##
is the hex value of the byte for example0x77
. (j2) Somehow encode the bytes in Unicode itself, one option would be for the/
to act as a marker that the next charter is a literal byte and not a utf-8 character, for example the byte0x77
could be encoded as/\u0077
. (j3) If we elect to make use of the CBOR byte/string distinction then any filenames that are bytes can just be encoded using as a BASE64-string. (j4) A slightly more compact version of 4 is to assume the non-utf-8 string is ISO-8859-1 and convert to Unicode, when decoding it will be converted back from UTF-8 to ISO-8859-1 with no lose of information.For (j3) or (j4) there will need to be way to signal that the string should not be interpreted as UTF-8 in JSON. One idea I have is to start the string with a '/' as that can be included in any filename.
Finally, there is always the option to just force UTF-8 (and maybe even a more restive set as I think @mib-kd743naq wants), but then there will be filenames that can not be repressed in a IPLD Unixfs and some other format will likely be needed for backing up the filesystem.
@whyrusleeping @Stebalien others, thoughts?
[Edited to include option (4) for JSON encoding.]
The text was updated successfully, but these errors were encountered: