Implement path validations #1740

willglynn · 2015-09-22T01:07:20Z

Per my proposal in #1710:

Paths must be valid UTF-8 per RFC 3629.
Paths may not contain ASCII/Unicode C0 control characters
(U+0000-U+001F).
Paths may not contain ASCII DEL (U+007F).
Paths are delimited by / (U+002F), and therefore path segments may not
contain it.
Path segments may contain up to 255 Unicode codepoints. Total path
length remains unbounded.
Path segments may not be empty, so that foo//bar can mean foo/bar, as
in POSIX.
Path segments must not be . and .., so that these can mean what they
do in POSIX.

Paths may contain any sequence of Unicode codepoints that are not otherwise prohibited. This includes many things that could prove problematic; see path/validation_test.go +121 for some examples.

Per my proposal in ipfs#1710: - Paths must be valid UTF-8 per RFC 3629. - Paths may not contain ASCII/Unicode C0 control characters (U+0000-U+001F). - Paths may not contain ASCII DEL (U+007F). - Paths are delimited by `/` (U+002F), and therefore path segments may not contain it. - Path segments may contain up to 255 Unicode codepoints. Total path length remains unbounded. - Path segments may not be empty, so that `foo//bar` can mean `foo/bar`, as in POSIX. - Path segments must not be `.` and `..`, so that these can mean what they do in POSIX. Paths may contain any sequence of Unicode codepoints that are not otherwise prohibited. This includes many things that could prove problematic; see path/validation_test.go +121 for some examples. License: MIT

whyrusleeping · 2015-09-22T01:09:40Z

path/path_test.go

+
+func TestPathCleaning(t *testing.T) {
+	cases := map[string]string{
+		// .. should strip the preceding path segment


i'm not in favor of special casing the names of links like this.

This is the same logic as POSIX pathname resolution and relative URI resolution.

User-specified objects named . and .. would be un-addressable in most contexts already, and even if they were addressable in some contexts, it would be very confusing if they had a different meaning than in POSIX and in URIs.

Maybe this should be pushed to ipfs/go-ipld?

—
Sent from Mailbox

On Mon, Sep 21, 2015 at 6:16 PM, Will Glynn notifications@github.com
wrote:

@@ -28,3 +28,27 @@ func TestPathParsing(t *testing.T) {
}
}
}
+
+func TestPathCleaning(t *testing.T) {

cases := map[string]string{

// .. should strip the preceding path segment

This is the same logic as POSIX pathname resolution and relative URI resolution.
User-specified objects named . and .. would be un-addressable in most contexts already, and even if they were addressable in some contexts, it would be very confusing if they had a different meaning than in POSIX and in URIs.
Reply to this email directly or view it on GitHub:
https://github.com/ipfs/go-ipfs/pull/1740/files#r40046047

I can see similar code being useful in go-ipld, but I think github.com/ipfs/go-ipfs/path needs to resolve . and .. references on parse.

This is required for URIs – even absolute URIs – so web servers including Go's net/http perform dot-related normalization automatically. This means that go-ipfs will generally not see paths including . and .. when referenced via a URI, whether http: or fs: or file:, rendering these paths equivalent for most users already. POSIX expects . and .. to have specific meanings which can be satisfied using this logic. Rob Pike's Lexical File Names in Plan 9 or Getting Dot-Dot Right is also applicable to IPFS, describing detailed rationale for treating Plan 9's non-hierarchical paths in this manner, stating in part:

The ability to construct union directories and other intricate naming structures introduces some thorny problems: as with symbolic links, the name space is no longer hierarchical, files and directories can have multiple names, and the meaning of .., the parent directory, can be ambiguous.

The meaning of .. is straightforward if the directory is in a locally hierarchical part of the name space, but if we ask what .. should identify when the current directory is a mount point or union directory or multiply symlinked spot (which we will henceforth call just a mount point, for brevity), there is no obvious answer. Name spaces have been part of Plan 9 from the beginning, but the definition of .. has changed several times as we grappled with this issue.
…
Frustrated by this situation, and eager to have better-defined names for some of the applications described later in this paper, we recently proposed the following definition for ..:
The parent of a directory X, X/.., is the same directory that would obtain if we instead accessed the directory named by stripping away the last path name element of X.

Why should string /ipfs/base58hash/foo/../bar not be immediately resolved to Path /ipfs/base58hash/bar? Are we entertaining the idea that it could have any other interpretation?

agreed @willglynn -- and agreed with plan9 folks:

The parent of a directory X, X/.., is the same directory that would obtain if we instead accessed the directory named by stripping away the last path name element of X.

So this basically means that .. is relative to the access path, correct? (this makes perfect sense to me)

@whyrusleeping what reservations do you have re . and .. matching unix/posix and the web expectations?

jbenet · 2015-09-23T11:32:32Z

path/path.go

+				return err
+			}
+		}
+	*/


maybe remove this commented out stuff, as it is part of ParsePath ?

jbenet · 2015-09-27T06:43:16Z

@willglynn what about these:

what about ASCII C1 set? should we remove it too?
perhaps we should trim whitespace from both ends?
what about these https://github.com/willglynn/go-ipfs/blob/path_validation/path/validation_test.go#L128-L142 ? should we really allow them?

wonder if there's some plan9 code for this that we can leverage. i imagine they ran through the same problems. (i emailed them btw. rsc still strongly agrees with our UTF-8 convictions.)

willglynn · 2015-09-29T16:51:36Z

I think excluding C0 (U+0000-U+001F, U+007F) is critical because of the volume of (mostly C) code that treats UTF-8 as if it were ASCII. NUL, ESC, BEL, BS, DEL etc. all get interpreted in undesirable ways by lots and lots of software.

Excluding C1 (U+0080-U+009F) seems less critical. They're Unicode codepoints that most things permit without exploding. (See UTF-EBCDIC for the major exception.) There's a case for excluding them – they are control characters, and control characters probably don't belong in paths – but each thing we exclude is another rule, and… well, rules get heavy.

I made this list and pushed early to prompt discussion with executable code. They're all things that maybe shouldn't be permitted, but that would be permitted by the rules in the accompanying commit message.

Combining marks tend to be paired, but they are legal codepoints in isolation. I think we should permit that, however weird it might be, mostly because rules to prevent that case seem like they'd have to be too smart for our own good.

Whitespace-related rules are a slippery slope. Prohibiting leading and trailing whitespace seems reasonable, but then why not interior whitespace too? Should multiple spaces be collapsed? Does this apply to all visually-blank characters, or just space? Which codepoints are considered whitespace, and is that context-dependent? (After all, the zero-width joiner exists because it actually does make a difference in certain languages.) It might be possible to formulate precise rules that accomplish a certain goal relating to whitespace, but let's start by defining that goal.

The zero-width joiner is fun because foo and fo\u200Do will be rendered identically even though they are different strings. This character has the advantage of being able to turn almost any string into another visually-identical string, but realistically we have the exact same problem in other contexts as long as we don't have some kind of Unicode normalization. I think we have to accept that paths will have this problem, which means accepting things like paths that contain the zero-width joiner.

Bidirectional text is a can of worms. I favor permitting bidi-related codepoints, largely because anything that understands them already has to deal with loads of complexity, much of which is described in annex 9. There are many other characters that affect text flow besides the one in the test case, and copy/pasting fragments of bidi text already causes most of these problems, so I don't think IPFS needs to be smart here.

Private use characters are legal for interchange, and they tend not to be prohibited by data formats. They don't mean anything, but… use them if you like. That seems like a reasonable position. Prohibiting them would allow downstream applications to use private use characters for internal purposes, which would also be a reasonable position, but this would require another rule.

Non-characters maybe should be excluded. There's no rule excluding them because I wasn't sure it's necessary, and keeping the rules short seems valuable. However, some of them used to be illegal for interchange, and some of them can be confusing when mapped to UTF-16, so this is probably a bigger deal than the C1 control codes causing problems in UTF-EBCDIC. If we're going to exclude one more class of things, I'd nominate excluding non-characters first.

jbenet · 2015-10-02T21:56:16Z

Excluding C1 (U+0080-U+009F) seems less critical...There's a case for excluding them – they are control characters, and control characters probably don't belong in paths

Yeah, i'd vote for excluding them still.

Combining marks tend to be paired, but they are legal codepoints in isolation. I think we should permit that, however weird it might be, mostly because rules to prevent that case seem like they'd have to be too smart for our own good.

No comment-- do not understand all the implications well enough.

Whitespace-related rules are a slippery slope. Prohibiting leading and trailing whitespace seems reasonable, but then why not interior whitespace too?

I think plenty of things do this though so it's not rare/unexpected.

Should multiple spaces be collapsed? Does this apply to all visually-blank characters, or just space? Which codepoints are considered whitespace, and is that context-dependent? (After all, the zero-width joiner exists because it actually does make a difference in certain languages.) It might be possible to formulate precise rules that accomplish a certain goal relating to whitespace, but let's start by defining that goal.

Oooooof. no idea here. What i do know is that paths represent critical information to people and being able to recognize URLs matters. Imagine UTF-8 domain names where the security depends on recognition of a name... This may be fraught, but i'd love to see how others have tackled the problem first.

The zero-width joiner is fun because foo and fo\u200Do will be rendered identically even though they are different strings. This character has the advantage of being able to turn almost any string into another visually-identical string, but realistically we have the exact same problem in other contexts as long as we don't have some kind of Unicode normalization. I think we have to accept that paths will have this problem, which means accepting things like paths that contain the zero-width joiner.

I don't think this is tenable, and worry about the implications. I don't think it's tenable because paths will continue to represent security implications. Today, name recognition is still responsible for stopping sophisticated phishing attacks.

Bidirectional text is a can of worms. I favor permitting bidi-related codepoints, largely because anything that understands them already has to deal with loads of complexity, much of which is described in annex 9. There are many other characters that affect text flow besides the one in the test case, and copy/pasting fragments of bidi text already causes most of these problems, so I don't think IPFS needs to be smart here.

No comment-- do not understand all the implications well enough.

Private use characters are legal for interchange, and they tend not to be prohibited by data formats. They don't mean anything, but… use them if you like. That seems like a reasonable position. Prohibiting them would allow downstream applications to use private use characters for internal purposes, which would also be a reasonable position, but this would require another rule.

I think it's fine to use them.

Non-characters maybe should be excluded. There's no rule excluding them because I wasn't sure it's necessary, and keeping the rules short seems valuable. However, some of them used to be illegal for interchange, and some of them can be confusing when mapped to UTF-16, so this is probably a bigger deal than the C1 control codes causing problems in UTF-EBCDIC. If we're going to exclude one more class of things, I'd nominate excluding non-characters first.

Sounds good to me. let's exclude them.

I think our best bet is to look at what plan9 did. I'm sure they had something very good and sane.

The source is available here: https://github.com/wangeguo/plan9 but i haven't found the path parsing/validating.

willglynn · 2015-10-02T22:19:27Z

IRIs are defined by RFC 3987 and addresses many of the concerns in this problem space. IRIs require normalization and specifies special handling for bidi text. This processing is more complex than the Unicode awareness in either HFS+ and NTFS, but might be workable in this context.

IDNs have spoofing problems despite careful attention being paid to that topic. Current references are RFC 5890, 5891, 5892, 5893. Also, just like HFS+ had to rewrite names with a fsck to update their Unicode rules, IDNA2003 and IDNA2008 treat names differently, causing the same set of codepoints to map to different strings under the hood. This is an outcome that I believe must be avoided in "the permanent web".

Another reference that might be useful is Unicode Technical Report 36, Unicode Security Considerations.

More generally, there's tremendous conflict between "try not to get in the way" and "try to keep users from getting harmed by names that look similar but are different under the hood". Given the focus on content-addressability and general acceptance of names that aren't useful except by following links or copy/pasting, I think trying not to get in the way is the better tradeoff.

jbenet · 2015-10-04T08:16:15Z

I don't think we're close at all to security not mattering in visually-confirmed {paths, names, logos}. even when a PKI is employed, still have to visually recognize {paths, names, logos}. 😞

willglynn · 2015-10-04T22:42:56Z

Hmm.

IPFS paths begin with a multihash, which is both base58 (thus shrugging off all Unicode concerns) and difficult to spoof (i.e. to create a multihash that appears similar to an existing one). An individual publisher has full control and authority to publish anything, malicious or benign, within the /ipfs/base58/ corresponding to their published content. Does it matter if paths within that /ipfs/base58/ appear similar?

For the case of /ipns/domain.name (or /dns/domain.name), there are additional concerns, but I think that's a separate topic from the global namespace, and thus out of scope here. DNS-related paths need smarts specific to DNS. For example, some component needs to immediately downcase "paypaI.com" to "paypai.com" in a way that the user sees, just like browsers' address bars do, or it needs to reject mixed-case domain references outright. (HTTP gateways should maybe 301, while FUSE needs to ENOENT to avoid silent aliasing.) DNS is a context where I agree that spoof prevention is important, but it also has plenty of precedent to follow. Best case, security issues related to DNS will affect both the web and IPFS in equal measure.

I don't understand the threat you have in mind. Are you suggesting that the entire IPFS namespace needs the same kind of protection as domain names, even in user-controlled (i.e. attacker-controlled) parts of the hierarchy?

jbenet · 2015-10-09T19:06:16Z

👍 to will's points above.

For example, some component needs to immediately downcase "paypaI.com" to
"paypai.com" in a way that the user sees, just like browsers' address
bars do, or it needs to reject mixed-case domain references outright. (HTTP
gateways should maybe 301, while FUSE needs to ENOENT to avoid silent
aliasing

yep!

I don't understand the threat you have in mind. Are you suggesting that
the entire IPFS namespace needs the same kind of protection as domain
names, even in user-controlled (i.e. attacker-controlled) parts of the
hierarchy?

In this case I'm worried about names like

real: /ipns/some.app/bankofamerica
phishing: /ipns/some.app/bankofamerìca

where there's some application you're accessing, and it gives users
ability to add names into a namespace, and that namespace is represented as
paths on IPFS so they just delegate to IPFS, and the user can make path
components that look almost or exactly the same with UTF-8.

On Sunday, October 4, 2015, Will Glynn notifications@github.com wrote:

Hmm.

IPFS paths begin with a multihash, which is both base58 (thus shrugging
off all Unicode concerns) and difficult to spoof (i.e. to create a
multihash that appears similar to an existing one). An individual publisher
has full control and authority to publish anything, malicious or benign,
within the /ipfs/base58/ corresponding to their published content. Does
it matter if paths within that /ipfs/base58/ appear similar?

For the case of /ipns/domain.name (or /dns/domain.name), there are
additional concerns, but I think that's a separate topic from the global
namespace, and thus out of scope here. DNS-related paths need smarts
specific to DNS. For example, some component needs to immediately downcase
"paypaI.com" to "paypai.com" in a way that the user sees, just like
browsers' address bars do, or it needs to reject mixed-case domain
references outright. (HTTP gateways should maybe 301, while FUSE needs to
ENOENT to avoid silent aliasing.) DNS is a context where I agree that
spoof prevention is important, but it also has plenty of precedent to
follow. Best case, security issues related to DNS will affect both the web
and IPFS in equal measure.

I don't understand the threat you have in mind. Are you suggesting that
the entire IPFS namespace needs the same kind of protection as domain
names, even in user-controlled (i.e. attacker-controlled) parts of the
hierarchy?

—
Reply to this email directly or view it on GitHub
#1740 (comment).

willglynn · 2015-10-09T19:42:59Z

where there's some application you're accessing, and it gives users ability to add names into a namespace, and that namespace is represented as paths on IPFS so they just delegate to IPFS, and the user can make path components that look almost or exactly the same with UTF-8.

Perhaps the path-related protections should be made pluggable, then. We agree that /dns and /ipns need DNS-specific handling – path components must be all lowercase, transformed and constrained by whatever the current IDNA rules are. Maybe the node at /ipns/some.app should be able to specify that its direct descendants are also domains, or constrain them to being printable ASCII, or constrain them to the base58 character set.

Still, I don't see that this concern belongs in the global namespace or in the definition of an IPFS path. bankofamerìca is a legal path component in essentially every other filesystem, and it should be representable in IPFS. What's more, I don't see any obvious way to prohibit that without also prohibiting most accented words in most other languages sharing Latin roots – and if that's on the table, we should pick 7-bit ASCII and be done.

jbenet · 2015-10-11T15:48:36Z

fair point

whyrusleeping · 2016-01-29T17:27:41Z

closing due to inactivity, please reopen as needed.

willglynn · 2016-01-29T17:42:01Z

I'm unclear on what needs to happen to get this (or something like it) merged.

dohues · 2017-07-27T15:46:03Z

Is @willglynn 's proposal the current link name standard of IPFS, now, or has something changed later? It is a quit important information to have during the construction of a project is based on IPFS.

jbenet added the backlog label Sep 22, 2015

willglynn force-pushed the path_validation branch from 29a6760 to 4b1c6d0 Compare September 22, 2015 01:07

willglynn force-pushed the path_validation branch from 4b1c6d0 to b8b0870 Compare September 22, 2015 01:08

whyrusleeping reviewed Sep 22, 2015
View reviewed changes

willglynn mentioned this pull request Sep 22, 2015

IPFS permits undesirable paths #1710

Open

jbenet reviewed Sep 23, 2015
View reviewed changes

path/path.go

return err

}

}

*/

Copy link

Member

jbenet Sep 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove this commented out stuff, as it is part of ParsePath ?

rht mentioned this pull request Nov 28, 2015

ipfs get cross platform paths #2013

Open

ghost added need_signoff need/community-input Needs input from the wider community labels Dec 22, 2015

daviddias removed the backlog label Jan 2, 2016

whyrusleeping closed this Jan 29, 2016

djdv mentioned this pull request May 18, 2016

'Path' type for commands lib Arguments #2721

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement path validations #1740

Implement path validations #1740

willglynn commented Sep 22, 2015

whyrusleeping Sep 22, 2015

willglynn Sep 22, 2015

jbenet Sep 22, 2015

willglynn Sep 22, 2015

jbenet Sep 23, 2015

jbenet Sep 23, 2015

jbenet Sep 23, 2015

jbenet commented Sep 27, 2015

willglynn commented Sep 29, 2015

jbenet commented Oct 2, 2015

willglynn commented Oct 2, 2015

jbenet commented Oct 4, 2015

willglynn commented Oct 4, 2015

jbenet commented Oct 9, 2015

willglynn commented Oct 9, 2015

jbenet commented Oct 11, 2015

whyrusleeping commented Jan 29, 2016

willglynn commented Jan 29, 2016

dohues commented Jul 27, 2017

Implement path validations #1740

Implement path validations #1740

Conversation

willglynn commented Sep 22, 2015

whyrusleeping Sep 22, 2015

Choose a reason for hiding this comment

willglynn Sep 22, 2015

Choose a reason for hiding this comment

jbenet Sep 22, 2015

Choose a reason for hiding this comment

willglynn Sep 22, 2015

Choose a reason for hiding this comment

jbenet Sep 23, 2015

Choose a reason for hiding this comment

jbenet Sep 23, 2015

Choose a reason for hiding this comment

jbenet Sep 23, 2015

Choose a reason for hiding this comment

jbenet commented Sep 27, 2015

willglynn commented Sep 29, 2015

jbenet commented Oct 2, 2015

willglynn commented Oct 2, 2015

jbenet commented Oct 4, 2015

willglynn commented Oct 4, 2015

jbenet commented Oct 9, 2015

willglynn commented Oct 9, 2015

jbenet commented Oct 11, 2015

whyrusleeping commented Jan 29, 2016

willglynn commented Jan 29, 2016

dohues commented Jul 27, 2017