Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement path validations #1740

Closed
wants to merge 1 commit into from
Closed

Conversation

willglynn
Copy link

Per my proposal in #1710:

  • Paths must be valid UTF-8 per RFC 3629.
  • Paths may not contain ASCII/Unicode C0 control characters
    (U+0000-U+001F).
  • Paths may not contain ASCII DEL (U+007F).
  • Paths are delimited by / (U+002F), and therefore path segments may not
    contain it.
  • Path segments may contain up to 255 Unicode codepoints. Total path
    length remains unbounded.
  • Path segments may not be empty, so that foo//bar can mean foo/bar, as
    in POSIX.
  • Path segments must not be . and .., so that these can mean what they
    do in POSIX.

Paths may contain any sequence of Unicode codepoints that are not otherwise prohibited. This includes many things that could prove problematic; see path/validation_test.go +121 for some examples.

Per my proposal in ipfs#1710:
  - Paths must be valid UTF-8 per RFC 3629.
  - Paths may not contain ASCII/Unicode C0 control characters
    (U+0000-U+001F).
  - Paths may not contain ASCII DEL (U+007F).
  - Paths are delimited by `/` (U+002F), and therefore path segments may not
    contain it.
  - Path segments may contain up to 255 Unicode codepoints. Total path
    length remains unbounded.
  - Path segments may not be empty, so that `foo//bar` can mean `foo/bar`, as
    in POSIX.
  - Path segments must not be `.` and `..`, so that these can mean what they
    do in POSIX.

Paths may contain any sequence of Unicode codepoints that are not otherwise
prohibited. This includes many things that could prove problematic; see
path/validation_test.go +121 for some examples.

License: MIT

func TestPathCleaning(t *testing.T) {
cases := map[string]string{
// .. should strip the preceding path segment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not in favor of special casing the names of links like this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same logic as POSIX pathname resolution and relative URI resolution.

User-specified objects named . and .. would be un-addressable in most contexts already, and even if they were addressable in some contexts, it would be very confusing if they had a different meaning than in POSIX and in URIs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be pushed to ipfs/go-ipld?


Sent from Mailbox

On Mon, Sep 21, 2015 at 6:16 PM, Will Glynn notifications@github.com
wrote:

@@ -28,3 +28,27 @@ func TestPathParsing(t *testing.T) {
}
}
}
+
+func TestPathCleaning(t *testing.T) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see similar code being useful in go-ipld, but I think github.com/ipfs/go-ipfs/path needs to resolve . and .. references on parse.

This is required for URIs – even absolute URIs – so web servers including Go's net/http perform dot-related normalization automatically. This means that go-ipfs will generally not see paths including . and .. when referenced via a URI, whether http: or fs: or file:, rendering these paths equivalent for most users already. POSIX expects . and .. to have specific meanings which can be satisfied using this logic. Rob Pike's Lexical File Names in Plan 9 or Getting Dot-Dot Right is also applicable to IPFS, describing detailed rationale for treating Plan 9's non-hierarchical paths in this manner, stating in part:

The ability to construct union directories and other intricate naming structures introduces some thorny problems: as with symbolic links, the name space is no longer hierarchical, files and directories can have multiple names, and the meaning of .., the parent directory, can be ambiguous.

The meaning of .. is straightforward if the directory is in a locally hierarchical part of the name space, but if we ask what .. should identify when the current directory is a mount point or union directory or multiply symlinked spot (which we will henceforth call just a mount point, for brevity), there is no obvious answer. Name spaces have been part of Plan 9 from the beginning, but the definition of .. has changed several times as we grappled with this issue.

Frustrated by this situation, and eager to have better-defined names for some of the applications described later in this paper, we recently proposed the following definition for ..:
The parent of a directory X, X/.., is the same directory that would obtain if we instead accessed the directory named by stripping away the last path name element of X.

Why should string /ipfs/base58hash/foo/../bar not be immediately resolved to Path /ipfs/base58hash/bar? Are we entertaining the idea that it could have any other interpretation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed @willglynn -- and agreed with plan9 folks:

The parent of a directory X, X/.., is the same directory that would obtain if we instead accessed the directory named by stripping away the last path name element of X.

So this basically means that .. is relative to the access path, correct? (this makes perfect sense to me)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whyrusleeping what reservations do you have re . and .. matching unix/posix and the web expectations?

return err
}
}
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove this commented out stuff, as it is part of ParsePath ?

@jbenet
Copy link
Member

jbenet commented Sep 27, 2015

@willglynn what about these:

wonder if there's some plan9 code for this that we can leverage. i imagine they ran through the same problems. (i emailed them btw. rsc still strongly agrees with our UTF-8 convictions.)

@willglynn
Copy link
Author

I think excluding C0 (U+0000-U+001F, U+007F) is critical because of the volume of (mostly C) code that treats UTF-8 as if it were ASCII. NUL, ESC, BEL, BS, DEL etc. all get interpreted in undesirable ways by lots and lots of software.

Excluding C1 (U+0080-U+009F) seems less critical. They're Unicode codepoints that most things permit without exploding. (See UTF-EBCDIC for the major exception.) There's a case for excluding them – they are control characters, and control characters probably don't belong in paths – but each thing we exclude is another rule, and… well, rules get heavy.

I made this list and pushed early to prompt discussion with executable code. They're all things that maybe shouldn't be permitted, but that would be permitted by the rules in the accompanying commit message.

Combining marks tend to be paired, but they are legal codepoints in isolation. I think we should permit that, however weird it might be, mostly because rules to prevent that case seem like they'd have to be too smart for our own good.

Whitespace-related rules are a slippery slope. Prohibiting leading and trailing whitespace seems reasonable, but then why not interior whitespace too? Should multiple spaces be collapsed? Does this apply to all visually-blank characters, or just space? Which codepoints are considered whitespace, and is that context-dependent? (After all, the zero-width joiner exists because it actually does make a difference in certain languages.) It might be possible to formulate precise rules that accomplish a certain goal relating to whitespace, but let's start by defining that goal.

The zero-width joiner is fun because foo and fo\u200Do will be rendered identically even though they are different strings. This character has the advantage of being able to turn almost any string into another visually-identical string, but realistically we have the exact same problem in other contexts as long as we don't have some kind of Unicode normalization. I think we have to accept that paths will have this problem, which means accepting things like paths that contain the zero-width joiner.

Bidirectional text is a can of worms. I favor permitting bidi-related codepoints, largely because anything that understands them already has to deal with loads of complexity, much of which is described in annex 9. There are many other characters that affect text flow besides the one in the test case, and copy/pasting fragments of bidi text already causes most of these problems, so I don't think IPFS needs to be smart here.

Private use characters are legal for interchange, and they tend not to be prohibited by data formats. They don't mean anything, but… use them if you like. That seems like a reasonable position. Prohibiting them would allow downstream applications to use private use characters for internal purposes, which would also be a reasonable position, but this would require another rule.

Non-characters maybe should be excluded. There's no rule excluding them because I wasn't sure it's necessary, and keeping the rules short seems valuable. However, some of them used to be illegal for interchange, and some of them can be confusing when mapped to UTF-16, so this is probably a bigger deal than the C1 control codes causing problems in UTF-EBCDIC. If we're going to exclude one more class of things, I'd nominate excluding non-characters first.

@jbenet
Copy link
Member

jbenet commented Oct 2, 2015

Excluding C1 (U+0080-U+009F) seems less critical...There's a case for excluding them – they are control characters, and control characters probably don't belong in paths

Yeah, i'd vote for excluding them still.

Combining marks tend to be paired, but they are legal codepoints in isolation. I think we should permit that, however weird it might be, mostly because rules to prevent that case seem like they'd have to be too smart for our own good.

No comment-- do not understand all the implications well enough.

Whitespace-related rules are a slippery slope. Prohibiting leading and trailing whitespace seems reasonable, but then why not interior whitespace too?

I think plenty of things do this though so it's not rare/unexpected.

Should multiple spaces be collapsed? Does this apply to all visually-blank characters, or just space? Which codepoints are considered whitespace, and is that context-dependent? (After all, the zero-width joiner exists because it actually does make a difference in certain languages.) It might be possible to formulate precise rules that accomplish a certain goal relating to whitespace, but let's start by defining that goal.

Oooooof. no idea here. What i do know is that paths represent critical information to people and being able to recognize URLs matters. Imagine UTF-8 domain names where the security depends on recognition of a name... This may be fraught, but i'd love to see how others have tackled the problem first.

The zero-width joiner is fun because foo and fo\u200Do will be rendered identically even though they are different strings. This character has the advantage of being able to turn almost any string into another visually-identical string, but realistically we have the exact same problem in other contexts as long as we don't have some kind of Unicode normalization. I think we have to accept that paths will have this problem, which means accepting things like paths that contain the zero-width joiner.

I don't think this is tenable, and worry about the implications. I don't think it's tenable because paths will continue to represent security implications. Today, name recognition is still responsible for stopping sophisticated phishing attacks.

Bidirectional text is a can of worms. I favor permitting bidi-related codepoints, largely because anything that understands them already has to deal with loads of complexity, much of which is described in annex 9. There are many other characters that affect text flow besides the one in the test case, and copy/pasting fragments of bidi text already causes most of these problems, so I don't think IPFS needs to be smart here.

No comment-- do not understand all the implications well enough.

Private use characters are legal for interchange, and they tend not to be prohibited by data formats. They don't mean anything, but… use them if you like. That seems like a reasonable position. Prohibiting them would allow downstream applications to use private use characters for internal purposes, which would also be a reasonable position, but this would require another rule.

I think it's fine to use them.

Non-characters maybe should be excluded. There's no rule excluding them because I wasn't sure it's necessary, and keeping the rules short seems valuable. However, some of them used to be illegal for interchange, and some of them can be confusing when mapped to UTF-16, so this is probably a bigger deal than the C1 control codes causing problems in UTF-EBCDIC. If we're going to exclude one more class of things, I'd nominate excluding non-characters first.

Sounds good to me. let's exclude them.


I think our best bet is to look at what plan9 did. I'm sure they had something very good and sane.

The source is available here: https://github.com/wangeguo/plan9 but i haven't found the path parsing/validating.

@willglynn
Copy link
Author

IRIs are defined by RFC 3987 and addresses many of the concerns in this problem space. IRIs require normalization and specifies special handling for bidi text. This processing is more complex than the Unicode awareness in either HFS+ and NTFS, but might be workable in this context.

IDNs have spoofing problems despite careful attention being paid to that topic. Current references are RFC 5890, 5891, 5892, 5893. Also, just like HFS+ had to rewrite names with a fsck to update their Unicode rules, IDNA2003 and IDNA2008 treat names differently, causing the same set of codepoints to map to different strings under the hood. This is an outcome that I believe must be avoided in "the permanent web".

Another reference that might be useful is Unicode Technical Report 36, Unicode Security Considerations.

More generally, there's tremendous conflict between "try not to get in the way" and "try to keep users from getting harmed by names that look similar but are different under the hood". Given the focus on content-addressability and general acceptance of names that aren't useful except by following links or copy/pasting, I think trying not to get in the way is the better tradeoff.

@jbenet
Copy link
Member

jbenet commented Oct 4, 2015

I don't think we're close at all to security not mattering in visually-confirmed {paths, names, logos}. even when a PKI is employed, still have to visually recognize {paths, names, logos}. 😞

@willglynn
Copy link
Author

Hmm.

IPFS paths begin with a multihash, which is both base58 (thus shrugging off all Unicode concerns) and difficult to spoof (i.e. to create a multihash that appears similar to an existing one). An individual publisher has full control and authority to publish anything, malicious or benign, within the /ipfs/base58/ corresponding to their published content. Does it matter if paths within that /ipfs/base58/ appear similar?

For the case of /ipns/domain.name (or /dns/domain.name), there are additional concerns, but I think that's a separate topic from the global namespace, and thus out of scope here. DNS-related paths need smarts specific to DNS. For example, some component needs to immediately downcase "paypaI.com" to "paypai.com" in a way that the user sees, just like browsers' address bars do, or it needs to reject mixed-case domain references outright. (HTTP gateways should maybe 301, while FUSE needs to ENOENT to avoid silent aliasing.) DNS is a context where I agree that spoof prevention is important, but it also has plenty of precedent to follow. Best case, security issues related to DNS will affect both the web and IPFS in equal measure.

I don't understand the threat you have in mind. Are you suggesting that the entire IPFS namespace needs the same kind of protection as domain names, even in user-controlled (i.e. attacker-controlled) parts of the hierarchy?

@jbenet
Copy link
Member

jbenet commented Oct 9, 2015

👍 to will's points above.

For example, some component needs to immediately downcase "paypaI.com" to
"paypai.com" in a way that the user sees, just like browsers' address
bars do, or it needs to reject mixed-case domain references outright. (HTTP
gateways should maybe 301, while FUSE needs to ENOENT to avoid silent
aliasing

yep!

I don't understand the threat you have in mind. Are you suggesting that
the entire IPFS namespace needs the same kind of protection as domain
names, even in user-controlled (i.e. attacker-controlled) parts of the
hierarchy?

In this case I'm worried about names like

  • real: /ipns/some.app/bankofamerica
  • phishing: /ipns/some.app/bankofamerìca

where there's some application you're accessing, and it gives users
ability to add names into a namespace, and that namespace is represented as
paths on IPFS so they just delegate to IPFS, and the user can make path
components that look almost or exactly the same with UTF-8.

On Sunday, October 4, 2015, Will Glynn notifications@github.com wrote:

Hmm.

IPFS paths begin with a multihash, which is both base58 (thus shrugging
off all Unicode concerns) and difficult to spoof (i.e. to create a
multihash that appears similar to an existing one). An individual publisher
has full control and authority to publish anything, malicious or benign,
within the /ipfs/base58/ corresponding to their published content. Does
it matter if paths within that /ipfs/base58/ appear similar?

For the case of /ipns/domain.name (or /dns/domain.name), there are
additional concerns, but I think that's a separate topic from the global
namespace, and thus out of scope here. DNS-related paths need smarts
specific to DNS. For example, some component needs to immediately downcase
"paypaI.com" to "paypai.com" in a way that the user sees, just like
browsers' address bars do, or it needs to reject mixed-case domain
references outright. (HTTP gateways should maybe 301, while FUSE needs to
ENOENT to avoid silent aliasing.) DNS is a context where I agree that
spoof prevention is important, but it also has plenty of precedent to
follow. Best case, security issues related to DNS will affect both the web
and IPFS in equal measure.

I don't understand the threat you have in mind. Are you suggesting that
the entire IPFS namespace needs the same kind of protection as domain
names, even in user-controlled (i.e. attacker-controlled) parts of the
hierarchy?


Reply to this email directly or view it on GitHub
#1740 (comment).

@willglynn
Copy link
Author

where there's some application you're accessing, and it gives users ability to add names into a namespace, and that namespace is represented as paths on IPFS so they just delegate to IPFS, and the user can make path components that look almost or exactly the same with UTF-8.

Perhaps the path-related protections should be made pluggable, then. We agree that /dns and /ipns need DNS-specific handling – path components must be all lowercase, transformed and constrained by whatever the current IDNA rules are. Maybe the node at /ipns/some.app should be able to specify that its direct descendants are also domains, or constrain them to being printable ASCII, or constrain them to the base58 character set.

Still, I don't see that this concern belongs in the global namespace or in the definition of an IPFS path. bankofamerìca is a legal path component in essentially every other filesystem, and it should be representable in IPFS. What's more, I don't see any obvious way to prohibit that without also prohibiting most accented words in most other languages sharing Latin roots – and if that's on the table, we should pick 7-bit ASCII and be done.

@jbenet
Copy link
Member

jbenet commented Oct 11, 2015

fair point

@ghost ghost added need_signoff need/community-input Needs input from the wider community labels Dec 22, 2015
@daviddias daviddias removed the backlog label Jan 2, 2016
@whyrusleeping
Copy link
Member

closing due to inactivity, please reopen as needed.

@willglynn
Copy link
Author

I'm unclear on what needs to happen to get this (or something like it) merged.

@dohues
Copy link

dohues commented Jul 27, 2017

Is @willglynn 's proposal the current link name standard of IPFS, now, or has something changed later? It is a quit important information to have during the construction of a project is based on IPFS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/community-input Needs input from the wider community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants