Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ban %00 from Serialized DisplayString #2615

Closed
bsdphk opened this issue Aug 17, 2023 · 14 comments · Fixed by #2650
Closed

Ban %00 from Serialized DisplayString #2615

bsdphk opened this issue Aug 17, 2023 · 14 comments · Fixed by #2650

Comments

@bsdphk
Copy link
Contributor

bsdphk commented Aug 17, 2023

I propose we ban %00 from serialized DisplayString, in order to avoid a large class of security issues (and to simplify implementations) in C/POSIX environments where 0x00 is the string terminator.

I can see three alternatives here.

Ban %00 outright

In the serialization of Display String we add:

if byte is %0x00, fail serializing.

In the parsing of `Display String we add:

if octet is zero, fail parsing.

If either end uses a strict UTF-8 encoder or decoder, which do not allow "overlong byte sequences" (see below), this prevents U+0000 NULL from being transported with DisplayString.

Specify loose UTF-8 encoding & decoding

Java pioneered a handling of U+0000 NULL by specifying that it must be UTF-8 serialized as 0xc0 + 0x80 - a so-called "over-long UTF-8 byte sequence".

We have presently not specified if the UTF-8 encoding of DisplayString is "strict" where such over-long byte sequences are illegal or "non-strict" where either all or a few over-long byte sequences are accepted.

Over-long byte sequences in general are frowned upon, because they can be used to "obfuscate" UTF-8 strings, for instance by encoding a period as 0xc0 0xae to try to escape directory traversal checks.

We could change the spec to say that the UTF-8 encoding/decoding must be loose (enough) to handle U+0000 NULL the Java-way, but I failed to identify a suitable normative reference which didnt bring in a lot of other UniCode/UTF-8 baggage we may not want.

Optimistically serialize 0x00 as %c0%80

We can optimistically serialize any 0x00 bytes we encounter in the encoded UTF-8 byte_array using the "java-trick" and leave it to the UTF-8 decoder to either reject or accept as it sees fit.

In the serialization of DisplayString we add:

If byte is %x00 append "%c0%80" to encoded_string.

In the parsing of DisplayString we add:

if octet is zero, fail parsing.

Discussion

I'm a big fan of clear text and simple solutions, and I cannot imagine why anybody would ever try to send U+0000 NULL through a DisplayString for non-nefarious purposes, so I am 100% on board with simply banning 0x00 bytes in the encoded UTF-8 byte-array.

But it is a (tiny) loss of generality, and if we want to avoid that, we either make life difficult for implementers in C/POSIX based environments by rejecting this ticket, or we throw our lot with Java's handling of U+0000 NULL to a greater or lesser degree.

Optimistically serializing 0x00, as specified in the third alternative is a way to do that without tying ourselves to Java's mast, but it introduces some wiggle-room which, depending on ones point of view, is either desirable (allowing the receiver to reject U+0000 NULL by using a strict UTF-8 decoder) or undesirable (making it anyone's guess if U+0000 NULL can be transported in a DisplayString or not.)

But Java's handling is a bit of a hack, and allowing (some) over-long UTF-8 byte sequences introduces a class of security issues similar to the one this ticket tries to prevent, so all in all, I think we should just ban %00 in serialized DisplayString with the first alternative above.

@kazuho
Copy link
Contributor

kazuho commented Aug 17, 2023

IIUC, Display Strings aren't meant for sending byte arrays, therefore I do not think we need to define a special way for conveying NUL characters.

Based on that, IMO the question boils down to if we should ban encoders from sending NUL.

I do not have a strong opinion on if that should be forbidden, considering the fact that if the receiving application is suspectible to NUL characters decoders have to be careful in the handling of NUL characters regardless of the encoding requirement.

Note that decoders are already expected to treat NUL characters specially if necessary, see Security Considerations.

Maybe we can just add a reference from the section describing the decoding logic to the Security Considerations, and call it a day.

@wtarreau
Copy link

One possible good point in favor of banning it is that we know that many configs will just act based on regex pattern search and might find some searched string after a %00 and consider they found what they were looking for, while the implementation will stop before when seeing the zero.

I think we should simply say that "0x00" must never be serialized, hence "%00" must never appear in a DisplayString. This implicitly allows end users to write pretty simple protection rules (even regex-based) consisting in blocking the 3-char sequence "%00" in such fields. Encouraging the blocking on the receiver side can be efficient enough to discourage implementations from even trying to purposely send it.

@davidben
Copy link
Contributor

davidben commented Aug 17, 2023

There is a cost to banning it, which is that the Unicode string to DisplayString encoding function is no longer complete. While C strings can't represent embedded NULs, all other languages' string types can. Those languages would then have a harder time writing a DisplayString API, because now encoding can fail. So we're trading off a nuisance for C vs. a nuisance for everyone else.

It is unclear to me what the second option buys us. The second option just means there are two ways to spell U+0000. A C decoder that uses a naive percent-decoded in-memory representation would still need to account for %00 somehow. The third option would avoid that, but ultimately it just means the on-the-wire representation of the same U+0000 codepoint becomes more verbose.

This, however, suggests a fourth option: keep the abstract data format and wire representation as-is. (All Unicode strings are allowed and %00 is how you say U+0000.) Instead, non-normatively suggest that decoders that use a NUL-terminated in-memory representation MAY account for U+0000 by representing it as \xc0\x80 in memory.

@wtarreau
Copy link

I'm naively suspecting this could be used to attack some elements: fill such a field with as many %00 as you can and it gets doubled in its final representation (e.g. when passed to a next hop) :-/

@davidben
Copy link
Contributor

davidben commented Aug 17, 2023

By "this", do you mean the \xc0\x80 in-memory representation? I don't think that's a concern.

If passing to a next hop is still over HTTP, the representation you send will continue to be %00 because (under that proposal), that is the one and only representation of U+0000. It would continue to be incorrect to encode it as %c0%80.

If you're worried about a large in-memory representation, I will note that it costs the attacker three bytes ('%' '0' '0') of bandwidth to get you to represent something in two bytes ('\xc0' '\x80'). The attacker would get a better ratio be sending you a bunch of 'a's.

But ultimately all this is just a question of what in-memory representation your program uses for U+0000. It just needs to unambiguously represents the possible state space. Indeed I might suggest, rather than doing these \xc0\x80 hacks, don't use NUL-terminated strings! 😄 Then this is moot. I certainly would not go with \xc0\x80 in anything I write.

@nharper
Copy link

nharper commented Aug 17, 2023

I see no reason to single out U+0000 from other control characters. Either allow U+0000, or disallow all control characters.

The overlong encoding hack will bring more issues than it will solve.

@bsdphk
Copy link
Contributor Author

bsdphk commented Aug 20, 2023

I see no reason to single out U+0000 from other control characters.

The only reason to single out U+0000 is that it is the string termination character in C and POSIX environments, and that is still a major cause of CVEs.

I would not object to banning all the control characters, but that takes us into UniCode land which is full of dragons...

This, however, suggests a fourth option: keep the abstract data format and wire representation as-is. (All Unicode strings are allowed and %00 is how you say U+0000.) Instead, non-normatively suggest that decoders that use a NUL-terminated in-memory representation MAY account for U+0000 by representing it as \xc0\x80 in memory.

That is a third way to get the indeterminacy of the "java-hack", but it does have the slight benefit that the implementor of the sf-bis parser may know if that over-long UTF8 sequence will work or not. However, if it does not, then nothing is gained with respect to predictability.

@martinthomson
Copy link
Contributor

I'd prefer to stick with a caution here, rather than a prohibition. That means that endpoints (yes, whatever that means) can make decisions about what they tolerate, but we aren't constraining use. That is, the field can carry anything, but a resource might decide not to accept some values.

@bsdphk
Copy link
Contributor Author

bsdphk commented Aug 21, 2023

I'd prefer to stick with a caution here, rather than a prohibition.

How about simply adding: "UniCode category 'Cc' (Control characters) are not permitted, unless the field definition unwisely explicitly permits them."

@mnot
Copy link
Member

mnot commented Aug 23, 2023

That's a field-specific parsing rule, which doesn't seem great from an API perspective.

@bsdphk
Copy link
Contributor Author

bsdphk commented Aug 23, 2023

That's a field-specific parsing rule, which doesn't seem great from an API perspective.

Not if it goes in the "Defining New Structured Fields" section ?

I think it is perfectly fair game to insist that fields which accept Control Characters are forced to specify that, and ban them by default.

My hope is that adding such text will mean that no field will ever be defined to allow them.

@mnot
Copy link
Member

mnot commented Aug 23, 2023

Right, but the software handling that ban isn't a SF processor, it's the field-specific code. I think the most we could do would be something like how we handle field-specific constraint failures in 'defining new...', eg:

Field specifications that allow Display Strings should specify how control characters [ref?] should be handled; absent specific instructions, they are not allowed.

But that creates a situation whereby the default is that there's nothing in the field's spec about control characters, the SF generic implementation passes them through, and the field-specific code is expected to know that it should reject them. Not great.

So a better approach might be:

Field specifications that allow Display Strings are required to specify how control characters [ref?] should be handled; considering them to errors [ref to error handling] is encouraged.

@bsdphk
Copy link
Contributor Author

bsdphk commented Aug 25, 2023

This is where we realize we have waded into the "no way to be a little bit pregnant with UniCode" sump, isn't it ?

I still think that %00 is a valid security concern in C/POSIX environments, but I really do not want to open the UniCode can of worms.

Maybe we should just drop this ticket and add a stern security warning that there is no such thing as "safe unicode strings" ?

@mnot
Copy link
Member

mnot commented Aug 25, 2023

I'd be good with a general unicode 'there be dragons' security considerations warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

7 participants