Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

field-value production rules out CTL characters, but these are common in the wild #683

Closed
njsmith opened this issue Jan 21, 2021 · 17 comments · Fixed by #783
Closed

field-value production rules out CTL characters, but these are common in the wild #683

njsmith opened this issue Jan 21, 2021 · 17 comments · Fixed by #783

Comments

@njsmith
Copy link

njsmith commented Jan 21, 2021

draft-ietf-httpbis-semantics-latest currently says:

Field values containing control (<x:ref>CTL</x:ref>) characters such as CR or LF are invalid; recipients &MUST; either reject a field value containing control characters, or convert them to SP before processing or forwarding the message.

However, when I implemented an HTTP/1.1 parsing library that followed this rule, it turned out to broken in the real world. Specifically, users reported that sites using Google Analytics were setting cookies with values containing the ASCII character \x01, which is a control character (python-hyper/h11#57). And in my investigations at the time, browsers and popular clients like curl all supported this just fine.

It would be great if the next round of RFCs could look into this in more detail and define the field-content production in a way that would be acceptable to e.g. the browsers.

In my library we currently reject NUL and exotic whitespace (\r, \v, \f, \n), on the grounds that those are high-risk for interop bugs and splitting exploits, but allow all other bytes to pass through, including control characters. I don't know if that's the best solution, but in our experience it's at least closer to being real-world compatible than what the RFC draft currently says.

@annevk
Copy link
Contributor

annevk commented Jan 22, 2021

See #215. https://fetch.spec.whatwg.org/#concept-header-value is what browsers implement, which is slightly more lenient than what you have there. web-platform-tests has tests for that as well.

@reschke
Copy link
Contributor

reschke commented Jan 22, 2021

Was this bug reported to Google Analytics?

@annevk
Copy link
Contributor

annevk commented Jan 22, 2021

To be clear, at Mozilla we found the same thing on routers and such. It's not an isolated incident.

@Lukasa
Copy link

Lukasa commented Jan 22, 2021

As with the discussion we’re having on-thread about #31, is it worth us revisiting the language in this specification when we know that non-conformance is widespread?

@royfielding
Copy link
Member

As with the discussion we’re having on-thread about #31, is it worth us revisiting the language in this specification when we know that non-conformance is widespread?

No, control characters as a set being disallowed in header fields is a relatively new requirement, whereas HTTP has explicitly forbidden CR-only line breaks since 1993 because that is a well-known security hole. Allowing some of the control characters to pass through is not going to decrease the security of valid implementations, whereas allowing CR will only increase the number of insecure browsers subjecting their own users to cache poisoning attacks.

@reschke
Copy link
Contributor

reschke commented Jan 24, 2021

No, control characters as a set being disallowed in header fields is a relatively new requirement (...)

Hmmm - unless I'm missing something they weren't allowed in RFC 2616 either. What's new are the instructions for recipients how to deal with malformed values.

@royfielding
Copy link
Member

Right, that makes sense, though more strict handling of field values in 7230 was part of the move away from lenient handling for robustness in 2616.

@njsmith
Copy link
Author

njsmith commented Feb 3, 2021

As an HTTP implementer, I'd like to respectfully plead for you to reconsider here, and align the RFC text with the WHAT-WG text for better real-world interoperability.

YOLOing together a HTTP/1.1 implementation that's interoperable but insecure is easy. Writing a secure but non-interoperable implementation is pointless, because no-one will use it. Writing a secure AND interoperable implementation is extremely difficult, because it requires that you find the narrowest possible rules that allow only the things that are required to interoperate, and reject everything else, and that's essentially impossible without browser-maker-level visibility into exactly what weird HTTP implementations are out there. The RFCs are an incredible resource for anyone who's trying to build a high-quality implementation, because they contain so much distilled wisdom on where this line is... except for bits like the field-value production, where if you follow the spec you'll get burned and end up having to YOLO something anyway.

It's very frustrating to carefully follow the specs, and then be punished for it. And it makes it hard to justify following the spec versus using some YOLO implementation.

@mnot mnot added the semantics label Feb 3, 2021
@royfielding
Copy link
Member

We can't align the text. HTTP has always forbidden bare CR as a line break. Almost all deployed implementations rely on that while parsing CR into field values and (CR)LF as line breaks. If we changed that now, all deployed implementations with a cache would have a security hole, instead of just the two evergreen browsers that broke their own parsers and can easily change them back.

What servers have been doing (and slowly deploying) is active removal of bare CR in fields when they do character-by-character parsing, just to fix the new browser brokenness. This is a workaround.

@royfielding
Copy link
Member

With regard to other CTL characters, many are found in the wild, but it would be a stretch to say that they are interoperable when sent in HTTP. Some might be. I don't think we have time to validate such a change.

@njsmith
Copy link
Author

njsmith commented Feb 4, 2021

The WHAT-WG spec text that @annevk linked above forbids NUL, CR, and LF inside field values, while allowing everything else. So I'm not suggesting allowing CR.

Re: the other control characters: if you can't validate interoperability, then that's a tough position to be in for sure. But surely there's something that would be better than the current text? Currently, all general-usage http clients MUST actively violate the spec. Even if all you can say is "you should definitely disallow NUL/CR/LF and for everything else we're not sure, use your best judgement", then that seems like it would at least be more accurate and helpful than what we have now.

@reschke
Copy link
Contributor

reschke commented Feb 4, 2021

It really would be interesting to see what the effect of converting-to-SP instead of preserving the CTL characters would be. That is, are those characters sent intentionally (I see that this probably is ghe case for G Analytics). Unfortunately, even if we ran an experiment right now, it wouldn't help in this LC.

@reschke
Copy link
Contributor

reschke commented Feb 4, 2021

FWIW

python-hyper/h11#57 (comment)

suggests that the HTTP/2 has a "MUST reject" rule. When we achieve consensus here, the H2bis probably should align.

@njsmith
Copy link
Author

njsmith commented Feb 4, 2021

It really would be interesting to see what the effect of converting-to-SP instead of preserving the CTL characters would be. That is, are those characters sent intentionally (I see that this probably is ghe case for G Analytics).

Yeah, specifically the CTL characters from G Analytics occur in Set-Cookie/Cookie headers, so it's an opaque value that the server expects to get back. I don't know for sure, but my impression is that someone chose \x01 as a "clever" separator character that they knew wouldn't occur inside the values being separated, so converting to SP would probably break it.

@reschke
Copy link
Contributor

reschke commented Feb 4, 2021

I see. Another related question would be: is this something that's specific to cookies?

@royfielding
Copy link
Member

Note also that the G analytics cookies refer to an old version of custom variables in a library that has been supplanted at least twice (ga.js --> analytics.js --> gtag.js). It's possible that they were removed because of lacking interop.

https://developers.google.com/analytics/devguides/collection/analyticsjs/cookie-usage

@mnot
Copy link
Member

mnot commented Feb 11, 2021

Discussed in Feb 21 Interim; align with Fetch for hard refusal of characters, strongly caution against CTL (minus NUL) for senders / field definitions. Don't require CTL (minus NUL) to be replaced with space. CR+NUL MUST be converted to space or rejected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

6 participants