Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream stacking occurs when H2 processes HTTP2 RST_STREAM frames. As a result, the memory and CPU usage are high. #2877

Closed
qinyushun opened this issue May 27, 2022 · 32 comments · Fixed by hyperium/h2#668
Labels
A-http2 Area: HTTP/2 specific. C-bug Category: bug. Something is wrong. This is bad!

Comments

@qinyushun
Copy link

Version
hyper-0.13.7
h2-0.2.4

Platform
wsl

Description

Summary:

​ Stream stacking occurs when Hyper processes HTTP2 RST_STREAM frames. As a result, the memory and CPU usage are high.

Step:

  1. Send an HEADERS frame to open the stream, followed by an RST_STREAM frame to request cancellation of the stream

    def reset_flood(h2_conn):
        for i in range(1, 20000):
            if i % 2 == 0:
                continue
            headers(h2_conn, i)
            rst_stream(h2_conn, i)
    
  2. Create multiple threads for sending.

    if __name__ == '__main__':
        for i in range(0, 400):
            try:
                _thread.start_new_thread(send, ("Thread-" + str(i),))
            except:
                print("Error: Can not start thread")
        while 1:
            pass
    

Result:

The CPU usage of the Hyper server keeps increasing. As a result, the VM memory is used up.
image

Vulnerability analysis:

When receiving a HEADERS frame, the h2 stores the corresponding stream content in the slab and sets a frame index to the ids. When receiving an RST_STREAM frame, h2 sets the stream to Closed and resets the related statistics. However, the stream memory is released only when stream.is_released() is true . When the RST_STREAM frame is received, the release is not triggered immediately. As a result, the size of the slab increases continuously.

Test procedure:

Add the slab_len(),ids_len() method for Store to return the length of all flows and active flows.
image

Add the preceding two stream lengths to the recv_reset() method.
image

After the test, when the HEADERS frame is repeatedly sent to open a stream or the RST_STREAM frame is sent to cancel the stream, the length of the ids does not exceed the value of max_recv, but the SLAB increases .The stream in the Closed state in the SLAB is released only after all frames on the connection are sent.
image
image

The max_concurrent_streams configuration can limit max_recv_streams, but it appears that in this scenario, the size of Slab is much larger than the max_concurrent_streams value and stacks up.

I think it is necessary to limit the size of the Slab or ensure that streams in the Slab can be released quickly after the RST_STREAM frame is received to prevent such attacks.

@qinyushun qinyushun added the C-bug Category: bug. Something is wrong. This is bad! label May 27, 2022
@seanmonstar seanmonstar added the A-http2 Area: HTTP/2 specific. label May 27, 2022
@lidong14
Copy link

@seanmonstar What do you think of this issue? I think hyper has some problems in dealing with this scenario.

@seanmonstar
Copy link
Member

Sounds like something we could try to fix. What is making the slab hold onto the memory? My guess is that it needs to hold the stream info while there is an Event that can be polled, so that a user accepting requests will see it was reset, is that it?

@lidong14
Copy link

@silence-coding thanks for your code.
@seanmonstar pls check , thandks

@Wang1921

This comment was marked as off-topic.

@shirshak55
Copy link

CVE-2023-26964

@jqnatividad
Copy link

GH Dependabot is now also raising a Security alert

GHSA-f8vr-r385-rh5r

@SamTV12345
Copy link

Is there anything we could do as reqwest library users to mitigate the attack surface?

@seanmonstar
Copy link
Member

Um. Ok. I'm sure you just wanted to see this fixed. But this is NOT how to do things. Filing a drive-by CVE like this is like seeing a lighter in a school and pulling the fire alarm.

There is a SECURITY policy if you ever feel something is a vulnerability. Then we can better coordinate.

Be Kind. That includes considering we're all humans contributing, and respecting time is part of being kind.

I'm also extremely disappointed that @github facilitated this.

That said, fire alarm time.


If you're here from a dependabot warning, or similar: I'll be working on this today. I will update this issue when there is a version you can upgrade to. Some things you can do to determine if you can just safely ignore the dependabot warning:

  • If you're not using HTTP/2, ignore.
  • If you're using a client, with HTTP/2, but you're talking to your own servers, or you at least control the URLs to talk to servers you trust to not attack your client (most businesses), ignore.

The fix will likely be an h2 version, hyper will not need to be changed.

@olix0r
Copy link
Contributor

olix0r commented Apr 12, 2023

This CVE is going to cause a flurry of unnecessary emergency response work across the ecosystem--even for use cases that are totally un-impacted by the bug. Every application that depends on Hyper is now going to have to ship out-of-cycle fixes to address the CVE, even if the application has nothing to do with HTTP/2.

Going through the documented SECURITY process isn't a feel good exercise: it allows stakeholders to coordinate about the severity and breadth of an issue's impact.

@jqnatividad
Copy link

Not to pile on @github's security team, the actual CVE is "better" insofar as it says "awaiting analysis"...
https://nvd.nist.gov/vuln/detail/CVE-2023-26964

The GH Security Advisory, in contrast, is unnecessarily alarming with its High Severity label...
GHSA-f8vr-r385-rh5r

@seanmonstar
Copy link
Member

seanmonstar commented Apr 12, 2023

An update: from what I can tell, the report is that the slab store of streams grows because its holding two events, [HEADERS, RESET] essentially, so that the user can notice them when they call conn.accept(). I have the skeleton of a unit test, it's not that great, but it lets me adjust the conditions a bit: https://github.com/hyperium/h2/compare/sean/reset-streams-growing.

It appears that as long as the server is accepting requests, the slab is actually cleared out. Looking in the original report, I can't find what the server is doing (it's target/release/server2?).

Also, here's the chat thread for this: https://discord.com/channels/500028886025895936/1095696565689122967

@Aloso
Copy link

Aloso commented Apr 12, 2023

This CVE is going to cause a flurry of unnecessary emergency response work across the ecosystem

What do you mean with "unnecessary"?

Finding out whether or not you are affected isn't unnecessary. Updating to an unaffected version is not possible, because there is no unaffected version as of now; every version of hyper is potentially vulnerable. The only thing people can do is to disable the "h2" feature if they have it enabled, which would prevent this attack.

@sigaloid
Copy link

What do you mean with "unnecessary"?

The whole point of a SECURITY policy is to respond in a reasonable way - reproduce identify scope, write tests, prepare fix, etc. Should we really report a CVE before confirming with the maintainers that it's 1. reproducible and 2. something that has an available fix ready to deploy? It would have been better to follow the requested SECURITY policy so that the response can be more coordinated. It's better to be sure first before declaring a vulnerability.

@carllerche
Copy link

Exactly; additionally, there is no world in which this is a "high severity" CVE. You can use this tool to calculate the severity.

@Aloso

This comment was marked as abuse.

@carllerche
Copy link

This issue was opened almost a year ago. The maintainers had more than enough time to reproduce the bug and prepare a fix.

It is still very unclear whether there is a vulnerability or an issue. The submitter claims there is one but needs to provide a report by which they achieved that result. The maintainers have yet to reproduce under normal conditions. The only way to reproduce seems to be either a) misusing h2 or b) an application that lags behind and explicitly is asking for the state to be held (though it could be that the implication of the API is unclear).

@sigaloid
Copy link

This issue was opened almost a year ago. The maintainers had more than enough time to reproduce the bug and prepare a fix.

We don't even have an official repro case. That's the entire point of the SECURITY policy - ensure the problem exists, realize its severity, etc. It's not to avoid blame of a vulnerability - just to better coordinate the response.

@dbrenot-pelmorex
Copy link

This issue was opened almost a year ago. The maintainers had more than enough time to reproduce the bug and prepare a fix.

We don't even have an official repro case. That's the entire point of the SECURITY policy - ensure the problem exists, realize its severity, etc. It's not to avoid blame of a vulnerability - just to better coordinate the response.

Does the issue not show how to reproduce this. I'm sorry but I don't exactly blame the person who made the CVE here. This was reported and was marked as moderate on the CVE website.

I can't imagine it's that rare of a situation for people to use http2 streams here, so I don't know why it's being minimized so much.

Be kind should also apply to not scolding your community members. I'm sure this wasn't meant out of malice and was simply meant to bring awareness to the issue at hand.

That being said, i'm seeing people refer to it being marked as high severity, but both github and nist are reporting as moderate, which I think is fair.

@CosmicHorrorDev
Copy link

That being said, i'm seeing people refer to it being marked as high severity, but both github and nist are reporting as moderate, which I think is fair.

GitHub initially had it marked as high. It was updated an hour ago

github/advisory-database@aa9e5d5

@sigaloid
Copy link

Does the issue not show how to reproduce this. I'm sorry but I don't exactly blame the person who made the CVE here. This was reported and was marked as moderate on the CVE website.

No, it looks like the maintainers/contributors were having a hard time figuring out exactly how to reproduce it/where the vulnerability is, or if it's just an API misuse. As I understand from the fire alarm discord thread, the large number of requests needed to reproduce in OPs case also confuse it. In general it's not very clear how/where the vulnerability lies. and this is emergency work that the maintainers now have to perform, without even a super concise repro case, instead of it being disclosed properly.

It's generally good practice to at least mention to the maintainer that you'll be filing a CVE, even if just to give advance warning, but also just to see if you are misunderstanding it. Even if it turns out to be severe, it's still best practice to communicate with maintainers about it for all of the reasons above.

@seanmonstar
Copy link
Member

seanmonstar commented Apr 12, 2023

Does the issue not show how to reproduce this.

It unfortunately does not. About 4 people have been working on this all day trying to figure out what the actual reproduction of this is. This makes it quite stressful to just guess at what a fix is, when we can't confirm that's what they meant in the first place.

I'm sure this wasn't meant out of malice and was simply meant to bring awareness to the issue at hand.

I acknowledged in my first comment this morning that I assume the person just wanted it fixed, not assuming they were being evil. But this is not the right way to get attention. There were several more steps that could have been taken.

@carllerche
Copy link

Another update, this should be classified as low severity. The attack requires sufficient bandwidth to overload Hyper's accept loop. The original submitter 400 threads over localhost against hyper's 1 thread. This would not be easy to trigger over a network (though possible). Additionally, systems return to normal once the attack completes.

In practice, I don't expect many to be vulnerable to such an attack due to the difficulty of triggering it via remote network and in terms of severity, a botnet attack triggering a DOS by overloading the pipe would be a greater risk.

That said, the Hyper maintainers are currently investigating a patch that could mitigate this specific attack.

@artificial-intelligence

This comment was marked as off-topic.

@seanmonstar

This comment was marked as off-topic.

@artificial-intelligence

Another update, this should be classified as low severity. The attack requires sufficient bandwidth to overload Hyper's accept loop. The original submitter 400 threads over localhost against hyper's 1 thread. This would not be easy to trigger over a network (though possible). Additionally, systems return to normal once the attack completes.

In practice, I don't expect many to be vulnerable to such an attack due to the difficulty of triggering it via remote network and in terms of severity, a botnet attack triggering a DOS by overloading the pipe would be a greater risk.

That said, the Hyper maintainers are currently investigating a patch that could mitigate this specific attack.

This is all very good to know, nice work.

Just one question: why is it not easy to trigger 400 client threads over the network against 1 server thread? Seems dead simple and is what every ddos framework can do, no? Am I misinterpreting your answer maybe?

Thanks for all the work from everyone involved again! It's appreciated!

@Noah-Kennedy
Copy link
Contributor

It isn't a matter of threading, but of bandwidth. You can only send so many bytes as the pipe allows, snd that has to be enough to saturate the hyper accept loop.

This is easy enough to do over localhost, where bandwidth is arbitrarily high, not so much over the public internet

@seanmonstar
Copy link
Member

why is it not easy to trigger over the network against 1 server thread?

This is our current theory (can't confirm it's what the original reporter meant): if you can flood the network faster than the server is able to "accept" requests, the queue of pending-to-accept-but-reset requests can keep growing. This is because the accept method is basically read_from_socket(); yield_pending_request(). The task in hyper that is "accepting" requests is very tight: it's simply focused on accept(); spawn_task_to_handle_request(). If you could fill up the TCP read buffer again with a bunch of more requests before we did that simple loop, over and over and over, that could keep growing the internal queue.

So, if it is over the network, not local, then latency, backpressure, and TCP flow control would make it harder to fill up the server's buffer faster than it can spawn a task to handle the request.

But again, this is just our best guess.

@seanmonstar
Copy link
Member

If that is indeed the issue, this PR means to fix it: hyperium/h2#668

@seanmonstar
Copy link
Member

The PR seems close. But, given that I've been at this for 14 hours, I'm going to go knock out. Comparing the severity with the possibility of breaking something just before being unresponsive, it doesn't seem worth publishing right at this moment.

seanmonstar added a commit to hyperium/h2 that referenced this issue Apr 13, 2023
Streams that have been received by the peer, but not accepted by the
user, can also receive a RST_STREAM. This is a legitimate pattern: one
could send a request and then shortly after, realize it is not needed,
sending a CANCEL.

However, since those streams are now "closed", they don't count towards
the max concurrent streams. So, they will sit in the accept queue, using
memory.

In most cases, the user is calling `accept` in a loop, and they can
accept requests that have been reset fast enough that this isn't an
issue in practice.

But if the peer is able to flood the network faster than the server
accept loop can run (simply accepting, not processing requests; that
tends to happen in a separate task), the memory could grow.

So, this introduces a maximum count for streams in the pending-accept
but remotely-reset state. If the maximum is reached, a GOAWAY frame with
the error code of ENHANCE_YOUR_CALM is sent, and the connection marks
itself as errored.

ref CVE-2023-26964
ref GHSA-f8vr-r385-rh5r

Closes hyperium/hyper#2877
seanmonstar added a commit to hyperium/h2 that referenced this issue Apr 13, 2023
Streams that have been received by the peer, but not accepted by the
user, can also receive a RST_STREAM. This is a legitimate pattern: one
could send a request and then shortly after, realize it is not needed,
sending a CANCEL.

However, since those streams are now "closed", they don't count towards
the max concurrent streams. So, they will sit in the accept queue, using
memory.

In most cases, the user is calling `accept` in a loop, and they can
accept requests that have been reset fast enough that this isn't an
issue in practice.

But if the peer is able to flood the network faster than the server
accept loop can run (simply accepting, not processing requests; that
tends to happen in a separate task), the memory could grow.

So, this introduces a maximum count for streams in the pending-accept
but remotely-reset state. If the maximum is reached, a GOAWAY frame with
the error code of ENHANCE_YOUR_CALM is sent, and the connection marks
itself as errored.

ref CVE-2023-26964
ref GHSA-f8vr-r385-rh5r

Closes hyperium/hyper#2877
@seanmonstar
Copy link
Member

Release PR is in CI right now: hyperium/h2#670

@seanmonstar
Copy link
Member

Release is out: https://github.com/hyperium/h2/releases/tag/v0.3.17

You can run a cargo update -p h2 to get to this newest version.

@KisaragiEffective
Copy link

submitted rustsec/advisory-db#1684.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-http2 Area: HTTP/2 specific. C-bug Category: bug. Something is wrong. This is bad!
Projects
None yet
Development

Successfully merging a pull request may close this issue.