-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
oohelperd: excessive CPU usage under moderate load #2413
Comments
The first problem to solve is how to stress test the oohelperd in a testing environment. To this end, @FedericoCeratto proposed using the siege tool to benchmark the oohelperd binary. This is the script we have been using to perform this task: #!/bin/bash
set -euxo pipefail
siege -c 10 -t30S \
--content-type "application/json" \
'http://127.0.0.1:8080 POST {"http_request": "https://www.youtube.com/", "http_request_headers": {}, "tcp_connect": ["216.58.209.46:443", "142.250.184.78:443", "142.250.184.110:443", "142.250.180.142:443", "142.250.180.174:443", "142.251.209.14:443", "142.251.209.46:443"]}' The general idea here is to issue the same POST request with the given body for 30 seconds using ten concurrent users to create load on the system. I am personally testing this code in my development computer (which does not have lower hardware resources than the VM we are currently using for the test helpers). |
This commit creates a private static cert pool to use in the oohelperd when issuing QUIC and TLS connections. See ooni/probe#2413 for context.
This commit creates a private static cert pool to use in the oohelperd when issuing QUIC and TLS connections. See ooni/probe#2413 for context.
See ooni/probe#2413 for context.
This diff is slightly more complex than the naive diff that creates a private static pool and uses it for quic and tls. The reason is that we also need to support a mechanism allowing integration tests to transparently override the default cert pool. See ooni/probe#2413 for context.
This diff is slightly more complex than the naive diff that creates a private static pool and uses it for quic and tls. The reason is that we also need to support a mechanism allowing integration tests to transparently override the default cert pool. See ooni/probe#2413 for context.
This commit backports 8476f11. This commit creates a private static cert pool to use in the oohelperd when issuing QUIC and TLS connections. See ooni/probe#2413 for context.
With ooni/probe-cli#1096 applied, I ran the above script and got as output:
The pprof trace for oohelperd looked as follows: If you download the SVG and zoom, you can see how So, I implemented ooni/probe-cli#1097 and ooni/probe-cli#1098 to create a static cached cert pool that we can use for both With those two changes in place, the script output become:
While there's more to optimize, we have clearly removed a significant bottleneck. |
I backported ooni/probe-cli#1097 to There was no need to backport ooni/probe-cli#1098. After resolving conflicts, the remaining diff boiled down to: diff --git a/internal/netxlite/quic.go b/internal/netxlite/quic.go
index 682c976e..c4641c88 100644
--- a/internal/netxlite/quic.go
+++ b/internal/netxlite/quic.go
@@ -166,7 +166,8 @@ func (d *quicDialerQUICGo) dialEarlyContext(ctx context.Context,
func (d *quicDialerQUICGo) maybeApplyTLSDefaults(config *tls.Config, port int) *tls.Config {
config = config.Clone()
if config.RootCAs == nil {
- config.RootCAs = defaultCertPool
+ // See https://github.com/ooni/probe/issues/2413 for context
+ config.RootCAs = tproxyDefaultCertPool
}
if len(config.NextProtos) <= 0 {
switch port {
diff --git a/internal/netxlite/tls.go b/internal/netxlite/tls.go
index dc87743a..8f7e781b 100644
--- a/internal/netxlite/tls.go
+++ b/internal/netxlite/tls.go
@@ -213,7 +213,8 @@ func (h *tlsHandshakerConfigurable) Handshake(
conn.SetDeadline(time.Now().Add(timeout))
if config.RootCAs == nil {
config = config.Clone()
- config.RootCAs = defaultCertPool
+ // See https://github.com/ooni/probe/issues/2413 for context
+ config.RootCAs = tproxyDefaultCertPool
}
tlsconn, err := h.newConn(conn, config)
if err != nil {
diff --git a/internal/netxlite/tproxy.go b/internal/netxlite/tproxy.go
index c56b853e..0043ed12 100644
--- a/internal/netxlite/tproxy.go
+++ b/internal/netxlite/tproxy.go
@@ -15,6 +15,16 @@ var TProxy model.UnderlyingNetwork = &DefaultTProxy{}
// defaultTProxy is the default UnderlyingNetwork implementation.
type DefaultTProxy struct{}
+// tproxyDefaultCertPool is a static copy of the default cert pool. You
+// MUST NOT access this variable directly. You SHOULD use the
+// tproxySingleton().DefaultCertPool() factory instead. By doing
+// that, you would allow integration tests to override the pool
+// we're using. Hence, we can run tests with fake servers.
+//
+// See https://github.com/ooni/probe/issues/2413 to understand why we
+// need a private static default pool.
+var tproxyDefaultCertPool = NewDefaultCertPool()
+
// DialContext implements UnderlyingNetwork.
func (tp *DefaultTProxy) DialContext(ctx context.Context, timeout time.Duration, network, address string) (net.Conn, error) {
d := &net.Dialer{
From this diff it's clear that Hence, with ooni/probe-cli@ebb4ecc merged into |
In the meanwhile, we have updated our test helpers architecture in ooni/api@3bf3bef. The general idea is to use 0.th as the place where we do A/B testing after code has been tested on the testing infrastructure. This test helper will receive a small portion of the traffic. The other THs will always run stable code and will round robin the rest of the traffic. |
To make the developed diff benefit all experiments, we should make sure we are not calling |
We should also expose pprof metrics to localhost and think about which additional prometheus metrics we should expose in dashboards. We have already added the number of currently allocated bytes and the GC duration. |
This diff attempts to rewrite and steer handler_test.go such that it mostly contains unit tests. As part of this work, we create a single factory function for constructing the handler. While there, we also rename the indexLogger to prefixLogger (a more accurate name) and we deploy more comments and whitespaces. Part of ooni/probe#2413
This diff attempts to rewrite and steer handler_test.go such that it mostly contains unit tests. As part of this work, we create a single factory function for constructing the handler. While there, we also rename the indexLogger to prefixLogger (a more accurate name) and we deploy more comments and whitespaces. Part of ooni/probe#2413
This diff modifies the tree to rely on netxlite's property that it will use the Mozilla cert pool when no cert pool has been set. After this diff has been applied, these are the places where we still call `NewDefaultCertPool`: ``` internal/experiment/echcheck/handshake.go:var certpool = netxlite.NewDefaultCertPool() internal/experiment/ndt7/dial.go:var certpool = netxlite.NewDefaultCertPool() internal/experiment/riseupvpn/riseupvpn.go: certPool := netxlite.NewDefaultCertPool() internal/experiment/signal/signal.go: certPool := netxlite.NewDefaultCertPool() internal/model/netx.go: // race. Use [netxlite.NewDefaultCertPool] if you wish to get internal/netxlite/filtering/tls_test.go: RootCAs: netxlite.NewDefaultCertPool(), internal/netxlite/tls.go:// NewDefaultCertPool returns the default x509 certificate pool internal/netxlite/tls.go:func NewDefaultCertPool() *x509.CertPool { internal/netxlite/tls_test.go:func TestNewDefaultCertPoolWorks(t *testing.T) { internal/netxlite/tls_test.go: pool := NewDefaultCertPool() internal/netxlite/tproxy.go:var tproxyDefaultCertPool = NewDefaultCertPool() ``` The references in `internal/model` and `internal/netxlite` look good. In `ndt7` and `echcheck`, we're creating a cached copy, so we're all good. In `riseupvpn` and `signal`, we need to extend the cert pool, so those are the places where calling `NewDefaultCertPool` makes sense. These are the places where we modify cert pools: ``` internal/experiment/riseupvpn/riseupvpn.go: if ok := certPool.AppendCertsFromPEM([]byte(tk.HTTPResponseBody)); !ok { internal/experiment/signal/signal.go: if !certPool.AppendCertsFromPEM(caBytes) { internal/experiment/signal/signal.go: return errors.New("AppendCertsFromPEM failed") internal/experiment/signal/signal_test.go: if err.Error() != "AppendCertsFromPEM failed" { internal/netxlite/certifi_test.go: if !pool.AppendCertsFromPEM([]byte(pemcerts)) { internal/netxlite/internal/gencertifi/main.go: if !pool.AppendCertsFromPEM(bundle) { internal/netxlite/tls.go: // Assumption: AppendCertsFromPEM cannot fail because we internal/netxlite/tls.go: ok := pool.AppendCertsFromPEM([]byte(pemcerts)) internal/netxlite/tls.go: runtimex.Assert(ok, "pool.AppendCertsFromPEM failed") internal/experiment/urlgetter/multi_test.go: certpool.AddCert(cert) internal/netxlite/filtering/http.go: o.AddCert(p.cert) internal/netxlite/filtering/tls.go: o.AddCert(p.cert) internal/netxlite/tproxy_test.go: pool.AddCert(srvr.Certificate()) ``` Excluding `_test.go` files, this is basically: * in `riseupvpn` and `signal`, where we need to do that; * when we generate the bundled certs; * in `netxlite/tls.go`, where we implement `NewDefaultCertPool`. Part of ooni/probe#2413.
This diff modifies the tree to rely on netxlite's property that it will use the Mozilla cert pool when no cert pool has been set. After this diff has been applied, these are the places where we still call `NewDefaultCertPool`: ``` internal/experiment/echcheck/handshake.go:var certpool = netxlite.NewDefaultCertPool() internal/experiment/ndt7/dial.go:var certpool = netxlite.NewDefaultCertPool() internal/experiment/riseupvpn/riseupvpn.go: certPool := netxlite.NewDefaultCertPool() internal/experiment/signal/signal.go: certPool := netxlite.NewDefaultCertPool() internal/model/netx.go: // race. Use [netxlite.NewDefaultCertPool] if you wish to get internal/netxlite/filtering/tls_test.go: RootCAs: netxlite.NewDefaultCertPool(), internal/netxlite/tls.go:// NewDefaultCertPool returns the default x509 certificate pool internal/netxlite/tls.go:func NewDefaultCertPool() *x509.CertPool { internal/netxlite/tls_test.go:func TestNewDefaultCertPoolWorks(t *testing.T) { internal/netxlite/tls_test.go: pool := NewDefaultCertPool() internal/netxlite/tproxy.go:var tproxyDefaultCertPool = NewDefaultCertPool() ``` The references in `internal/model` and `internal/netxlite` look good. In `ndt7` and `echcheck`, we're creating a cached copy, so we're all good. In `riseupvpn` and `signal`, we need to extend the cert pool, so those are the places where calling `NewDefaultCertPool` makes sense. These are the places where we modify cert pools: ``` internal/experiment/riseupvpn/riseupvpn.go: if ok := certPool.AppendCertsFromPEM([]byte(tk.HTTPResponseBody)); !ok { internal/experiment/signal/signal.go: if !certPool.AppendCertsFromPEM(caBytes) { internal/experiment/signal/signal.go: return errors.New("AppendCertsFromPEM failed") internal/experiment/signal/signal_test.go: if err.Error() != "AppendCertsFromPEM failed" { internal/netxlite/certifi_test.go: if !pool.AppendCertsFromPEM([]byte(pemcerts)) { internal/netxlite/internal/gencertifi/main.go: if !pool.AppendCertsFromPEM(bundle) { internal/netxlite/tls.go: // Assumption: AppendCertsFromPEM cannot fail because we internal/netxlite/tls.go: ok := pool.AppendCertsFromPEM([]byte(pemcerts)) internal/netxlite/tls.go: runtimex.Assert(ok, "pool.AppendCertsFromPEM failed") internal/experiment/urlgetter/multi_test.go: certpool.AddCert(cert) internal/netxlite/filtering/http.go: o.AddCert(p.cert) internal/netxlite/filtering/tls.go: o.AddCert(p.cert) internal/netxlite/tproxy_test.go: pool.AddCert(srvr.Certificate()) ``` Excluding `_test.go` files, this is basically: * in `riseupvpn` and `signal`, where we need to do that; * when we generate the bundled certs; * in `netxlite/tls.go`, where we implement `NewDefaultCertPool`. Part of ooni/probe#2413.
This diff increases the timeout used to await for pending requests and parallelizes the shutdown of servers. While there, let's move all the flags at toplevel for clarity. While there, recognize that we can use the channel's signal for shutting down the server in tests. These changes are preliminary changes before making the oohelperd more capable of serving interesting metrics and debug info. See ooni/probe#2413
This diff increases the timeout used to await for pending requests and parallelizes the shutdown of servers. While there, let's move all the flags at toplevel for clarity. While there, recognize that we can use the channel's signal for shutting down the server in tests. These changes are preliminary changes before making the oohelperd more capable of serving interesting metrics and debug info. See ooni/probe#2413
I also added an endpoint for obtaining exection traces. See ooni/probe#2413
I also added an endpoint for obtaining execution traces. See ooni/probe#2413
* buildtool: add rules to build oohelperd * .github/workflows: use buildtool to build oohelperd * oohelperd: add `-replace` flag to replace an already running instance of the oohelperd * oohelperd: add `-version` flag to get version Part of ooni/probe#2413
* buildtool: add rules to build oohelperd * .github/workflows: use buildtool to build oohelperd * oohelperd: add `-replace` flag to replace an already running instance of the oohelperd * oohelperd: add `-version` flag to get version Part of ooni/probe#2413
In ooni/probe-cli#1104, I have added support for obtaining the profile of a running instance. In ooni/probe-cli#1105, I made oohelperd deployment easier. We should now write a short runbook explaining how to extract performance metrics from a running oohelperd daemon. |
We have put work in to mitigate the CPU usage and we now have support for collecting metrics. Additionally, we are now using four test helpers and we disabled using HTTP/3 in the TH for now. I will close the issue right now, because otherwise this will become a long-running issue with no ending condition in sight. Should we see new performance issues in the future, we will reopen this issue or create a new issue. |
TL;DR We need to investigate high oohelperd CPU usage under load, which we noticed during an A/B testing session, but which seems to have been a lingering problem for the TH also before the A/B testing itself.
With my team members (and @FedericoCeratto and @sloncocs in particular), we have been investigating excessive CPU load in the oohelperd, which surfaced when A/B testing code we introduced in ooni/probe-cli#985.
We started testing this code on 2023-02-07T08:06:00Z by deploying a ~3.17.0-beta build on 0.th.ooni.org. On February 9, I said on our internal channels that I was a bit worried by the idle CPU after the deployment:
Further zooming out, though, I noticed there were previous moments in time where we had little CPU to spare, so I was not convinced that we could easily say the new code was the culprit (spoiler: it was not):
As you could see, in fact, in December we had lower available CPU that we had in early February. Because this fact was worrying me, I called for discussing this topic in the next backend meeting (February 14).
However, when I was back to the office after the weekend, I noticed that the CPU usage was now 100%:
While I was investigating, my colleague @sloncocs warned about a surge of failed tests over the weekend:
I quickly realized that, because the 0.th.ooni.org test helper was on its knees, it could not service many users. Therefore, several measurements that happened using such a test helper did not include the control measurement. Hence, the probe could not determine the status of the measurement (all good, website down, anomaly, etc).
Fifteen minutes after noticing this issue, I restarted the oohelperd. Since then, I have reverted back to running the previous code, thus interrupting the A/B testing session until it became more clear to us what was going on.
This issue is here to share what I have learned (so far) when investigating this issue.
The text was updated successfully, but these errors were encountered: