oohelperd: excessive CPU usage under moderate load #2413

bassosimone · 2023-02-15T09:07:34Z

TL;DR We need to investigate high oohelperd CPU usage under load, which we noticed during an A/B testing session, but which seems to have been a lingering problem for the TH also before the A/B testing itself.

With my team members (and @FedericoCeratto and @sloncocs in particular), we have been investigating excessive CPU load in the oohelperd, which surfaced when A/B testing code we introduced in ooni/probe-cli#985.

We started testing this code on 2023-02-07T08:06:00Z by deploying a ~3.17.0-beta build on 0.th.ooni.org. On February 9, I said on our internal channels that I was a bit worried by the idle CPU after the deployment:

Further zooming out, though, I noticed there were previous moments in time where we had little CPU to spare, so I was not convinced that we could easily say the new code was the culprit (spoiler: it was not):

As you could see, in fact, in December we had lower available CPU that we had in early February. Because this fact was worrying me, I called for discussing this topic in the next backend meeting (February 14).

However, when I was back to the office after the weekend, I noticed that the CPU usage was now 100%:

While I was investigating, my colleague @sloncocs warned about a surge of failed tests over the weekend:

I quickly realized that, because the 0.th.ooni.org test helper was on its knees, it could not service many users. Therefore, several measurements that happened using such a test helper did not include the control measurement. Hence, the probe could not determine the status of the measurement (all good, website down, anomaly, etc).

Fifteen minutes after noticing this issue, I restarted the oohelperd. Since then, I have reverted back to running the previous code, thus interrupting the A/B testing session until it became more clear to us what was going on.

This issue is here to share what I have learned (so far) when investigating this issue.

bassosimone · 2023-02-15T09:11:25Z

The first problem to solve is how to stress test the oohelperd in a testing environment. To this end, @FedericoCeratto proposed using the siege tool to benchmark the oohelperd binary.

This is the script we have been using to perform this task:

#!/bin/bash
set -euxo pipefail
siege -c 10 -t30S \
  --content-type "application/json" \
  'http://127.0.0.1:8080 POST {"http_request": "https://www.youtube.com/", "http_request_headers": {}, "tcp_connect": ["216.58.209.46:443", "142.250.184.78:443", "142.250.184.110:443", "142.250.180.142:443", "142.250.180.174:443", "142.251.209.14:443", "142.251.209.46:443"]}'

The general idea here is to issue the same POST request with the given body for 30 seconds using ten concurrent users to create load on the system. I am personally testing this code in my development computer (which does not have lower hardware resources than the VM we are currently using for the test helpers).

See ooni/probe#2413

This commit creates a private static cert pool to use in the oohelperd when issuing QUIC and TLS connections. See ooni/probe#2413 for context.

See ooni/probe#2413 for context.

This diff is slightly more complex than the naive diff that creates a private static pool and uses it for quic and tls. The reason is that we also need to support a mechanism allowing integration tests to transparently override the default cert pool. See ooni/probe#2413 for context.

This commit backports 8476f11. This commit creates a private static cert pool to use in the oohelperd when issuing QUIC and TLS connections. See ooni/probe#2413 for context.

bassosimone · 2023-02-15T10:59:21Z

With ooni/probe-cli#1096 applied, I ran the above script and got as output:

Lifting the server siege...
Transactions:		          45 hits
Availability:		      100.00 %
Elapsed time:		       30.08 secs
Data transferred:	        0.24 MB
Response time:		        4.79 secs
Transaction rate:	        1.50 trans/sec
Throughput:		        0.01 MB/sec
Concurrency:		        7.16
Successful transactions:          45
Failed transactions:	           0
Longest transaction:	       12.69
Shortest transaction:	        1.34

The pprof trace for oohelperd looked as follows:

If you download the SVG and zoom, you can see how netxlite.NewDefaultCertPool is eating lots of CPU.

So, I implemented ooni/probe-cli#1097 and ooni/probe-cli#1098 to create a static cached cert pool that we can use for both oohelperd and netxlite. The latter change is required because the oohelperd uses a netxlite HTTP client with default configuration, so it was always calling netxlite.NewDefaultCertPool when fetching webpages using HTTP.

With those two changes in place, the script output become:

Lifting the server siege...
Transactions:		         154 hits
Availability:		      100.00 %
Elapsed time:		       30.82 secs
Data transferred:	        0.82 MB
Response time:		        1.88 secs
Transaction rate:	        5.00 trans/sec
Throughput:		        0.03 MB/sec
Concurrency:		        9.41
Successful transactions:         154
Failed transactions:	           0
Longest transaction:	        4.07
Shortest transaction:	        1.24

While there's more to optimize, we have clearly removed a significant bottleneck.

bassosimone · 2023-02-15T11:04:10Z

I backported ooni/probe-cli#1097 to release/3.17 with ooni/probe-cli@ebb4ecc.

There was no need to backport ooni/probe-cli#1098. After resolving conflicts, the remaining diff boiled down to:

diff --git a/internal/netxlite/quic.go b/internal/netxlite/quic.go
index 682c976e..c4641c88 100644
--- a/internal/netxlite/quic.go
+++ b/internal/netxlite/quic.go
@@ -166,7 +166,8 @@ func (d *quicDialerQUICGo) dialEarlyContext(ctx context.Context,
 func (d *quicDialerQUICGo) maybeApplyTLSDefaults(config *tls.Config, port int) *tls.Config {
        config = config.Clone()
        if config.RootCAs == nil {
-               config.RootCAs = defaultCertPool
+               // See https://github.com/ooni/probe/issues/2413 for context
+               config.RootCAs = tproxyDefaultCertPool
        }
        if len(config.NextProtos) <= 0 {
                switch port {
diff --git a/internal/netxlite/tls.go b/internal/netxlite/tls.go
index dc87743a..8f7e781b 100644
--- a/internal/netxlite/tls.go
+++ b/internal/netxlite/tls.go
@@ -213,7 +213,8 @@ func (h *tlsHandshakerConfigurable) Handshake(
        conn.SetDeadline(time.Now().Add(timeout))
        if config.RootCAs == nil {
                config = config.Clone()
-               config.RootCAs = defaultCertPool
+               // See https://github.com/ooni/probe/issues/2413 for context
+               config.RootCAs = tproxyDefaultCertPool
        }
        tlsconn, err := h.newConn(conn, config)
        if err != nil {
diff --git a/internal/netxlite/tproxy.go b/internal/netxlite/tproxy.go
index c56b853e..0043ed12 100644
--- a/internal/netxlite/tproxy.go
+++ b/internal/netxlite/tproxy.go
@@ -15,6 +15,16 @@ var TProxy model.UnderlyingNetwork = &DefaultTProxy{}
 // defaultTProxy is the default UnderlyingNetwork implementation.
 type DefaultTProxy struct{}
 
+// tproxyDefaultCertPool is a static copy of the default cert pool. You
+// MUST NOT access this variable directly. You SHOULD use the
+// tproxySingleton().DefaultCertPool() factory instead. By doing
+// that, you would allow integration tests to override the pool
+// we're using. Hence, we can run tests with fake servers.
+//
+// See https://github.com/ooni/probe/issues/2413 to understand why we
+// need a private static default pool.
+var tproxyDefaultCertPool = NewDefaultCertPool()
+
 // DialContext implements UnderlyingNetwork.
 func (tp *DefaultTProxy) DialContext(ctx context.Context, timeout time.Duration, network, address string) (net.Conn, error) {
        d := &net.Dialer{

From this diff it's clear that release/3.17 already used a private cert pool. This means that ooni/probe-cli#1068 and ooni/probe-cli#1069 were introducing a performance regression in that they always called NewDefaultCertPool when using a private static pool would have been better.

Hence, with ooni/probe-cli@ebb4ecc merged into release/3.17, we are now in a good place to evaluate whether we improved the situation for release/3.17.

bassosimone · 2023-02-15T15:18:41Z

In the meanwhile, we have updated our test helpers architecture in ooni/api@3bf3bef. The general idea is to use 0.th as the place where we do A/B testing after code has been tested on the testing infrastructure. This test helper will receive a small portion of the traffic. The other THs will always run stable code and will round robin the rest of the traffic.

bassosimone · 2023-02-15T15:46:59Z

To make the developed diff benefit all experiments, we should make sure we are not calling NewDefaultCertPool every time an experiment runs. We did this in ooni/probe-cli#1102.

bassosimone · 2023-02-15T15:59:00Z

We should also expose pprof metrics to localhost and think about which additional prometheus metrics we should expose in dashboards. We have already added the number of currently allocated bytes and the GC duration.

Part of ooni/probe#2413

This diff attempts to rewrite and steer handler_test.go such that it mostly contains unit tests. As part of this work, we create a single factory function for constructing the handler. While there, we also rename the indexLogger to prefixLogger (a more accurate name) and we deploy more comments and whitespaces. Part of ooni/probe#2413

This diff modifies the tree to rely on netxlite's property that it will use the Mozilla cert pool when no cert pool has been set. After this diff has been applied, these are the places where we still call `NewDefaultCertPool`: ``` internal/experiment/echcheck/handshake.go:var certpool = netxlite.NewDefaultCertPool() internal/experiment/ndt7/dial.go:var certpool = netxlite.NewDefaultCertPool() internal/experiment/riseupvpn/riseupvpn.go: certPool := netxlite.NewDefaultCertPool() internal/experiment/signal/signal.go: certPool := netxlite.NewDefaultCertPool() internal/model/netx.go: // race. Use [netxlite.NewDefaultCertPool] if you wish to get internal/netxlite/filtering/tls_test.go: RootCAs: netxlite.NewDefaultCertPool(), internal/netxlite/tls.go:// NewDefaultCertPool returns the default x509 certificate pool internal/netxlite/tls.go:func NewDefaultCertPool() *x509.CertPool { internal/netxlite/tls_test.go:func TestNewDefaultCertPoolWorks(t *testing.T) { internal/netxlite/tls_test.go: pool := NewDefaultCertPool() internal/netxlite/tproxy.go:var tproxyDefaultCertPool = NewDefaultCertPool() ``` The references in `internal/model` and `internal/netxlite` look good. In `ndt7` and `echcheck`, we're creating a cached copy, so we're all good. In `riseupvpn` and `signal`, we need to extend the cert pool, so those are the places where calling `NewDefaultCertPool` makes sense. These are the places where we modify cert pools: ``` internal/experiment/riseupvpn/riseupvpn.go: if ok := certPool.AppendCertsFromPEM([]byte(tk.HTTPResponseBody)); !ok { internal/experiment/signal/signal.go: if !certPool.AppendCertsFromPEM(caBytes) { internal/experiment/signal/signal.go: return errors.New("AppendCertsFromPEM failed") internal/experiment/signal/signal_test.go: if err.Error() != "AppendCertsFromPEM failed" { internal/netxlite/certifi_test.go: if !pool.AppendCertsFromPEM([]byte(pemcerts)) { internal/netxlite/internal/gencertifi/main.go: if !pool.AppendCertsFromPEM(bundle) { internal/netxlite/tls.go: // Assumption: AppendCertsFromPEM cannot fail because we internal/netxlite/tls.go: ok := pool.AppendCertsFromPEM([]byte(pemcerts)) internal/netxlite/tls.go: runtimex.Assert(ok, "pool.AppendCertsFromPEM failed") internal/experiment/urlgetter/multi_test.go: certpool.AddCert(cert) internal/netxlite/filtering/http.go: o.AddCert(p.cert) internal/netxlite/filtering/tls.go: o.AddCert(p.cert) internal/netxlite/tproxy_test.go: pool.AddCert(srvr.Certificate()) ``` Excluding `_test.go` files, this is basically: * in `riseupvpn` and `signal`, where we need to do that; * when we generate the bundled certs; * in `netxlite/tls.go`, where we implement `NewDefaultCertPool`. Part of ooni/probe#2413.

This diff increases the timeout used to await for pending requests and parallelizes the shutdown of servers. While there, let's move all the flags at toplevel for clarity. While there, recognize that we can use the channel's signal for shutting down the server in tests. These changes are preliminary changes before making the oohelperd more capable of serving interesting metrics and debug info. See ooni/probe#2413

I also added an endpoint for obtaining exection traces. See ooni/probe#2413

I also added an endpoint for obtaining execution traces. See ooni/probe#2413

* buildtool: add rules to build oohelperd * .github/workflows: use buildtool to build oohelperd * oohelperd: add `-replace` flag to replace an already running instance of the oohelperd * oohelperd: add `-version` flag to get version Part of ooni/probe#2413

bassosimone · 2023-02-16T10:54:45Z

In ooni/probe-cli#1104, I have added support for obtaining the profile of a running instance. In ooni/probe-cli#1105, I made oohelperd deployment easier. We should now write a short runbook explaining how to extract performance metrics from a running oohelperd daemon.

bassosimone · 2023-06-12T12:11:25Z

We have put work in to mitigate the CPU usage and we now have support for collecting metrics. Additionally, we are now using four test helpers and we disabled using HTTP/3 in the TH for now. I will close the issue right now, because otherwise this will become a long-running issue with no ending condition in sight. Should we see new performance issues in the future, we will reopen this issue or create a new issue.

bassosimone added bug Something isn't working testing priority/high technical task technical tasks e.g. deployment platform/linux data quality needs investigation This issue needs extra data and investigation ooni/probe-engine labels Feb 15, 2023

bassosimone self-assigned this Feb 15, 2023

bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023

feat(oohelperd): allow collecting CPU profiles

7d79b5b

See ooni/probe#2413

bassosimone mentioned this issue Feb 15, 2023

feat(oohelperd): allow collecting CPU profiles ooni/probe-cli#1096

Merged

bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023

feat(oohelperd): allow collecting CPU profiles (#1096)

26fd3e2

See ooni/probe#2413

bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023

fix(oohelperd): use cached cert pool for quic and tls

c11c37a

This commit creates a private static cert pool to use in the oohelperd when issuing QUIC and TLS connections. See ooni/probe#2413 for context.

bassosimone mentioned this issue Feb 15, 2023

fix(oohelperd): use cached cert pool for quic and tls ooni/probe-cli#1097

Merged

bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023

fix(netxlite): use private static default cert pool

ff46aa3

See ooni/probe#2413 for context.

bassosimone mentioned this issue Feb 15, 2023

fix(netxlite): use private static default cert pool ooni/probe-cli#1098

Merged

bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023

fix(oohelperd): better unit tests for dnsDo

53d66df

Part of ooni/probe#2413

bassosimone mentioned this issue Feb 15, 2023

fix(oohelperd): better unit tests for dnsDo ooni/probe-cli#1099

Merged

bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023

fix(oohelperd): better unit tests for dnsDo (#1099)

8bd0c10

Part of ooni/probe#2413

bassosimone mentioned this issue Feb 16, 2023

refactor(oohelperd): improve handler tests ooni/probe-cli#1101

Merged

bassosimone mentioned this issue Feb 16, 2023

fix: avoid NewDefaultCertPool when not needed ooni/probe-cli#1102

Merged

4 tasks

bassosimone mentioned this issue Feb 16, 2023

fix(oohelperd): allow pending requests to complete ooni/probe-cli#1103

Merged

bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 16, 2023

feat(oohelperd): serve profiles using HTTP

4020124

I also added an endpoint for obtaining exection traces. See ooni/probe#2413

bassosimone mentioned this issue Feb 16, 2023

feat(oohelperd): serve CPU profiles using HTTP ooni/probe-cli#1104

Merged

bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 16, 2023

feat(oohelperd): serve CPU profiles using HTTP (#1104)

b4cbefc

I also added an endpoint for obtaining execution traces. See ooni/probe#2413

bassosimone mentioned this issue Feb 16, 2023

feat(oohelperd): make 0.th.ooni.org deployment easier ooni/probe-cli#1105

Merged

kelmenhorst mentioned this issue Mar 22, 2023

engine: NewDefaultCertPool should not be used by experiment code #2444

Closed

bassosimone closed this as completed Jun 12, 2023

bassosimone mentioned this issue Oct 20, 2023

dslx: support running w/o measurexlite #2580

Closed

bassosimone mentioned this issue Dec 21, 2023

oohelperd: protect against overload #2649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oohelperd: excessive CPU usage under moderate load #2413

oohelperd: excessive CPU usage under moderate load #2413

bassosimone commented Feb 15, 2023

bassosimone commented Feb 15, 2023 •

edited

Loading

bassosimone commented Feb 15, 2023 •

edited

Loading

bassosimone commented Feb 15, 2023

bassosimone commented Feb 15, 2023 •

edited

Loading

bassosimone commented Feb 15, 2023 •

edited

Loading

bassosimone commented Feb 15, 2023 •

edited

Loading

bassosimone commented Feb 16, 2023 •

edited

Loading

bassosimone commented Jun 12, 2023

oohelperd: excessive CPU usage under moderate load #2413

oohelperd: excessive CPU usage under moderate load #2413

Comments

bassosimone commented Feb 15, 2023

bassosimone commented Feb 15, 2023 • edited Loading

bassosimone commented Feb 15, 2023 • edited Loading

bassosimone commented Feb 15, 2023

bassosimone commented Feb 15, 2023 • edited Loading

bassosimone commented Feb 15, 2023 • edited Loading

bassosimone commented Feb 15, 2023 • edited Loading

bassosimone commented Feb 16, 2023 • edited Loading

bassosimone commented Jun 12, 2023

bassosimone commented Feb 15, 2023 •

edited

Loading

bassosimone commented Feb 15, 2023 •

edited

Loading

bassosimone commented Feb 15, 2023 •

edited

Loading

bassosimone commented Feb 15, 2023 •

edited

Loading

bassosimone commented Feb 15, 2023 •

edited

Loading

bassosimone commented Feb 16, 2023 •

edited

Loading