Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oohelperd: excessive CPU usage under moderate load #2413

Closed
bassosimone opened this issue Feb 15, 2023 · 8 comments
Closed

oohelperd: excessive CPU usage under moderate load #2413

bassosimone opened this issue Feb 15, 2023 · 8 comments
Assignees
Labels
bug Something isn't working data quality needs investigation This issue needs extra data and investigation ooni/probe-engine platform/linux priority/high technical task technical tasks e.g. deployment testing

Comments

@bassosimone
Copy link
Contributor

TL;DR We need to investigate high oohelperd CPU usage under load, which we noticed during an A/B testing session, but which seems to have been a lingering problem for the TH also before the A/B testing itself.

With my team members (and @FedericoCeratto and @sloncocs in particular), we have been investigating excessive CPU load in the oohelperd, which surfaced when A/B testing code we introduced in ooni/probe-cli#985.

We started testing this code on 2023-02-07T08:06:00Z by deploying a ~3.17.0-beta build on 0.th.ooni.org. On February 9, I said on our internal channels that I was a bit worried by the idle CPU after the deployment:

Screenshot 2023-02-09 at 10 44 17

Further zooming out, though, I noticed there were previous moments in time where we had little CPU to spare, so I was not convinced that we could easily say the new code was the culprit (spoiler: it was not):

Screenshot 2023-02-10 at 10 41 14

As you could see, in fact, in December we had lower available CPU that we had in early February. Because this fact was worrying me, I called for discussing this topic in the next backend meeting (February 14).

However, when I was back to the office after the weekend, I noticed that the CPU usage was now 100%:

Screenshot 2023-02-13 at 07 54 31

While I was investigating, my colleague @sloncocs warned about a surge of failed tests over the weekend:

Screenshot 2023-02-15 at 10 00 24

I quickly realized that, because the 0.th.ooni.org test helper was on its knees, it could not service many users. Therefore, several measurements that happened using such a test helper did not include the control measurement. Hence, the probe could not determine the status of the measurement (all good, website down, anomaly, etc).

Fifteen minutes after noticing this issue, I restarted the oohelperd. Since then, I have reverted back to running the previous code, thus interrupting the A/B testing session until it became more clear to us what was going on.

This issue is here to share what I have learned (so far) when investigating this issue.

@bassosimone bassosimone added bug Something isn't working testing priority/high technical task technical tasks e.g. deployment platform/linux data quality needs investigation This issue needs extra data and investigation ooni/probe-engine labels Feb 15, 2023
@bassosimone bassosimone self-assigned this Feb 15, 2023
@bassosimone
Copy link
Contributor Author

bassosimone commented Feb 15, 2023

The first problem to solve is how to stress test the oohelperd in a testing environment. To this end, @FedericoCeratto proposed using the siege tool to benchmark the oohelperd binary.

This is the script we have been using to perform this task:

#!/bin/bash
set -euxo pipefail
siege -c 10 -t30S \
  --content-type "application/json" \
  'http://127.0.0.1:8080 POST {"http_request": "https://www.youtube.com/", "http_request_headers": {}, "tcp_connect": ["216.58.209.46:443", "142.250.184.78:443", "142.250.184.110:443", "142.250.180.142:443", "142.250.180.174:443", "142.251.209.14:443", "142.251.209.46:443"]}'

The general idea here is to issue the same POST request with the given body for 30 seconds using ten concurrent users to create load on the system. I am personally testing this code in my development computer (which does not have lower hardware resources than the VM we are currently using for the test helpers).

bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023
This commit creates a private static cert pool to use in the
oohelperd when issuing QUIC and TLS connections.

See ooni/probe#2413 for context.
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023
This commit creates a private static cert pool to use in the
oohelperd when issuing QUIC and TLS connections.

See ooni/probe#2413 for context.
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023
This diff is slightly more complex than the naive diff that creates
a private static pool and uses it for quic and tls. The reason is
that we also need to support a mechanism allowing integration tests
to transparently override the default cert pool.

See ooni/probe#2413 for context.
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023
This diff is slightly more complex than the naive diff that creates
a private static pool and uses it for quic and tls. The reason is
that we also need to support a mechanism allowing integration tests
to transparently override the default cert pool.

See ooni/probe#2413 for context.
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023
This commit backports 8476f11.

This commit creates a private static cert pool to use in the
oohelperd when issuing QUIC and TLS connections.

See ooni/probe#2413 for context.
@bassosimone
Copy link
Contributor Author

bassosimone commented Feb 15, 2023

With ooni/probe-cli#1096 applied, I ran the above script and got as output:

Lifting the server siege...
Transactions:		          45 hits
Availability:		      100.00 %
Elapsed time:		       30.08 secs
Data transferred:	        0.24 MB
Response time:		        4.79 secs
Transaction rate:	        1.50 trans/sec
Throughput:		        0.01 MB/sec
Concurrency:		        7.16
Successful transactions:          45
Failed transactions:	           0
Longest transaction:	       12.69
Shortest transaction:	        1.34

The pprof trace for oohelperd looked as follows:

pprof001

If you download the SVG and zoom, you can see how netxlite.NewDefaultCertPool is eating lots of CPU.

So, I implemented ooni/probe-cli#1097 and ooni/probe-cli#1098 to create a static cached cert pool that we can use for both oohelperd and netxlite. The latter change is required because the oohelperd uses a netxlite HTTP client with default configuration, so it was always calling netxlite.NewDefaultCertPool when fetching webpages using HTTP.

With those two changes in place, the script output become:

Lifting the server siege...
Transactions:		         154 hits
Availability:		      100.00 %
Elapsed time:		       30.82 secs
Data transferred:	        0.82 MB
Response time:		        1.88 secs
Transaction rate:	        5.00 trans/sec
Throughput:		        0.03 MB/sec
Concurrency:		        9.41
Successful transactions:         154
Failed transactions:	           0
Longest transaction:	        4.07
Shortest transaction:	        1.24

While there's more to optimize, we have clearly removed a significant bottleneck.

@bassosimone
Copy link
Contributor Author

I backported ooni/probe-cli#1097 to release/3.17 with ooni/probe-cli@ebb4ecc.

There was no need to backport ooni/probe-cli#1098. After resolving conflicts, the remaining diff boiled down to:

diff --git a/internal/netxlite/quic.go b/internal/netxlite/quic.go
index 682c976e..c4641c88 100644
--- a/internal/netxlite/quic.go
+++ b/internal/netxlite/quic.go
@@ -166,7 +166,8 @@ func (d *quicDialerQUICGo) dialEarlyContext(ctx context.Context,
 func (d *quicDialerQUICGo) maybeApplyTLSDefaults(config *tls.Config, port int) *tls.Config {
        config = config.Clone()
        if config.RootCAs == nil {
-               config.RootCAs = defaultCertPool
+               // See https://github.com/ooni/probe/issues/2413 for context
+               config.RootCAs = tproxyDefaultCertPool
        }
        if len(config.NextProtos) <= 0 {
                switch port {
diff --git a/internal/netxlite/tls.go b/internal/netxlite/tls.go
index dc87743a..8f7e781b 100644
--- a/internal/netxlite/tls.go
+++ b/internal/netxlite/tls.go
@@ -213,7 +213,8 @@ func (h *tlsHandshakerConfigurable) Handshake(
        conn.SetDeadline(time.Now().Add(timeout))
        if config.RootCAs == nil {
                config = config.Clone()
-               config.RootCAs = defaultCertPool
+               // See https://github.com/ooni/probe/issues/2413 for context
+               config.RootCAs = tproxyDefaultCertPool
        }
        tlsconn, err := h.newConn(conn, config)
        if err != nil {
diff --git a/internal/netxlite/tproxy.go b/internal/netxlite/tproxy.go
index c56b853e..0043ed12 100644
--- a/internal/netxlite/tproxy.go
+++ b/internal/netxlite/tproxy.go
@@ -15,6 +15,16 @@ var TProxy model.UnderlyingNetwork = &DefaultTProxy{}
 // defaultTProxy is the default UnderlyingNetwork implementation.
 type DefaultTProxy struct{}
 
+// tproxyDefaultCertPool is a static copy of the default cert pool. You
+// MUST NOT access this variable directly. You SHOULD use the
+// tproxySingleton().DefaultCertPool() factory instead. By doing
+// that, you would allow integration tests to override the pool
+// we're using. Hence, we can run tests with fake servers.
+//
+// See https://github.com/ooni/probe/issues/2413 to understand why we
+// need a private static default pool.
+var tproxyDefaultCertPool = NewDefaultCertPool()
+
 // DialContext implements UnderlyingNetwork.
 func (tp *DefaultTProxy) DialContext(ctx context.Context, timeout time.Duration, network, address string) (net.Conn, error) {
        d := &net.Dialer{

From this diff it's clear that release/3.17 already used a private cert pool. This means that ooni/probe-cli#1068 and ooni/probe-cli#1069 were introducing a performance regression in that they always called NewDefaultCertPool when using a private static pool would have been better.

Hence, with ooni/probe-cli@ebb4ecc merged into release/3.17, we are now in a good place to evaluate whether we improved the situation for release/3.17.

@bassosimone
Copy link
Contributor Author

bassosimone commented Feb 15, 2023

In the meanwhile, we have updated our test helpers architecture in ooni/api@3bf3bef. The general idea is to use 0.th as the place where we do A/B testing after code has been tested on the testing infrastructure. This test helper will receive a small portion of the traffic. The other THs will always run stable code and will round robin the rest of the traffic.

@bassosimone
Copy link
Contributor Author

bassosimone commented Feb 15, 2023

To make the developed diff benefit all experiments, we should make sure we are not calling NewDefaultCertPool every time an experiment runs. We did this in ooni/probe-cli#1102.

@bassosimone
Copy link
Contributor Author

bassosimone commented Feb 15, 2023

We should also expose pprof metrics to localhost and think about which additional prometheus metrics we should expose in dashboards. We have already added the number of currently allocated bytes and the GC duration.

bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 15, 2023
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 16, 2023
This diff attempts to rewrite and steer handler_test.go such that
it mostly contains unit tests. As part of this work, we create a
single factory function for constructing the handler. While there,
we also rename the indexLogger to prefixLogger (a more accurate name)
and we deploy more comments and whitespaces.

Part of ooni/probe#2413
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 16, 2023
This diff attempts to rewrite and steer handler_test.go such that
it mostly contains unit tests. As part of this work, we create a
single factory function for constructing the handler. While there,
we also rename the indexLogger to prefixLogger (a more accurate name)
and we deploy more comments and whitespaces.

Part of ooni/probe#2413
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 16, 2023
This diff modifies the tree to rely on netxlite's property that it will
use the Mozilla cert pool when no cert pool has been set.

After this diff has been applied, these are the places where we still
call `NewDefaultCertPool`:

```
internal/experiment/echcheck/handshake.go:var certpool = netxlite.NewDefaultCertPool()
internal/experiment/ndt7/dial.go:var certpool = netxlite.NewDefaultCertPool()
internal/experiment/riseupvpn/riseupvpn.go:	certPool := netxlite.NewDefaultCertPool()
internal/experiment/signal/signal.go:	certPool := netxlite.NewDefaultCertPool()
internal/model/netx.go:	// race. Use [netxlite.NewDefaultCertPool] if you wish to get
internal/netxlite/filtering/tls_test.go:				RootCAs:    netxlite.NewDefaultCertPool(),
internal/netxlite/tls.go:// NewDefaultCertPool returns the default x509 certificate pool
internal/netxlite/tls.go:func NewDefaultCertPool() *x509.CertPool {
internal/netxlite/tls_test.go:func TestNewDefaultCertPoolWorks(t *testing.T) {
internal/netxlite/tls_test.go:	pool := NewDefaultCertPool()
internal/netxlite/tproxy.go:var tproxyDefaultCertPool = NewDefaultCertPool()
```

The references in `internal/model` and `internal/netxlite` look good.

In `ndt7` and `echcheck`, we're creating a cached copy, so we're all good.

In `riseupvpn` and `signal`, we need to extend the cert pool, so those
are the places where calling `NewDefaultCertPool` makes sense.

These are the places where we modify cert pools:

```
internal/experiment/riseupvpn/riseupvpn.go:		if ok := certPool.AppendCertsFromPEM([]byte(tk.HTTPResponseBody)); !ok {
internal/experiment/signal/signal.go:		if !certPool.AppendCertsFromPEM(caBytes) {
internal/experiment/signal/signal.go:			return errors.New("AppendCertsFromPEM failed")
internal/experiment/signal/signal_test.go:	if err.Error() != "AppendCertsFromPEM failed" {
internal/netxlite/certifi_test.go:		if !pool.AppendCertsFromPEM([]byte(pemcerts)) {
internal/netxlite/internal/gencertifi/main.go:	if !pool.AppendCertsFromPEM(bundle) {
internal/netxlite/tls.go:	// Assumption: AppendCertsFromPEM cannot fail because we
internal/netxlite/tls.go:	ok := pool.AppendCertsFromPEM([]byte(pemcerts))
internal/netxlite/tls.go:	runtimex.Assert(ok, "pool.AppendCertsFromPEM failed")
internal/experiment/urlgetter/multi_test.go:	certpool.AddCert(cert)
internal/netxlite/filtering/http.go:		o.AddCert(p.cert)
internal/netxlite/filtering/tls.go:	o.AddCert(p.cert)
internal/netxlite/tproxy_test.go:				pool.AddCert(srvr.Certificate())
```

Excluding `_test.go` files, this is basically:

* in `riseupvpn` and `signal`, where we need to do that;

* when we generate the bundled certs;

* in `netxlite/tls.go`, where we implement `NewDefaultCertPool`.

Part of ooni/probe#2413.
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 16, 2023
This diff modifies the tree to rely on netxlite's property that it will
use the Mozilla cert pool when no cert pool has been set.

After this diff has been applied, these are the places where we still
call `NewDefaultCertPool`:

```
internal/experiment/echcheck/handshake.go:var certpool = netxlite.NewDefaultCertPool()
internal/experiment/ndt7/dial.go:var certpool = netxlite.NewDefaultCertPool()
internal/experiment/riseupvpn/riseupvpn.go:	certPool := netxlite.NewDefaultCertPool()
internal/experiment/signal/signal.go:	certPool := netxlite.NewDefaultCertPool()
internal/model/netx.go:	// race. Use [netxlite.NewDefaultCertPool] if you wish to get
internal/netxlite/filtering/tls_test.go:				RootCAs:    netxlite.NewDefaultCertPool(),
internal/netxlite/tls.go:// NewDefaultCertPool returns the default x509 certificate pool
internal/netxlite/tls.go:func NewDefaultCertPool() *x509.CertPool {
internal/netxlite/tls_test.go:func TestNewDefaultCertPoolWorks(t *testing.T) {
internal/netxlite/tls_test.go:	pool := NewDefaultCertPool()
internal/netxlite/tproxy.go:var tproxyDefaultCertPool = NewDefaultCertPool()
```

The references in `internal/model` and `internal/netxlite` look good.

In `ndt7` and `echcheck`, we're creating a cached copy, so we're all good.

In `riseupvpn` and `signal`, we need to extend the cert pool, so those
are the places where calling `NewDefaultCertPool` makes sense.

These are the places where we modify cert pools:

```
internal/experiment/riseupvpn/riseupvpn.go:		if ok := certPool.AppendCertsFromPEM([]byte(tk.HTTPResponseBody)); !ok {
internal/experiment/signal/signal.go:		if !certPool.AppendCertsFromPEM(caBytes) {
internal/experiment/signal/signal.go:			return errors.New("AppendCertsFromPEM failed")
internal/experiment/signal/signal_test.go:	if err.Error() != "AppendCertsFromPEM failed" {
internal/netxlite/certifi_test.go:		if !pool.AppendCertsFromPEM([]byte(pemcerts)) {
internal/netxlite/internal/gencertifi/main.go:	if !pool.AppendCertsFromPEM(bundle) {
internal/netxlite/tls.go:	// Assumption: AppendCertsFromPEM cannot fail because we
internal/netxlite/tls.go:	ok := pool.AppendCertsFromPEM([]byte(pemcerts))
internal/netxlite/tls.go:	runtimex.Assert(ok, "pool.AppendCertsFromPEM failed")
internal/experiment/urlgetter/multi_test.go:	certpool.AddCert(cert)
internal/netxlite/filtering/http.go:		o.AddCert(p.cert)
internal/netxlite/filtering/tls.go:	o.AddCert(p.cert)
internal/netxlite/tproxy_test.go:				pool.AddCert(srvr.Certificate())
```

Excluding `_test.go` files, this is basically:

* in `riseupvpn` and `signal`, where we need to do that;

* when we generate the bundled certs;

* in `netxlite/tls.go`, where we implement `NewDefaultCertPool`.

Part of ooni/probe#2413.
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 16, 2023
This diff increases the timeout used to await for pending
requests and parallelizes the shutdown of servers.

While there, let's move all the flags at toplevel for clarity.

While there, recognize that we can use the channel's signal
for shutting down the server in tests.

These changes are preliminary changes before making the
oohelperd more capable of serving interesting metrics and
debug info.

See ooni/probe#2413
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 16, 2023
This diff increases the timeout used to await for pending
requests and parallelizes the shutdown of servers.

While there, let's move all the flags at toplevel for clarity.

While there, recognize that we can use the channel's signal
for shutting down the server in tests.

These changes are preliminary changes before making the
oohelperd more capable of serving interesting metrics and
debug info.

See ooni/probe#2413
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 16, 2023
I also added an endpoint for obtaining exection traces.

See ooni/probe#2413
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 16, 2023
I also added an endpoint for obtaining execution traces.

See ooni/probe#2413
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 16, 2023
* buildtool: add rules to build oohelperd

* .github/workflows: use buildtool to build oohelperd

* oohelperd: add `-replace` flag to replace an already running
instance of the oohelperd

* oohelperd: add `-version` flag to get version

Part of ooni/probe#2413
bassosimone added a commit to ooni/probe-cli that referenced this issue Feb 16, 2023
* buildtool: add rules to build oohelperd

* .github/workflows: use buildtool to build oohelperd

* oohelperd: add `-replace` flag to replace an already running
instance of the oohelperd

* oohelperd: add `-version` flag to get version

Part of ooni/probe#2413
@bassosimone
Copy link
Contributor Author

bassosimone commented Feb 16, 2023

In ooni/probe-cli#1104, I have added support for obtaining the profile of a running instance. In ooni/probe-cli#1105, I made oohelperd deployment easier. We should now write a short runbook explaining how to extract performance metrics from a running oohelperd daemon.

@bassosimone
Copy link
Contributor Author

We have put work in to mitigate the CPU usage and we now have support for collecting metrics. Additionally, we are now using four test helpers and we disabled using HTTP/3 in the TH for now. I will close the issue right now, because otherwise this will become a long-running issue with no ending condition in sight. Should we see new performance issues in the future, we will reopen this issue or create a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data quality needs investigation This issue needs extra data and investigation ooni/probe-engine platform/linux priority/high technical task technical tasks e.g. deployment testing
Projects
None yet
Development

No branches or pull requests

1 participant