Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qaenv_test.go:169: Get "https://www.example.com/": generic_timeout_error #2527

Closed
bassosimone opened this issue Sep 4, 2023 · 2 comments · Fixed by ooni/probe-cli#1228
Closed
Assignees
Labels
bug Something isn't working ooni/probe-engine

Comments

@bassosimone
Copy link
Member

In ooni/probe-cli#1224, there is a test (TestQAEnv/we_can_hijack_HTTP3_requests) that fails with generic_timeout_error when executed along with other tests, but works when run alone. I have read the diff a couple of times and I do not fully understand why this error happens only when run along with other tests. Ideally, one would say that there is some state across calls that create issues. However, I do not see such a state in our code. So, I think probably the error we have here is something subtle. Because having this test WAI is not a priority now, I am going to document the bug and mark the test as one that should be skipped. I will eventually figure out what is wrong here.

@bassosimone bassosimone added bug Something isn't working ooni/probe-engine labels Sep 4, 2023
@bassosimone bassosimone self-assigned this Sep 4, 2023
@bassosimone
Copy link
Member Author

bassosimone commented Sep 4, 2023

Uh, I did not notice it initially, but the logs also contain this error:

--- FAIL: TestHTTP3ServerFactory (7.68s)
    --- FAIL: TestHTTP3ServerFactory/when_using_an_incompatible_TLS_config (6.67s)
        http3_test.go:73: unexpected error generic_timeout_error
2023/09/04 19:37:58 http: TLS handshake error from 130.192.91.211:27865: EOF
2023/09/04 19:37:59  info TH_HANDLER      <#1> DNSLookup www.example.com... ok
2023/09/04 19:37:59  info TH_HANDLER      <#1> TCPConnect 93.184.216.34:443 EnableTLS=true SNI=www.example.com... ok
2023/09/04 19:37:59  info TH_HANDLER      <#1> GET https://www.example.com/... ok
2023/09/04 19:37:59  info TH_HANDLER      <#1> QUICConnect 93.184.216.34:443 SNI=www.example.com... in progress
2023/09/04 19:37:59  info TH_HANDLER      <#1> GET https://www.example.com:443... in progress
2023/09/04 19:38:04  info TH_HANDLER      <#1> QUICConnect 93.184.216.34:443 SNI=www.example.com... generic_timeout_error
2023/09/04 19:38:04  info TH_HANDLER      <#1> GET https://www.example.com:443... Get "[https://www.example.com:443](https://www.example.com/)": generic_timeout_error
--- FAIL: TestOOHelperDHandler (5.83s)
    oohelperd_test.go:123:   &model.THResponse{
          	TCPConnect:   {"93.184.216.34:443": {Status: true}},
          	TLSHandshake: {"93.184.216.34:443": {ServerName: "www.example.com", Status: true}},
          	QUICHandshake: map[string]model.THTLSHandshakeResult{
          		"93.184.216.34:443": {
          			ServerName: "www.example.com",
        - 			Status:     true,
        + 			Status:     false,
        - 			Failure:    nil,
        + 			Failure:    &"generic_timeout_error",
          		},
          	},
          	HTTPRequest: {BodyLength: 194, DiscoveredH3Endpoint: "www.example.com:443", Title: "Default Web Page", Headers: {"Alt-Svc": `h3=":443"`, "Content-Length": "194", "Content-Type": "text/html; charset=utf-8", "Date": "Thu, 24 Aug 2023 14:35:29 GMT"}, ...},
          	HTTP3Request: &model.THHTTPRequestResult{
        - 		BodyLength:           194,
        + 		BodyLength:           -1,
          		DiscoveredH3Endpoint: "",
        - 		Failure:              nil,
        + 		Failure:              &"generic_timeout_error",
        - 		Title:                "Default Web Page",
        + 		Title:                "",
          		Headers: map[string]string{
        - 			"Alt-Svc":      `h3=":443"`,
        - 			"Content-Type": "text/html; charset=utf-8",
        - 			"Date":         "Thu, 24 Aug 2023 14:35:29 GMT",
          		},
        - 		StatusCode: 200,
        + 		StatusCode: -1,
          	},
          	DNS:    {Addrs: {"93.184.216.34"}},
          	IPInfo: {"93.184.216.34": &{ASN: 15133, Flags: 10}},
          }
        
2023/09/04 19:38:12 http: TLS handshake error from 130.192.91.211:52167: use of closed network connection
--- FAIL: TestQAEnv (9.25s)
    --- FAIL: TestQAEnv/we_can_hijack_HTTP3_requests (5.57s)
        qaenv_test.go:169: Get "https://www.example.com/": generic_timeout_error

So, another QUIC based test (TestOOHelperDHandler, checking oohelperd for HTTP/3) fails with timeout just because I have added another test using the same endpoint. This is certainly quite strange and unexpected 😅.

@bassosimone
Copy link
Member Author

So, apparently I need to disable all the new tests I have added. One strange "thing" is that I can also make tests work if I am using different IP addresses for different tests (see ooni/probe-cli@3145e55). I think I should discuss this issue with @kelmenhorst. For now, I think it's ~fine to merge the code with the ASCII ART "WTF" and tests working as intended notwithstanding the need to use different IP addresses to make it work. The issue should still be open, though.

bassosimone added a commit to ooni/probe-cli that referenced this issue Sep 4, 2023
This diff continues improving and refactoring netemx with the objective
of unifying how we create all kind of servers.

Here, specifically, we modify the HTTPS server implementing
NetStackServerFactory implemented in the previous commit and obtain an
HTTP3 server honouring NetStackServerFactory.

Crucially, this diff also adds support for overriding the TLS config
passed to the server, which enables us to test for expired certificates,
self-signed certificates, and so forth.

While working on this diff, I noticed a weird behavior with HTTP/3 tests
using the same address, which is documented at
ooni/probe#2527. I modified the tests to make
them pass. To this end, I changed the IP addresses used by HTTP/3 tests
to avoid reusing www.example.com's IP address. It seems fine, for now,
to merge this code, because HTTP/3 is not a cornerstone of how we
measure, for now. But we should investigate further in the future!

## Checklist

- [x] I have read the [contribution
guidelines](https://github.com/ooni/probe-cli/blob/master/CONTRIBUTING.md)
- [x] reference issue for this pull request:
ooni/probe#1803
- [x] if you changed anything related to how experiments work and you
need to reflect these changes in the ooni/spec repository, please link
to the related ooni/spec pull request: N/A
- [x] if you changed code inside an experiment, make sure you bump its
version number: N/A
bassosimone added a commit to ooni/probe-cli that referenced this issue Sep 4, 2023
This diff converts the ScenarioRoleWebServer case to using
QAEnvOptionNetStack. While there, recognize that
ooni/probe#2527 is really
making all QUIC tests fragile, and scale them down a bit.

Part of ooni/probe#1803
bassosimone added a commit to ooni/probe-cli that referenced this issue Sep 4, 2023
This diff converts the ScenarioRoleWebServer case to using
QAEnvOptionNetStack. While there, recognize that
ooni/probe#2527 is really making all QUIC
tests fragile, and scale them down a bit.

## Checklist

- [x] I have read the [contribution
guidelines](https://github.com/ooni/probe-cli/blob/master/CONTRIBUTING.md)
- [x] reference issue for this pull request:
ooni/probe#1803
- [x] if you changed anything related to how experiments work and you
need to reflect these changes in the ooni/spec repository, please link
to the related ooni/spec pull request: N/A
- [x] if you changed code inside an experiment, make sure you bump its
version number: N/A
bassosimone added a commit to ooni/probe-cli that referenced this issue Sep 4, 2023
This diff fixes ooni/probe#2527.

The comment I originally wrote in the code to explain the issue read:

> I am wondering whether I am not fully understanding how quic-go/quic-go works.

It turns out this comment was true. I was assuming http3.Server was closing
the listener, however, that was not the case.

I noticed this issue when reading again netemx/qaenv.go.

So, it all boiled down to PEBCAK 😬.
bassosimone added a commit to ooni/probe-cli that referenced this issue Sep 4, 2023
This diff fixes ooni/probe#2527.

The comment I originally wrote in the code to explain the issue read:

> I am wondering whether I am not fully understanding how
quic-go/quic-go works.

It turns out this comment was true. I was assuming http3.Server was
closing the listener, however, that was not the case.

I noticed this issue when reading again netemx/qaenv.go.

So, it all boiled down to PEBCAK 😬.

## Checklist

- [x] I have read the [contribution
guidelines](https://github.com/ooni/probe-cli/blob/master/CONTRIBUTING.md)
- [x] reference issue for this pull request:
ooni/probe#2527
- [x] if you changed anything related to how experiments work and you
need to reflect these changes in the ooni/spec repository, please link
to the related ooni/spec pull request: N/A
- [x] if you changed code inside an experiment, make sure you bump its
version number: N/A
bassosimone added a commit to ooni/probe-cli that referenced this issue Dec 13, 2023
Upgrade to quic-go@v0.40.1 and adapt closing policy after [changes in
quic-go@v0.40.0](https://github.com/quic-go/quic-go/releases/tag/v0.40.0)
and quic-go/quic-go#4072 in particular. We're
creating an UDP conn and [passing it to
server.Serve](https://github.com/ooni/probe-cli/pull/1428/files#diff-efda3daa51e9aed0b3444a327e64b7e5c412938a1fe894a3c850d533179c2425R105),
which [calls
serveConn](https://github.com/quic-go/quic-go/blob/v0.40.1/http3/server.go#L242),
which [calls
quicListen](https://github.com/quic-go/quic-go/blob/v0.40.1/http3/server.go#L316)
and then
[ServeListener](https://github.com/quic-go/quic-go/blob/v0.40.1/http3/server.go#L321).
In turn, ServeListener [is interrupted by
ErrServerClosed](https://github.com/quic-go/quic-go/blob/v0.40.1/http3/server.go#L268),
which seems to be generated by [server.Close calling Close for each
listener](https://github.com/quic-go/quic-go/blob/v0.40.1/http3/server.go#L657).
The following is what happened before updating the shutdown protocol:

```
goroutine 247 [sync.Mutex.Lock]:

[...]

github.com/quic-go/quic-go.(*Transport).closeServer(0x1400054a000?)
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/transport.go:298 +0x90

github.com/quic-go/quic-go.(*baseServer).close.func1()
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/server.go:344 +0x84

[...]

github.com/quic-go/quic-go.(*baseServer).close(0x14000cc1c10?, {0x102faa7a0?, 0x140002557b0?}, 0xec?)
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/server.go:338 +0x64

github.com/quic-go/quic-go.(*baseServer).Close(...)
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/server.go:333

github.com/quic-go/quic-go.(*EarlyListener).Close(0x14000120700?)
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/server.go:165 +0x34

github.com/quic-go/quic-go/http3.(*Server).Close(0x140007344d0)
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/http3/server.go:657 +0xe8

github.com/ooni/probe-cli/v3/internal/netemx.(*http3Server).Close(0x140001345a0)
	/Users/sbs/src/github.com/ooni/probe-cli/internal/netemx/http3.go:66 +0xe4

github.com/ooni/probe-cli/v3/internal/netemx.(*QAEnv).Close.func1()
	/Users/sbs/src/github.com/ooni/probe-cli/internal/netemx/qaenv.go:291 +0x4c

[...]

github.com/ooni/probe-cli/v3/internal/netemx.(*QAEnv).Close(0x140001e5570?)
	/Users/sbs/src/github.com/ooni/probe-cli/internal/netemx/qaenv.go:288 +0x48

goroutine 136 [sync.Mutex.Lock]:

[...]

github.com/quic-go/quic-go.(*baseServer).close(0x14000127200?, {0x102faa7a0?, 0x140001920f0?}, 0xd8?)
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/server.go:338 +0x64

github.com/quic-go/quic-go.(*Transport).close(0x1400054a000, {0x102faa7a0, 0x140001920f0})
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/transport.go:325 +0x110

github.com/quic-go/quic-go.(*Transport).listen(0x1400054a000, {0x102fbd3a8, 0x14000526108})
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/transport.go:358 +0x364

created by github.com/quic-go/quic-go.(*Transport).init.func1
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/transport.go:242 +0x45c
```

Looking at the above traces, both end up at server.go:338 locking the
same sync.Once. It seems the structures are such that attempting to
close both the server and the listener leads to self locking in
v0.40.{0,1}. The original expectation was that the server closed the
connection used for listening anyway, and
ooni/probe#2527 documented how that was not
the case. It seems that now this is the case, so we can comment out the
original ooni/probe#2527 fix without any test
hangs. Also, if the original bug was indeed that the server did not own
the listener, and considering that now it seems the server owns the
listener, it makes sense that the fix for v0.40.1 is to revert the
ooni/probe#2527 fix.

See ooni/probe#2556
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
)

This diff continues improving and refactoring netemx with the objective
of unifying how we create all kind of servers.

Here, specifically, we modify the HTTPS server implementing
NetStackServerFactory implemented in the previous commit and obtain an
HTTP3 server honouring NetStackServerFactory.

Crucially, this diff also adds support for overriding the TLS config
passed to the server, which enables us to test for expired certificates,
self-signed certificates, and so forth.

While working on this diff, I noticed a weird behavior with HTTP/3 tests
using the same address, which is documented at
ooni/probe#2527. I modified the tests to make
them pass. To this end, I changed the IP addresses used by HTTP/3 tests
to avoid reusing www.example.com's IP address. It seems fine, for now,
to merge this code, because HTTP/3 is not a cornerstone of how we
measure, for now. But we should investigate further in the future!

## Checklist

- [x] I have read the [contribution
guidelines](https://github.com/ooni/probe-cli/blob/master/CONTRIBUTING.md)
- [x] reference issue for this pull request:
ooni/probe#1803
- [x] if you changed anything related to how experiments work and you
need to reflect these changes in the ooni/spec repository, please link
to the related ooni/spec pull request: N/A
- [x] if you changed code inside an experiment, make sure you bump its
version number: N/A
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
)

This diff converts the ScenarioRoleWebServer case to using
QAEnvOptionNetStack. While there, recognize that
ooni/probe#2527 is really making all QUIC
tests fragile, and scale them down a bit.

## Checklist

- [x] I have read the [contribution
guidelines](https://github.com/ooni/probe-cli/blob/master/CONTRIBUTING.md)
- [x] reference issue for this pull request:
ooni/probe#1803
- [x] if you changed anything related to how experiments work and you
need to reflect these changes in the ooni/spec repository, please link
to the related ooni/spec pull request: N/A
- [x] if you changed code inside an experiment, make sure you bump its
version number: N/A
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
This diff fixes ooni/probe#2527.

The comment I originally wrote in the code to explain the issue read:

> I am wondering whether I am not fully understanding how
quic-go/quic-go works.

It turns out this comment was true. I was assuming http3.Server was
closing the listener, however, that was not the case.

I noticed this issue when reading again netemx/qaenv.go.

So, it all boiled down to PEBCAK 😬.

## Checklist

- [x] I have read the [contribution
guidelines](https://github.com/ooni/probe-cli/blob/master/CONTRIBUTING.md)
- [x] reference issue for this pull request:
ooni/probe#2527
- [x] if you changed anything related to how experiments work and you
need to reflect these changes in the ooni/spec repository, please link
to the related ooni/spec pull request: N/A
- [x] if you changed code inside an experiment, make sure you bump its
version number: N/A
Murphy-OrangeMud pushed a commit to Murphy-OrangeMud/probe-cli that referenced this issue Feb 13, 2024
Upgrade to quic-go@v0.40.1 and adapt closing policy after [changes in
quic-go@v0.40.0](https://github.com/quic-go/quic-go/releases/tag/v0.40.0)
and quic-go/quic-go#4072 in particular. We're
creating an UDP conn and [passing it to
server.Serve](https://github.com/ooni/probe-cli/pull/1428/files#diff-efda3daa51e9aed0b3444a327e64b7e5c412938a1fe894a3c850d533179c2425R105),
which [calls
serveConn](https://github.com/quic-go/quic-go/blob/v0.40.1/http3/server.go#L242),
which [calls
quicListen](https://github.com/quic-go/quic-go/blob/v0.40.1/http3/server.go#L316)
and then
[ServeListener](https://github.com/quic-go/quic-go/blob/v0.40.1/http3/server.go#L321).
In turn, ServeListener [is interrupted by
ErrServerClosed](https://github.com/quic-go/quic-go/blob/v0.40.1/http3/server.go#L268),
which seems to be generated by [server.Close calling Close for each
listener](https://github.com/quic-go/quic-go/blob/v0.40.1/http3/server.go#L657).
The following is what happened before updating the shutdown protocol:

```
goroutine 247 [sync.Mutex.Lock]:

[...]

github.com/quic-go/quic-go.(*Transport).closeServer(0x1400054a000?)
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/transport.go:298 +0x90

github.com/quic-go/quic-go.(*baseServer).close.func1()
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/server.go:344 +0x84

[...]

github.com/quic-go/quic-go.(*baseServer).close(0x14000cc1c10?, {0x102faa7a0?, 0x140002557b0?}, 0xec?)
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/server.go:338 +0x64

github.com/quic-go/quic-go.(*baseServer).Close(...)
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/server.go:333

github.com/quic-go/quic-go.(*EarlyListener).Close(0x14000120700?)
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/server.go:165 +0x34

github.com/quic-go/quic-go/http3.(*Server).Close(0x140007344d0)
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/http3/server.go:657 +0xe8

github.com/ooni/probe-cli/v3/internal/netemx.(*http3Server).Close(0x140001345a0)
	/Users/sbs/src/github.com/ooni/probe-cli/internal/netemx/http3.go:66 +0xe4

github.com/ooni/probe-cli/v3/internal/netemx.(*QAEnv).Close.func1()
	/Users/sbs/src/github.com/ooni/probe-cli/internal/netemx/qaenv.go:291 +0x4c

[...]

github.com/ooni/probe-cli/v3/internal/netemx.(*QAEnv).Close(0x140001e5570?)
	/Users/sbs/src/github.com/ooni/probe-cli/internal/netemx/qaenv.go:288 +0x48

goroutine 136 [sync.Mutex.Lock]:

[...]

github.com/quic-go/quic-go.(*baseServer).close(0x14000127200?, {0x102faa7a0?, 0x140001920f0?}, 0xd8?)
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/server.go:338 +0x64

github.com/quic-go/quic-go.(*Transport).close(0x1400054a000, {0x102faa7a0, 0x140001920f0})
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/transport.go:325 +0x110

github.com/quic-go/quic-go.(*Transport).listen(0x1400054a000, {0x102fbd3a8, 0x14000526108})
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/transport.go:358 +0x364

created by github.com/quic-go/quic-go.(*Transport).init.func1
	/Users/sbs/go/pkg/mod/github.com/quic-go/quic-go@v0.40.0/transport.go:242 +0x45c
```

Looking at the above traces, both end up at server.go:338 locking the
same sync.Once. It seems the structures are such that attempting to
close both the server and the listener leads to self locking in
v0.40.{0,1}. The original expectation was that the server closed the
connection used for listening anyway, and
ooni/probe#2527 documented how that was not
the case. It seems that now this is the case, so we can comment out the
original ooni/probe#2527 fix without any test
hangs. Also, if the original bug was indeed that the server did not own
the listener, and considering that now it seems the server owns the
listener, it makes sense that the fix for v0.40.1 is to revert the
ooni/probe#2527 fix.

See ooni/probe#2556
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ooni/probe-engine
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant