Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestMultiConnectToSingleBrowser is flaky #861

Closed
inancgumus opened this issue Apr 26, 2023 · 6 comments
Closed

TestMultiConnectToSingleBrowser is flaky #861

inancgumus opened this issue Apr 26, 2023 · 6 comments
Assignees
Labels
bug Something isn't working next Might be eligible for the next planning (not guaranteed!) remote remote browser related tests

Comments

@inancgumus
Copy link
Member

inancgumus commented Apr 26, 2023

Investigate why the test fails to create the second page.

time="2023-04-26T06:02:15Z" level=error msg="communicating with browser: websocket: close 1006 (abnormal closure): unexpected EOF" category=cdp elapsed="0 ms" goroutine=983
time="2023-04-26T06:02:15Z" level=error msg="communicating with browser: websocket: close 1006 (abnormal closure): unexpected EOF" category=cdp elapsed="0 ms" goroutine=1165

--- FAIL: TestMultiConnectToSingleBrowser (0.19s)
    browser_test.go:249: 
        	Error Trace:	/home/runner/work/xk6-browser/xk6-browser/tests/browser_test.go:249
        	Error:      	Received unexpected error:
        	            	creating new page in browser context: creating a new blank page: Session with given id not found. (-32001)
        	Test:       	TestMultiConnectToSingleBrowser
        	Messages:   	failed to create page #2

time="2023-04-26T06:02:15Z" level=error msg="communicating with browser: websocket: close 1006 (abnormal closure): unexpected EOF" category=cdp elapsed="0 ms" goroutine=1299
time="2023-04-26T06:02:30Z" level=error msg="Failed to load resource: the server responded with a status of 401 (Unauthorized)" browser_source=network line_number=0 source=browser stacktrace="<nil>" url="http://127.0.0.1:34171/basic-auth/validuser/validpass"
time="2023-04-26T06:02:30Z" level=warning msg="url:http://127.0.0.1:34171/basic-auth/validuser/validpass method:GET err:fetching response body: context canceled" category="Response:bodySize:fetchBody" elapsed="0 ms" goroutine=8069
time="2023-04-26T06:02:30Z" level=warning msg="url:http://127.0.0.1:35603/static/empty.html method:GET err:fetching response body: context canceled" category="Response:bodySize:fetchBody" elapsed="0 ms" goroutine=8365
time="2023-04-26T06:02:32Z" level=warning msg="The specified value \"invalid date\" does not conform to the required format, \"yyyy-MM-dd\"." browser_source=rendering line_number=457 source=browser url=__xk6_browser_evaluation_script__
time="2023-04-26T06:02:34Z" level=error msg="communicating with browser: websocket: close 1006 (abnormal closure): unexpected EOF" category=cdp elapsed="0 ms" goroutine=10587
time="2023-04-26T06:02:34Z" level=error msg="process with PID 9495 unexpectedly ended: signal: killed" category=browser elapsed="18 ms" goroutine=10605
time="2023-04-26T06:02:36Z" level=error msg="process with PID 9918 unexpectedly ended: exit status 1" category=browser elapsed="0 ms" goroutine=11781
time="2023-04-26T06:02:37Z" level=error msg="communicating with browser: websocket: close 1006 (abnormal closure): unexpected EOF" category=cdp elapsed="0 ms" goroutine=12075
time="2023-04-26T06:02:37Z" level=error msg="process with PID 10081 unexpectedly ended: signal: killed" category=browser elapsed="3 ms" goroutine=1[197](https://github.com/grafana/xk6-browser/actions/runs/4805402538/jobs/8551776054#step:4:198)4
time="[202](https://github.com/grafana/xk6-browser/actions/runs/4805402538/jobs/8551776054#step:4:203)3-04-26T06:02:38Z" level=info msg="\"calling setTimeout...\"" source=browser-console-api
time="2023-04-26T06:02:39Z" level=info msg="\"creating element...\"" source=browser-console-api
time="2023-04-26T06:02:42Z" level=info msg="\"hello\"" source=browser-console-api

There is a data race:

==================
WARNING: DATA RACE
Write at 0x00c004f7e400 by goroutine 163:
  github.com/grafana/xk6-browser/common.(*Browser).NewContext()
      /home/runner/work/xk6-browser/xk6-browser/common/browser.go:515 +0xa76
  github.com/grafana/xk6-browser/tests.TestMultiConnectToSingleBrowser()
      /home/runner/work/xk6-browser/xk6-browser/tests/browser_test.go:334 +0x2fe
  testing.tRunner()
      /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:14[93](https://github.com/grafana/xk6-browser/actions/runs/6038330531/job/16384504375#step:5:94) +0x47

Previous read at 0x00c004f7e400 by goroutine 8963:
  github.com/grafana/xk6-browser/common.(*Browser).getDefaultBrowserContextOrMatchedID()
      /home/runner/work/xk6-browser/xk6-browser/common/browser.go:142 +0x236
  github.com/grafana/xk6-browser/common.(*Browser).onAttachedToTarget()
      /home/runner/work/xk6-browser/xk6-browser/common/browser.go:220 +0x1f0
  github.com/grafana/xk6-browser/common.(*Browser).initEvents.func1()
      /home/runner/work/xk6-browser/xk6-browser/common/browser.go:185 +0x424

Goroutine 163 (running) created at:
  testing.(*T).Run()
      /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1493 +0x75d
  testing.runTests.func1()
      /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1846 +0x99
  testing.tRunner()
      /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1446 +0x216
  testing.runTests()
      /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1844 +0x7ec
  testing.(*M).Run()
      /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1726 +0xa84
  main.main()
      _testmain.go:219 +0x2e9

Goroutine 8963 (running) created at:
  github.com/grafana/xk6-browser/common.(*Browser).initEvents()
      /home/runner/work/xk6-browser/xk6-browser/common/browser.go:170 +0x345
  github.com/grafana/xk6-browser/common.(*Browser).connect()
      /home/runner/work/xk6-browser/xk6-browser/common/browser.go:123 +0x7e4
  github.com/grafana/xk6-browser/common.NewBrowser()
      /home/runner/work/xk6-browser/xk6-browser/common/browser.go:81 +0x74
  github.com/grafana/xk6-browser/chromium.(*BrowserType).connect()
      /home/runner/work/xk6-browser/xk6-browser/chromium/browser_type.go:131 +0x1d8
  github.com/grafana/xk6-browser/chromium.(*BrowserType).Connect()
      /home/runner/work/xk6-browser/xk6-browser/chromium/browser_type.go:107 +0x15c
  github.com/grafana/xk6-browser/tests.TestMultiConnectToSingleBrowser()
      /home/runner/work/xk6-browser/xk6-browser/tests/browser_test.go:332 +0x2b6
  testing.tRunner()
      /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      /opt/hostedtoolcache/go/1.19.12/x64/src/testing/testing.go:1493 +0x47
==================
@inancgumus inancgumus added bug Something isn't working tests labels Apr 26, 2023
@ankur22
Copy link
Collaborator

ankur22 commented Jun 13, 2023

While working on a PR, I found that this test was flakey. Initially i thought it was due to the changes in the PR, but it turns out that this is a known flakey test. I found that not closing the bctx1 caused it to fail more often:

diff --git a/tests/browser_test.go b/tests/browser_test.go
index 9ec03f8..400f7f6 100644
--- a/tests/browser_test.go
+++ b/tests/browser_test.go
@@ -251,7 +251,7 @@ func TestMultiConnectToSingleBrowser(t *testing.T) {
 
        err = p1.Close(nil)
        require.NoError(t, err, "failed to close page #1")
-       bctx1.Close()
+       // bctx1.Close()
 
        p2, err := bctx2.NewPage()
        require.NoError(t, err, "failed to create page #2")

@inancgumus did some debugging and found that adding a sleep helped prevent the issue, so it's likely to be a race condition:

"This works all the time (-count=100). This might mean that there is a race condition issue while creating pages."

// bctx1.Close()
time.Sleep(time.Second)    // <---
p2, err := bctx2.NewPage()

"I made more checks. It's probably because of the browser or we killed the connection before attaching the blank page. And we're late updating/removing our internal contexts."

making new context BB3319CCBC996F491E7F5E51F35D31E2
<- {"id":2,"result":{"windowId":1,"bounds":{"left":0,"top":397,"width":1280,"height":720,"windowState":"normal"}},"sessionId":"CF427B2EAF26043A97E976F6A0401CA7"}
<- {"id":23,"result":{},"sessionId":"09B020BCC266AC93E95C31D1FBAFA760"}
<- {"id":3,"result":{},"sessionId":"CF427B2EAF26043A97E976F6A0401CA7"}
<- {"method":"Target.detachedFromTarget","params":{"sessionId":"8EB506E362714513E689D093C0559848","targetId":"75D0B2C31C28330092C746AB234788AF"}}
<- {"method":"Target.detachedFromTarget","params":{"sessionId":"CF427B2EAF26043A97E976F6A0401CA7","targetId":"75D0B2C31C28330092C746AB234788AF"}}
<- {"method":"Target.detachedFromTarget","params":{"sessionId":"09B020BCC266AC93E95C31D1FBAFA760","targetId":"75D0B2C31C28330092C746AB234788AF"}}
newPageInContext BB3319CCBC996F491E7F5E51F35D31E2
<- {"id":4,"error":{"code":-32001,"message":"Session with given id not found."}}
  1. Session CF427B2EAF26043A97E976F6A0401CA7 gets closed before attaching to the page.
  2. Browser:initEvents:onDetachedFromTarget happens before newPageInContext.

"I guess the solution is we may want to include some mechanism to ensure the session with the target has not been detached before sending the get frame tree command (or others). Or, alternatively, we could handle the error and create a new session if necessary."

@inancgumus
Copy link
Member Author

A workaround is here in case we need it: #930

@ankur22 ankur22 self-assigned this Jun 30, 2023
@ankur22
Copy link
Collaborator

ankur22 commented Jul 11, 2023

Update

I have an inkling that the issue resides in chrome.

The error we get is from chrome which reads "Session with given id not found.". When creating a new page we work with the Target.createTarget CDP command, which only takes an optional browserContext id, and there is no reference of a session anywhere.

Sessions are created by chrome on Target.createTarget. The browser module doesn't execute any CDP commands to manage the sessions. Chrome asynchronously informs the browser module when to create or delete a session, so the browser module is beholden to chrome. If a sessions doesn't exist, it seems as though chrome has the race issue whereby it's unable to create the required session before it's needed.

It's also plausible that the page.close action causes side affects on all open browserContexts, but again, that's not something we have any control over.

Finally, maybe we have missed an important step when we're opening multiple websocket connections against chrome, maybe a flag of some kind. Having looked at puppeteer and playwright, I'm unable to see an obvious setup that would help prevent such an issue. In fact it seems that both playwright hasn't solved it and puppeteer have encountered similar problems with unclear solutions.

I've opened an issue with the chromium team to see if they can help us out: https://bugs.chromium.org/p/chromium/issues/detail?id=1463459.

Possible workaround

When the browser module sees such an error, it retries up to n times before giving up.

diff --git a/common/browser_context.go b/common/browser_context.go
index 4322d5a..f09c39b 100644
--- a/common/browser_context.go
+++ b/common/browser_context.go
@@ -4,6 +4,7 @@ import (
        "context"
        "fmt"
        "reflect"
+       "strings"
        "time"
 
        "github.com/grafana/xk6-browser/api"
@@ -247,9 +248,20 @@ func (b *BrowserContext) NewCDPSession() api.CDPSession {
 func (b *BrowserContext) NewPage() (api.Page, error) {
        b.logger.Infof("BrowserContext:NewPage", "bctxid:%v", b.id)
 
-       p, err := b.browser.newPageInContext(b.id)
+       var p *Page
+       var err error
+       p, err = b.browser.newPageInContext(b.id)
        if err != nil {
-               return nil, fmt.Errorf("creating new page in browser context %s: %w", b.id, err)
+               if !strings.Contains(err.Error(), "Session with given id not found") {
+                       return nil, fmt.Errorf("creating new page in browser context %s: %w", b.id, err)
+               }
+
+               b.logger.Infof("BrowserContext:NewPage", "retrying page creation bctxid:%v", b.id)
+
+               p, err = b.browser.newPageInContext(b.id)
+               if err != nil {
+                       return nil, fmt.Errorf("creating new page in browser context %s: %w", b.id, err)
+               }
        }
 
        var (

Branch

https://github.com/grafana/xk6-browser/tree/investigate/TestMultiConnectToSingleBrowser

Error in chromium
https://source.chromium.org/chromium/chromium/src/+/refs/heads/main:content/browser/devtools/devtools_session.cc;l=322;drc=9d4eb7ed25296abba8fd525a6bdd0fdbf4bcdd9f;bpv=0;bpt=0

@ankur22 ankur22 added the remote remote browser related label Jul 12, 2023
@ankur22
Copy link
Collaborator

ankur22 commented Jul 27, 2023

After receiving a response from the chromium team with the issue I raised a few weeks ago, they have come back with a response.

TL;DR

The Session with given id not found. (-32001) error is in fact for another request, an earlier request from a session which just happens to have the same message id as the request to createTarget. It seems that some error responses (at least this one) do not contain the sessionId that was in the original request:

msg='-> {"id":4,"sessionId":"169957DD1CBB5704DBC6798D891A6099","method":"Page.getFrameTree"}'
msg='<- {"method":"Target.detachedFromTarget","params":{"sessionId":"A10EDFD7D1B1CD6CFC1C8BBFCCFA2160","targetId":"6A497BD7D0C068592869FEEDE3235124"}}'
msg='<- {"method":"Target.detachedFromTarget","params":{"sessionId":"169957DD1CBB5704DBC6798D891A6099","targetId":"6A497BD7D0C068592869FEEDE3235124"}}'
msg='<- {"method":"Target.detachedFromTarget","params":{"sessionId":"60BE8201E7C7B6136367BB624399D03E","targetId":"6A497BD7D0C068592869FEEDE3235124"}}'
msg='-> {"id":4,"method":"Target.createTarget","params":{"url":"about:blank","browserContextId":"A42B57D4FDDF0A5E3928DC583BAD8A9E"}}'
msg='<- {"id":4,"error":{"code":-32001,"message":"Session with given id not found."}}'

There's the request for Page.getFrameTree, and we have a request for Target.createTarget, which both have the same id. The error response is actually for the Page.getFrameTree request, but we redirect it to any handlers listening on the connection event loop (which is what Target.createTarget was sent on) because the response doesn't contain the sessionId.

The next steps are:

  1. Use unique Ids.
  2. The chromium team suggest that it's ok to ignore "code":-32001 errors, so we should do that.
  3. When we receive an error message without a sessionId, it feels like we should be sending that response to all open sessions and the connection event loop too. The reason behind this is explained below:

At the moment when we make a CDP request (i'm going to work with NewPage as the example here):

  1. We make the request
  2. Since the request was made on the connection (b.conn), the Execute in connection is called.
  3. We create an event handler to listen for all CDP messages that are received. Any CDP requests with a sessionId will not be redirected to this handler. It blocks until a CDP message with the matching msgId is received.
  4. We send the request to a channel, which is read and sent to chrome over the websocket.
  5. In send, we also wait for the response which is proxied from the handler created in step 3, this time we're waiting for the one matching response.
  6. The recvLoop will receive all responses from the websocket and redirect them to the session or the connection event loop (1 or 2).
  7. When the handler from step 3 receives the correct response, it passes it to the select statement that was created in step 5. This will parse the message and return back upstream all the way to the initial caller that initiated the CDP request.

Now, when we get Session with given id not found. (-32001) from chrome, it doesn't contain the sessionId , so currently we redirect it to the connection eventloop. What we should be doing is redirecting it to the handler that is blocked and waiting for a response, otherwise it might be waiting indefinitely. The simplest way to achieve this would be to send all errors without a sessionId to all the sessions and the connection eventloop, that way it should unblock the waiting handler.

As a side, I'm also hoping that this will help resolve the deadlock issue since it's a deadlock in a handler waiting for a response from chrome which is never received, so is that due to an error that is being redirected to the wrong place?

@ankur22
Copy link
Collaborator

ankur22 commented Aug 8, 2023

Resolved in #984.

@inancgumus inancgumus added the next Might be eligible for the next planning (not guaranteed!) label Nov 30, 2023
@inancgumus
Copy link
Member Author

Closed by #1219.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working next Might be eligible for the next planning (not guaranteed!) remote remote browser related tests
Projects
None yet
Development

No branches or pull requests

2 participants