Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock: Multi-VU many iterations #971

Closed
7 tasks done
Tracked by #1138
ankur22 opened this issue Jul 13, 2023 · 9 comments
Closed
7 tasks done
Tracked by #1138

Deadlock: Multi-VU many iterations #971

ankur22 opened this issue Jul 13, 2023 · 9 comments
Assignees
Labels
bug Something isn't working remote remote browser related team/k6browser To distinguish the issue on project boards.

Comments

@ankur22
Copy link
Collaborator

ankur22 commented Jul 13, 2023

While running the following test script:

Test script
import { check } from 'k6';
import { browser } from 'k6/experimental/browser';

export const options = {
  scenarios: {
    ui: {
      executor: 'shared-iterations',
      vus: 5,
      iterations: 1000,
      maxDuration: '40m',
      options: {
        browser: {
          type: 'chromium',
        },
      },
    },
  },
  thresholds: {
    checks: ["rate==1.0"]
  }
}

export default async function () {
  const context = browser.newContext();
  const page = context.newPage();

  try {
    // Goto front page, find login link and click it
    await page.goto('https://test.k6.io/', { waitUntil: 'networkidle' });
    await Promise.all([
      page.waitForNavigation(),
      page.locator('a[href="/my_messages.php"]').click(),
    ]);
    // Enter login credentials and login
    page.locator('input[name="login"]').type('admin');
    page.locator('input[name="password"]').type('123');
    // We expect the form submission to trigger a navigation, so to prevent a
    // race condition, setup a waiter concurrently while waiting for the click
    // to resolve.
    await Promise.all([
      page.waitForNavigation(),
      page.locator('input[type="submit"]').click(),
    ]);
    check(page, {
      'header': page.locator('h2').textContent() == 'Welcome, admin!',
    });
  } finally {
    page.close();
  }
}

The test has to be aborted since it goes over the 40m limit. Investigate why this happens and resolve it.

Image

This is the screenshot of the dashboard of the test run. It would seem that all but one of the VUs completed all the allotted iterations. One vu was stuck in a deadlock, which meant the test run was waiting for it to complete the iteration, which it never did, that eventually resulted in an abort.


Tasks

  1. browser context bug internal optimization performance remote
    inancgumus
@ankur22 ankur22 added bug Something isn't working remote remote browser related labels Jul 13, 2023
@ankur22 ankur22 added this to the v0.11.0 milestone Jul 13, 2023
@ankur22 ankur22 self-assigned this Jul 13, 2023
@ankur22
Copy link
Collaborator Author

ankur22 commented Jul 13, 2023

Investigation

After profiling we found that the issue was related to evaluating the page hide when page.close is called. Looking at the stack trace of the live goroutines, it became evident that it wasn't a traditional deadlock, but rather a stuck asynchronous process waiting for a response from the chrome browser.

My theory was that at a certain point when chrome is under a lot of load, it stops responding to CDP requests and that is why we get timeouts in navigations and new page creations. I tested this theory by setting the timeout to one that is longer than the max test duration. Unfortunately, chrome does respond to these events but it just takes longer for it to do so.

As of now, we're unable to to identify the root cause, and we can only guess that:

  1. We're not receiving the response from chrome for an unknown reason;
  2. In some cases a response it's valid for chrome not to respond to evaluate CDP requests (especially ones where there is no expected return value);
  3. We receive the response but it slips through the filters possibly due to it being malformed or structured differently to what we expect.

The current workaround for this "deadlock" issue is to have a timeout, but it doesn't help solve the root problem.

Unrelated find #1

Unrelated to this current issue, we also found that logging the goroutine ids were taking a lot of time (pprof), and so we removed this step completely in #977.

Unrelated find #2

While working on this theory, I was also able to determine that the compute resources given to the remote chrome instance affects the test (see full investigation results here). We need to ensure that the compute resources for all components are appropriately set so that non of them over utilise the compute resources, which will affect the outcome and make results inconclusive.

@ankur22 ankur22 changed the title Tests are always aborted due to time out set by scenario VU deadlocks when running against a remote chrome instance Jul 13, 2023
ankur22 added a commit that referenced this issue Jul 25, 2023
This new EvaluateWithContext will allow us to pass in a context with a
timeout. For issue #971, we're seeing that often the evaluate to hide
the page on page.close ends up in a deadlock (waiting for the response
from the chrome browser after executing the CDP evaluate command). We
are not sure of the root cause of the issue, but this fix will help
prevent this deadlock allow the test run to complete in a timely
manner without it causing a time out error.
@ankur22
Copy link
Collaborator Author

ankur22 commented Jul 25, 2023

A temporary fix can be found in #979.

@inancgumus
Copy link
Member

inancgumus commented Jul 26, 2023

Could there be a race in handling iteration events? 🤔

From the stack trace, it seems like both page.Close and closing/disconnecting from of the browser happens concurrently.

goroutine 56 [select]:
github.com/grafana/xk6-browser/common.(*Browser).Close(0xc00067bc80)
	github.com/grafana/xk6-browser@v1.0.0-rc0.0.20230711104523-a8ebe8cc24f6/common/browser.go:464 +0x313
github.com/grafana/xk6-browser/browser.(*browserRegistry).deleteBrowser(0xc001cb2c80, 0xc001cb2840?)
	github.com/grafana/xk6-browser@v1.0.0-rc0.0.20230711104523-a8ebe8cc24f6/browser/registry.go:310 +0x9c
github.com/grafana/xk6-browser/browser.(*browserRegistry).handleIterEvents(0xc001cb2c80, 0x1a0fea0?, 0xc001480d80)
	github.com/grafana/xk6-browser@v1.0.0-rc0.0.20230711104523-a8ebe8cc24f6/browser/registry.go:269 +0x1c5
created by github.com/grafana/xk6-browser/browser.newBrowserRegistry
	github.com/grafana/xk6-browser@v1.0.0-rc0.0.20230711104523-a8ebe8cc24f6/browser/registry.go:219 +0x311

I haven't looked into this deeply, just guessing. So please take it with a grain of salt.

cc: @ankur22 @ka3de

@ankur22
Copy link
Collaborator Author

ankur22 commented Jul 26, 2023

From the stack trace, it seems like both page.Close and closing of the browser happens concurrently.

When running this test, it was running 5 VUs, so it's likely an iteration finishing on one VU, and a page.close occurring on a different VU.

@inancgumus
Copy link
Member

inancgumus commented Jul 26, 2023

Right. But, in the meantime, might we be closing/(update: or disconnecting from) the browser and this might be racing with page.Close? It's surprising to me that it occurs only on pagehide. I think it should happen elsewhere too. Why does it only happen there? :)

@ankur22
Copy link
Collaborator Author

ankur22 commented Jul 26, 2023

Right. But, in the meantime, might we be closing the browser and this might be racing with page.Close? It's weird that it occurs only on pagehide. It should happen elsewhere too. Why only there? :)

I don't have the answer for why it's occurring in this scenario, I was hoping that it was due to the overloaded chrome instance, however other APIs (such as page.goto) they eventually receive a response when the default timeout is set very high, which would mean that chrome does indeed respond to CDP requests when it is under heavy load, and doesn't drop messages.

When the evaluate on pagehide event blocks waiting on the response from chrome, that VU is stuck indefinitely, and it's stuck on the "main thread" of that VU, putting a pause on the whole underlying JS event loop. At this point even if k6 were to signal an iterEnd, the stuck VU wouldn't be able to handle it. So I can't see how there could be a race condition between the browser closing and page closing 🤔

If the browser for that stuck VU were to close, wouldn't we get an error suggesting that the remote host closed the connection?

Actually, the browser shouldn't be closed since this was running against a remote browser 🤔 Actually, it's expected. It won't actually close the remote chrome instance.

ankur22 added a commit that referenced this issue Jul 26, 2023
This new EvaluateWithContext will allow us to pass in a context with a
timeout. For issue #971, we're seeing that often the evaluate to hide
the page on page.close ends up in a deadlock (waiting for the response
from the chrome browser after executing the CDP evaluate command). We
are not sure of the root cause of the issue, but this fix will help
prevent this deadlock allow the test run to complete in a timely
manner without it causing a time out error.
@inancgumus
Copy link
Member

It seems we need to look deeper into this issue since you have very good points and I don't have the best answer at this moment :(

@ankur22 ankur22 removed their assignment Jul 26, 2023
@ankur22
Copy link
Collaborator Author

ankur22 commented Jul 26, 2023

After working with this fix against a remote chrome instance, we've now been able to reproduce the issue with page.goto as well as with the evaluate on page.close. We did this by setting the page's defaultTimeout to a very high value (30mins), which confirmed the VU deadlocks waiting for a response from chrome. When the timeouts are left at 30secs, the timeouts occur as expected and we do not see deadlocks. As far as I can tell this still leaves us with the these possibilities for the root cause of the problem:

  1. Chrome isn't returning a response for some of these CDP requests.
  2. Chrome is returning a malformed request which we're not propagating to all listening handlers. I've looked at the code and it seems to be set up correctly here to catch malformed responses and seems to forward all other msgs correctly to all listening handlers.
  3. We have a deadlock somewhere in xk6-browser.

One more point to make is that the remote chrome instances were not under a lot of load.

@ankur22 ankur22 self-assigned this Jul 26, 2023
@ankur22 ankur22 removed their assignment Aug 10, 2023
@inancgumus inancgumus self-assigned this Oct 13, 2023
@inancgumus inancgumus changed the title VU deadlocks when running against a remote chrome instance Deadlock: Navigation sometimes stops Oct 23, 2023
@inancgumus inancgumus removed this from the v1.0.0 milestone Nov 1, 2023
@inancgumus inancgumus changed the title Deadlock: Navigation sometimes stops Deadlock: Multi-VU many iterations Nov 10, 2023
@inancgumus inancgumus removed their assignment Dec 4, 2023
@inancgumus inancgumus self-assigned this Dec 13, 2023
@inancgumus inancgumus added the team/k6browser To distinguish the issue on project boards. label Dec 20, 2023
@inancgumus
Copy link
Member

inancgumus commented Feb 22, 2024

Fixed by #1219.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working remote remote browser related team/k6browser To distinguish the issue on project boards.
Projects
None yet
Development

No branches or pull requests

2 participants