Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify scaling behaviour for selenium grid to remove effective cap on node creation #4858

Closed
Noel-Jones opened this issue Aug 3, 2023 · 7 comments · Fixed by #4922
Closed
Labels
feature-request All issues for new features that have not been committed to needs-discussion

Comments

@Noel-Jones
Copy link

Noel-Jones commented Aug 3, 2023

Proposal

The proposal is that the metric for scaling should be calculated as

 (**browserQueueSize + browserSessionCount) / browserNodeCount**

Use-Case

In our testing we often see something like 30 nodes, 30 sessions and a queue of 18. While this reasonable, having what amounts to a queue of less than one 1 per session, it limits us because our tests take a while to execute and as one session completes another is queued. In this case we have 48 tasks creating sessions so we are likely to consistently have 48 sessions at any one time. Once the initial 30 nodes are created no further nodes tend to be created.

Looking at the logic (pkg/scalers/selenium_grid_scaler.go) there are two return cases, but lets start with a definition that is not clearly stated in the documentation:

  • maxSession is the number of sessions currently running as returned by Selenium (may be less than nodeCount)
  • nodeCount is the number of nodes currently registered

It is worth noting that this is across the grid and not split by browser. Let's come back to that. So, if either of the above are zero then the metric for scaling is the size of the queue (providing it is greater than the activation count). This is effectively the starting point. So if I am ramping up to 48 sessions the query might take place when only 30 have been queued. Thus the scale target will be 30. There are of course other subtleties with the HPA but we will keep it relatively simple for now and assume that we now have 30 nodes and 30 running sessions.

Soon we will have 18 on the queue and can expect this to be the case for some time. The metric for scaling is now calculated as
queueSize / maxSessions / nodeCount (rounded up). 18 / 30 / 30 => 1 means that the grid will not be scaled.

With the above caclulation and the same numbers this would be (18+30) / 30 => 2 which I think should cause further scaling according to the HPA configuation. I don't think it should be rounded so that it also scales down as the queue size reduces.

As noted above, the queueSize is specific to the browser while maxSessions and nodeCount are not. So in a multi browser scenario with the current calculation, if we start a set of chrome sessions and nodes are started, then when we start a set of firefox nodes the firefox nodes will not be started until all of the chrome sessions have finished. Thinking about it this would class as a bug, though I have not actually tested the scenario.

Is this a feature you are interested in implementing yourself?

Maybe

Anything else?

Hopefully someone can sanity check my logic and understanding. Probably also worth referencing issue #3453 which might be alleviated by a change to the calculation.

@Noel-Jones Noel-Jones added feature-request All issues for new features that have not been committed to needs-discussion labels Aug 3, 2023
@Wolfe1
Copy link
Contributor

Wolfe1 commented Aug 29, 2023

Running into this same issue now that the previous version I was on (2.8.0) seems to have stopped working. In 2.8.0 the scaling worked as expected. If I had 30 sessions running on 30 nodes (with 1 max session per node) and 18 in the queue, then we would scale to 48 (provided we do not go over our max count from KEDA).

Fast-forward to current and the behavior I am seeing is that we will not scale unless the queue size is larger than the node count. So 30 sessions on 30 nodes with 18 in the queue will not scale up but will actually scale down to 18 (queue size). It seems as though the running sessions are being disregarded in the count entirely.

I haven't looked at this code in awhile, having trouble figuring out what code change since 2.8.0 is causing this issue.

@Noel-Jones :
"The metric for scaling is now calculated as
queueSize / maxSessions / nodeCount (rounded up). 18 / 30 / 30 => 1 means that the grid will not be scaled."

So this is not entirely true, its calculated as:
floatCount = float64(count) / (float64(gridMaxSession) / float64(gridNodeCount))
With count being all queued and active sessions for that specific browser type and version. So in the case of your example it should be

48 / (30 / 30) = 48

Which should then tell KEDA that the scale should be 48....but that's not happening.

Based on what I am seeing my assumption is that some change in either the KEDA core code, selenium grid API, or KEDA Scaler code has caused the active sessions to not be counted. Still investigating...

@Noel-Jones
Copy link
Author

Thanks @Wolfe1. Annoying that I must have missed the brackets but good to have confirmation that something does not appear to be right. I am also wondering how I concluded that it is calculated as queue size (without active sessions). The code certainly looks to account for both. It make me wonder what I was looking at!! :-( When I get a chance I will fire up some more tests and try to work through it again.

@Noel-Jones
Copy link
Author

@Wolfe1 I think I can explain what you are seeing, regarding it scaling down to the queue size.

This is what is returned from Selenium 4.10

{
  "data": {
    "grid": {
      "maxSession": 30,
      "nodeCount": 30
    },
    "sessionsInfo": {
      "sessionQueueRequests": [
        "{\n  \"SameSite:None\": {\n    \"Secure\": true\n  },\n  \"acceptInsecureCerts\": true,\n  \"browserName\": \"chrome\",\n  \"goog:chromeOptions\": {\n    \"args\": [\n      \"--ignore-certificate-errors\",\n      \"--start-maximized\",\n      \"--no-sandbox\",\n      \"--disable-gpu\",\n      \"--silent\",\n      \"--disable-popup-blocking\",\n      \"window-size=1920,1080\",\n      \"--headless\"\n    ],\n    \"extensions\": [\n    ]\n  },\n  \"platformName\": \"linux\"\n}",
...
      ],
      "sessions": [
        {
          "id": "d2aa61ace92a8f5eb9da2f40e4b7f748",
          "capabilities": "{\n  \"SameSite:None\": {\n    \"Secure\": true\n  },\n  \"acceptInsecureCerts\": true,\n  \"browserName\": \"chrome\",\n  \"browserVersion\": \"114.0.5735.106\",\n  \"chrome\": {\n    \"chromedriverVersion\": \"114.0.5735.90 (386bc09e8f4f2e025eddae123f36f6263096ae49-refs\\u002fbranch-heads\\u002f5735@{#1052})\",\n    \"userDataDir\": \"\\u002ftmp\\u002f.com.google.Chrome.OK2hfv\"\n  },\n  \"goog:chromeOptions\": {\n    \"debuggerAddress\": \"localhost:41167\"\n  },\n  \"networkConnectionEnabled\": false,\n  \"pageLoadStrategy\": \"normal\",\n  \"platformName\": \"linux\",\n  \"proxy\": {\n  },\n  \"se:bidiEnabled\": false,\n  \"se:cdp\": \"ws:\\u002f\\u002fselenium-hub.selenium4:4444\\u002fsession\\u002fd2aa61ace92a8f5eb9da2f40e4b7f748\\u002fse\\u002fcdp\",\n  \"se:cdpVersion\": \"114.0.5735.106\",\n  \"se:vnc\": \"ws:\\u002f\\u002fselenium-hub.selenium4:4444\\u002fsession\\u002fd2aa61ace92a8f5eb9da2f40e4b7f748\\u002fse\\u002fvnc\",\n  \"se:vncEnabled\": true,\n  \"se:vncLocalAddress\": \"ws:\\u002f\\u002f10.177.4.235:7900\",\n  \"setWindowRect\": true,\n  \"strictFileInteractability\": false,\n  \"timeouts\": {\n    \"implicit\": 0,\n    \"pageLoad\": 300000,\n    \"script\": 30000\n  },\n  \"unhandledPromptBehavior\": \"dismiss and notify\",\n  \"webauthn:extension:credBlob\": true,\n  \"webauthn:extension:largeBlob\": true,\n  \"webauthn:extension:minPinLength\": true,\n  \"webauthn:extension:prf\": true,\n  \"webauthn:virtualAuthenticators\": true\n}",
          "nodeId": "5c869b42-421b-4cd7-8608-db4e577896d7"
        },
 

Leaving out the platform name the count of queue and session entries is the same (keda release 2.11):

capability.BrowserVersion starts with the desired browser version

OR (ELSE)

capability.BrowserVersion is not set AND desired browser version is "latest"

For the queue the capability.BrowserVersion is not set. Assume the desired browser version is "latest" then it will be counted.

For the session the capability.BrowserVersion is set. Assuming the desired version is "latest" then it cannot match.

This was changed post 2.9.3 by #4348

@Wolfe1
Copy link
Contributor

Wolfe1 commented Aug 30, 2023

@Noel-Jones Forgot to update before I logged out for the day yesterday. I came to a similar conclusion as I use “latest” in my work.

I have a possible solution that is working in Unit tests but I was having trouble building the images yesterday afternoon to test it in my cluster. Will try to get it working today and hopefully have a pull request.

@Noel-Jones
Copy link
Author

No worries. I think part of the problem is using the term "latest". I assume that Selenium will match a queued request with no version against any session with that browser. Unless it choses the latest version it knows about? If the latter then we would need to know what the latest version is. In either case maybe the solution to 4348 should have been to say do not mix "latest" and an actual version in trigger specs - if your grid supports multiple versions of a browser then the version should always be specified in the tests and in the trigger specs. Anything else will lead to inconsistencies. Interested to see what your potential solution is.

@Wolfe1
Copy link
Contributor

Wolfe1 commented Aug 30, 2023

You are correct. I believe its the former, incoming requests without a version callout will be considered "latest" and the selenium grid doesn't have any idea on what "latest" is per browser.

I agree that if we are specifying any versions then you won't really want to use latest. My potential solution is nothing special though, if we are using "latest" then I ignore the browserVersion so that it gets added to the count. This doesn't break any existing tests so it shouldn't break the use case it was built for. Adding a couple tests to cover the default use case as well.

Testing it now on our cluster now, a lot of the morning tests runs are currently going so I should get some good observations of how it scales now.

@Noel-Jones
Copy link
Author

Thanks Wolfe1 for taking on the fix. Much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All issues for new features that have not been committed to needs-discussion
Projects
Status: Ready To Ship
Development

Successfully merging a pull request may close this issue.

2 participants