Timeout in Openshift 4 due to autoscaler #903

braisvq1996 · 2022-05-06T10:05:52Z

Describe the bug
When you deploy a container in ocp4 the current nodes may not be enough to do it and the cluster cutoscaler kicks in and scale up a new node.
The problem is that the current timeout for the depoyment stage is set to 5 min and this is not enough to scale up a new node and deploy the container, resulting in a failure in the pipeline.

To Reproduce
Steps to reproduce the behavior:

Run component pipeline with deployment stage enabled
In the situation of not enough nodes, check Openshift events for the trigger of the nodes scale up.
See error in the pipeline

Expected behavior
The pipeline should end successfully

Affected version:

OpenShift: 4.9
OpenDevStack 4.x

braisvq1996 · 2022-05-10T09:11:41Z

I have seen deployments that trigger the cluster to scale up and took arround 10 min for the deployment to be successfull.
I would increase the default to 15 to be on the safe side.

metmajer · 2022-05-11T14:22:31Z

@michaelsauter hi Michael, this has become a very regular issue. While we're looking into other issues, would you and your team be able to take a look at this one?

michaelsauter · 2022-05-11T14:33:05Z

@metmajer What would you expect to be done here? The cluster autoscaler is outside the ODS project, and if the scaling takes a very long time, then increasing the timeout as Brais suggested is the only option I believe? However I think waiting for ~10 minutes for your deployment to happen is not great, if this is a regular occurrence. Maybe the scaler should not kick in regularly?

braisvq1996 · 2022-05-11T14:55:20Z

It does not happen regulary, but it cannot be foresee by the users, at some point they may face it or they may not.
I suggest to increase the timeout to avoid this -> here.

metmajer · 2022-05-11T16:23:41Z

@michaelsauter the idea was to get your team's feedback and ensure we're not turning the wrong knobs.

michaelsauter · 2022-05-12T08:02:57Z

@metmajer Got it. As I said the suggestion makes sense for me.

metmajer · 2022-05-13T06:23:36Z

@braisvq1996 then please provide a PR. We want to push this change next week together with #899, #900.

braisvq1996 added the bug Something isn't working label May 6, 2022

braisvq1996 added this to To Do in ODS Maintenance via automation May 11, 2022

braisvq1996 mentioned this issue May 13, 2022

increase timeout for rollout stage #904

Merged

metmajer moved this from To Do to In Progress in ODS Maintenance May 13, 2022

braisvq1996 closed this as completed in #904 May 13, 2022

ODS Maintenance automation moved this from In Progress to Done May 13, 2022

braisvq1996 mentioned this issue May 13, 2022

increase deploy timeout for 4x #907

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout in Openshift 4 due to autoscaler #903

Timeout in Openshift 4 due to autoscaler #903

braisvq1996 commented May 6, 2022

braisvq1996 commented May 10, 2022

metmajer commented May 11, 2022

michaelsauter commented May 11, 2022

braisvq1996 commented May 11, 2022 •

edited

metmajer commented May 11, 2022

michaelsauter commented May 12, 2022

metmajer commented May 13, 2022

Timeout in Openshift 4 due to autoscaler #903

Timeout in Openshift 4 due to autoscaler #903

Comments

braisvq1996 commented May 6, 2022

braisvq1996 commented May 10, 2022

metmajer commented May 11, 2022

michaelsauter commented May 11, 2022

braisvq1996 commented May 11, 2022 • edited

metmajer commented May 11, 2022

michaelsauter commented May 12, 2022

metmajer commented May 13, 2022

braisvq1996 commented May 11, 2022 •

edited