Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout in Openshift 4 due to autoscaler #903

Closed
braisvq1996 opened this issue May 6, 2022 · 7 comments · Fixed by #904
Closed

Timeout in Openshift 4 due to autoscaler #903

braisvq1996 opened this issue May 6, 2022 · 7 comments · Fixed by #904
Labels
bug Something isn't working

Comments

@braisvq1996
Copy link
Contributor

Describe the bug
When you deploy a container in ocp4 the current nodes may not be enough to do it and the cluster cutoscaler kicks in and scale up a new node.
The problem is that the current timeout for the depoyment stage is set to 5 min and this is not enough to scale up a new node and deploy the container, resulting in a failure in the pipeline.

To Reproduce
Steps to reproduce the behavior:

  1. Run component pipeline with deployment stage enabled
  2. In the situation of not enough nodes, check Openshift events for the trigger of the nodes scale up.
  3. See error in the pipeline

Expected behavior
The pipeline should end successfully

Affected version:

  • OpenShift: 4.9
  • OpenDevStack 4.x
@braisvq1996 braisvq1996 added the bug Something isn't working label May 6, 2022
@braisvq1996
Copy link
Contributor Author

I have seen deployments that trigger the cluster to scale up and took arround 10 min for the deployment to be successfull.
I would increase the default to 15 to be on the safe side.

@braisvq1996 braisvq1996 added this to To Do in ODS Maintenance via automation May 11, 2022
@metmajer
Copy link
Member

@michaelsauter hi Michael, this has become a very regular issue. While we're looking into other issues, would you and your team be able to take a look at this one?

@michaelsauter
Copy link
Member

@metmajer What would you expect to be done here? The cluster autoscaler is outside the ODS project, and if the scaling takes a very long time, then increasing the timeout as Brais suggested is the only option I believe? However I think waiting for ~10 minutes for your deployment to happen is not great, if this is a regular occurrence. Maybe the scaler should not kick in regularly?

@braisvq1996
Copy link
Contributor Author

braisvq1996 commented May 11, 2022

It does not happen regulary, but it cannot be foresee by the users, at some point they may face it or they may not.
I suggest to increase the timeout to avoid this -> here.

@metmajer
Copy link
Member

@michaelsauter the idea was to get your team's feedback and ensure we're not turning the wrong knobs.

@michaelsauter
Copy link
Member

@metmajer Got it. As I said the suggestion makes sense for me.

@metmajer
Copy link
Member

@braisvq1996 then please provide a PR. We want to push this change next week together with #899, #900.

@metmajer metmajer moved this from To Do to In Progress in ODS Maintenance May 13, 2022
ODS Maintenance automation moved this from In Progress to Done May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

Successfully merging a pull request may close this issue.

3 participants