Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading instance through webapp fails for the first time. #8502

Closed
1yuv opened this issue Aug 30, 2023 · 14 comments
Closed

Upgrading instance through webapp fails for the first time. #8502

1yuv opened this issue Aug 30, 2023 · 14 comments

Comments

@1yuv
Copy link
Member

1yuv commented Aug 30, 2023

Describe the bug
When upgrade is staged and installed from webapp interface, first installation fails with following message:

Error triggering udpate

To Reproduce
Steps to reproduce the behavior:

  1. Have 4.3.0 instance
  2. From webapp, upgrades, select to upgrade instance to 4.3.1.
  3. Once staging is completed, hit Install button.
  4. Error is thrown.

Expected behavior
Installation should be successful in first attempt.

Screen recording

Screen.Recording.2023-08-30.at.10.02.14.PM.mov
@1yuv 1yuv added Type: Bug Fix something that isn't working as intended Affects: 4.3.0 labels Aug 30, 2023
@henokgetachew
Copy link
Contributor

Does this mean second attempt works or is it just broken?

@1yuv
Copy link
Member Author

1yuv commented Aug 30, 2023

Does this mean second attempt works or is it just broken?

Second attempt just works fine without error.

@henokgetachew
Copy link
Contributor

It's weird that second attempt works but not the first one. I will have a look at this.

@mrjones-plip
Copy link
Contributor

cc @nydr and @garethbowen - I'm seeing a lot of HTML being returned when JSON is expected in the video above. Suspect that #8179 will help a lot here, but not sure if addresses the root cause.

@garethbowen
Copy link
Member

There's also a curious http2 error which doesn't make sense to me. The http2 change doesn't exist in the tag so I'm assuming it's something spurious. We need to dig deeper into the actual logs to see what happened.

@dianabarsan
Copy link
Member

Hah, I think there's a bit of a race condition here where, for a brief moment, some of the old containers are still running while others are down and unexpected errors happen.
When an upgrade is triggered, we just do a docker-compose up on updated docker-compose files. This means that docker will recreate the containers concurrently, and there are no rules as to which containers come up or down first. This means that there can be a short moment where the old API is up, while CouchDb is down and throwing that json error (or some other combination).

I think that the incoming change that updates how haproxy and nginx respond when services are down will maybe fix this.

@1yuv
Copy link
Member Author

1yuv commented Sep 11, 2023

incoming change that updates how haproxy and nginx respond when services are down will maybe fix this.

Hi @dianabarsan , is there a PR or issue for this? Can you link that here ?

@dianabarsan
Copy link
Member

dianabarsan commented Sep 11, 2023

@yrimal

#8179

@garethbowen
Copy link
Member

garethbowen commented Oct 10, 2023

Confirming that this still happens with 4.4.0 -> 4.4.1 so it's not fixed by the issue Diana cited, though if I'm patient it does just work eventually. I think the yellow warning is shown unnecessarily and the upgrade trigger happens correctly.

@dianabarsan
Copy link
Member

The warning is displayed "optimistically" after 1 minute - because we believed that 1 minute is enough time for containers to stop and restart. This is clearly not true.
We can increase this interval.

@dianabarsan dianabarsan self-assigned this Oct 11, 2023
@dianabarsan dianabarsan added this to the 4.5.0 milestone Oct 11, 2023
@dianabarsan
Copy link
Member

A nice solution would be to check the health of the containers somehow, and validate we are indeed on the right versions, but right now I don't have a solution for this, especially since deployments are either in docker or k8s, we'd need to add an endpoint to some external service that API can reach.

@garethbowen
Copy link
Member

@dianabarsan I see the warning immediately, not after 1 minute, just like in @1yuv 's video. It's displayed behind the dialog but if you close the dialog it's there.

@mrjones-plip
Copy link
Contributor

A nice solution would be to check the health of the containers somehow, and validate we are indeed on the right versions, but right now I don't have a solution for this, especially since deployments are either in docker or k8s, we'd need to add an endpoint to some external service that API can reach.

For docker compose deployment, we do have access to the docker Unix socket, so we could tell container state as well as new version image download progress. But yeah, as you already said, - it'd only be for docker compose and not k*s. Since we're trying to migrate away from docker compose hosting - likely not worth pursuing.

@dianabarsan
Copy link
Member

dianabarsan commented Oct 12, 2023

Yea, after further testing I think I know what is happening:

  • we update docker-compose files
  • we run docker-compose pull
  • we run docker-compose up
  • this makes docker download and restart containers in parallel
  • haproxy (or other db container) is updated (downloaded and restarted) before API. API restarts because API rage quits when it doesn't have a DB. API believes momentarily that the upgrade has gone wrong somehow. Upgrade continues and old API is eventually killed, new API comes up and everything is fine.

The warning is displayed because previous version of API goes down and up because another container (likely haproxy or healthcheck) is updated.
I'm experimenting with not killing API when the database goes down.

dianabarsan added a commit that referenced this issue Oct 14, 2023
- increases interval before warning the user that the upgrade might have issues
- handles case where nginx sends a 502
- prevents upgrade interruption when state is completing

#8502
Benmuiruri pushed a commit that referenced this issue Oct 26, 2023
- increases interval before warning the user that the upgrade might have issues
- handles case where nginx sends a 502
- prevents upgrade interruption when state is completing

#8502
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

6 participants