Skip to content
This repository has been archived by the owner on Mar 2, 2024. It is now read-only.

Research upgrade failing due to SecurityGroup used by old node #1038

Closed
yngvark opened this issue Sep 16, 2022 · 1 comment
Closed

Research upgrade failing due to SecurityGroup used by old node #1038

yngvark opened this issue Sep 16, 2022 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@yngvark
Copy link
Contributor

yngvark commented Sep 16, 2022

Description

When attempting to upgrade EKS to 1.21 recently, the following happened:

% ./upgrade.sh cluster-dev.yaml eu-west-1 1.21 | tee "logs/eks-upgrade-1-21-$(date +"%Y-%m-%dx%H-%M-%S").log"


------------------------------------------------------------------------------------------------------------------------
Verify AWS account
------------------------------------------------------------------------------------------------------------------------

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Running command [2022-09-15 10:01:25]: aws sts get-caller-identity
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

------------------------------------------------------------------------------------------------------------------------
Download dependencies to /tmp/eks-upgrade/1-21
------------------------------------------------------------------------------------------------------------------------
Running: curl --location  https://github.com/weaveworks/eksctl/releases/download/v0.104.0/eksctl_Darwin_amd64.tar.gz | tar xz -C  /tmp/eks-upgrade/1-21
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 29.0M  100 29.0M    0     0  15.0M      0  0:00:01  0:00:01 --:--:-- 18.0M
Running: curl --location  https://dl.k8s.io/release/v1.21.14/bin/darwin/amd64/kubectl  -o  /tmp/eks-upgrade/1-21/kubectl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   138  100   138    0     0    775      0 --:--:-- --:--:-- --:--:--   797
100 50.4M  100 50.4M    0     0  14.6M      0  0:00:03  0:00:03 --:--:-- 17.1M

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Running command [2022-09-15 10:01:31]: chmod +x /tmp/eks-upgrade/1-21/eksctl
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~


~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Running command [2022-09-15 10:01:31]: chmod +x /tmp/eks-upgrade/1-21/kubectl
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~


~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Running command [2022-09-15 10:01:31]: /tmp/eks-upgrade/1-21/eksctl version -o json
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Running command [2022-09-15 10:01:32]: /tmp/eks-upgrade/1-21/kubectl version --client=true --output=yaml
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

------------------------------------------------------------------------------------------------------------------------
Verify cluster name
------------------------------------------------------------------------------------------------------------------------

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Running command [2022-09-15 10:01:32]: /tmp/eks-upgrade/1-21/eksctl get cluster xxxxxxxxxxxxxx-dev
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
NAME        VERSION STATUS  CREATED         VPC         SUBNETS                                         SECURITYGROUPS      PROVIDER
xxxxxxxxxxxxxx-dev 1.20    ACTIVE  2021-05-27T07:10:47Z    vpc-0ba89374891c80fb8   subnet-00b77c17380b708e2,subnet-042d70465122a5da6,subnet-04b69ed7310e36114,subnet-06359b07b96a5ba3a,subnet-06a4cf61b0f9c9daa,subnet-06a9a2365a2566328   sg-0d45570581c5492c9    EKS

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~


------------------------------------------------------------------------------------------------------------------------
Do these variables look okay?
------------------------------------------------------------------------------------------------------------------------
Upgrading EKS to version: 1.21
Cluster manifest: cluster-dev.yaml
Cluster name: xxxxxxxxx-dev
AWS account: xxxxxxxxxxxxxxxx
AWS region: eu-west-1
Dry run: true

Do these variables look okay? (Y/n) YY


------------------------------------------------------------------------------------------------------------------------
Run upgrade of EKS control plane. Estimated time: 10-15 min.
------------------------------------------------------------------------------------------------------------------------
💡 Tip: You can go to EKS in AWS console to see the status is set to 'Updating'.

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Running command [2022-09-15 10:01:35]: /tmp/eks-upgrade/1-21/eksctl upgrade cluster --name xxxxxxxxxxxxxx-dev --version 1.21
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
2022-09-15 10:01:38 [ℹ]  (plan) would upgrade cluster "xxxxxxxxxxxxxx-dev" control plane from current version "1.20" to "1.21"
2022-09-15 10:01:41 [ℹ]  re-building cluster stack "eksctl-xxxxxxxxxxxxxx-dev-cluster"
2022-09-15 10:01:41 [✔]  all resources in cluster stack "eksctl-xxxxxxxxxxxxxx-dev-cluster" are up-to-date
2022-09-15 10:01:44 [!]  stack's status of nodegroup named eksctl-xxxxxxxxxxxxxx-dev-nodegroup-ng-generic is DELETE_FAILED
2022-09-15 10:01:44 [ℹ]  checking security group configuration for all nodegroups
2022-09-15 10:01:44 [ℹ]  all nodegroups have up-to-date cloudformation templates
2022-09-15 10:01:44 [!]  no changes were applied, run again with '--approve' to apply the changes

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~


------------------------------------------------------------------------------------------------------------------------
Replacing node groups, step 1 of 4: Create configuration for new node groups.
------------------------------------------------------------------------------------------------------------------------
2022-09-15 10:01:47 [!]  stack's status of nodegroup named eksctl-xxxxxxxxxxxxxx-dev-nodegroup-ng-generic is DELETE_FAILED

Cause

eksctl was unable to delete the stack used to create the nodegroups for EKS 1.20. The reason is that the stack
refers to security groups that are being used. The security groups in question are RDSPostgresIncoming and
RDSPostgresOutgoing.

Comments

My first thought was that, either we must deattach security groups before attempting to upgrade, and reattach
them afterwards, or we should research if we should be attaching the postgres security groups to something
else in EKS than the nodegroups that is more stable, and do not need deattaching when upgradring.

However, there are yet some things I don't understand with this issue:

  • Why did the execution of the upgrade script exit with an error when running it for this team, but it succeeded
    with the same message for another team?
    • But it probaly doesn't matter. The other team also still have the same CloudFormation stack in state DELETE_FAILED. But everything seems to work fine.
  • Why was this not a problem when upgrading from 1.19 to 1.20? The nodegroups should have been deleted at that time as well.

To do

  • Research what to do about this.
    • Do we need to configure how Security groups are setup? In that case it means to adjust okctl-code and create an okctl-upgrade for it to fix existing setups - or perhaps cleanup manually is faster and just as safe.
@deifyed
Copy link
Member

deifyed commented Oct 3, 2022

This happened due to undeclared (manual changes which caused -) dependencies preventing eksctl from deleting the node group security group defined in the node group CloudFormation template.

I've presented a guide on how to clean up manual changes done to the node group SG which was making the stack fail. We'll support if needed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants