Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Qhub fails to destroy on AWS instance #1081

Closed
viniciusdc opened this issue Feb 19, 2022 · 1 comment
Closed

[bug] Qhub fails to destroy on AWS instance #1081

viniciusdc opened this issue Feb 19, 2022 · 1 comment
Labels
area: terraform 💾 needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug provider: AWS type: bug 🐛 Something isn't working

Comments

@viniciusdc
Copy link
Contributor

viniciusdc commented Feb 19, 2022

Describe the bug

Fresh install of qhub (version 0.4.0.dev86+g9fb62c8 based on main:9fb62c812e), failed to destroy instance:

[terraform]: module.network.aws_subnet.main[1]: Still destroying... [id=subnet-0e67818d0783e1d0c, 20m1s elapsed]
[terraform]: module.network.aws_subnet.main[0]: Still destroying... [id=subnet-057575eebaf199725, 20m1s elapsed]
[terraform]: 2022-02-18T20:24:16.347-0300 [INFO]  provider.terraform-provider-aws_v3.73.0_x5: 2022/02/18 20:24:16 [WARN] WaitForState timeout after 20m0s: timestamp=2022-02-18T20:24:16.347-0300
[terraform]: 2022-02-18T20:24:16.347-0300 [INFO]  provider.terraform-provider-aws_v3.73.0_x5: 2022/02/18 20:24:16 [WARN] WaitForState starting 30s refresh grace period: timestamp=2022-02-18T20:24:16.347-0300
[terraform]: 2022-02-18T20:24:16.371-0300 [INFO]  provider.terraform-provider-aws_v3.73.0_x5: 2022/02/18 20:24:16 [WARN] WaitForState timeout after 20m0s: timestamp=2022-02-18T20:24:16.371-0300
[terraform]: 2022-02-18T20:24:16.371-0300 [INFO]  provider.terraform-provider-aws_v3.73.0_x5: 2022/02/18 20:24:16 [WARN] WaitForState starting 30s refresh grace period: timestamp=2022-02-18T20:24:16.371-0300
[terraform]: ╷
[terraform]: │ Error: error deleting EC2 Subnet (subnet-0e67818d0783e1d0c): DependencyViolation: The subnet 'subnet-0e67818d0783e1d0c' has dependencies and cannot be deleted.
[terraform]: │ 	status code: 400, request id: a936bad1-4605-4b70-9fc4-25c0dae06131
[terraform]: │ 
[terraform]: │ 
[terraform]: ╵
[terraform]: ╷
[terraform]: │ Error: error detaching EC2 Internet Gateway (igw-00f5928d9219a02fa) from VPC (vpc-027cf7b4d3f134cc1): DependencyViolation: Network vpc-027cf7b4d3f134cc1 has some mapped public address(es). Please unmap those public address(es) before detaching the gateway.
[terraform]: │ 	status code: 400, request id: ae39e552-b365-45e6-a900-0304447bd733
[terraform]: │ 
[terraform]: │ 
[terraform]: ╵
[terraform]: ╷
[terraform]: │ Error: error deleting EC2 Subnet (subnet-057575eebaf199725): DependencyViolation: The subnet 'subnet-057575eebaf199725' has dependencies and cannot be deleted.
[terraform]: │ 	status code: 400, request id: 88945487-86a3-4f29-8ebd-8ab98ed4d638
[terraform]: │ 
[terraform]: │ 
[terraform]: ╵

I spotted this error before, and I was able to fix it by manually deleting the ELB (Elastic Load Balancer) assigned to the vpn, then deleting the vpn (as qhub has no condition to delete the resource for some weird reason).

How to reproduce

Steps to recreate:

  • qhub init aws --project=qhubstages --domain awsqhubstages.qhub.dev --auth-provider=password --terraform-state=local --ci-provider=github-actions
  • qhub deploy -c qhub-config.yaml --disable-prompt --dns-provider cloudflare --dns-auto-provision (do not execute the dns auto-provision bit if you need to do a redeployment)
  • then qhub destroy

Expected behavior

  • Successful execution of the destroy command, no traces of qhub resources left in the aws portal

Observations

I am not sure why this is happening, maybe the ELB is created from the provider, and terraform does not have control over it? or we just need to change the order of deletion (simple thoughts here)

@viniciusdc viniciusdc added type: bug 🐛 Something isn't working provider: AWS needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug area: terraform 💾 labels Feb 19, 2022
@viniciusdc
Copy link
Contributor Author

viniciusdc commented Feb 19, 2022

Seems to be related to ELB and some internal behavior of Azure when removing resources, see this comment for ref.
There are some correlated items:

  • The ELB is auto-created during deployment and terraform does not have information about it, thus it's more robust when attempting to remove it.
  • The security group also is auto-generated, the same situation as above
  • The aws cluster might need to be dependent on its security roles, see example in ref.
  • During destroying, terraform attempts to destroy both the internet_gateway and its assigned vpcs, which might generate a loop of dependency (the gateway depends over the vps, to destroy a vpc you need to discard all internet attachments)

Possible solutions?

  • Add dependency between gateway and vpc (explicitly call depends_on)
  • import data resources for security_groups and LoadBalancer -- the second point is a little bit difficult as the Lb is created under Ingress, which means adding specific provider code into 06 stage...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: terraform 💾 needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug provider: AWS type: bug 🐛 Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant