Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment on AWS fails waiting for persistent volume #17

Closed
aktech opened this issue Jun 24, 2020 · 10 comments
Closed

Deployment on AWS fails waiting for persistent volume #17

aktech opened this issue Jun 24, 2020 · 10 comments

Comments

@aktech
Copy link
Member

aktech commented Jun 24, 2020

Steps to reproduce:

  • Render the aws project:
  • Run terraform init then terraform apply in the infrastructure

First I got this:

Error: namespaces "dev" not found

  on .terraform/modules/kubernetes-conda-store-mount/modules/kubernetes/nfs-mount/main.tf line 38, in resource "kubernetes_persistent_volume_claim" "main":
  38: resource "kubernetes_persistent_volume_claim" "main" {

Then I did terraform apply again, then I got the following every-time:

Error: timeout while waiting for state to become 'Bound' (last state: 'Pending', timeout: 5m0s)

  on .terraform/modules/kubernetes-conda-store-server/modules/kubernetes/services/conda-store/main.tf line 1, in resource "kubernetes_persistent_volume_claim" "main":
   1: resource "kubernetes_persistent_volume_claim" "main" {
@aktech
Copy link
Member Author

aktech commented Jun 24, 2020

On terraform destroy:

Error: error deleting subnet (subnet-0513a3d069c573aa4): timeout while waiting for state to become 'destroyed' (last state: 'pending', timeout: 20m0s)

@costrouc
Copy link
Member

I ran into this same issue and here is the solution/workaround that I found. I would like for qhub to automatically handle this use case but at the same time I "like" that this is difficult since resizing a pvc will delete the old one thus deleting all users data. Eventually there needs to be a better way to do this... gcp does not support disk resizing but other storage providers do e.g. rook/ceph.

Here is the workaround:

kubectl delete -n dev deployments conda-store-conda-store nfs-server-nfs
kubectl delete -n dev pvc conda-store-dev-share nfs-mount-dev-share
kubectl delete -n dev pv conda-store-dev-share nfs-mount-dev-share

Then reapply terraform deployment and it should create the needed volumes...what I sorta like from this is that we are forcing the users to delete the shared filesystems making losing data harder (kinda a "this bug is a feature :)" ).

@aktech
Copy link
Member Author

aktech commented Jul 1, 2020

Interesting, but the problem I faced with AWS was on a new deployment, which didn't had any cluster already present, so this problem shouldn't happen there, right?

@aktech
Copy link
Member Author

aktech commented Jul 1, 2020

It seems they are still there after running the above commands. Seems like following hacks works:

@costrouc costrouc transferred this issue from Quansight/qhub-ops Aug 18, 2020
@tylerpotts
Copy link
Contributor

Document solution, and how to delete the persistent volume. Will leave the solution as manual so that user data is not automatically deleted.

@filippo82
Copy link
Contributor

Hi all, is there any update on this issue?

@tylerpotts
Copy link
Contributor

@filippo82 There is. With the newer terraform/kubernetes update, persistent volume side can be increased without being deleted/destroyed. I've verified this as of this past week

@filippo82
Copy link
Contributor

Hi @tylerpotts I believe that the issue which @aktech was experiencing in June (and which I was experiencing yesterday too) was related (somehow, I think) to the order of creation of resources by Terraform. This should be now fixed by this #129. First building the Kubernets cluster with terraform apply -auto-approve -target=module.kubernetes -target=module.kubernetes-initialization -target=module.kubernetes-ingress and then everything else with a general terraform apply.

So this is issue is now fixed for me and, I believe, it is fully taken care of by qhub deploy and probably can be closed.

The issue I am having is to terraform destroy the QHub deployment. It has been a nightmare so far :/

I've opened this issue to discuss that: #144

Best,
-Filippo

@tylerpotts
Copy link
Contributor

@filippo82 Thanks for the clarification. I'll close out this issue

@prasunanand
Copy link
Contributor

prasunanand commented Nov 6, 2020

Solution: The issue is with conda enironment syntax. So the conda environment is not ready. Hence the PV fails. If the environment is taking too long to build, this error may still exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants