Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with Traefik Proxy Deployment #563

Closed
jhamman opened this issue Mar 12, 2020 · 6 comments
Closed

Problem with Traefik Proxy Deployment #563

jhamman opened this issue Mar 12, 2020 · 6 comments

Comments

@jhamman
Copy link
Member

jhamman commented Mar 12, 2020

possibly related to #560

In an effort to debug #560, I tore down staging.hydro.pangeo.io and redeployed it in a fresh environment. This went as expected:

helm upgrade --wait --install --namespace hydro-staging hydro-staging pangeo-deploy -f deployments/hydro/config/common.yaml -f deployments/hydro/config/staging.yaml -f deployments/hydro/secrets/staging.yaml
Release "hydro-staging" does not exist. Installing it now.
NAME: hydro-staging
LAST DEPLOYED: Thu Mar 12 13:59:32 2020
NAMESPACE: hydro-staging
STATUS: deployed
REVISION: 1
TEST SUITE: None

But there seems to be a problem with the proxy:

$ kubectl logs autohttps-779cc6866d-p4wxk traefik -n hydro-staging
time="2020-03-12T20:59:41Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.toml"
time="2020-03-12T20:59:41Z" level=info msg="Traefik version 2.1.6 built on 2020-02-28T17:40:18Z"
time="2020-03-12T20:59:41Z" level=info msg="\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://docs.traefik.io/contributing/data-collection/\n"
time="2020-03-12T20:59:41Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-03-12T20:59:41Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-03-12T20:59:41Z" level=info msg="Starting provider *acme.Provider {\"email\":\"jhamman@ucar.edu\",\"caServer\":\"https://acme-v02.api.letsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\",\"store\":{},\"ChallengeStore\":{}}"
time="2020-03-12T20:59:41Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-03-12T20:59:41Z" level=info msg="Starting provider *traefik.Provider {}"
time="2020-03-12T20:59:48Z" level=info msg=Register... providerName=le.acme
time="2020-03-12T21:00:01Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.hydro.pangeo.io\" : unable to generate a certificate for the domains [staging.hydro.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.hydro.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.hydro.pangeo.io/.well-known/acme-challenge/gXjQzQXrdwo3ZtwznBXAqK8QeNm7EYbRDANxzmviLns: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme
time="2020-03-12T21:00:01Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.hydro.pangeo.io\" : unable to generate a certificate for the domains [staging.hydro.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.hydro.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.hydro.pangeo.io/.well-known/acme-challenge/gXjQzQXrdwo3ZtwznBXAqK8QeNm7EYbRDANxzmviLns: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme

cc @consideRatio

@scottyhq
Copy link
Member

@snickell
Copy link

snickell commented Mar 13, 2020

related: jupyterhub/zero-to-jupyterhub-k8s#1594

Worth noting that I had this problem with the traefik in z2jh above, and a slightly newer release of the JH chart (0.9.0-beta.4.n008.hb20ad22: no problem, 0.9.0-beta.4: has the traefik problem) fixes the problem for me, so if somebody is having trouble w/ pangeo they could see the difference in the traefik config there for a possible solution?

@jhamman
Copy link
Member Author

jhamman commented Mar 13, 2020

Thanks @snickell - unfortunately, we're already running with 0.9.0-beta.4.n008.hb20ad22 so I don't think that is the problem.

@jhamman
Copy link
Member Author

jhamman commented Mar 13, 2020

I think I have this working now. I think what is happening is that the autohttps pod's traefik container is starting up before the rest of the deployment/network is ready to go. The solution I've found is just to delete the autohttps pod and let it come back to life on its own:

 ~/Dropbox/src/pangeo-cloud-federation   update-https-proxy→upstream/staging ● ⍟2  helm upgrade --wait --install --namespace ocean-staging ocean-staging pangeo-deploy -f deployments/ocean/config/common.yaml -f deployments/ocean/config/staging.yaml -f deployments/ocean/secrets/staging.yaml --cleanup-on-failRelease "ocean-staging" does not exist. Installing it now.
NAME: ocean-staging
LAST DEPLOYED: Fri Mar 13 14:21:15 2020
NAMESPACE: ocean-staging
STATUS: deployed
REVISION: 1
TEST SUITE: None
 ~/Dropbox/src/pangeo-cloud-federation   update-https-proxy→upstream/staging ● ⍟2  kubectl logs autohttps-d44df9478-5j7p4 traefik -n ocean-staging -f                                           ✔  10653  14:22:57
time="2020-03-13T21:21:21Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.toml"
time="2020-03-13T21:21:21Z" level=info msg="Traefik version 2.1.6 built on 2020-02-28T17:40:18Z"
time="2020-03-13T21:21:21Z" level=info msg="\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://docs.traefik.io/contributing/data-collection/\n"
time="2020-03-13T21:21:21Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-03-13T21:21:21Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-03-13T21:21:21Z" level=info msg="Starting provider *acme.Provider {\"email\":\"raphael.dussin@gmail.com\",\"caServer\":\"https://acme-v02.api.letsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\",\"store\":{},\"ChallengeStore\":{}}"
time="2020-03-13T21:21:21Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-03-13T21:21:21Z" level=info msg="Starting provider *traefik.Provider {}"
time="2020-03-13T21:21:28Z" level=info msg=Register... providerName=le.acme
time="2020-03-13T21:21:40Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.ocean.pangeo.io\" : unable to generate a certificate for the domains [staging.ocean.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.ocean.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.ocean.pangeo.io/.well-known/acme-challenge/LkF-zkQujguOBaF8hZ_a6AOk6vBFUzOTVjlIUr8CF3Y: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme
time="2020-03-13T21:21:41Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.ocean.pangeo.io\" : unable to generate a certificate for the domains [staging.ocean.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.ocean.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.ocean.pangeo.io/.well-known/acme-challenge/LkF-zkQujguOBaF8hZ_a6AOk6vBFUzOTVjlIUr8CF3Y: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme
^C
 ~/Dropbox/src/pangeo-cloud-federation   update-https-proxy→upstream/staging ● ⍟2  kubectl delete pod -n ocean-staging autohttps-d44df9478-5j7p4                                      SIGINT(2) ↵  10654  14:23:20
pod "autohttps-d44df9478-5j7p4" deleted
 ~/Dropbox/src/pangeo-cloud-federation   update-https-proxy→upstream/staging ● ⍟2  kubectl logs autohttps-d44df9478-rjtjh traefik -n ocean-staging -f                                           ✔  10655  14:24:15
time="2020-03-13T21:23:40Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.toml"
time="2020-03-13T21:23:40Z" level=info msg="Traefik version 2.1.6 built on 2020-02-28T17:40:18Z"
time="2020-03-13T21:23:40Z" level=info msg="\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://docs.traefik.io/contributing/data-collection/\n"
time="2020-03-13T21:23:40Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-03-13T21:23:40Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-03-13T21:23:40Z" level=info msg="Starting provider *acme.Provider {\"email\":\"raphael.dussin@gmail.com\",\"caServer\":\"https://acme-v02.api.letsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\",\"store\":{},\"ChallengeStore\":{}}"
time="2020-03-13T21:23:40Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-03-13T21:23:40Z" level=info msg="Starting provider *traefik.Provider {}"

And the hub comes online.

@jhamman jhamman closed this as completed Mar 13, 2020
@scottyhq
Copy link
Member

@jhamman @tjcrone In order to get this update to work on AWS I had to run the following commands locally:
➜ ~ helm version
version.BuildInfo{Version:"v3.1.2", GitCommit:"d878d4d45863e42fd5cff6743294a11d28a9abce", GitTreeState:"clean", GoVersion:"go1.14"}

cd pangeo-cloud-federation
git pull upstream staging
cd pangeo-deploy
helm repo add pangeo https://pangeo-data.github.io/helm-chart/
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm repo add dask-gateway https://dask.org/dask-gateway-helm-repo/
helm repo update
helm dependency update
cd ../
kubectl delete deployment autohttps -n icesat2-staging
kubectl delete rolebinding -n icesat2-staging autohttps
kubectl delete role -n icesat2-staging autohttps
helm upgrade --wait --install --cleanup-on-fail --namespace icesat2-staging icesat2-staging pangeo-deploy -f deployments/icesat2/config/common.yaml -f deployments/icesat2/config/staging.yaml -f deployments/icesat2/secrets/staging.yaml

Update succeeds but problem with autoHTTPS - going to the login page results in ERR_SSL_PROTOCOL_ERROR

kubectl delete pod -n icesat2-staging autohttps-7cb6845966-fvjhs and we're back in business

@tjcrone
Copy link
Contributor

tjcrone commented Mar 13, 2020

@scottyhq!! Thank you! This was very helpful. I was having the same ERR_SSL_PROTOCOL_ERROR error that you had, but ran through all of the steps you provided here and we are back up on staging, and looks like the CI is working fine as well for staging. Awesome. Thank you very much for providing these steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants