Problem with Traefik Proxy Deployment #563

jhamman · 2020-03-12T22:01:16Z

possibly related to #560

In an effort to debug #560, I tore down staging.hydro.pangeo.io and redeployed it in a fresh environment. This went as expected:

helm upgrade --wait --install --namespace hydro-staging hydro-staging pangeo-deploy -f deployments/hydro/config/common.yaml -f deployments/hydro/config/staging.yaml -f deployments/hydro/secrets/staging.yaml
Release "hydro-staging" does not exist. Installing it now.
NAME: hydro-staging
LAST DEPLOYED: Thu Mar 12 13:59:32 2020
NAMESPACE: hydro-staging
STATUS: deployed
REVISION: 1
TEST SUITE: None

But there seems to be a problem with the proxy:

$ kubectl logs autohttps-779cc6866d-p4wxk traefik -n hydro-staging
time="2020-03-12T20:59:41Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.toml"
time="2020-03-12T20:59:41Z" level=info msg="Traefik version 2.1.6 built on 2020-02-28T17:40:18Z"
time="2020-03-12T20:59:41Z" level=info msg="\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://docs.traefik.io/contributing/data-collection/\n"
time="2020-03-12T20:59:41Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-03-12T20:59:41Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-03-12T20:59:41Z" level=info msg="Starting provider *acme.Provider {\"email\":\"jhamman@ucar.edu\",\"caServer\":\"https://acme-v02.api.letsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\",\"store\":{},\"ChallengeStore\":{}}"
time="2020-03-12T20:59:41Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-03-12T20:59:41Z" level=info msg="Starting provider *traefik.Provider {}"
time="2020-03-12T20:59:48Z" level=info msg=Register... providerName=le.acme
time="2020-03-12T21:00:01Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.hydro.pangeo.io\" : unable to generate a certificate for the domains [staging.hydro.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.hydro.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.hydro.pangeo.io/.well-known/acme-challenge/gXjQzQXrdwo3ZtwznBXAqK8QeNm7EYbRDANxzmviLns: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme
time="2020-03-12T21:00:01Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.hydro.pangeo.io\" : unable to generate a certificate for the domains [staging.hydro.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.hydro.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.hydro.pangeo.io/.well-known/acme-challenge/gXjQzQXrdwo3ZtwznBXAqK8QeNm7EYbRDANxzmviLns: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme

cc @consideRatio

The text was updated successfully, but these errors were encountered:

scottyhq · 2020-03-12T22:18:55Z

related: jupyterhub/zero-to-jupyterhub-k8s#1594

snickell · 2020-03-13T01:04:57Z

related: jupyterhub/zero-to-jupyterhub-k8s#1594

Worth noting that I had this problem with the traefik in z2jh above, and a slightly newer release of the JH chart (0.9.0-beta.4.n008.hb20ad22: no problem, 0.9.0-beta.4: has the traefik problem) fixes the problem for me, so if somebody is having trouble w/ pangeo they could see the difference in the traefik config there for a possible solution?

jhamman · 2020-03-13T06:38:49Z

Thanks @snickell - unfortunately, we're already running with 0.9.0-beta.4.n008.hb20ad22 so I don't think that is the problem.

jhamman · 2020-03-13T21:30:47Z

I think I have this working now. I think what is happening is that the autohttps pod's traefik container is starting up before the rest of the deployment/network is ready to go. The solution I've found is just to delete the autohttps pod and let it come back to life on its own:

 ~/Dropbox/src/pangeo-cloud-federation   update-https-proxy→upstream/staging ● ⍟2  helm upgrade --wait --install --namespace ocean-staging ocean-staging pangeo-deploy -f deployments/ocean/config/common.yaml -f deployments/ocean/config/staging.yaml -f deployments/ocean/secrets/staging.yaml --cleanup-on-failRelease "ocean-staging" does not exist. Installing it now.
NAME: ocean-staging
LAST DEPLOYED: Fri Mar 13 14:21:15 2020
NAMESPACE: ocean-staging
STATUS: deployed
REVISION: 1
TEST SUITE: None
 ~/Dropbox/src/pangeo-cloud-federation   update-https-proxy→upstream/staging ● ⍟2  kubectl logs autohttps-d44df9478-5j7p4 traefik -n ocean-staging -f                                           ✔  10653  14:22:57
time="2020-03-13T21:21:21Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.toml"
time="2020-03-13T21:21:21Z" level=info msg="Traefik version 2.1.6 built on 2020-02-28T17:40:18Z"
time="2020-03-13T21:21:21Z" level=info msg="\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://docs.traefik.io/contributing/data-collection/\n"
time="2020-03-13T21:21:21Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-03-13T21:21:21Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-03-13T21:21:21Z" level=info msg="Starting provider *acme.Provider {\"email\":\"raphael.dussin@gmail.com\",\"caServer\":\"https://acme-v02.api.letsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\",\"store\":{},\"ChallengeStore\":{}}"
time="2020-03-13T21:21:21Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-03-13T21:21:21Z" level=info msg="Starting provider *traefik.Provider {}"
time="2020-03-13T21:21:28Z" level=info msg=Register... providerName=le.acme
time="2020-03-13T21:21:40Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.ocean.pangeo.io\" : unable to generate a certificate for the domains [staging.ocean.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.ocean.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.ocean.pangeo.io/.well-known/acme-challenge/LkF-zkQujguOBaF8hZ_a6AOk6vBFUzOTVjlIUr8CF3Y: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme
time="2020-03-13T21:21:41Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.ocean.pangeo.io\" : unable to generate a certificate for the domains [staging.ocean.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.ocean.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.ocean.pangeo.io/.well-known/acme-challenge/LkF-zkQujguOBaF8hZ_a6AOk6vBFUzOTVjlIUr8CF3Y: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme
^C
 ~/Dropbox/src/pangeo-cloud-federation   update-https-proxy→upstream/staging ● ⍟2  kubectl delete pod -n ocean-staging autohttps-d44df9478-5j7p4                                      SIGINT(2) ↵  10654  14:23:20
pod "autohttps-d44df9478-5j7p4" deleted
 ~/Dropbox/src/pangeo-cloud-federation   update-https-proxy→upstream/staging ● ⍟2  kubectl logs autohttps-d44df9478-rjtjh traefik -n ocean-staging -f                                           ✔  10655  14:24:15
time="2020-03-13T21:23:40Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.toml"
time="2020-03-13T21:23:40Z" level=info msg="Traefik version 2.1.6 built on 2020-02-28T17:40:18Z"
time="2020-03-13T21:23:40Z" level=info msg="\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://docs.traefik.io/contributing/data-collection/\n"
time="2020-03-13T21:23:40Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-03-13T21:23:40Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-03-13T21:23:40Z" level=info msg="Starting provider *acme.Provider {\"email\":\"raphael.dussin@gmail.com\",\"caServer\":\"https://acme-v02.api.letsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\",\"store\":{},\"ChallengeStore\":{}}"
time="2020-03-13T21:23:40Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-03-13T21:23:40Z" level=info msg="Starting provider *traefik.Provider {}"

And the hub comes online.

scottyhq · 2020-03-13T22:43:33Z

@jhamman @tjcrone In order to get this update to work on AWS I had to run the following commands locally:
➜ ~ helm version
version.BuildInfo{Version:"v3.1.2", GitCommit:"d878d4d45863e42fd5cff6743294a11d28a9abce", GitTreeState:"clean", GoVersion:"go1.14"}

cd pangeo-cloud-federation
git pull upstream staging
cd pangeo-deploy
helm repo add pangeo https://pangeo-data.github.io/helm-chart/
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm repo add dask-gateway https://dask.org/dask-gateway-helm-repo/
helm repo update
helm dependency update
cd ../
kubectl delete deployment autohttps -n icesat2-staging
kubectl delete rolebinding -n icesat2-staging autohttps
kubectl delete role -n icesat2-staging autohttps
helm upgrade --wait --install --cleanup-on-fail --namespace icesat2-staging icesat2-staging pangeo-deploy -f deployments/icesat2/config/common.yaml -f deployments/icesat2/config/staging.yaml -f deployments/icesat2/secrets/staging.yaml

Update succeeds but problem with autoHTTPS - going to the login page results in ERR_SSL_PROTOCOL_ERROR

kubectl delete pod -n icesat2-staging autohttps-7cb6845966-fvjhs and we're back in business

tjcrone · 2020-03-13T23:59:45Z

@scottyhq!! Thank you! This was very helpful. I was having the same ERR_SSL_PROTOCOL_ERROR error that you had, but ran through all of the steps you provided here and we are back up on staging, and looks like the CI is working fine as well for staging. Awesome. Thank you very much for providing these steps.

jhamman closed this as completed Mar 13, 2020

jhamman mentioned this issue Mar 14, 2020

staging -> prod (#555, #556, #559, #561, #562, #564) #565

Merged

scottyhq mentioned this issue Mar 17, 2020

Dask-gateway checklist #496

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with Traefik Proxy Deployment #563

Problem with Traefik Proxy Deployment #563

jhamman commented Mar 12, 2020 •

edited

scottyhq commented Mar 12, 2020

snickell commented Mar 13, 2020 •

edited

jhamman commented Mar 13, 2020

jhamman commented Mar 13, 2020

scottyhq commented Mar 13, 2020

tjcrone commented Mar 13, 2020

Problem with Traefik Proxy Deployment #563

Problem with Traefik Proxy Deployment #563

Comments

jhamman commented Mar 12, 2020 • edited

scottyhq commented Mar 12, 2020

snickell commented Mar 13, 2020 • edited

jhamman commented Mar 13, 2020

jhamman commented Mar 13, 2020

scottyhq commented Mar 13, 2020

tjcrone commented Mar 13, 2020

jhamman commented Mar 12, 2020 •

edited

snickell commented Mar 13, 2020 •

edited