Skip to content

Latest commit

 

History

History
210 lines (159 loc) · 9.33 KB

monitor-iap-setup.md

File metadata and controls

210 lines (159 loc) · 9.33 KB

+++ title = "Monitor IAP Setup" description = "Instructions for monitoring and troubleshooting IAP" weight = 5 +++

Using identity aware proxy (IAP) is the recommended solution for accessing your Kubeflow deployment from outside the cluster.

This is a step to step guide to ensuring your IAP secured endpoint comes up and debugging problems when it doesn't.

While it requires some effort, the end result is well worth it

  • Users can easily login in using their GCP accounts
  • You rely on Google's security expertise to protect your sensitive workloads
  1. The first step is to ensure the ingress and GCB loadbalancer is created

    kubectl -n kubeflow describe ingress
    
    Name:             envoy-ingress
    Namespace:        kubeflow
    Address:          35.244.132.160
    Default backend:  default-http-backend:80 (10.20.0.10:8080)
    Events:
       Type     Reason     Age                 From                     Message
       ----     ------     ----                ----                     -------
       Normal   ADD        12m                 loadbalancer-controller  kubeflow/envoy-ingress
       Warning  Translate  12m (x10 over 12m)  loadbalancer-controller  error while evaluating the ingress spec: could not find service "kubeflow/envoy"
       Warning  Translate  12m (x2 over 12m)   loadbalancer-controller  error while evaluating the ingress spec: error getting BackendConfig for port "8080" on service "kubeflow/envoy", err: no BackendConfig for service port exists.
       Warning  Sync       12m                 loadbalancer-controller  Error during sync: Error running backend syncing routine: received errors when updating backend service: googleapi: Error 400: The resource 'projects/code-search-demo/global/backendServices/k8s-be-32230--bee2fc38fcd6383f' is not ready, resourceNotReady
     googleapi: Error 400: The resource 'projects/code-search-demo/global/backendServices/k8s-be-32230--bee2fc38fcd6383f' is not ready, resourceNotReady
       Normal  CREATE  11m  loadbalancer-controller  ip: 35.244.132.160
    ...
    
    • If the address isn't set then there was a problem creating the loadbalancer

      • The CREATE event indicates the loadbalancer was successfully created on the specified ip address
    • If there are any problems creating the loadbalancer they will be reported as Kubernetes events that show up when you run describe

    • The most common error is running out of GCP quota

    • If you run out of GCP quota you will either need to increase the quota on your project for that resource or else delete some existing resources.

  2. Verify that a signed SSL certificate could be generated using Let's Encrypt

    kubectl -n kubeflow get certificate envoy-ingress-tls  -o yaml
    
    apiVersion: certmanager.k8s.io/v1alpha1
    kind: Certificate
    metadata:
      annotations:
        ksonnet.io/managed: '{"pristine":"H4sIAAAAAAAA/6yRsW7zMAyE9/8xONv+463w2qlLhg5dggyMRDuCJVIQ6RSB4XcvlDQdCnRqN0EHfjzerYA5vFHRIAwDOCqWkHGi0s1P2gX5f+kx5jP20MAc2MMAz1QsjMGhETSQyNCjIQwrRDxR1PqaVZjJKsBJysLEBgMEzG3gqZAqbA0wJoIBiC9yffy3FhXukmZ0VZ+XE41R3uuIZnJ1Abo6uoITHsMEw2EFLwkDKwwHmMf2klCNSsu7viP2WQKbdg9U60LrKUe5JmLrXJTFd5PIBMcGzmZ511f6w+s3j7Btx60BJykJ7+9H/GJlA561Yv7Ae1BdqLzSeGvhs7C4VNzLTYKv2COZErtyzdbmIv4WL7lCtv+pl2379wEAAP//AQAA///uHVhQMgIAAA=="}'
        kubecfg.ksonnet.io/garbage-collect-tag: gc-tag
      creationTimestamp: 2019-04-02T22:49:43Z
      generation: 1
      labels:
        app.kubernetes.io/deploy-manager: ksonnet
        ksonnet.io/component: iap-ingress
      name: envoy-ingress-tls
      namespace: kubeflow
      resourceVersion: "4803"
      selfLink: /apis/certmanager.k8s.io/v1alpha1/namespaces/kubeflow/certificates/envoy-ingress-tls
      uid: 9b137b29-5599-11e9-a223-42010a8e020c
    spec:
      acme:
        config:
        - domains:
          - mykubeflow.endpoints.myproject.cloud.goog
          http01:
            ingress: envoy-ingress
      commonName: kf-vmaster-n01.endpoints.kubeflow-ci-deployment.cloud.goog
      dnsNames:
      - mykubeflow.endpoints.myproject.cloud.goog
      issuerRef:
        kind: ClusterIssuer
        name: letsencrypt-prod
      secretName: envoy-ingress-tls
    status:
      acme:
        order:
          url: https://acme-v02.api.letsencrypt.org/acme/order/54483154/382580193
      conditions:
      - lastTransitionTime: 2019-04-02T23:00:28Z
        message: Certificate issued successfully
        reason: CertIssued
        status: "True"
        type: Ready
      - lastTransitionTime: null
        message: Order validated
        reason: OrderValidated
        status: "False"
        type: ValidateFailed
    
    • The most recent condition should be Certificate issued successfully

    • It can take around 10 minutes to provision a certificate after the GCP loadbalancer is created

    • The most common error is hitting Let's Encrypt quota issues

      • Let's Encrypt enforces a quota of 5 duplicate certificates per week

      • The easiest fix to quota issues is to pick a different hostname by recreating and redeploying Kubeflow with a different name

      • For example if you ran

        kfctl init myapp --project=myproject --platform=gcp
        
      • Rerun kfctl with a different name that you had not previously used

        kfctl init myapp-unique --project=myproject --platform=gcp
        
  3. Wait for the load balancer to report the backends as healthy

    NODE_PORT=$(kubectl --namespace=${NAMESPACE} get svc envoy -o jsonpath='{.spec.ports[0].nodePort}')
    BACKEND_NAME=$(gcloud compute --project=${PROJECT} backend-services list --filter=name~k8s-be-${NODE_PORT}- --format='value(name)')
    gcloud compute --project=${PROJECT} backend-services get-health --global ${BACKEND_NAME}
    
    https://www.googleapis.com/compute/v1/projects/kubeflow-ci-deployment/zones/us-east1-b/instanceGroups/k8s-ig--686aad7559e1cf0e
    status:
       healthStatus:
       - healthState: HEALTHY
         instance: https://www.googleapis.com/compute/v1/projects/kubeflow-ci-deployment/zones/us-east1-b/instances/gke-kf-vmaster-n01-kf-vmaster-n01-cpu-66360615-xjrc
         ipAddress: 10.142.0.8
         port: 32694
       - healthState: HEALTHY
         instance: https://www.googleapis.com/compute/v1/projects/kubeflow-ci-deployment/zones/us-east1-b/instances/gke-kf-vmaster-n01-kf-vmaster-n01-cpu-66360615-gmmx
         ipAddress: 10.142.0.13
         port: 32694
       kind: compute#backendServiceGroupHealth
    
    • Both backends should be reported as healthy

    • It can take several minutes for the load balancer to consider the backend healthy

    • The service with port ${NODE_PORT} is the one we care about most since that is the one handling Kubeflow traffic

    • If the backend is unhealthy check the status of the envoy podss

      kubectl -n kubeflow get pods -l service=envoy
      NAME                     READY     STATUS    RESTARTS   AGE
      envoy-69bf97959c-29dnw   2/2       Running   2          1d
      envoy-69bf97959c-5w5rl   2/2       Running   3          1d
      envoy-69bf97959c-9cjtg   2/2       Running   3          1d
      
      • The backends should have status Running

      • A small number of restarts is expected since the envoy containers need to be restarted as part of their configuration process

    • If the pods are crash looping look at the logs to try to figure out why

      kubectl -n kubeflow logs ${POD}
      
  4. Now that the certificate exists the ingress should report that it is serving on https as well

    ```
    kubectl -n kubeflow get ingress
    NAME            HOSTS                                                        ADDRESS          PORTS     AGE
    envoy-ingress   mykubeflow.endpoints.myproject.cloud.goog   35.244.132.159   80, 443   1d
    ```
    
    • If you don't see 443 look at the ingress events using kubectl describe to see if there are any errors
  5. Try accessing IAP at the full qualified domain name in your web browser

    https://${FQDN}     
    
    • If you get SSL errors this typically means your SSL certificate is still propagating wait a bit and try again

      • SSL propagation could take up to 10 minutes
    • If you are not asked to login and you get a 404 error that means IAP is still being configured

      • Keep retrying for up to 10 minutes
  6. After logging in if you get an error Error: redirect_uri_mismatch this means the OAuth authorized redirect URIs does not include your domain

    • The full error message will look like the following and include the relevant links

      The redirect URI in the request, https://mykubeflow.endpoints.myproject.cloud.goog/_gcp_gatekeeper/authenticate, does not match the ones authorized for the OAuth client. 
      To update the authorized redirect URIs, visit: https://console.developers.google.com/apis/credentials/oauthclient/22222222222-7meeee7a9a76jvg54j0g2lv8lrsb4l8g.apps.googleusercontent.com?project=22222222222
      
    • Follow the link in the error message to navigate to the OAuth credential being used and add the redirect URI listed in the error message to the list of authorized URIs