mash: sops/hubploy/chart-config

neurohackademy · Jun 27, 2020 · c8e957c · c8e957c
1 parent 18c43da
commit c8e957c
Show file tree

Hide file tree

Showing 11 changed files with 310 additions and 20 deletions.
diff --git a/.sops.yaml b/.sops.yaml
@@ -0,0 +1,7 @@
+# TODO: Write notes
+# gcloud auth login
+# gcloud auth application-default login
+# sops
+creation_rules:
+  - path_regex: .*/secrets/.*
+    gcp_kms: projects/neurohackademy/locations/global/keyRings/nh-2020/cryptoKeys/main
diff --git a/book/work-log.md b/book/work-log.md
@@ -211,3 +211,78 @@ I looked through all the quotas, and given the plan to use m1-ultramem-40 nodes
 with ~80 user each on them, I concluded we would fit 2400 users in 30 nodes.
 30*40 is 1200 CPUs and our current CPU quota is 500. So, due to that, it felt
 sensible to request an increase. I requested a quota of 1500 CPUs.
+
+### GKE
+
+I created a GKE cluster, and this was the gcloud equivalent command. It failed 
+
+```
+gcloud beta container --project "neurohackademy" clusters create "nh-2020" --region "us-east1" --no-enable-basic-auth --cluster-version "1.16.9-gke.6" --machine-type "n1-standard-4" --image-type "COS" --disk-type "pd-standard" --disk-size "100" --node-labels hub.jupyter.org/node-purpose=core --metadata disable-legacy-endpoints=true --service-account "gke-node-core@neurohackademy.iam.gserviceaccount.com" --num-nodes "1" --enable-stackdriver-kubernetes --enable-private-nodes --master-ipv4-cidr "10.60.0.0/28" --enable-ip-alias --network "projects/neurohackademy/global/networks/neurohackademy" --subnetwork "projects/neurohackademy/regions/us-east1/subnetworks/us-east1" --cluster-secondary-range-name "pods" --services-secondary-range-name "services" --default-max-pods-per-node "110" --enable-network-policy --enable-master-authorized-networks --master-authorized-networks 0.0.0.0/0 --addons HorizontalPodAutoscaling,HttpLoadBalancing --no-enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --node-locations "us-east1-b" && gcloud beta container --project "neurohackademy" node-pools create "user" --cluster "nh-2020" --region "us-east1" --node-version "1.16.9-gke.6" --machine-type "m1-ultramem-40" --image-type "COS" --disk-type "pd-standard" --disk-size "100" --node-labels hub.jupyter.org/node-purpose=user --metadata disable-legacy-endpoints=true --node-taints hub.jupyter.org_dedicated=user:NoSchedule --service-account "gke-node-user@neurohackademy.iam.gserviceaccount.com" --num-nodes "0" --enable-autoscaling --min-nodes "0" --max-nodes "25" --no-enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --node-locations "us-east1-b"
+```
+
+Apparently only two m1-ultramem-40 nodes were available, which was unexpected. I
+received the following error.
+
+> Google Compute Engine: Not all instances running in IGM after 56.02936237s. Expect 3. Current errors: [GCE_STOCKOUT]: Instance 'gke-nha-2020-user-8565ecfe-phhk' creation failed: The zone 'projects/neurohackademy/zones/us-east1-c' does not have enough resources available to fulfill the request. '(resource type:compute)'.
+
+I learned that there was no easy way to inspect the availability, but
+successfully scaled to 25 nodes on us-central1-a for ~5 minutes brief moment and
+decided to go with that over us-central1-c. I concluded that it cost me about 13
+USD to make that test.
+
+---
+
+I got a response from Google support, they recommended to use us-east1. In the
+[GCP docs about zones and their
+resources](https://cloud.google.com/compute/docs/regions-zones#available) I
+concluded that us-east1-(b,c,d) were allowed zones, but only b and d had
+m1-ultramem-40 nodes. I tried starting up 25 nodes on us-east1-d first but I got
+stuck at 2 and got the GCP_STOCKOUT issue on the rest.
+
+On us-east1-b I managed to startup 25 nodes though, so now I'm going to assume
+the preparation is as good as it get.
+
+### SOPS
+
+Seeing that Yuvi Panda advocated for a transition to SOPS and put in work to
+make hubploy use it among other things, it made sense to set that up instead of
+staying with git-crypt. See for example [this open
+PR](https://github.com/yuvipanda/hubploy/pull/81).
+
+I used [these steps part of SOPS
+documentation](https://github.com/mozilla/sops#encrypting-using-gcp-kms) to
+setup a Google Cloud KMS keyring. Here is [a link to the GCP web
+console](https://console.cloud.google.com/security/kms?project=neurohackademy).
+
+```shell
+# create a keyring
+gcloud kms keyrings create nh-2020 --location global
+gcloud kms keyrings list --location global
+# resulting keyring: projects/neurohackademy/locations/global/keyRings/nh-2020
+
+# create a key
+gcloud kms keys create main --location global --keyring nh-2020 --purpose encryption
+gcloud kms keys list --location global --keyring nh-2020
+# resulting key: projects/neurohackademy/locations/global/keyRings/nh-2020/cryptoKeys/main
+```
+
+```yaml
+# content of .sops.yaml
+creation_rules:
+  - path_regex: .*/secrets/.*
+    gcp_kms: projects/neurohackademy/locations/global/keyRings/nh-2020/cryptoKeys/main
+```
+
+```shell
+# login to a google cloud account
+gcloud auth login
+
+ # request a credentials file for use
+gcloud auth application-default login
+
+# encrypt a new file
+sops --encrypt --in-place deployments/hub.neurohackademy.org/secrets/prod.yaml
+
+# edit the file in memory
+sops deployments/hub.neurohackademy.org/secrets/prod.yaml
+```
diff --git a/chart/Chart.yaml b/chart/Chart.yaml
@@ -10,26 +10,26 @@ dependencies:
     version: 0.9.0-n078.ha6fb810
     repository: https://jupyterhub.github.io/helm-chart/
 
-  # Nginx-Ingress for highly available ingress routing
-  # https://hub.helm.sh/charts/stable/nginx-ingress
-  - name: nginx-ingress
-    version: 1.39.1
-    repository: https://kubernetes-charts.storage.googleapis.com/
+  # # Nginx-Ingress for highly available ingress routing
+  # # https://hub.helm.sh/charts/stable/nginx-ingress
+  # - name: nginx-ingress
+  #   version: 1.39.1
+  #   repository: https://kubernetes-charts.storage.googleapis.com/
 
-  # Cert-Manager for automatic cert acquisition from Let's Encrypt
-  # https://hub.helm.sh/charts/jetstack/cert-manager
-  - name: cert-manager
-    version: v0.15.1
-    repository: https://charts.jetstack.io
+  # # Cert-Manager for automatic cert acquisition from Let's Encrypt
+  # # https://hub.helm.sh/charts/jetstack/cert-manager
+  # - name: cert-manager
+  #   version: v0.15.1
+  #   repository: https://charts.jetstack.io
 
-  # Prometheus for collection of metrics
-  # https://hub.helm.sh/charts/stable/prometheus
-  - name: prometheus
-    version: 11.4.0
-    repository: https://kubernetes-charts.storage.googleapis.com/
+  # # Prometheus for collection of metrics
+  # # https://hub.helm.sh/charts/stable/prometheus
+  # - name: prometheus
+  #   version: 11.4.0
+  #   repository: https://kubernetes-charts.storage.googleapis.com/
 
-  # Grafana for dashboarding of metrics
-  # https://hub.helm.sh/charts/stable/grafana
-  - name: grafana
-    version: 5.1.4
-    repository: https://kubernetes-charts.storage.googleapis.com/
+  # # Grafana for dashboarding of metrics
+  # # https://hub.helm.sh/charts/stable/grafana
+  # - name: grafana
+  #   version: 5.1.4
+  #   repository: https://kubernetes-charts.storage.googleapis.com/
diff --git a/chart/templates/configmap.yaml b/chart/templates/configmap.yaml
@@ -0,0 +1,16 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: hub-files-etc-jupyterhub-templates
+data:
+  {{- (.Files.Glob "files/etc/jupyterhub/templates/*").AsConfig | nindent 2 }}
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: hub-usr-local-share-jupyterhub-static-external
+binaryData:
+  {{- $root := . }}
+  {{- range $path, $bytes := .Files.Glob "files/static/external/*" }}
+  {{ base $path }}: '{{ $root.Files.Get $path | b64enc }}'
+  {{- end }}
diff --git a/chart/templates/pv.yaml b/chart/templates/pv.yaml
@@ -0,0 +1,12 @@
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+  name: nfs-pv
+spec:
+  capacity:
+    storage: 1Mi
+  accessModes:
+    - ReadWriteMany
+  nfs:
+    server: {{ .Values.nfs.serverIP | quote }}
+    path: "/{{ .Values.nfs.serverName }}"
diff --git a/chart/templates/pvc.yaml b/chart/templates/pvc.yaml
@@ -0,0 +1,13 @@
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: nfs-pvc
+spec:
+  accessModes:
+    - ReadWriteMany
+  # Match name of PV
+  volumeName: nfs-pv
+  storageClassName: ""
+  resources:
+    requests:
+      storage: 1Mi
diff --git a/chart/values.yaml b/chart/values.yaml
@@ -0,0 +1,14 @@
+jupyterhub:
+  hub:
+    extraVolumes:
+      - name: hub-etc-jupyterhub-templates
+        configMap:
+          name: hub-etc-jupyterhub-templates
+      - name: hub-usr-local-share-jupyterhub-static-external
+        configMap:
+          name: hub-usr-local-share-jupyterhub-static-external
+    extraVolumeMounts:
+      - mountPath: /etc/jupyterhub/templates
+        name: hub-etc-jupyterhub-templates
+      - mountPath: /usr/local/share/jupyterhub/static/external
+        name: hub-usr-local-share-jupyterhub-static-external
diff --git a/deployments/hub.neurohackademy.org/config/prod.yaml b/deployments/hub.neurohackademy.org/config/prod.yaml
@@ -0,0 +1,106 @@
+nfs:
+  # Output from:
+  # gcloud beta filestore instances describe nh-2020 --location=us-east1-b
+  serverIP: <todo>
+  serverName: nh
+
+jupyterhub:
+  ## ingress: should be enabled if we transition to use nginx-ingress +
+  ## cert-manager.
+  ##
+  # ingress:
+  #   enabled: true
+  #   annotations:
+  #     kubernetes.io/tls-acme: "true"
+  #     kubernetes.io/ingress.class: nginx
+  #   hosts:
+  #     - hub.neurohackademy.org
+  #   tls:
+  #     - secretName: jupyterhub-tls
+  #       hosts:
+  #         - hub.neurohackademy.org
+
+  prePuller:
+    hook:
+      enabled: true
+    continuous:
+      enabled: true
+
+  scheduling:
+    userScheduler:
+      enabled: true
+      replicas: 2
+    podPriority:
+      enabled: true
+    userPlaceholder:
+      enabled: true
+      replicas: 0
+    corePods:
+      nodeAffinity:
+        matchNodePurpose: require
+    userPods:
+      nodeAffinity:
+        matchNodePurpose: require
+
+  singleuser:
+    ## initContainers:
+    ## We may want this to ensure whatever dataset is mounted through NFS is
+    ## readable for jovyan.
+    ##
+    # initContainers:
+    #   - name: volume-mount-hack
+    #     image: busybox
+    #     command:
+    #       - "sh"
+    #       - "-c"
+    #       - "id && chown 1000:1000 /home/jovyan && ls -lhd /home/jovyan"
+    #     securityContext:
+    #       runAsUser: 0
+    #     volumeMounts:
+    #     - name: home
+    #       mountPath: /home/jovyan
+    #       subPath: "home/{username}"
+    ## image:
+    ## hubploy is supposed to override this!
+    image:
+      name: gcr.io/neurohackademy/nh-2020-env
+      tag: latest
+    ## cpu/memory requests:
+    ## We want to fit as many users on a m1-ultramem-40 node but still ensure
+    ## they get up to 24 GB of ram. At this point during setup, we want to also
+    ## allow a user to start on the n1-standard-4 node to save money.
+    cpu:
+      guarantee: 0.975
+      limit: 40
+    memory:
+      guarantee: 0.5G
+      limit: 24G
+    defaultUrl: /lab
+    startTimeout: 900
+
+  hub:
+    extraConfig:
+      # announcements: |
+      #   c.JupyterHub.template_vars.update({
+      #       'announcement': 'Any message we want to pass to instructors?',
+      #   })
+      templates: |
+        c.JupyterHub.template_paths.insert(0, "/etc/jupyterhub/templates")
+      metrics: |
+        # With this setting set to False, the /hub/metrics endpoint will be
+        # publically accessible just like at hub.mybinder.org/hub/metrics is.
+        c.JupyterHub.authenticate_prometheus = False
+
+  proxy:
+    https:
+      enabled: true
+      hosts: [hub.neurohackademy.org]
+    service:
+      type: LoadBalancer
+      loadBalancerIP: 34.75.11.207
+
+  cull:
+    enabled: true
+    timeout: 7200 # 2 hours in seconds
+    maxAge: 0 # Allow pods to run forever
+
diff --git a/deployments/hub.neurohackademy.org/hubploy.yaml b/deployments/hub.neurohackademy.org/hubploy.yaml
@@ -0,0 +1,15 @@
+images:
+  image_name: gcr.io/neurohackademy/nh-2020-env
+  registry:
+    provider: gcloud
+    gcloud:
+      project: neurohackademy
+      service_key: gcr-key.json
+
+cluster:
+  provider: gcloud
+  gcloud:
+    project: neurohackademy
+    service_key: gke-key.json
+    cluster: nh-2020
+    zone: us-east1-b
diff --git a/deployments/hub.neurohackademy.org/image/requirements.txt b/deployments/hub.neurohackademy.org/image/requirements.txt
@@ -0,0 +1 @@
+pandas
diff --git a/deployments/hub.neurohackademy.org/secrets/prod.yaml b/deployments/hub.neurohackademy.org/secrets/prod.yaml
@@ -0,0 +1,31 @@
+jupyterhub:
+    hub:
+        cookieSecret: ENC[AES256_GCM,data:lABDE8UGK886aMd1tg3n94tAdu8ggttVLA9Bedq47NziKZsVJLJbHcWWSHuCyG3jZDK8O5BWKDvJ5b9Q/tuMaQ==,iv:+KLoMbXqs7E+Jq4T0wqp4yRtru6cAM8rgs+yzvieXzY=,tag:+/tYisKSHmJLABhXNiaFwg==,type:str]
+    proxy:
+        secretToken: ENC[AES256_GCM,data:rshEx6b2qQfQYxZDnmFSVqE35ZcpFrZoCyIKBeI+e0sApresGev3XxzFdVCSmt2LpAf82eZGTnw6vxfExXOWMg==,iv:Nbl5SszrCm72x9momvGsPsrNGpUjn7pKB1gwzm4dXsM=,tag:tcHm4JoFmyeQjPogqfAxGA==,type:str]
+    auth:
+        type: ENC[AES256_GCM,data:J8fMY/vG,iv:6bmTJ82ZhbJj9391tx9QLr4n/Li/sLHhYdYrQnHznRs=,tag:zHS8V7G6nE1+W4TR4oNdNg==,type:str]
+        github:
+            clientId: ENC[AES256_GCM,data:jccwgxIXluR+rFHzwxHje3vARdQ=,iv:4ZW4KKDIVqltzuVoaDbncI31MzNBttoj2GpEWuyFvBo=,tag:TFrbtQbtPrs6sntOEhefsA==,type:str]
+            clientSecret: ENC[AES256_GCM,data:F++GRNfMdKM8wMBn85arzxal5Eb6n+eETNvFHnB8pSotEdSb3kgZyQ==,iv:ApSvnNBby4pJvFrpin9jU5IhwS98IUrfmFeJaDDcrrM=,tag:jl+CjDTYHwlgcJYYsoHfpQ==,type:str]
+            callbackUrl: ENC[AES256_GCM,data:L4pE1qscmn7IMpC5J5fxUHxGqNhmaThQYlP23hx218KX6XgbgntPPPotS52AkYWIPA==,iv:k+uFgfiyEAJ8SFUS+9OzLH8QI8CZbLbgC1Xk8mxi5yc=,tag:ry+NhBXuQyLrlzHq8Spnjw==,type:str]
+        admin:
+            access: ENC[AES256_GCM,data:o6LtOjU=,iv:UPYdEK61S67NUWzwAdok1nduf3nvNHGZ1mTdNKq5uJ8=,tag:ANxZ036bPwVcTaUUvVzFdA==,type:bool]
+            users:
+            - ENC[AES256_GCM,data:472dZxlb,iv:FWVDAfD5fzHZqI4uuQxJWvxL9nu9HFqjUIWjVSPab+Y=,tag:HS6uOR3RHv/1rL2ndhASiA==,type:str]
+            - ENC[AES256_GCM,data:QLPS7gnPSZx1HS6R,iv:nFWne/f0EQA8aDznsJ1sLj1AqC2RnMS17XMZTi+sb7k=,tag:oDQCdiUbbGBdyosB9Rgpkg==,type:str]
+        whitelist:
+            users:
+            - ENC[AES256_GCM,data:Wy0TQz3ZyXqmJc+Q,iv:U4UgN+oe38Ww2Tp0tmMq5tCcWGGbgvTrkm0D/ORm5xE=,tag:s+siGVLHFii6jPN4JdIlKw==,type:str]
+sops:
+    kms: []
+    gcp_kms:
+    -   resource_id: projects/neurohackademy/locations/global/keyRings/nh-2020/cryptoKeys/main
+        created_at: '2020-06-27T10:26:01Z'
+        enc: CiQAPiFjYZSksYpBVrkC/69hHvVRPCp53JrETZ/Gh7jmlcwuTxMSSQBLOy9h6SbGThVLMaXfKm26EPKVFXbB8Mrsdhw8nk/W6m17FQE2FL/1DrQKsLbJkcpIken3pDtUMvdsMhnXJWsem4KTZLDQvvw=
+    azure_kv: []
+    lastmodified: '2020-06-27T10:30:18Z'
+    mac: ENC[AES256_GCM,data:zDf9e6OcUAjZ0W9ehvgr82rEeGVgftlClkjWxedMRjzt3UhkYNocqLT0S4ABooxJI2zEpcJsPLqHPZgg9JHWKIUf+ncj99CfDiTjjs/+iEIDh37oUg59wyfdkhsevUU9U+Pige5JAE2E6lP8/UphySp2CRX441tbjUcEULGnO30=,iv:25TCxURoMNGukMWuKAYnug4IMSb4ee8XjIUIHDZZclI=,tag:ngH+mMaFD/dl1kU8qy0I1A==,type:str]
+    pgp: []
+    unencrypted_suffix: _unencrypted
+    version: 3.5.0