Skip to content

Commit

Permalink
mash: sops/hubploy/chart-config
Browse files Browse the repository at this point in the history
  • Loading branch information
consideRatio committed Jun 27, 2020
1 parent 18c43da commit c8e957c
Show file tree
Hide file tree
Showing 11 changed files with 310 additions and 20 deletions.
7 changes: 7 additions & 0 deletions .sops.yaml
@@ -0,0 +1,7 @@
# TODO: Write notes
# gcloud auth login
# gcloud auth application-default login
# sops
creation_rules:
- path_regex: .*/secrets/.*
gcp_kms: projects/neurohackademy/locations/global/keyRings/nh-2020/cryptoKeys/main
75 changes: 75 additions & 0 deletions book/work-log.md
Expand Up @@ -211,3 +211,78 @@ I looked through all the quotas, and given the plan to use m1-ultramem-40 nodes
with ~80 user each on them, I concluded we would fit 2400 users in 30 nodes.
30*40 is 1200 CPUs and our current CPU quota is 500. So, due to that, it felt
sensible to request an increase. I requested a quota of 1500 CPUs.

### GKE

I created a GKE cluster, and this was the gcloud equivalent command. It failed

```
gcloud beta container --project "neurohackademy" clusters create "nh-2020" --region "us-east1" --no-enable-basic-auth --cluster-version "1.16.9-gke.6" --machine-type "n1-standard-4" --image-type "COS" --disk-type "pd-standard" --disk-size "100" --node-labels hub.jupyter.org/node-purpose=core --metadata disable-legacy-endpoints=true --service-account "gke-node-core@neurohackademy.iam.gserviceaccount.com" --num-nodes "1" --enable-stackdriver-kubernetes --enable-private-nodes --master-ipv4-cidr "10.60.0.0/28" --enable-ip-alias --network "projects/neurohackademy/global/networks/neurohackademy" --subnetwork "projects/neurohackademy/regions/us-east1/subnetworks/us-east1" --cluster-secondary-range-name "pods" --services-secondary-range-name "services" --default-max-pods-per-node "110" --enable-network-policy --enable-master-authorized-networks --master-authorized-networks 0.0.0.0/0 --addons HorizontalPodAutoscaling,HttpLoadBalancing --no-enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --node-locations "us-east1-b" && gcloud beta container --project "neurohackademy" node-pools create "user" --cluster "nh-2020" --region "us-east1" --node-version "1.16.9-gke.6" --machine-type "m1-ultramem-40" --image-type "COS" --disk-type "pd-standard" --disk-size "100" --node-labels hub.jupyter.org/node-purpose=user --metadata disable-legacy-endpoints=true --node-taints hub.jupyter.org_dedicated=user:NoSchedule --service-account "gke-node-user@neurohackademy.iam.gserviceaccount.com" --num-nodes "0" --enable-autoscaling --min-nodes "0" --max-nodes "25" --no-enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --node-locations "us-east1-b"
```

Apparently only two m1-ultramem-40 nodes were available, which was unexpected. I
received the following error.

> Google Compute Engine: Not all instances running in IGM after 56.02936237s. Expect 3. Current errors: [GCE_STOCKOUT]: Instance 'gke-nha-2020-user-8565ecfe-phhk' creation failed: The zone 'projects/neurohackademy/zones/us-east1-c' does not have enough resources available to fulfill the request. '(resource type:compute)'.
I learned that there was no easy way to inspect the availability, but
successfully scaled to 25 nodes on us-central1-a for ~5 minutes brief moment and
decided to go with that over us-central1-c. I concluded that it cost me about 13
USD to make that test.

---

I got a response from Google support, they recommended to use us-east1. In the
[GCP docs about zones and their
resources](https://cloud.google.com/compute/docs/regions-zones#available) I
concluded that us-east1-(b,c,d) were allowed zones, but only b and d had
m1-ultramem-40 nodes. I tried starting up 25 nodes on us-east1-d first but I got
stuck at 2 and got the GCP_STOCKOUT issue on the rest.

On us-east1-b I managed to startup 25 nodes though, so now I'm going to assume
the preparation is as good as it get.

### SOPS

Seeing that Yuvi Panda advocated for a transition to SOPS and put in work to
make hubploy use it among other things, it made sense to set that up instead of
staying with git-crypt. See for example [this open
PR](https://github.com/yuvipanda/hubploy/pull/81).

I used [these steps part of SOPS
documentation](https://github.com/mozilla/sops#encrypting-using-gcp-kms) to
setup a Google Cloud KMS keyring. Here is [a link to the GCP web
console](https://console.cloud.google.com/security/kms?project=neurohackademy).

```shell
# create a keyring
gcloud kms keyrings create nh-2020 --location global
gcloud kms keyrings list --location global
# resulting keyring: projects/neurohackademy/locations/global/keyRings/nh-2020

# create a key
gcloud kms keys create main --location global --keyring nh-2020 --purpose encryption
gcloud kms keys list --location global --keyring nh-2020
# resulting key: projects/neurohackademy/locations/global/keyRings/nh-2020/cryptoKeys/main
```

```yaml
# content of .sops.yaml
creation_rules:
- path_regex: .*/secrets/.*
gcp_kms: projects/neurohackademy/locations/global/keyRings/nh-2020/cryptoKeys/main
```

```shell
# login to a google cloud account
gcloud auth login

# request a credentials file for use
gcloud auth application-default login

# encrypt a new file
sops --encrypt --in-place deployments/hub.neurohackademy.org/secrets/prod.yaml

# edit the file in memory
sops deployments/hub.neurohackademy.org/secrets/prod.yaml
```
40 changes: 20 additions & 20 deletions chart/Chart.yaml
Expand Up @@ -10,26 +10,26 @@ dependencies:
version: 0.9.0-n078.ha6fb810
repository: https://jupyterhub.github.io/helm-chart/

# Nginx-Ingress for highly available ingress routing
# https://hub.helm.sh/charts/stable/nginx-ingress
- name: nginx-ingress
version: 1.39.1
repository: https://kubernetes-charts.storage.googleapis.com/
# # Nginx-Ingress for highly available ingress routing
# # https://hub.helm.sh/charts/stable/nginx-ingress
# - name: nginx-ingress
# version: 1.39.1
# repository: https://kubernetes-charts.storage.googleapis.com/

# Cert-Manager for automatic cert acquisition from Let's Encrypt
# https://hub.helm.sh/charts/jetstack/cert-manager
- name: cert-manager
version: v0.15.1
repository: https://charts.jetstack.io
# # Cert-Manager for automatic cert acquisition from Let's Encrypt
# # https://hub.helm.sh/charts/jetstack/cert-manager
# - name: cert-manager
# version: v0.15.1
# repository: https://charts.jetstack.io

# Prometheus for collection of metrics
# https://hub.helm.sh/charts/stable/prometheus
- name: prometheus
version: 11.4.0
repository: https://kubernetes-charts.storage.googleapis.com/
# # Prometheus for collection of metrics
# # https://hub.helm.sh/charts/stable/prometheus
# - name: prometheus
# version: 11.4.0
# repository: https://kubernetes-charts.storage.googleapis.com/

# Grafana for dashboarding of metrics
# https://hub.helm.sh/charts/stable/grafana
- name: grafana
version: 5.1.4
repository: https://kubernetes-charts.storage.googleapis.com/
# # Grafana for dashboarding of metrics
# # https://hub.helm.sh/charts/stable/grafana
# - name: grafana
# version: 5.1.4
# repository: https://kubernetes-charts.storage.googleapis.com/
16 changes: 16 additions & 0 deletions chart/templates/configmap.yaml
@@ -0,0 +1,16 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: hub-files-etc-jupyterhub-templates
data:
{{- (.Files.Glob "files/etc/jupyterhub/templates/*").AsConfig | nindent 2 }}
---
apiVersion: v1
kind: ConfigMap
metadata:
name: hub-usr-local-share-jupyterhub-static-external
binaryData:
{{- $root := . }}
{{- range $path, $bytes := .Files.Glob "files/static/external/*" }}
{{ base $path }}: '{{ $root.Files.Get $path | b64enc }}'
{{- end }}
12 changes: 12 additions & 0 deletions chart/templates/pv.yaml
@@ -0,0 +1,12 @@
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-pv
spec:
capacity:
storage: 1Mi
accessModes:
- ReadWriteMany
nfs:
server: {{ .Values.nfs.serverIP | quote }}
path: "/{{ .Values.nfs.serverName }}"
13 changes: 13 additions & 0 deletions chart/templates/pvc.yaml
@@ -0,0 +1,13 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pvc
spec:
accessModes:
- ReadWriteMany
# Match name of PV
volumeName: nfs-pv
storageClassName: ""
resources:
requests:
storage: 1Mi
14 changes: 14 additions & 0 deletions chart/values.yaml
@@ -0,0 +1,14 @@
jupyterhub:
hub:
extraVolumes:
- name: hub-etc-jupyterhub-templates
configMap:
name: hub-etc-jupyterhub-templates
- name: hub-usr-local-share-jupyterhub-static-external
configMap:
name: hub-usr-local-share-jupyterhub-static-external
extraVolumeMounts:
- mountPath: /etc/jupyterhub/templates
name: hub-etc-jupyterhub-templates
- mountPath: /usr/local/share/jupyterhub/static/external
name: hub-usr-local-share-jupyterhub-static-external
106 changes: 106 additions & 0 deletions deployments/hub.neurohackademy.org/config/prod.yaml
@@ -0,0 +1,106 @@
nfs:
# Output from:
# gcloud beta filestore instances describe nh-2020 --location=us-east1-b
serverIP: <todo>
serverName: nh

jupyterhub:
## ingress: should be enabled if we transition to use nginx-ingress +
## cert-manager.
##
# ingress:
# enabled: true
# annotations:
# kubernetes.io/tls-acme: "true"
# kubernetes.io/ingress.class: nginx
# hosts:
# - hub.neurohackademy.org
# tls:
# - secretName: jupyterhub-tls
# hosts:
# - hub.neurohackademy.org

prePuller:
hook:
enabled: true
continuous:
enabled: true

scheduling:
userScheduler:
enabled: true
replicas: 2
podPriority:
enabled: true
userPlaceholder:
enabled: true
replicas: 0
corePods:
nodeAffinity:
matchNodePurpose: require
userPods:
nodeAffinity:
matchNodePurpose: require

singleuser:
## initContainers:
## We may want this to ensure whatever dataset is mounted through NFS is
## readable for jovyan.
##
# initContainers:
# - name: volume-mount-hack
# image: busybox
# command:
# - "sh"
# - "-c"
# - "id && chown 1000:1000 /home/jovyan && ls -lhd /home/jovyan"
# securityContext:
# runAsUser: 0
# volumeMounts:
# - name: home
# mountPath: /home/jovyan
# subPath: "home/{username}"
## image:
## hubploy is supposed to override this!
image:
name: gcr.io/neurohackademy/nh-2020-env
tag: latest
## cpu/memory requests:
## We want to fit as many users on a m1-ultramem-40 node but still ensure
## they get up to 24 GB of ram. At this point during setup, we want to also
## allow a user to start on the n1-standard-4 node to save money.
cpu:
guarantee: 0.975
limit: 40
memory:
guarantee: 0.5G
limit: 24G
defaultUrl: /lab
startTimeout: 900

hub:
extraConfig:
# announcements: |
# c.JupyterHub.template_vars.update({
# 'announcement': 'Any message we want to pass to instructors?',
# })
templates: |
c.JupyterHub.template_paths.insert(0, "/etc/jupyterhub/templates")
metrics: |
# With this setting set to False, the /hub/metrics endpoint will be
# publically accessible just like at hub.mybinder.org/hub/metrics is.
c.JupyterHub.authenticate_prometheus = False
proxy:
https:
enabled: true
hosts: [hub.neurohackademy.org]
service:
type: LoadBalancer
loadBalancerIP: 34.75.11.207

cull:
enabled: true
timeout: 7200 # 2 hours in seconds
maxAge: 0 # Allow pods to run forever

15 changes: 15 additions & 0 deletions deployments/hub.neurohackademy.org/hubploy.yaml
@@ -0,0 +1,15 @@
images:
image_name: gcr.io/neurohackademy/nh-2020-env
registry:
provider: gcloud
gcloud:
project: neurohackademy
service_key: gcr-key.json

cluster:
provider: gcloud
gcloud:
project: neurohackademy
service_key: gke-key.json
cluster: nh-2020
zone: us-east1-b
1 change: 1 addition & 0 deletions deployments/hub.neurohackademy.org/image/requirements.txt
@@ -0,0 +1 @@
pandas
31 changes: 31 additions & 0 deletions deployments/hub.neurohackademy.org/secrets/prod.yaml
@@ -0,0 +1,31 @@
jupyterhub:
hub:
cookieSecret: ENC[AES256_GCM,data:lABDE8UGK886aMd1tg3n94tAdu8ggttVLA9Bedq47NziKZsVJLJbHcWWSHuCyG3jZDK8O5BWKDvJ5b9Q/tuMaQ==,iv:+KLoMbXqs7E+Jq4T0wqp4yRtru6cAM8rgs+yzvieXzY=,tag:+/tYisKSHmJLABhXNiaFwg==,type:str]
proxy:
secretToken: ENC[AES256_GCM,data:rshEx6b2qQfQYxZDnmFSVqE35ZcpFrZoCyIKBeI+e0sApresGev3XxzFdVCSmt2LpAf82eZGTnw6vxfExXOWMg==,iv:Nbl5SszrCm72x9momvGsPsrNGpUjn7pKB1gwzm4dXsM=,tag:tcHm4JoFmyeQjPogqfAxGA==,type:str]
auth:
type: ENC[AES256_GCM,data:J8fMY/vG,iv:6bmTJ82ZhbJj9391tx9QLr4n/Li/sLHhYdYrQnHznRs=,tag:zHS8V7G6nE1+W4TR4oNdNg==,type:str]
github:
clientId: ENC[AES256_GCM,data:jccwgxIXluR+rFHzwxHje3vARdQ=,iv:4ZW4KKDIVqltzuVoaDbncI31MzNBttoj2GpEWuyFvBo=,tag:TFrbtQbtPrs6sntOEhefsA==,type:str]
clientSecret: ENC[AES256_GCM,data:F++GRNfMdKM8wMBn85arzxal5Eb6n+eETNvFHnB8pSotEdSb3kgZyQ==,iv:ApSvnNBby4pJvFrpin9jU5IhwS98IUrfmFeJaDDcrrM=,tag:jl+CjDTYHwlgcJYYsoHfpQ==,type:str]
callbackUrl: ENC[AES256_GCM,data:L4pE1qscmn7IMpC5J5fxUHxGqNhmaThQYlP23hx218KX6XgbgntPPPotS52AkYWIPA==,iv:k+uFgfiyEAJ8SFUS+9OzLH8QI8CZbLbgC1Xk8mxi5yc=,tag:ry+NhBXuQyLrlzHq8Spnjw==,type:str]
admin:
access: ENC[AES256_GCM,data:o6LtOjU=,iv:UPYdEK61S67NUWzwAdok1nduf3nvNHGZ1mTdNKq5uJ8=,tag:ANxZ036bPwVcTaUUvVzFdA==,type:bool]
users:
- ENC[AES256_GCM,data:472dZxlb,iv:FWVDAfD5fzHZqI4uuQxJWvxL9nu9HFqjUIWjVSPab+Y=,tag:HS6uOR3RHv/1rL2ndhASiA==,type:str]
- ENC[AES256_GCM,data:QLPS7gnPSZx1HS6R,iv:nFWne/f0EQA8aDznsJ1sLj1AqC2RnMS17XMZTi+sb7k=,tag:oDQCdiUbbGBdyosB9Rgpkg==,type:str]
whitelist:
users:
- ENC[AES256_GCM,data:Wy0TQz3ZyXqmJc+Q,iv:U4UgN+oe38Ww2Tp0tmMq5tCcWGGbgvTrkm0D/ORm5xE=,tag:s+siGVLHFii6jPN4JdIlKw==,type:str]
sops:
kms: []
gcp_kms:
- resource_id: projects/neurohackademy/locations/global/keyRings/nh-2020/cryptoKeys/main
created_at: '2020-06-27T10:26:01Z'
enc: CiQAPiFjYZSksYpBVrkC/69hHvVRPCp53JrETZ/Gh7jmlcwuTxMSSQBLOy9h6SbGThVLMaXfKm26EPKVFXbB8Mrsdhw8nk/W6m17FQE2FL/1DrQKsLbJkcpIken3pDtUMvdsMhnXJWsem4KTZLDQvvw=
azure_kv: []
lastmodified: '2020-06-27T10:30:18Z'
mac: ENC[AES256_GCM,data:zDf9e6OcUAjZ0W9ehvgr82rEeGVgftlClkjWxedMRjzt3UhkYNocqLT0S4ABooxJI2zEpcJsPLqHPZgg9JHWKIUf+ncj99CfDiTjjs/+iEIDh37oUg59wyfdkhsevUU9U+Pige5JAE2E6lP8/UphySp2CRX441tbjUcEULGnO30=,iv:25TCxURoMNGukMWuKAYnug4IMSb4ee8XjIUIHDZZclI=,tag:ngH+mMaFD/dl1kU8qy0I1A==,type:str]
pgp: []
unencrypted_suffix: _unencrypted
version: 3.5.0

0 comments on commit c8e957c

Please sign in to comment.