This repository contains a setup for running a distributed deployment of the Gadgetron in a Kubernetes cluster. The setup has been developed for and tested with Azure Kubernetes Service (AKS) but should work on other Kubernetes deploymentes too.
The setup uses Horizontal Pod Autoscaling to adjust the number of Gadgetron instances (pods) running in the cluster in response to gadgetron activity and it relies on cluster-autoscaling to adjust the number of nodes. Specifically, an increase reconstruction activity will lead to the deployment of more Gadgetron instances and when the resources on existing nodes are exhausted more will be added. Idle nodes will be removed from the cluster after some idle time.
Shared files (dependencies and exported data) are stored in persistent volumes, which could be backed by Azure Files.
The Gadgetron uses a script to discover remote worker nodes. The script is specified in the GADGETRON_REMOTE_WORKER_COMMAND environment variable, which references a script added in a ConfigMap. The is also a PreStop lifecycle hook script, which is used to ensure that Gadgetron instances with active connections are not abruptly disconnected.
The repository contains a script for setting up an Azure Kubernetes Service (AKS) cluster. You can use this to deploy a test cluster for the Gadgetron:
./scripts/deploy-aks.sh -n <NAME OF CLUSTER>Use ./scripts/deploy-aks.sh --help to see configuration options.
You can also manually deploy the cluster and necessary components:
-
Set up a Kubernets cluster. Please see instructions for Azure Kubernetes Service (AKS).
-
Deploy Prometheus Operator to allow metrics collection from the Gadgetron:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set commonLabels.prometheus=monitor \ --set prometheus.prometheusSpec.serviceMonitorSelector.matchLabels.prometheus=monitorThis will install the operator, Prometheus server, Grafana, etc.
-
Deploy the Prometheus Adapter:
helm install --namespace monitoring prometheus-adapter prometheus-community/prometheus-adapter -f custom-metrics/custom-metrics.yaml
The Prometheus Adapter is responsible for aggregating metrics from Promtheus and exposing them as custom metrics that we can use for scaling the Gadgetron. Be sure to pay attention to the
prometheus.urlparameter in thecustom-metrics.yamlfile. It has to point to the prometheus operator (find it withkubectl get svc -n monitoring). -
Deploy a storage account for the mrd-storage-server:
storageServerSa="${cluster_name}sa"
storageServerSa="$(echo "$storageServerSa" | tr '[:upper:]' '[:lower:]' | tr -d '-')"
az storage account create -n "$storageServerSa" -g "$rg_name" -l "$cluster_location"
kubectl create secret generic storageserversa --from-literal=connectionString="$(az storage account show-connection-string --name "$storageServerSa" | jq -r .connectionString)" --dry-run=client -o yaml | kubectl apply -f -If you choose a different secret name. Make a note of it and supply it when deploying the help chart below.
-
Deploy Gadgetron with helm chart:
helm install <nameofgadgetroninstance> helm/gadgetron/ --set storage.storageClass=azurefile
Note: Select an appropriate
storageClassfor your Kubernetes setup.To use a specific node pool:
helm install --set nodeSelector.agentpool=userpool <nameofgadgetroninstance> helm/gadgetron/
For details on targeting a GPU node pool, see below or test_values.yml
-
Check that metrics are flowing. After deploying the Gadgetron, it should start emitting metrics and they should be exposed as custom metrics. You can check that you can read them with:
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/gadgetron_activity | jq .
The Gadgetron helm chart has a number of settings that you will probably need to adjust for your scenario. The defaults will get you started but a more reliastic deployment might have a configuration like this:
# Freeze the image version
image:
repository: ghcr.io/gadgetron/gadgetron/gadgetron_ubuntu_rt_cuda@sha256:7f55c1fd1cdd45c898884265649dedfe85ac297384ec5f096e3da9298fe61a16
hpa:
maxReplicas: 20
minReplicas: 1
targetInstanceUtilization: 500m
# Between the hours of 7am and 6pm (UTC) on weekdays keep a minimum of 5 replicas
schedule:
up:
schedule: "0 7 * * 1-5"
minReplicas: 5
down:
schedule: "0 18 * * 1-5"
minReplicas: 1
# I want to use my GPU nodes
nodeSelector:
agentpool: userpool
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
storage:
dependenciesVolumeSize: 20Gi
# I need a TB to store data
dataVolumeSize: 1000Gi
storageClass: azurefile
resources:
# Make sure you get a GPU allocated
limits:
nvidia.com/gpu: 1
# Each replica must have 16 cores and 32 GB of RAM, the cluster auto scaler will add nodes if needed
requests:
cpu: 16000m
memory: 32GiOnce the Gadgetron deployment is live, you can find the cluster ip address for the Gadgetron with something like:
kubectl get svcAnd you can easily open a tunnel from your desktop to the gadgetron with something like:
kubectl port-forward svc/<helm release>-gadgetron 9002:9002And then connect directly to localhost:9002.
This repo also contains a helm chart and other artifacts for deploying an SSH jump server in the cluster and you can use this jump server to establish an SSH tunnel. Maintaining these tunnels can be cumbersome and stunnel (see below) is an easier approach. That said, an SSH jump server may be the only way to access from say an imaging device.
Approach adopted from https://github.com/kubernetes-contrib/jumpserver.
First generate some keys for the SSH server and store them in a Kubernetes secret. There is a script for doing this:
./generate_ssh_keys.shThis script will by default generate RSA, ECDSA, and ED25519 keys. If you would like to restrict or expand the keys generated, add the desired algorithms as arguments, e.g.:
./generate_ssh_keys.sh rsa ed25519Then store the public key for the user to connect, e.g.:
kubectl create secret generic sshkey --from-file=authorizedkeys=/home/<myuser>/.ssh/id_rsa.pub Replace the the public key path with the specific key that you would like to use. Before deploying the SSH jump server you should have two secrets (check with kubectl get secrets) in your cluster: sshkey and ssh-server-keys.
Then deploy the jump server:
helm install sshjump helm/sshjumpOr if you have given the keys other names with a custom ssh key secret:
helm install --set sshKeySecret=alternative-ssh-secret-name --set sshServerKeysSecret=alternative-keys-secret-name sshjump2 helm/sshjump/The jump sever enables the "standard" Gadgetron connection paradigm through an SSH tunnel. The Gadgetron instances themselves are not directly accessible. Discover the relvant IPs and open a tunnel with:
#Public (external) IP:
EXTERNALIP=$(kubectl get svc <sshd-jumpserver-svc> --output=json | jq -r .status.loadBalancer.ingress[0].ip)
#Internal (cluster) IP:
GTCLUSTERIP=$(kubectl get svc <gadgetron-frontend> --output=json | jq -r .spec.clusterIP)
#Open tunnel:
ssh -L 9022:${GTCLUSTERIP}:9002 root@${EXTERNALIP}A simpler approach that should also work with most Kubernetes DNS schemes would be:
ssh -L 9022:<gadgetron-service-name>:9002 root@${EXTERNALIP}The repo contains a helm chart for deploying stunnel for secure access to the Gadgetron in the cluster. To deploy the stunnel server, you must first have some secrets (pre-shared keys). You can generate those and store them as a secret in the cluster with:
cd stunnel
./create-stunnel-secrets.shThis will create an stunnel.secrets file containing 10 client secrets. You can also supply your own file prior to running the script. After running the script confirm that you have an stunnel secret in the cluster with kubectl get secrets.
If you have deployed the Gadgetron in the cluster with a helm release name of say gt1, the DNS name of the gadgetron service would be gt1-gadgetron and you can deploy an stunnel to interface with that Gadgetron deployment by creating a values.yaml file:
stunnel:
secretName: stunnel
connections:
gadgetron:
listenPort: 9002
connectHost: gt1-gadgetron
connectPort: 9002
service:
dnsPrefix: mytunneldnsnameThe service.dnsPrefix is optional, but when deployed in AKS, it will assign a DNS name to the public IP address of the LoadBalancer so that you can reference it with something like mytunneldnsname.westus2.cloudapp.azure.com instead of the IP address, which may change as you redeploy.
Then deploy the stunnel with:
helm upgrade --install -f values.yamlOn some other host on your (on-prem) network, you can then install stunnel and create an stunnel.conf file:
[gadgetron]
client = yes
accept = 9002
connect = <ip or dns name of tunnel service>:9002
PSKsecrets = stunnel.secrets
The stunnel.secrets file must contain one of the pre-shared secrets created with the create-stunnel-secrets.sh, e.g. it could look like:
client1:4ZYJY+HoIX1xxZM563VpppejPJNQ4S4Z
Start the stunnel with:
stunnel stunnel.confAnd you should now be able to connect to the Gadgetron from port 9002 on the host where you are running the stunnel client.
You can use VPN to connect to the Gadgetron in the Kubernetes cluster, it is recommended that you establish a VPN point to site connection. Please consult the Azure P2S VPN guide. The basic steps are:
- Create a gateway subnet in your AKS cluster network.
- Create a VPN Gateway in the subnet.
- Obtain/generate keys and install client software.
- Connect securely using VPN connection.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.