Kuberay integration (#2383)

* ray integration * init securityContext * reduce cpu request * UPGRADE.md * replace Jupyter notebook with JupyterLab * update * add KFP to the figure * Disable istio sidecars for ray head in raycluster_example.yaml * Update README.md * Never ever use the default namespace * Update README.md * Disable the ray worker sidecar * Update kustomization.yaml * Create namespace.yaml * Update test.sh to use the right namespace * Update README.md * Update README.md for Kubeflow 1.7 --------- Co-authored-by: kaihsun <kaihsun@anyscale.com>
kubeflow · May 12, 2023 · afc6a0e · afc6a0e
1 parent 182e81d
commit afc6a0e
Show file tree

Hide file tree

Showing 12 changed files with 36,055 additions and 0 deletions.
diff --git a/.github/workflows/ray_kind_test.yaml b/.github/workflows/ray_kind_test.yaml
@@ -0,0 +1,26 @@
+name: Build & Apply Ray manifest in KinD
+on:
+  pull_request:
+    paths:
+      - contrib/ray/**
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+    - name: Checkout
+      uses: actions/checkout@v3
+
+    - name: Install KinD
+      run: ./tests/gh-actions/install_kind.sh
+
+    - name: Create KinD Cluster
+      run: kind create cluster --image=kindest/node:v1.23.0
+
+    - name: Install kustomize
+      run: ./tests/gh-actions/install_kustomize.sh
+
+    - name: Build & Apply manifests
+      run: |
+        cd contrib/ray/
+        make test
diff --git a/contrib/ray/Makefile b/contrib/ray/Makefile
@@ -0,0 +1,12 @@
+KUBERAY_RELEASE_VERSION ?= 0.4.0
+KUBERAY_HELM_CHART_REPO ?= https://ray-project.github.io/kuberay-helm/
+
+.PHONY: kuberay-operator/base
+kuberay-operator/base:
+	mkdir -p kuberay-operator/base
+	cd kuberay-operator/base && helm template --include-crds kuberay-operator kuberay-operator --version ${KUBERAY_RELEASE_VERSION} --repo ${KUBERAY_HELM_CHART_REPO} > resources.yaml
+
+.PHONY: test
+test:
+	./test.sh
+
diff --git a/contrib/ray/OWNERS b/contrib/ray/OWNERS
@@ -0,0 +1,5 @@
+approvers:
+  - juliusvonkohout
+reviewers:
+  - juliusvonkohout
+  - kimwnasptd
diff --git a/contrib/ray/README.md b/contrib/ray/README.md
@@ -0,0 +1,135 @@
+TODO
+- The ray dashboard, worker and head must only be available from inside your kubeflow user namespace
+- Reenable the istio sidecar for the ray head and worker in the user namespace and provide the corresponding Istio Authorizationpolicies. We can keep the istio sidecar for the deployment kuberay-operator in the namespace kubeflow, since it does NOT use a webhook, but something else to reconcile rayclusters. This means we also do not need a networkpolicy for the ray operator.
+
+
+> Credit: This manifest refers a lot to the engineering blog ["Building a Machine Learning Platform with Kubeflow and Ray on Google Kubernetes Engine"](https://cloud.google.com/blog/products/ai-machine-learning/build-a-ml-platform-with-kubeflow-and-ray-on-gke) from Google Cloud.
+
+# Ray
+[Ray](https://github.com/ray-project/ray) is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for simplifying ML compute.
+
+<figure>
+  <img
+  src="assets/map-of-ray.png"
+  alt="Ray">
+  <figcaption>Stack of Ray libraries - unified toolkit for ML workloads. (ref: https://docs.ray.io/en/latest/ray-overview/index.html)</figcaption>
+</figure>
+
+# KubeRay
+[KubeRay](https://github.com/ray-project/kuberay) is an open-source Kubernetes operator for Ray. It provides several CRDs to simplify managing Ray clusters on Kubernetes. We will integrate Kubeflow and KubeRay in this document.
+
+# Requirements
+* Dependencies
+    * `kustomize`: v3.2.0 (Kubeflow manifest is sensitive to `kustomize` version.)
+    * `Kubernetes`: v1.23
+
+* Computing resources:
+    * 16GB RAM
+    * 8 CPUs
+
+# Example
+<figure>
+  <img
+  src="assets/architecture.svg"
+  alt="ray/kubeflow integration">
+  <figcaption>Note: (1) Kubeflow Central Dashboard will be renamed to workbench in the future. (2) Kubeflow Pipeline (KFP) is an important component of Kubeflow, but it is not included in this example.</figcaption>
+</figure>
+
+## Step 1: Install Kubeflow v1.7-branch
+* This example installs Kubeflow with the [v1.7-branch](https://github.com/kubeflow/manifests/tree/v1.7-branch).
+
+* Install all Kubeflow official components and all common services using [one command](https://github.com/kubeflow/manifests/tree/v1.7-branch#install-with-a-single-command).
+    * If you do not want to install all components, you can comment out **KNative**, **Katib**, **Tensorboards Controller**, **Tensorboard Web App**, **Training Operator**, and **KServe** from [example/kustomization.yaml](https://github.com/kubeflow/manifests/blob/v1.7-branch/example/kustomization.yaml).
+
+## Step 2: Install KubeRay operator
+
+We never ever break Kubernetes standards and do not use the "default" namespace, but a proper one, in our case "kubeflow" for the ray operator.
+
+```sh
+# Install a KubeRay operator and custom resource definitions.
+kustomize build kuberay-operator/base | kubectl apply --server-side -f -
+
+# Check KubeRay operator
+kubectl get pod -l app.kubernetes.io/component=kuberay-operator
+# NAME                                READY   STATUS    RESTARTS   AGE
+# kuberay-operator-5b8cd69758-rkpvh   1/1     Running   0          6m23s
+```
+
+## Step 3: Install RayCluster
+```sh
+# Create a RayCluster CR, and the KubeRay operator will reconcile a Ray cluster
+# with 1 head Pod and 1 worker Pod.
+# $MY_KUBEFLOW_USER_NAMESPACE is a proper Kubeflow user namespace with istio sidecar injection and never ever the wrong "default" 
+export MY_KUBEFLOW_USER_NAMESPACE=development
+kubectl apply -f $raycluster_example.yaml -n $MY_KUBEFLOW_USER_NAMESPACE
+
+# Check RayCluster
+kubectl get pod -l ray.io/cluster=kubeflow-raycluster -n $MY_KUBEFLOW_USER_NAMESPACE
+# NAME                                           READY   STATUS    RESTARTS   AGE
+# kubeflow-raycluster-head-p6dpk                 1/1     Running   0          70s
+# kubeflow-raycluster-worker-small-group-l7j6c   1/1     Running   0          70s
+```
+* `raycluster_example.yaml` uses `rayproject/ray:2.2.0-py38-cpu` as its OCI image. Ray is very sensitive to the Python versions and Ray versions between the server (RayCluster) and client (JupyterLab) sides. This image uses:
+    * Python 3.8.13
+    * Ray 2.2.0
+
+## Step 4: Forward the port of Istio's Ingress-Gateway
+* Follow the [instructions](https://github.com/kubeflow/manifests/tree/v1.7-branch#port-forward) to forward the port of Istio's Ingress-Gateway and log in to Kubeflow Central Dashboard.
+
+## Step 5: Create a JupyterLab via Kubeflow Central Dashboard
+* Click "Notebooks" icon in the left panel.
+* Click "New Notebook"
+* Select `kubeflownotebookswg/jupyter-scipy:v1.7.0` as OCI image.
+* Click "Launch"
+* Click "CONNECT" to connect into the JupyterLab instance.
+
+## Step 6: Use Ray client in the JupyterLab to connect to the RayCluster
+* As I mentioned in Step 3, Ray is very sensitive to the Python versions and Ray versions between the server (RayCluster) and client (JupyterLab) sides.
+    ```sh
+    # Check Python version. The version's MAJOR and MINOR should match with RayCluster (i.e. Python 3.8)
+    python --version 
+    # Python 3.8.10
+
+    # Install Ray 2.2.0
+    pip install -U ray[default]==2.2.0
+    ```
+* Connect to RayCluster via Ray client.
+    ```python
+    # Open a new .ipynb page.
+
+    import ray
+    # For other namespaces use ray://${RAYCLUSTER_HEAD_SVC}.${NAMESPACE}.svc.cluster.local:${RAY_CLIENT_PORT}
+    # But we use of course our per namespace ray cluster to have multi-tenancy and
+    # We never ever use "default" as namespace since this would violate Kubernetes standards
+    ray.init(address="ray://kubeflow-raycluster-head-svc:10001")
+    print(ray.cluster_resources())
+    # {'node:10.244.0.41': 1.0, 'memory': 3000000000.0, 'node:10.244.0.40': 1.0, 'object_store_memory': 805386239.0, 'CPU': 2.0}
+
+    # Try Ray task
+    @ray.remote
+    def f(x):
+        return x * x
+
+    futures = [f.remote(i) for i in range(4)]
+    print(ray.get(futures)) # [0, 1, 4, 9]
+
+    # Try Ray actor
+    @ray.remote
+    class Counter(object):
+        def __init__(self):
+            self.n = 0
+
+        def increment(self):
+            self.n += 1
+
+        def read(self):
+            return self.n
+
+    counters = [Counter.remote() for i in range(4)]
+    [c.increment.remote() for c in counters]
+    futures = [c.read.remote() for c in counters]
+    print(ray.get(futures)) # [1, 1, 1, 1]
+    ```
+
+# Upgrading
+See [UPGRADE.md](UPGRADE.md) for more details.
diff --git a/contrib/ray/UPGRADE.md b/contrib/ray/UPGRADE.md
@@ -0,0 +1,6 @@
+# Upgrading
+```sh
+# Step 1: Update KUBERAY_RELEASE_VERSION in Makefile
+# Step 2: Create new KubeRay operator manifest
+make kuberay-operator/base
+```
diff --git a/contrib/ray/assets/architecture.svg b/contrib/ray/assets/architecture.svg
diff --git a/contrib/ray/assets/map-of-ray.png b/contrib/ray/assets/map-of-ray.png
diff --git a/contrib/ray/kuberay-operator/base/kustomization.yaml b/contrib/ray/kuberay-operator/base/kustomization.yaml
@@ -0,0 +1,20 @@
+patches:
+# Add securityContext to KubeRay operator Pod.
+- target:
+    kind: Deployment
+    labelSelector: "app.kubernetes.io/name=kuberay-operator"
+  patch: |-
+    - op: add
+      path: /spec/template/spec/containers/0/securityContext
+      value:
+        runAsUser: 1000
+        allowPrivilegeEscalation: false
+        capabilities:
+          drop: ["ALL"]
+        runAsNonRoot: true
+        seccompProfile:
+          type: RuntimeDefault
+namespace: kubeflow
+resources:
+- namespace.yaml
+- resources.yaml
diff --git a/contrib/ray/kuberay-operator/base/namespace.yaml b/contrib/ray/kuberay-operator/base/namespace.yaml
@@ -0,0 +1,4 @@
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: kubeflow