Running e2e tests locally

Pre-requisites

We recommend using Python 3.9, along with Poetry.

On KinD clusters

Pre-requisite for KinD clusters: please add in your local /etc/hosts file 127.0.0.1 kind. This will map your localhost IP address to the KinD cluster's hostname. This is already performed on GitHub Actions

If the system you run on contains NVidia GPU then you can enable the GPU support in KinD, this will allow you to run also GPU tests. To enable GPU on KinD follow these instructions.

Setup Phase:

Pull the codeflare-operator repo and run the following make targets:

make kind-e2e
export CLUSTER_HOSTNAME=kind
make setup-e2e
make deploy -e IMG=quay.io/project-codeflare/codeflare-operator:v1.3.0

For running tests locally on Kind cluster, we need to disable `rayDashboardOAuthEnabled` in `codeflare-operator-config` ConfigMap and then restart CodeFlare Operator

(Optional) - Create and add sdk-user with limited permissions to the cluster to run through the e2e tests:

  # Get KinD certificates
  docker cp kind-control-plane:/etc/kubernetes/pki/ca.crt .
  docker cp kind-control-plane:/etc/kubernetes/pki/ca.key .

  # Generate certificates for new user
  openssl genrsa -out user.key 2048
  openssl req -new -key user.key -out user.csr -subj '/CN=sdk-user/O=tenant'
  openssl x509 -req -in user.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out user.crt -days 360

  # Add generated certificated to KinD context
  user_crt=$(base64 --wrap=0 user.crt)
  user_key=$(base64 --wrap=0 user.key)
  yq eval -i ".contexts += {\"context\": {\"cluster\": \"kind-kind\", \"user\": \"sdk-user\"}, \"name\": \"sdk-user\"}" $HOME/.kube/config
  yq eval -i ".users += {\"name\": \"sdk-user\", \"user\": {\"client-certificate-data\": \"$user_crt\", \"client-key-data\": \"$user_key\"}}" $HOME/.kube/config
  cat $HOME/.kube/config

  # Cleanup
  rm ca.crt
  rm ca.srl
  rm ca.key
  rm user.crt
  rm user.key
  rm user.csr

  # Add RBAC permissions to sdk-user
  kubectl create clusterrole list-ingresses --verb=get,list --resource=ingresses
  kubectl create clusterrolebinding sdk-user-list-ingresses --clusterrole=list-ingresses --user=sdk-user
  kubectl create clusterrole appwrapper-creator --verb=get,list,create,delete,patch --resource=appwrappers
  kubectl create clusterrolebinding sdk-user-appwrapper-creator --clusterrole=appwrapper-creator --user=sdk-user
  kubectl create clusterrole namespace-creator --verb=get,list,create,delete,patch --resource=namespaces
  kubectl create clusterrolebinding sdk-user-namespace-creator --clusterrole=namespace-creator --user=sdk-user
  kubectl create clusterrole list-rayclusters --verb=get,list --resource=rayclusters
  kubectl create clusterrolebinding sdk-user-list-rayclusters --clusterrole=list-rayclusters --user=sdk-user
  kubectl config use-context sdk-user

Install the latest development version of kueue

kubectl apply --server-side -k "github.com/opendatahub-io/kueue/config/rhoai?ref=dev"

Test Phase:
- Once we have the codeflare-operator, kuberay-operator and kueue running and ready, we can run the e2e test on the codeflare-sdk repository:
```
poetry install --with test,docs
poetry run pytest -v -s ./tests/e2e/mnist_raycluster_sdk_kind_test.py
```
- If the cluster doesn't have NVidia GPU support then we need to disable NVidia GPU tests by providing proper marker:
```
poetry install --with test,docs
poetry run pytest -v -s ./tests/e2e/mnist_raycluster_sdk_kind_test.py -m 'kind and not nvidia_gpu'
```

On OpenShift clusters

Setup Phase:

Pull the codeflare-operator repo and run the following make targets:


make setup-e2e
make deploy -e IMG=quay.io/project-codeflare/codeflare-operator:v1.3.0

Install the latest development version of kueue

kubectl apply --server-side -k "github.com/opendatahub-io/kueue/config/rhoai?ref=dev"

If the system you run on contains NVidia GPU then you can enable the GPU support on OpenShift, this will allow you to run also GPU tests. To enable GPU on OpenShift follow these instructions. Currently the SDK doesn't support tolerations, so e2e tests can't be executed on nodes with taint (i.e. GPU taint).

Test Phase:
- Once we have the codeflare-operator, kuberay-operator and kueue running and ready, we can run the e2e test on the codeflare-sdk repository:
```
poetry install --with test,docs
poetry run pytest -v -s ./tests/e2e/mnist_raycluster_sdk_test.py
```
- To run the multiple tests based on the cluster environment, we can run the e2e tests by marking -m with cluster environment (kind or openshift)
```
poetry run pytest -v -s ./tests/e2e -m openshift
```
- By default tests configured with timeout of 15 minutes. If necessary, we can override the timeout using --timeout option
```
poetry run pytest -v -s ./tests/e2e -m openshift --timeout=1200
```

On OpenShift Disconnected clusters

In addition to setup phase mentioned above in case of Openshift cluster, Disconnected environment requires following pre-requisites :
- Mirror Image registry :
  - Image mirror registry is used to host set of container images required locally for the applications and services. This ensures to pull images without needing an external network connection. It also ensures continuous operation and deployment capabilities in a network-isolated environment.
- PYPI Mirror Index :
  - When trying to install Python packages in a disconnected environment, the pip command might fail because the connection cannot install packages from external URLs. This issue can be resolved by setting up PIP Mirror Index on separate endpoint in same environment.
- S3 compatible storage :
  - Some of our distributed training examples require an external storage solution so that all nodes can access the same data in disconnected environment (For example: common-datasets and model files).
  - Minio S3 compatible storage type instance can be deployed in disconnected environment using /tests/e2e/minio_deployment.yaml or using support methods in e2e test suite.
  - The following are environment variables for configuring PIP index URl for accessing the common-python packages required and the S3 or Minio storage for your Ray Train script or interactive session.
```
export RAY_IMAGE=quay.io/project-codeflare/ray@sha256:<image-digest> (prefer image digest over image tag in disocnnected environment)
PIP_INDEX_URL=https://<bastion-node-endpoint-url>/root/pypi/+simple/ \
PIP_TRUSTED_HOST=<bastion-node-endpoint-url> \
AWS_DEFAULT_ENDPOINT=<s3-compatible-storage-endpoint-url> \
AWS_ACCESS_KEY_ID=<s3-compatible-storage-access-key>  \
AWS_SECRET_ACCESS_KEY=<s3-compatible-storage-secret-key>  \
AWS_STORAGE_BUCKET=<storage-bucket-name>
AWS_STORAGE_BUCKET_MNIST_DIR=<storage-bucket-MNIST-datasets-directory>
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e.md

e2e.md

Running e2e tests locally

Pre-requisites

On KinD clusters

On OpenShift clusters

On OpenShift Disconnected clusters

Files

e2e.md

Latest commit

History

e2e.md

File metadata and controls

Running e2e tests locally

Pre-requisites

On KinD clusters

On OpenShift clusters

On OpenShift Disconnected clusters