Skip to content

Latest commit

 

History

History
132 lines (112 loc) · 7.24 KB

e2e.md

File metadata and controls

132 lines (112 loc) · 7.24 KB

Running e2e tests locally

Pre-requisites

  • We recommend using Python 3.9, along with Poetry.

On KinD clusters

Pre-requisite for KinD clusters: please add in your local /etc/hosts file 127.0.0.1 kind. This will map your localhost IP address to the KinD cluster's hostname. This is already performed on GitHub Actions

If the system you run on contains NVidia GPU then you can enable the GPU support in KinD, this will allow you to run also GPU tests. To enable GPU on KinD follow these instructions.

  • Setup Phase:

    make kind-e2e
    export CLUSTER_HOSTNAME=kind
    make setup-e2e
    make deploy -e IMG=quay.io/project-codeflare/codeflare-operator:v1.3.0
    
    For running tests locally on Kind cluster, we need to disable `rayDashboardOAuthEnabled` in `codeflare-operator-config` ConfigMap and then restart CodeFlare Operator
    
    • (Optional) - Create and add sdk-user with limited permissions to the cluster to run through the e2e tests:
      # Get KinD certificates
      docker cp kind-control-plane:/etc/kubernetes/pki/ca.crt .
      docker cp kind-control-plane:/etc/kubernetes/pki/ca.key .
    
      # Generate certificates for new user
      openssl genrsa -out user.key 2048
      openssl req -new -key user.key -out user.csr -subj '/CN=sdk-user/O=tenant'
      openssl x509 -req -in user.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out user.crt -days 360
    
      # Add generated certificated to KinD context
      user_crt=$(base64 --wrap=0 user.crt)
      user_key=$(base64 --wrap=0 user.key)
      yq eval -i ".contexts += {\"context\": {\"cluster\": \"kind-kind\", \"user\": \"sdk-user\"}, \"name\": \"sdk-user\"}" $HOME/.kube/config
      yq eval -i ".users += {\"name\": \"sdk-user\", \"user\": {\"client-certificate-data\": \"$user_crt\", \"client-key-data\": \"$user_key\"}}" $HOME/.kube/config
      cat $HOME/.kube/config
    
      # Cleanup
      rm ca.crt
      rm ca.srl
      rm ca.key
      rm user.crt
      rm user.key
      rm user.csr
    
      # Add RBAC permissions to sdk-user
      kubectl create clusterrole list-ingresses --verb=get,list --resource=ingresses
      kubectl create clusterrolebinding sdk-user-list-ingresses --clusterrole=list-ingresses --user=sdk-user
      kubectl create clusterrole appwrapper-creator --verb=get,list,create,delete,patch --resource=appwrappers
      kubectl create clusterrolebinding sdk-user-appwrapper-creator --clusterrole=appwrapper-creator --user=sdk-user
      kubectl create clusterrole namespace-creator --verb=get,list,create,delete,patch --resource=namespaces
      kubectl create clusterrolebinding sdk-user-namespace-creator --clusterrole=namespace-creator --user=sdk-user
      kubectl create clusterrole list-rayclusters --verb=get,list --resource=rayclusters
      kubectl create clusterrolebinding sdk-user-list-rayclusters --clusterrole=list-rayclusters --user=sdk-user
      kubectl config use-context sdk-user
    
    
    • Install the latest development version of kueue
    kubectl apply --server-side -k "github.com/opendatahub-io/kueue/config/rhoai?ref=dev"
    
  • Test Phase:

    • Once we have the codeflare-operator, kuberay-operator and kueue running and ready, we can run the e2e test on the codeflare-sdk repository:
    poetry install --with test,docs
    poetry run pytest -v -s ./tests/e2e/mnist_raycluster_sdk_kind_test.py
    
    • If the cluster doesn't have NVidia GPU support then we need to disable NVidia GPU tests by providing proper marker:
    poetry install --with test,docs
    poetry run pytest -v -s ./tests/e2e/mnist_raycluster_sdk_kind_test.py -m 'kind and not nvidia_gpu'
    

On OpenShift clusters

  • Setup Phase:

    
    make setup-e2e
    make deploy -e IMG=quay.io/project-codeflare/codeflare-operator:v1.3.0
    
    • Install the latest development version of kueue
    kubectl apply --server-side -k "github.com/opendatahub-io/kueue/config/rhoai?ref=dev"
    

If the system you run on contains NVidia GPU then you can enable the GPU support on OpenShift, this will allow you to run also GPU tests. To enable GPU on OpenShift follow these instructions. Currently the SDK doesn't support tolerations, so e2e tests can't be executed on nodes with taint (i.e. GPU taint).

  • Test Phase:
    • Once we have the codeflare-operator, kuberay-operator and kueue running and ready, we can run the e2e test on the codeflare-sdk repository:
    poetry install --with test,docs
    poetry run pytest -v -s ./tests/e2e/mnist_raycluster_sdk_test.py
    
    • To run the multiple tests based on the cluster environment, we can run the e2e tests by marking -m with cluster environment (kind or openshift)
    poetry run pytest -v -s ./tests/e2e -m openshift
    
    • By default tests configured with timeout of 15 minutes. If necessary, we can override the timeout using --timeout option
    poetry run pytest -v -s ./tests/e2e -m openshift --timeout=1200
    

On OpenShift Disconnected clusters

  • In addition to setup phase mentioned above in case of Openshift cluster, Disconnected environment requires following pre-requisites :
    • Mirror Image registry :
      • Image mirror registry is used to host set of container images required locally for the applications and services. This ensures to pull images without needing an external network connection. It also ensures continuous operation and deployment capabilities in a network-isolated environment.
    • PYPI Mirror Index :
      • When trying to install Python packages in a disconnected environment, the pip command might fail because the connection cannot install packages from external URLs. This issue can be resolved by setting up PIP Mirror Index on separate endpoint in same environment.
    • S3 compatible storage :
      • Some of our distributed training examples require an external storage solution so that all nodes can access the same data in disconnected environment (For example: common-datasets and model files).
      • Minio S3 compatible storage type instance can be deployed in disconnected environment using /tests/e2e/minio_deployment.yaml or using support methods in e2e test suite.
      • The following are environment variables for configuring PIP index URl for accessing the common-python packages required and the S3 or Minio storage for your Ray Train script or interactive session.
    export RAY_IMAGE=quay.io/project-codeflare/ray@sha256:<image-digest> (prefer image digest over image tag in disocnnected environment)
    PIP_INDEX_URL=https://<bastion-node-endpoint-url>/root/pypi/+simple/ \
    PIP_TRUSTED_HOST=<bastion-node-endpoint-url> \
    AWS_DEFAULT_ENDPOINT=<s3-compatible-storage-endpoint-url> \
    AWS_ACCESS_KEY_ID=<s3-compatible-storage-access-key>  \
    AWS_SECRET_ACCESS_KEY=<s3-compatible-storage-secret-key>  \
    AWS_STORAGE_BUCKET=<storage-bucket-name>
    AWS_STORAGE_BUCKET_MNIST_DIR=<storage-bucket-MNIST-datasets-directory>