forked from bigscience-workshop/Megatron-DeepSpeed
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request bigscience-workshop#18 from OpenGPTX/feature/add_k…
…8s_action_runner Feature/add k8s action runner
- Loading branch information
Showing
6 changed files
with
130 additions
and
223 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Setup and configuration | ||
|
||
Runners are controlled and only spawned by the [Actions-Runner-Controller](https://github.com/actions-runner-controller/actions-runner-controller) (ARC), so they will not show up in Github's Runner setting while idling. | ||
|
||
The main backend software can only be installed by the clusters' Admins. | ||
However, users in the `project-ns-opengptx` namespace can configure the controller using normal k8s deployment yaml in the below session. | ||
|
||
Authentication for runners are done using Github-app as instructed in the ARC repo. | ||
|
||
|
||
# Deployment files for running github actions on k8s cluster | ||
|
||
`arc_runner_deployment.yaml` deploys runner managed by [Actions-Runner-Controller](https://github.com/actions-runner-controller/actions-runner-controller) (ARC). These runners are only created when need, thus does not permanently block resource on the cluster. | ||
|
||
`unmanaged_runner_deployment.yaml` is the simplest way to deploy a runner. | ||
However, this is not recommended for runners with GPU access, because these runners will permanently block/occupy GPU on the cluster. | ||
|
||
To deploy runner: | ||
```bash | ||
kubectl -f arc_runner_deployment.yaml | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# runnerdeployment.yaml | ||
apiVersion: actions.summerwind.dev/v1alpha1 | ||
kind: RunnerDeployment | ||
metadata: | ||
name: action-runner-deploymment | ||
namespace: project-ns-opengptx | ||
spec: | ||
replicas: 0 | ||
template: | ||
spec: | ||
repository: OpenGPTX/bigscience_megatron_deepspeed | ||
ephemeral: true | ||
dockerEnabled: false | ||
|
||
env: | ||
- name: RUNNER_ASSETS_DIR | ||
value: "/actions-runner" | ||
|
||
image: hub.cc-asp.fraunhofer.de/dockerhub_proxy_cache/malteos/obmd:22.08-py3-runner | ||
imagePullPolicy: Always | ||
resources: | ||
requests: | ||
cpu: 3 #<-- same value for requests and limits | ||
memory: "10Gi" | ||
nvidia.com/gpu: 1 #Assign the same values for GPU requests and limits. | ||
limits: | ||
cpu: 3 #<-- same value for requests and limits | ||
memory: "10Gi" | ||
nvidia.com/gpu: 1 #Assign integer values. GPU has no fraction values. | ||
|
||
tolerations: | ||
- key: "nvidia.com" | ||
operator: "Equal" | ||
value: "a100" | ||
effect: "NoSchedule" | ||
|
||
--- | ||
apiVersion: actions.summerwind.dev/v1alpha1 | ||
kind: HorizontalRunnerAutoscaler | ||
metadata: | ||
name: example-runner-autoscaler | ||
namespace: project-ns-opengptx | ||
spec: | ||
minReplicas: 0 | ||
maxReplicas: 1 | ||
|
||
scaleDownDelaySecondsAfterScaleOut: 120 | ||
scaleTargetRef: | ||
kind: RunnerDeployment | ||
name: action-runner-deploymment | ||
|
||
metrics: | ||
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns | ||
repositoryNames: | ||
- OpenGPTX/bigscience_megatron_deepspeed |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: unmanaged-github-runner | ||
namespace: project-ns-opengptx | ||
spec: | ||
containers: | ||
- name: runner-container | ||
image: hub.cc-asp.fraunhofer.de/dockerhub_proxy_cache/library/nginx:latest #<-- Your docker image | ||
env: | ||
- name: REPO_URL | ||
value: "https://github.com/OpenGPTX/bigscience_megatron_deepspeed" | ||
- name: REG_TOKEN | ||
value: "ADMDER2UAJZ57H57SOUUXXXXXXXXX" # Setting > Actions > Runners > New self-hosted runner | ||
|
||
command: ["/bin/sh", "-c"] | ||
args: #<-- Command to run in container. Override docker's default entry point | ||
- useradd -m runner && cd /home/runner; | ||
mkdir actions-runner && cd actions-runner; | ||
curl -o actions-runner-linux-x64-2.295.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.295.0/actions-runner-linux-x64-2.295.0.tar.gz; | ||
echo "a80c1ab58be3cd4920ac2e51948723af33c2248b434a8a20bd9b3891ca4000b6 actions-runner-linux-x64-2.295.0.tar.gz" | shasum -a 256 -c; | ||
tar xzf ./actions-runner-linux-x64-2.295.0.tar.gz; | ||
su runner -c "cd /home/runner/actions-runner && ./config.sh --unattended --ephemeral --url $(REPO_URL) --token $(REG_TOKEN); ./run.sh" | ||
|
||
resources: | ||
requests: | ||
cpu: 3 #<-- same value for requests and limits | ||
memory: "5Gi" | ||
nvidia.com/gpu: 1 #Assign the same values for GPU requests and limits. | ||
limits: | ||
cpu: 3 #<-- same value for requests and limits | ||
memory: "5Gi" | ||
nvidia.com/gpu: 1 #Assign integer values. GPU has no fraction values. | ||
|
||
tolerations: | ||
- key: "nvidia.com" | ||
operator: "Equal" | ||
value: "a100" | ||
effect: "NoSchedule" |
Empty file.