Skip to content

Commit 41d9e11

Browse files
authored
Elide generating annotations with default values (#62)
1. Always generate an annotation with the version of the helm chart. 2. Do not generate fault tolerance annotations unless user explicitly sets them 3. Bump chart version to 1.1.0 4. Add unit tests for annotation generation
1 parent e9db365 commit 41d9e11

File tree

7 files changed

+106
-88
lines changed

7 files changed

+106
-88
lines changed

tools/pytorchjob-generator/chart/Chart.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@ apiVersion: v2
22
name: pytorchjob-generator
33
description: An AppWrapper generator for PyTorchJobs
44
type: application
5-
version: 1.0.0
5+
version: 1.1.0
66
appVersion: "v1beta2"

tools/pytorchjob-generator/chart/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
An AppWrapper generator for PyTorchJobs
44

5-
![Version: 1.0.0](https://img.shields.io/badge/Version-1.0.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: v1beta2](https://img.shields.io/badge/AppVersion-v1beta2-informational?style=flat-square)
5+
![Version: 1.1.0](https://img.shields.io/badge/Version-1.1.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: v1beta2](https://img.shields.io/badge/AppVersion-v1beta2-informational?style=flat-square)
66

77
## Overview
88

@@ -66,12 +66,12 @@ customize the Jobs generated by the tool.
6666

6767
| Key | Type | Default | Description |
6868
|-----|------|---------|-------------|
69-
| admissionGracePeriodDuration | string | `"60s"` | Customize the admissionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
70-
| warmupGracePeriodDuration | string | `"300s"` | Customize the warmupGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
71-
| failureGracePeriodDuration | string | `"60s"` | Customize the failureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
72-
| retryPausePeriodDuration | string | `"90s"` | Customize the retryPausePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
73-
| retryLimit | integer | `3` | Customize the retryLimit; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
74-
| forcefulDeletionGracePeriodDuration | string | `"600s"` | Customize the forcefulDelectionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
75-
| deletionOnFailureGracePeriodDuration | string | `"0s"` | Customize the deletionOnFailureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
69+
| admissionGracePeriodDuration | string | The AppWrapper defaults will be used | Customize the admissionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
70+
| warmupGracePeriodDuration | string | The AppWrapper defaults will be used | Customize the warmupGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
71+
| failureGracePeriodDuration | string | The AppWrapper defaults will be used | Customize the failureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
72+
| retryPausePeriodDuration | string | The AppWrapper defaults will be used | Customize the retryPausePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
73+
| retryLimit | integer | The AppWrapper defaults will be used | Customize the retryLimit; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
74+
| forcefulDeletionGracePeriodDuration | string | The AppWrapper defaults will be used | Customize the forcefulDelectionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
75+
| deletionOnFailureGracePeriodDuration | string | The AppWrapper defaults will be used | Customize the deletionOnFailureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/ |
7676
| restartPolicy | string | `"Never"` | Set Kubernertes policy for restarting failed containers "in place" (without restarting the Pod). |
7777
| terminationGracePeriodSeconds | integer | Kubernetes's default value is used | Set a non-default pod termination grace period (in seconds). |

tools/pytorchjob-generator/chart/templates/appwrapper.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,13 +54,28 @@ metadata:
5454
name: {{ .Values.jobName }}
5555
namespace: {{ required "Please specify a 'namespace' in the user file" .Values.namespace }}
5656
annotations:
57+
workload.codeflare.dev.mlbatch/pytorchGeneratorVersion: "{{ .Chart.Version }}"
58+
{{- if .Values.admissionGracePeriodDuration }}
5759
workload.codeflare.dev.appwrapper/admissionGracePeriodDuration: "{{ .Values.admissionGracePeriodDuration }}"
60+
{{- end }}
61+
{{- if .Values.warmupGracePeriodDuration }}
5862
workload.codeflare.dev.appwrapper/warmupGracePeriodDuration: "{{ .Values.warmupGracePeriodDuration }}"
63+
{{- end }}
64+
{{- if .Values.failureGracePeriodDuration }}
5965
workload.codeflare.dev.appwrapper/failureGracePeriodDuration: "{{ .Values.failureGracePeriodDuration }}"
66+
{{- end }}
67+
{{- if .Values.retryPausePeriodDuration }}
6068
workload.codeflare.dev.appwrapper/retryPausePeriodDuration: "{{ .Values.retryPausePeriodDuration }}"
69+
{{- end }}
70+
{{- if .Values.retryLimit }}
6171
workload.codeflare.dev.appwrapper/retryLimit: "{{ .Values.retryLimit }}"
72+
{{- end }}
73+
{{- if .Values.forcefulDeletionGracePeriodDuration }}
6274
workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration: "{{ .Values.forcefulDeletionGracePeriodDuration }}"
75+
{{- end }}
76+
{{- if .Values.deletionOnFailureGracePeriodDuration }}
6377
workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration: "{{ .Values.deletionOnFailureGracePeriodDuration }}"
78+
{{- end }}
6479
labels:
6580
kueue.x-k8s.io/queue-name: {{ .Values.queueName }}
6681
{{- include "mlbatch.customLabels" . | indent 8 }}

tools/pytorchjob-generator/chart/tests/__snapshot__/helloworld_test.yaml.snap

Lines changed: 8 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,7 @@ Adding Volume Mounts:
44
kind: AppWrapper
55
metadata:
66
annotations:
7-
workload.codeflare.dev.appwrapper/admissionGracePeriodDuration: 60s
8-
workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration: 0s
9-
workload.codeflare.dev.appwrapper/failureGracePeriodDuration: 60s
10-
workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration: 600s
11-
workload.codeflare.dev.appwrapper/retryLimit: "3"
12-
workload.codeflare.dev.appwrapper/retryPausePeriodDuration: 90s
13-
workload.codeflare.dev.appwrapper/warmupGracePeriodDuration: 300s
7+
workload.codeflare.dev.mlbatch/pytorchGeneratorVersion: 1.1.0
148
labels:
159
kueue.x-k8s.io/queue-name: default-queue
1610
name: my-job
@@ -149,13 +143,7 @@ Adding initContainers:
149143
kind: AppWrapper
150144
metadata:
151145
annotations:
152-
workload.codeflare.dev.appwrapper/admissionGracePeriodDuration: 60s
153-
workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration: 0s
154-
workload.codeflare.dev.appwrapper/failureGracePeriodDuration: 60s
155-
workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration: 600s
156-
workload.codeflare.dev.appwrapper/retryLimit: "3"
157-
workload.codeflare.dev.appwrapper/retryPausePeriodDuration: 90s
158-
workload.codeflare.dev.appwrapper/warmupGracePeriodDuration: 300s
146+
workload.codeflare.dev.mlbatch/pytorchGeneratorVersion: 1.1.0
159147
labels:
160148
kueue.x-k8s.io/queue-name: default-queue
161149
name: my-job
@@ -300,13 +288,7 @@ AppWrapper metadata should match snapshot:
300288
kind: AppWrapper
301289
metadata:
302290
annotations:
303-
workload.codeflare.dev.appwrapper/admissionGracePeriodDuration: 60s
304-
workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration: 0s
305-
workload.codeflare.dev.appwrapper/failureGracePeriodDuration: 60s
306-
workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration: 600s
307-
workload.codeflare.dev.appwrapper/retryLimit: "3"
308-
workload.codeflare.dev.appwrapper/retryPausePeriodDuration: 90s
309-
workload.codeflare.dev.appwrapper/warmupGracePeriodDuration: 300s
291+
workload.codeflare.dev.mlbatch/pytorchGeneratorVersion: 1.1.0
310292
labels:
311293
kueue.x-k8s.io/queue-name: default-queue
312294
name: my-job
@@ -425,13 +407,7 @@ AppWrapper spec should match snapshot:
425407
kind: AppWrapper
426408
metadata:
427409
annotations:
428-
workload.codeflare.dev.appwrapper/admissionGracePeriodDuration: 60s
429-
workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration: 0s
430-
workload.codeflare.dev.appwrapper/failureGracePeriodDuration: 60s
431-
workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration: 600s
432-
workload.codeflare.dev.appwrapper/retryLimit: "3"
433-
workload.codeflare.dev.appwrapper/retryPausePeriodDuration: 90s
434-
workload.codeflare.dev.appwrapper/warmupGracePeriodDuration: 300s
410+
workload.codeflare.dev.mlbatch/pytorchGeneratorVersion: 1.1.0
435411
labels:
436412
kueue.x-k8s.io/queue-name: default-queue
437413
name: my-job
@@ -550,13 +526,7 @@ Enabling NVMe:
550526
kind: AppWrapper
551527
metadata:
552528
annotations:
553-
workload.codeflare.dev.appwrapper/admissionGracePeriodDuration: 60s
554-
workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration: 0s
555-
workload.codeflare.dev.appwrapper/failureGracePeriodDuration: 60s
556-
workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration: 600s
557-
workload.codeflare.dev.appwrapper/retryLimit: "3"
558-
workload.codeflare.dev.appwrapper/retryPausePeriodDuration: 90s
559-
workload.codeflare.dev.appwrapper/warmupGracePeriodDuration: 300s
529+
workload.codeflare.dev.mlbatch/pytorchGeneratorVersion: 1.1.0
560530
labels:
561531
kueue.x-k8s.io/queue-name: default-queue
562532
name: my-job
@@ -705,13 +675,7 @@ Enabling RoCE GDR:
705675
kind: AppWrapper
706676
metadata:
707677
annotations:
708-
workload.codeflare.dev.appwrapper/admissionGracePeriodDuration: 60s
709-
workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration: 0s
710-
workload.codeflare.dev.appwrapper/failureGracePeriodDuration: 60s
711-
workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration: 600s
712-
workload.codeflare.dev.appwrapper/retryLimit: "3"
713-
workload.codeflare.dev.appwrapper/retryPausePeriodDuration: 90s
714-
workload.codeflare.dev.appwrapper/warmupGracePeriodDuration: 300s
678+
workload.codeflare.dev.mlbatch/pytorchGeneratorVersion: 1.1.0
715679
labels:
716680
kueue.x-k8s.io/queue-name: default-queue
717681
name: my-job
@@ -862,13 +826,7 @@ Enabling all advanced features at once:
862826
kind: AppWrapper
863827
metadata:
864828
annotations:
865-
workload.codeflare.dev.appwrapper/admissionGracePeriodDuration: 60s
866-
workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration: 0s
867-
workload.codeflare.dev.appwrapper/failureGracePeriodDuration: 60s
868-
workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration: 600s
869-
workload.codeflare.dev.appwrapper/retryLimit: "3"
870-
workload.codeflare.dev.appwrapper/retryPausePeriodDuration: 90s
871-
workload.codeflare.dev.appwrapper/warmupGracePeriodDuration: 300s
829+
workload.codeflare.dev.mlbatch/pytorchGeneratorVersion: 1.1.0
872830
labels:
873831
kueue.x-k8s.io/queue-name: default-queue
874832
name: my-job
@@ -1123,13 +1081,7 @@ Enabling sshGitConfig injects the envvars, volumes, and volumeMounts:
11231081
kind: AppWrapper
11241082
metadata:
11251083
annotations:
1126-
workload.codeflare.dev.appwrapper/admissionGracePeriodDuration: 60s
1127-
workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration: 0s
1128-
workload.codeflare.dev.appwrapper/failureGracePeriodDuration: 60s
1129-
workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration: 600s
1130-
workload.codeflare.dev.appwrapper/retryLimit: "3"
1131-
workload.codeflare.dev.appwrapper/retryPausePeriodDuration: 90s
1132-
workload.codeflare.dev.appwrapper/warmupGracePeriodDuration: 300s
1084+
workload.codeflare.dev.mlbatch/pytorchGeneratorVersion: 1.1.0
11331085
labels:
11341086
kueue.x-k8s.io/queue-name: default-queue
11351087
name: my-job

tools/pytorchjob-generator/chart/tests/helloworld_test.yaml

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,37 @@ tests:
134134
command: ["sh", "-c", "echo hello world!"]
135135
asserts:
136136
- matchSnapshot:
137-
path: spec.components[0].template
137+
patch: spec.components[0].template
138+
139+
- it: Setting fault tolerance annotations
140+
set:
141+
admissionGracePeriodDuration: "10s"
142+
warmupGracePeriodDuration: "11s"
143+
failureGracePeriodDuration: "22s"
144+
retryPausePeriodDuration: "17s"
145+
retryLimit: 42
146+
forcefulDeletionGracePeriodDuration: "19s"
147+
deletionOnFailureGracePeriodDuration: "2s"
148+
asserts:
149+
- isSubset:
150+
path: metadata.annotations
151+
content:
152+
workload.codeflare.dev.appwrapper/admissionGracePeriodDuration: "10s"
153+
workload.codeflare.dev.appwrapper/warmupGracePeriodDuration: "11s"
154+
workload.codeflare.dev.appwrapper/failureGracePeriodDuration: "22s"
155+
workload.codeflare.dev.appwrapper/retryPausePeriodDuration: "17s"
156+
workload.codeflare.dev.appwrapper/retryLimit: "42"
157+
workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration: "19s"
158+
workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration: "2s"
159+
160+
- it: Setting jsut one tolerance annotation
161+
set:
162+
deletionOnFailureGracePeriodDuration: "6h"
163+
asserts:
164+
- isSubset:
165+
path: metadata.annotations
166+
content:
167+
workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration: "6h"
138168

139169
- it: Enabling all advanced features at once
140170
set:

tools/pytorchjob-generator/chart/values.schema.json

Lines changed: 29 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,7 @@
44
"required": [
55
"namespace",
66
"jobName",
7-
"containerImage",
8-
"admissionGracePeriodDuration",
9-
"warmupGracePeriodDuration",
10-
"failureGracePeriodDuration",
11-
"retryPausePeriodDuration",
12-
"retryLimit",
13-
"forcefulDeletionGracePeriodDuration",
14-
"deletionOnFailureGracePeriodDuration"
7+
"containerImage"
158
],
169
"additionalProperties": false,
1710
"properties": {
@@ -125,13 +118,34 @@
125118
{ "type": "null" },
126119
{ "type": "integer", "minimum": 0 }
127120
]},
128-
"admissionGracePeriodDuration": { "$ref": "#/$defs/duration" },
129-
"warmupGracePeriodDuration": { "$ref": "#/$defs/duration" },
130-
"failureGracePeriodDuration": { "$ref": "#/$defs/duration" },
131-
"retryPausePeriodDuration": { "$ref": "#/$defs/duration" },
132-
"retryLimit": { "type": "integer", "minimum": 0, "maximum": 100 },
133-
"forcefulDeletionGracePeriodDuration": { "$ref": "#/$defs/duration" },
134-
"deletionOnFailureGracePeriodDuration" : { "$ref": "#/$defs/duration" }
121+
"admissionGracePeriodDuration": { "oneOf" : [
122+
{ "type": "null" },
123+
{ "$ref": "#/$defs/duration" }
124+
]},
125+
"warmupGracePeriodDuration": { "oneOf" : [
126+
{ "type": "null" },
127+
{ "$ref": "#/$defs/duration" }
128+
]},
129+
"failureGracePeriodDuration": { "oneOf" : [
130+
{ "type": "null" },
131+
{ "$ref": "#/$defs/duration" }
132+
]},
133+
"retryPausePeriodDuration": { "oneOf" : [
134+
{ "type": "null" },
135+
{ "$ref": "#/$defs/duration" }
136+
]},
137+
"retryLimit": { "oneOf" : [
138+
{ "type": "null" },
139+
{ "type": "integer", "minimum": 0, "maximum": 100 }
140+
]},
141+
"forcefulDeletionGracePeriodDuration": { "oneOf" : [
142+
{ "type": "null" },
143+
{ "$ref": "#/$defs/duration" }
144+
]},
145+
"deletionOnFailureGracePeriodDuration" : { "oneOf" : [
146+
{ "type": "null" },
147+
{ "$ref": "#/$defs/duration" }
148+
]}
135149
},
136150

137151
"if": {

tools/pytorchjob-generator/chart/values.yaml

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -229,31 +229,38 @@ serviceAccountName: # service account name
229229

230230
# -- (string) Customize the admissionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
231231
# @section -- Fault Tolerance
232-
admissionGracePeriodDuration: "60s"
232+
# @default -- The AppWrapper defaults will be used
233+
admissionGracePeriodDuration:
233234

234235
# -- (string) Customize the warmupGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
235236
# @section -- Fault Tolerance
236-
warmupGracePeriodDuration: "300s"
237+
# @default -- The AppWrapper defaults will be used
238+
warmupGracePeriodDuration:
237239

238240
# -- (string) Customize the failureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
239241
# @section -- Fault Tolerance
240-
failureGracePeriodDuration: "60s"
242+
# @default -- The AppWrapper defaults will be used
243+
failureGracePeriodDuration:
241244

242245
# -- (string) Customize the retryPausePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
243246
# @section -- Fault Tolerance
244-
retryPausePeriodDuration: "90s"
247+
# @default -- The AppWrapper defaults will be used
248+
retryPausePeriodDuration:
245249

246250
# -- (integer) Customize the retryLimit; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
247251
# @section -- Fault Tolerance
248-
retryLimit: 3
252+
# @default -- The AppWrapper defaults will be used
253+
retryLimit:
249254

250255
# -- (string) Customize the forcefulDelectionGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
251256
# @section -- Fault Tolerance
252-
forcefulDeletionGracePeriodDuration: "600s"
257+
# @default -- The AppWrapper defaults will be used
258+
forcefulDeletionGracePeriodDuration:
253259

254260
# -- (string) Customize the deletionOnFailureGracePeriod; see https://project-codeflare.github.io/appwrapper/arch-fault-tolerance/
255261
# @section -- Fault Tolerance
256-
deletionOnFailureGracePeriodDuration: "0s"
262+
# @default -- The AppWrapper defaults will be used
263+
deletionOnFailureGracePeriodDuration:
257264

258265
# -- (string) Set Kubernertes policy for restarting failed containers "in place" (without restarting the Pod).
259266
# @section -- Fault Tolerance

0 commit comments

Comments
 (0)