Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
248 commits
Select commit Hold shift + click to select a range
a059c8c
Rename change team request query to "current-team" from "team"
Gerhut Jul 3, 2019
7bc9d1c
Set authorized clusters in session when using password auth
Gerhut Jul 3, 2019
ea4882b
clean useless code and format
hao1939 Jun 26, 2019
7583fcb
add Job JobSchema and test
hao1939 Jun 26, 2019
b5a2f3c
refine validation message
hao1939 Jun 27, 2019
e032ed8
fix job_id schema
hao1939 Jun 27, 2019
07491ea
use Job object in job_manager
hao1939 Jun 27, 2019
d415dac
Job: add field "email"
hao1939 Jun 27, 2019
a1b54a7
Job: add attr "cluster", a dict of cluster config
hao1939 Jun 27, 2019
86147e3
refine mountpoints
hao1939 Jun 27, 2019
6e4814b
refine job_manager
hao1939 Jun 27, 2019
32cbac5
extract the launch-script generation to a method
hao1939 Jun 27, 2019
52c75e9
cleanup useless code
hao1939 Jun 27, 2019
8a19a38
refine: instance job template
hao1939 Jun 27, 2019
cf6ae05
user's mountpoints first, but should after 'jobPath'
hao1939 Jun 28, 2019
8de64e2
cleanup job_manager
hao1939 Jun 28, 2019
d7464b6
extract class PodTemplate
hao1939 Jun 28, 2019
d2c4404
cleanup job_manager.py
hao1939 Jun 28, 2019
b6c6728
refactoring
hao1939 Jun 28, 2019
4a77fb3
minor refactoring
hao1939 Jun 28, 2019
ff4df26
refactoring and fix typo
hao1939 Jun 28, 2019
ba441e1
generate job description in PodTemplate
hao1939 Jun 28, 2019
0eb7a30
return job_description list and fix typo
hao1939 Jun 29, 2019
ea3ce4e
extract job_deployer for create/delete pod
hao1939 Jul 1, 2019
6e97c77
better funciton name for PodTemlate
hao1939 Jul 2, 2019
14587ee
better log format
hao1939 Jul 2, 2019
579e304
refine log
hao1939 Jul 2, 2019
dcbcf33
refactoring dist job
hao1939 Jul 2, 2019
1d4dccd
add default value: enable_custom_scheduler=False
hao1939 Jul 3, 2019
d7e2eb5
fix comments
hao1939 Jul 4, 2019
be6b449
refine SubmitPSDistJob()
hao1939 Jul 4, 2019
412e3e3
a better func name
hao1939 Jul 4, 2019
e4b0c20
dist job: add label sshPort
hao1939 Jul 4, 2019
1631366
extract DistPodTemplate.generate_pod()
hao1939 Jul 4, 2019
0801c07
extract DistPodTemplate.generate_pods()
hao1939 Jul 4, 2019
a7fd729
refactoring jobmanager
hao1939 Jul 4, 2019
eba3160
cleanup
hao1939 Jul 4, 2019
b67486e
update field lastUpdated
hao1939 Jul 4, 2019
7ac1a5d
refactoring PodTemplate
hao1939 Jul 4, 2019
827037d
refine
hao1939 Jul 4, 2019
e0bde4d
refine PodTemplate
hao1939 Jul 4, 2019
2c29a20
refactoring DistPodTemplate
hao1939 Jul 4, 2019
211142f
using yaml.sfull_load for safety
hao1939 Jul 4, 2019
0371695
Add password validation and team checking
Gerhut Jul 4, 2019
a3d8f14
refine job template file
hao1939 Jul 4, 2019
f207da8
refine job template again
hao1939 Jul 4, 2019
441b8d8
merge Regular/Dist job template
hao1939 Jul 4, 2019
c3018ef
fix pod["envs"] not properly generated
hao1939 Jul 4, 2019
5bdd37f
add config.yaml for unittest
hao1939 Jul 5, 2019
80af8df
refine
hao1939 Jul 5, 2019
5c6c0b4
mount /pod for Regular job
hao1939 Jul 5, 2019
72da3de
mount /pod for dist job
hao1939 Jul 5, 2019
e4bb0ef
mount /pod for each pod, put launcher.sh under it
hao1939 Jul 5, 2019
04c3441
Merge pull request #406 from hao1939/dltsdev
hongzhili Jul 7, 2019
e786a12
Merge pull request #405 from Gerhut/dltsdev
hongzhili Jul 7, 2019
15e1e9e
add missing cluster prefix in alert manager
xudifsd Jul 10, 2019
e31e93e
add cluster gpu statistic dashboard
xudifsd Jul 11, 2019
a7341ac
fix issue of not fetching through restfulapi without login
Jul 12, 2019
2472a4f
Merge pull request #414 from LeoHongyi/dltsdev
hongzhili Jul 12, 2019
d980716
consider used gpus while calculating reserved on unschedulable nodes
deepak-ms Jul 14, 2019
35193a8
unify log usage
xudifsd Jul 15, 2019
4207b9f
Merge pull request #408 from xudifsd/dixu/add-cluster-in-subject
hongzhili Jul 16, 2019
a755314
Merge pull request #410 from xudifsd/dixu/cluster-gpu-statistci
hongzhili Jul 16, 2019
ea0b8b4
Merge pull request #416 from xudifsd/dixu/log-refactor
hongzhili Jul 16, 2019
91e17e6
Disable distrubed job when using low priority cluster job in job typ…
LeoHongyi Jul 16, 2019
e2a3c1a
Fix issue of changing job template based on low priority
Jul 17, 2019
6548dbf
Fix issue of changing job template based on low priority
Jul 17, 2019
531f7ba
implement notify
xudifsd Jul 9, 2019
399cc3a
call notifier in jobmanager
xudifsd Jul 9, 2019
ef61d40
fix according to review
xudifsd Jul 17, 2019
cb11da1
refine kill job
hao1939 Jul 9, 2019
80318d9
update job.lastUpdated
hao1939 Jul 10, 2019
4f7f3e4
add func pod_exec
hao1939 Jul 10, 2019
ffe8f73
add timeout
hao1939 Jul 11, 2019
786337c
fix typo
hao1939 Jul 12, 2019
7241664
refine
hao1939 Jul 15, 2019
1508b8a
add JobRole
hao1939 Jul 16, 2019
da04f50
install requirements
hao1939 Jul 16, 2019
1b4ac62
refine job bootup script
hao1939 Jul 16, 2019
2f4e019
refine dist job bootup script
hao1939 Jul 16, 2019
be314ec
generate /job/hostfile
hao1939 Jul 16, 2019
c0e9c24
cleanup
hao1939 Jul 16, 2019
36aeb98
add env "DLWS_ROLE_NAME" for regular job
hao1939 Jul 16, 2019
f15bc28
cleanup
hao1939 Jul 16, 2019
7fc3446
ignore "delete" error when pod not existing
hao1939 Jul 16, 2019
756551b
fix role status
hao1939 Jul 16, 2019
30d5b7d
when job is "running", user should be ready
hao1939 Jul 16, 2019
dba21fe
create configmap dlws-scripts
hao1939 Jul 16, 2019
9458f69
keep the orignal order of output
hao1939 Jul 17, 2019
87c34ff
notify on success
xudifsd Jul 17, 2019
05ad217
container return the same exit code as the job command
hao1939 Jul 17, 2019
6184b54
unify job status for dist/regular jobs
hao1939 Jul 17, 2019
849ab7d
cleanup
hao1939 Jul 17, 2019
38319d9
Fix issue of changing job template based on low priority
Jul 17, 2019
0bfd4c8
Add master key support (#421)
Gerhut Jul 17, 2019
a6f3ce7
tolerate master node in job/node-exporter (#420)
xudifsd Jul 18, 2019
0921e2d
put manager into dedicated process
xudifsd Jul 17, 2019
e2deff4
Merge pull request #3 from xudifsd/dixu/manager-per-process
hao1939 Jul 18, 2019
b8ed68c
add missing time import
xudifsd Jul 18, 2019
8b69df1
Merge pull request #4 from xudifsd/dixu/manager-per-process
hao1939 Jul 18, 2019
a559c47
reformat
hao1939 Jul 18, 2019
24f0a06
Merge pull request #409 from hao1939/dltsdev
hongzhili Jul 18, 2019
0fb48cd
Merge branch 'jobmanager' into dixu/state-email
hongzhili Jul 18, 2019
4e2d2fc
Merge pull request #407 from xudifsd/dixu/state-email
hongzhili Jul 18, 2019
dae92b6
Merge pull request #419 from LeoHongyi/dltsdev
hongzhili Jul 18, 2019
3d240c1
fix missing imports
hao1939 Jul 18, 2019
649986f
add logging
hao1939 Jul 18, 2019
259ff92
set disable_existing_loggers to False
xudifsd Jul 18, 2019
267048b
Merge pull request #5 from xudifsd/dixu/manager-per-process
hao1939 Jul 18, 2019
fb161d0
output file:lineno in log
hao1939 Jul 18, 2019
b11cc96
catch exception
hao1939 Jul 18, 2019
765cc9e
fix pod "Unknown" status
hao1939 Jul 18, 2019
3459a89
Redirect to Wiki page for unauthorized login user (#423)
LeoHongyi Jul 18, 2019
4c0774d
Fix the job template dropdown issue (#424)
LeoHongyi Jul 19, 2019
d421b66
no longer using "hostport", remove useness code
hao1939 Jul 19, 2019
54f7492
job state transition graph
hao1939 Jul 20, 2019
eb2981f
resubmit "Unknown" job
hao1939 Jul 20, 2019
8ccdbeb
refactoring
hao1939 Jul 20, 2019
02a6d43
no need anymore, all job pod "restartPolicy: Never"
hao1939 Jul 20, 2019
0e8cf0a
fix typo
hao1939 Jul 20, 2019
e9d1389
two potential problems
hao1939 Jul 20, 2019
93b75ba
profile data handler in jobmanager
xudifsd Jul 22, 2019
0acdf54
Merge pull request #6 from xudifsd/dixu/profile-handler
hao1939 Jul 22, 2019
29357c3
fix some bugs
xudifsd Jul 22, 2019
4042459
config exporter port
xudifsd Jul 22, 2019
91be5b1
Merge pull request #7 from xudifsd/dixu/profile-handler
hao1939 Jul 22, 2019
3c558a7
shorter port name
xudifsd Jul 22, 2019
93a8b8a
Merge pull request #8 from xudifsd/dixu/profile-handler
hao1939 Jul 22, 2019
fb3a9d7
fix some bug
xudifsd Jul 22, 2019
3eb2250
Merge pull request #9 from xudifsd/dixu/profile-handler
hao1939 Jul 22, 2019
b6540b0
expose in restfulapi
xudifsd Jul 22, 2019
928fb70
Merge pull request #10 from xudifsd/dixu/profile-handler
hao1939 Jul 22, 2019
7fa12c4
fix bug
xudifsd Jul 22, 2019
5218c61
Merge pull request #11 from xudifsd/dixu/profile-handler
hao1939 Jul 22, 2019
d4e2239
Change title of every table and adjust the orders (#427)
LeoHongyi Jul 23, 2019
ee72f95
fix typo
hao1939 Jul 23, 2019
7dc0556
When node back after "lost", the pod may turn into "NotFound".
hao1939 Jul 23, 2019
2d8fa8e
add useful link
hao1939 Jul 23, 2019
aeb2744
Merge pull request #422 from hao1939/dltsdev
Anbang-Hu Jul 24, 2019
42d8032
add performance dashboard
xudifsd Jul 23, 2019
3678e77
add manager histogram
xudifsd Jul 23, 2019
9e877ba
perf job deployer
xudifsd Jul 23, 2019
4c35dfa
Merge pull request #12 from xudifsd/dixu/perf-dashboard
hao1939 Jul 24, 2019
f79ad2f
add missing import
xudifsd Jul 24, 2019
47c97e5
Merge pull request #13 from xudifsd/dixu/perf-dashboard
hao1939 Jul 24, 2019
ad40629
fix conflict error
xudifsd Jul 24, 2019
000913f
Merge pull request #14 from xudifsd/dixu/perf-dashboard
hao1939 Jul 24, 2019
b21eca7
Merge pull request #429 from hao1939/jobmanager
Anbang-Hu Jul 24, 2019
99f57b4
add reaper
xudifsd Jul 24, 2019
8ba0ca0
Merge pull request #15 from xudifsd/dixu/reaper
hao1939 Jul 24, 2019
d0b7cd8
split jobmanager's metrics with restfulapi's
xudifsd Jul 24, 2019
11f3be0
Merge pull request #16 from xudifsd/dixu/refine-perf-dashboard
hao1939 Jul 24, 2019
4c45f52
share k8s client: to avoid connection leak
hao1939 Jul 24, 2019
fed8c67
add option for "force" delete pod
hao1939 Jul 25, 2019
ebfd771
forcing cleanup job before submit
hao1939 Jul 25, 2019
0d5f87e
Enable distribute job in low priority job template (#432)
LeoHongyi Jul 25, 2019
5b1df00
Clean up advance option & save job template issue
Jul 25, 2019
fbd2454
default using localhost as prometheus ip in grafana
xudifsd Jul 26, 2019
6618ecb
Merge pull request #434 from xudifsd/dixu/default-prometheus-ip
Anbang-Hu Jul 26, 2019
6e0b34b
force delete pod when killing
hao1939 Jul 26, 2019
8fd7693
no need to retry on submit
hao1939 Jul 26, 2019
3a3a14c
refine execption handle
hao1939 Jul 26, 2019
9263d66
reset endpoint when resubmit job
hao1939 Jul 26, 2019
5ecfa44
waiting 30s before resubmit (was 300s)
hao1939 Jul 26, 2019
de777d8
won't change 'pending' endpionts to 'stoped' or other status
hao1939 Jul 26, 2019
30ce59c
narrow "dead endpoint", execlude the endpoints for job in status pend…
hao1939 Jul 26, 2019
e08880e
fix func call parameter
hao1939 Jul 26, 2019
110eda7
correctly use the return value
hao1939 Jul 26, 2019
7c5bf14
Merge pull request #415 from deepak-ms/d8
Anbang-Hu Jul 26, 2019
538ba0a
fix restapi bugs
hao1939 Jul 29, 2019
2da3bb3
Merge pull request #433 from LeoHongyi/dltsdev
Anbang-Hu Jul 30, 2019
95cebe0
Hidden Priviledge docker & keep the logic when job type change
Jul 30, 2019
92b4192
Merge pull request #430 from hao1939/jobmanager
Anbang-Hu Jul 30, 2019
570cc9a
Merge pull request #437 from LeoHongyi/dltsdev
Anbang-Hu Jul 30, 2019
8349cfb
Rename change team request query to "current-team" from "team"
Gerhut Jul 3, 2019
c1ba1d2
Set authorized clusters in session when using password auth
Gerhut Jul 3, 2019
0328e3d
Add password validation and team checking
Gerhut Jul 4, 2019
375199e
add missing cluster prefix in alert manager
xudifsd Jul 10, 2019
ae4671c
add cluster gpu statistic dashboard
xudifsd Jul 11, 2019
d51a6bb
fix issue of not fetching through restfulapi without login
Jul 12, 2019
a76551e
consider used gpus while calculating reserved on unschedulable nodes
deepak-ms Jul 14, 2019
584cf85
Disable distrubed job when using low priority cluster job in job typ…
LeoHongyi Jul 16, 2019
796f942
Fix issue of changing job template based on low priority
Jul 17, 2019
878d7af
Fix issue of changing job template based on low priority
Jul 17, 2019
8abbc21
Fix issue of changing job template based on low priority
Jul 17, 2019
334fb1e
Add master key support (#421)
Gerhut Jul 17, 2019
704a672
tolerate master node in job/node-exporter (#420)
xudifsd Jul 18, 2019
199c8ca
Redirect to Wiki page for unauthorized login user (#423)
LeoHongyi Jul 18, 2019
4fbd4e5
Fix the job template dropdown issue (#424)
LeoHongyi Jul 19, 2019
bf5feb9
Change title of every table and adjust the orders (#427)
LeoHongyi Jul 23, 2019
214f278
Enable distribute job in low priority job template (#432)
LeoHongyi Jul 25, 2019
981482f
Clean up advance option & save job template issue
Jul 25, 2019
622d0cc
default using localhost as prometheus ip in grafana
xudifsd Jul 26, 2019
c375e6b
Hidden Priviledge docker & keep the logic when job type change
Jul 30, 2019
d0c86bd
Merge pull request #438 from hao1939/dltsdev
Anbang-Hu Jul 30, 2019
24ec7da
fix generate ssh config
hao1939 Jul 30, 2019
4c697c4
exec "sleep infinity" on workers
hao1939 Jul 30, 2019
792eb74
fix setup ssh script
hao1939 Jul 30, 2019
67391fb
Merge pull request #439 from hao1939/dltsdev
Anbang-Hu Jul 31, 2019
84b98bf
Enable the preemptible job and disable when low priority job
Jul 31, 2019
d2a9918
fix perf dashboard
xudifsd Jul 31, 2019
beaf0d5
Merge pull request #441 from xudifsd/dixu/fix
Anbang-Hu Jul 31, 2019
0f15860
Merge pull request #440 from LeoHongyi/dltsdev
Anbang-Hu Jul 31, 2019
b2268e0
fix dist job pod_name
hao1939 Jul 31, 2019
26f658c
fix endpoint
hao1939 Jul 31, 2019
d631160
robust
hao1939 Jul 31, 2019
e54c602
fix dist job path
hao1939 Jul 31, 2019
9b59bdd
Merge pull request #442 from hao1939/dltsdev
Anbang-Hu Jul 31, 2019
9cd49fa
use inter-pod affinity to achieve less fragmentation
xudifsd Aug 1, 2019
f5b5e94
fix bootstrap script
hao1939 Aug 1, 2019
a92f92e
add missing label
xudifsd Aug 2, 2019
c1dd2c9
Merge pull request #444 from hao1939/dltsdev
Anbang-Hu Aug 2, 2019
929e5d2
Merge pull request #443 from xudifsd/dixu/less-fragmentation
Anbang-Hu Aug 2, 2019
e33978a
reset endpoint before resubmit
hao1939 Aug 2, 2019
8f5f0a3
persist prometheus data into host path
xudifsd Aug 2, 2019
8201c1b
Expose environment variable DLWS_NUM_GPU_PER_WORKER for regular job
Anbang-Hu Aug 2, 2019
ceedfdd
Merge pull request #447 from Anbang-Hu/dltsdev
Anbang-Hu Aug 2, 2019
b25d9f9
setup ssh and hostfile for regular job
hao1939 Aug 4, 2019
b80a90d
Merge pull request #445 from hao1939/dltsdev
Anbang-Hu Aug 5, 2019
6992f64
Merge pull request #446 from xudifsd/dixu/persist-prometheus
Anbang-Hu Aug 5, 2019
7d5bb04
send out email while killing
xudifsd Aug 5, 2019
fb31b0b
same role anti affinity
xudifsd Aug 5, 2019
5e4efb4
Merge pull request #448 from xudifsd/dixu/kill-email
Anbang-Hu Aug 5, 2019
5cc0e07
Revert "persist prometheus data into host path"
Anbang-Hu Aug 5, 2019
d77ed4b
Merge pull request #450 from microsoft/revert-446-dixu/persist-promet…
Anbang-Hu Aug 5, 2019
18637fa
persist prometheus data into host path
xudifsd Aug 2, 2019
1ff0b27
fix permission of /prometheus-data
xudifsd Aug 6, 2019
e38da86
remove anti affinity
xudifsd Aug 6, 2019
b52b73d
Merge pull request #449 from xudifsd/dixu/anti-affinity
Anbang-Hu Aug 6, 2019
7c072d9
Merge pull request #451 from xudifsd/dixu/persist-prometheus
Anbang-Hu Aug 6, 2019
bd5f61f
change to required affinity
xudifsd Aug 6, 2019
d07c909
change order of cmd in restful api to speed up build & deployment
xudifsd Aug 6, 2019
2b80f3f
Merge pull request #452 from xudifsd/dixu/required-affinity
Anbang-Hu Aug 6, 2019
342d5d3
Merge pull request #453 from xudifsd/dixu/reorder-install
Anbang-Hu Aug 6, 2019
c395a40
fix create database
xudifsd Aug 6, 2019
4cb996a
Merge pull request #454 from xudifsd/dixu/fix-create-db
Anbang-Hu Aug 6, 2019
b0f9173
fix template error
xudifsd Aug 7, 2019
0b272db
Merge pull request #455 from xudifsd/dixu/fix
Anbang-Hu Aug 7, 2019
3c29c98
install missing pkg "openssl"
hao1939 Aug 7, 2019
61ec5f2
Merge pull request #456 from hao1939/dltsdev
Anbang-Hu Aug 7, 2019
3f3040e
fix apt-get install "-y"
hao1939 Aug 7, 2019
a4457d4
Merge pull request #457 from hao1939/dltsdev
Anbang-Hu Aug 7, 2019
7e28c47
fix apt-get hang
hao1939 Aug 7, 2019
4cd9b34
Merge pull request #458 from hao1939/dltsdev
Anbang-Hu Aug 7, 2019
5eca31e
Enable submit distrbuted job under low priority cluster
Aug 7, 2019
1375359
Merge pull request #459 from LeoHongyi/dltsdev
Anbang-Hu Aug 7, 2019
5d8a107
check if *_manager is hanging and restart accordingly
xudifsd Aug 9, 2019
9daf206
Merge pull request #469 from xudifsd/dixu/heads/v1.1.0
Anbang-Hu Aug 27, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/ClusterBootstrap/deploy.py
Original file line number Diff line number Diff line change
Expand Up @@ -2831,8 +2831,8 @@ def start_one_kube_service(fname):
pass

if fname == "./deploy/services/jobmanager/jobmanager.yaml":
# recreate the configmap init-user-script
run_kubectl( ["create configmap init-user-script --from-file=../Jobs_Templete/init_user.sh -o yaml --dry-run | ./deploy/bin/kubectl apply -f -"] )
# recreate the configmap dlws-scripts
run_kubectl( ["create configmap dlws-scripts --from-file=../Jobs_Templete/ -o yaml --dry-run | ./deploy/bin/kubectl apply -f -"] )

run_kubectl( ["create", "-f", fname ] )

Expand Down
7 changes: 6 additions & 1 deletion src/ClusterBootstrap/params.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,19 @@
"job-exporter": { "port": 9102 },
"node-exporter": { "port": 9100 },
"watchdog": { "port": 9101 },
"grafana": { "port": 3000 },
"grafana": { "port": 3000, "prometheus-ip": "localhost" },
"alert-manager": {
"port": 9093,
"configured": False,
"alert_users": False,
# If want to deploy with alert-manager, should config
# configured with True, and fill appropriate value to:
# smtp_url, smtp_from, smtp_auth_username, smtp_auth_password and receiver
"reaper": {
"dry-run": True,
"port": "9500",
"restful-url": "http://localhost:5000",
}
},

"mysql_port": "3306",
Expand Down
38 changes: 37 additions & 1 deletion src/ClusterBootstrap/services/jobmanager/jobmanager.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ spec:
labels:
jobmanager-node: pod
app: jobmanager
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
spec:
{% if cnf["dnsPolicy"] %}
dnsPolicy: {{cnf["dnsPolicy"]}}
Expand All @@ -39,7 +42,40 @@ spec:
- mountPath: {{cnf["storage-mount-path"]}}/jobfiles
name: dlwsdatajobfiles
- mountPath: /var/log/dlworkspace
name: log
name: log
ports:
- containerPort: 9200
hostPort: 9200
name: job-mgr
protocol: TCP
- containerPort: 9201
hostPort: 9201
name: user-mgr
protocol: TCP
- containerPort: 9202
hostPort: 9202
name: node-mgr
protocol: TCP
- containerPort: 9203
hostPort: 9203
name: joblog-mgr
protocol: TCP
- containerPort: 9204
hostPort: 9204
name: cmd-mgr
protocol: TCP
- containerPort: 9205
hostPort: 9205
name: endpoint-mgr
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 3
periodSeconds: 30
successThreshold: 1
tcpSocket:
port: 9200
timeoutSeconds: 10
volumes:
- name: certs
hostPath:
Expand Down
79 changes: 74 additions & 5 deletions src/ClusterBootstrap/services/monitor/alert-manager.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ spec:
hostNetwork: true
containers:
- name: alert-manager
image: prom/alertmanager:v0.15.1
image: prom/alertmanager:v0.18.0
args:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
Expand All @@ -40,6 +40,23 @@ spec:
mountPath: /alertmanager
- name: templates-volume
mountPath: /etc/alertmanager/template
{% if cnf["alert-manager"]["reaper"] %}
- name: reaper
image: {{cnf["worker-dockerregistry"]}}{{cnf["dockerprefix"]}}reaper:{{cnf["dockertag"]}}
command:
- 'python'
- '/reaper/main.py'
- '--port'
- '{{ cnf["alert-manager"]["reaper"]["port"] }}'
- '--restful_url'
- '{{ cnf["alert-manager"]["reaper"]["restful-url"] }}'
{% if cnf["alert-manager"]["reaper"]["dry-run"] %}
- '--dry_run'
{% endif %}
ports:
- name: alert-manager
containerPort: {{ cnf["alert-manager"]["reaper"]["port"] }}
{% endif %}
volumes:
- name: config-volume
configMap:
Expand Down Expand Up @@ -80,22 +97,38 @@ data:
receiver: alert-email
group_wait: 30s
group_interval: 5m
group_by: [alertname]
group_by: [alertname, cluster]
routes:
- receiver: task_user
- receiver: idle_gpu_receiver
repeat_interval: 4h
group_by: [alertname, user_email, cluster]
match_re:
type: user_alert
type: idle_gpu
alertname: "zero-gpu-usage"
- receiver: job_state_change_receiver
group_by: [alertname, user_email, cluster, subject]
match_re:
type: user_alert
alertname: "job-state-changed"
- receiver: reaper
group_by: [alertname, user_email, job_name]
group_wait: 0s
match_re:
type: reaper
- receiver: kill_idle_job_email
group_by: [alertname, user_email, cluster]
group_wait: 0s
match_re:
type: kill_idle_job_email
alertname: "kill-idle-jobs-email"
receivers:
- name: "alert-email"
email_configs:
- to: {{ alert_info["receiver"] }}
html: '{{ "{{" }} template "email.html" . {{ "}}" }}'
headers:
subject: '{{ "{{" }} .GroupLabels.cluster {{ "}}" }}: {{ "{{" }} template "__subject" . {{ "}}" }}'
- name: "task_user"
- name: "idle_gpu_receiver"
email_configs:
{% if cnf["alert-manager"]["alert_users"] %}
- to: '{{ "{{" }} .GroupLabels.user_email {{ "}}" }},{{ alert_info["receiver"] }}'
Expand All @@ -109,4 +142,40 @@ data:
CC: '{{ alert_info["receiver"] }}'
{% endif %}
subject: '{{ "{{" }} .GroupLabels.cluster {{ "}}" }}: {{ "{{" }} template "__subject" . {{ "}}" }}'
- name: "job_state_change_receiver"
email_configs:
{% if cnf["alert-manager"]["alert_users"] %}
- to: '{{ "{{" }} .GroupLabels.user_email {{ "}}" }},{{ alert_info["receiver"] }}'
{% else %}
- to: '{{ alert_info["receiver"] }}'
{% endif %}
html: '{{ "{{" }} template "job_state.html" . {{ "}}" }}'
headers:
{% if cnf["alert-manager"]["alert_users"] %}
To: '{{ "{{" }} .GroupLabels.user_email {{ "}}" }}'
CC: '{{ alert_info["receiver"] }}'
{% endif %}
subject: '{{ "{{" }} .GroupLabels.cluster {{ "}}" }}: {{ "{{" }} template "__subject" . {{ "}}" }}'
- name: "reaper"
{% if cnf["alert-manager"]["reaper"] %}
webhook_configs:
- send_resolved: False
url: 'http://localhost:{{ cnf["alert-manager"]["reaper"]["port"] }}/kill'
http_config:
bearer_token: 'shinigami'
- name: "kill_idle_job_email"
email_configs:
{% if cnf["alert-manager"]["alert_users"] %}
- to: '{{ "{{" }} .GroupLabels.user_email {{ "}}" }},{{ alert_info["receiver"] }}'
{% else %}
- to: '{{ alert_info["receiver"] }}'
{% endif %}
html: '{{ "{{" }} template "kill_idle.html" . {{ "}}" }}'
headers:
{% if cnf["alert-manager"]["alert_users"] %}
To: '{{ "{{" }} .GroupLabels.user_email {{ "}}" }}'
CC: '{{ alert_info["receiver"] }}'
{% endif %}
subject: '{{ "{{" }} .GroupLabels.cluster {{ "}}" }}: {{ "{{" }} template "__subject" . {{ "}}" }}'
{% endif %}
{% endif %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
{{ define "job_state.html" }}
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
Style and HTML derived from https://github.com/mailgun/transactional-email-templates


The MIT License (MIT)

Copyright (c) 2014 Mailgun

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
-->
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<head>
<meta name="viewport" content="width=device-width"/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>{{ template "__subject" . }}</title>

</head>

<body itemscope="" itemtype="http://schema.org/EmailMessage" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; -webkit-font-smoothing: antialiased; -webkit-text-size-adjust: none; height: 100%; line-height: 1.6em; width: 100% !important; background-color: #f6f6f6; margin: 0; padding: 0;" bgcolor="#f6f6f6">

<table style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; width: 100%; background-color: #f6f6f6; margin: 0;" bgcolor="#f6f6f6">
<tr>
<td valign="top"></td>
<td width="600" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; display: block !important; max-width: 600px !important; clear: both !important; width: 100% !important; margin: 0 auto; padding: 0;" valign="top">
<div style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; max-width: 600px; display: block; margin: 0 auto; padding: 0;">
<table width="100%" cellpadding="0" cellspacing="0" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; border-radius: 3px; background-color: #fff; margin: 0; border: 1px solid #e9e9e9;" bgcolor="#fff">
<tr>
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 10px;" valign="top">
<table width="100%" cellpadding="0" cellspacing="0">
{{ range .Alerts.Firing }}
<tr>
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
Your job
<a href="http://dltshub.redmond.corp.microsoft.com/Home/JobDetail/?cluster={{.Labels.cluster}}&jobId={{.Labels.job_name}}">
<strong>{{.Labels.job_name}}</strong>
</a> from cluster '{{.Labels.cluster}}' has changed to state of {{.Labels.job_state}}.
</td>
</tr>
{{ end }}
</table>
</td>
</tr>
</table>

</div>
</td>
<td valign="top"></td>
</tr>
</table>

</body>
</html>
{{ end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
{{ define "kill_idle.html" }}
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
Style and HTML derived from https://github.com/mailgun/transactional-email-templates


The MIT License (MIT)

Copyright (c) 2014 Mailgun

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
-->
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0;">
<head>
<meta name="viewport" content="width=device-width"/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>{{ template "__subject" . }}</title>

</head>

<body itemscope="" itemtype="http://schema.org/EmailMessage" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; -webkit-font-smoothing: antialiased; -webkit-text-size-adjust: none; height: 100%; line-height: 1.6em; width: 100% !important; background-color: #f6f6f6; margin: 0; padding: 0;" bgcolor="#f6f6f6">

<table style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; width: 100%; background-color: #f6f6f6; margin: 0;" bgcolor="#f6f6f6">
<tr>
<td valign="top"></td>
<td width="600" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; display: block !important; max-width: 600px !important; clear: both !important; width: 100% !important; margin: 0 auto; padding: 0;" valign="top">
<div style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; max-width: 600px; display: block; margin: 0 auto; padding: 0;">
<table width="100%" cellpadding="0" cellspacing="0" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; border-radius: 3px; background-color: #fff; margin: 0; border: 1px solid #e9e9e9;" bgcolor="#fff">
<tr>
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 10px;" valign="top">
<table width="100%" cellpadding="0" cellspacing="0">
{{ range .Alerts.Firing }}
<tr>
<td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; vertical-align: top; margin: 0; padding: 0 0 20px;" valign="top">
Your job
<a href="http://dltshub.redmond.corp.microsoft.com/Home/JobDetail/?cluster={{.Labels.cluster}}&jobId={{.Labels.job_name}}">
<strong>{{.Labels.job_name}}</strong>
</a> from cluster '{{.Labels.cluster}}' VC '{{.Labels.vc_name}}' was killed because it have been idle for too long.
</td>
</tr>
{{ end }}
</table>
</td>
</tr>
</table>

</div>
</td>
<td valign="top"></td>
</tr>
</table>

</body>
</html>
{{ end }}
12 changes: 11 additions & 1 deletion src/ClusterBootstrap/services/monitor/alerting/jobs.rules
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,14 @@ groups:
expr: avg(task_gpu_percent) by (user_email, job_name, vc_name) == 0
for: 4h
labels:
type: user_alert
type: idle_gpu
- alert: kill-idle-jobs-email
expr: avg(task_gpu_percent) by (user_email, job_name, vc_name) == 0
for: 8h
labels:
type: kill_idle_job_email
- alert: kill-idle-jobs
expr: avg(task_gpu_percent) by (user_email, job_name, vc_name) == 0
for: 8h
labels:
type: reaper
Loading