DATAUP-262 Create ADR For Split/Aggregate Jobs#422
Conversation
| * `+` Simplest solution, quickest turnaround, fixes deadlock issue | ||
| * `-` UI still broken for batch analysis | ||
|
|
||
| ## DEADLOCK: Increase number of slots or Seperate Queue for kbparallels apps without 10 job limit |
There was a problem hiding this comment.
I'm starting to think a separate queue with barely any resources for the monitoring job might work. Basically the only responsibility of the jobs in that queue are to start other jobs and monitor them. So the monitor jobs have basically no resources, and all they do is:
- Kick off a FO job that starts the PIP jobs. If that's not computationally intensive it could be done in the monitor job. The FO job returns the job IDs to the monitor job.
- When the PIP jobs are done, the monitor job kicks off the FI job.
- When the FI job is done, the monitor job returns the results.
That seems like it'd be a lot less work than modifying EE2 (which IMO is the "best" solution, but also the biggest lift)
Although it has all the issues around cancellation, UI tracking, etc.
Is it possible to kill jobs if they take up a certain amount of compute?
There was a problem hiding this comment.
Also why is the header prefixed with DEADLOCK?
There was a problem hiding this comment.
No it is not possible to kill them based on compute or disk, but it is possible to do it based on memory. One con would be that there would be a maximum amount of management jobs based on the number of cores, as most jobs require at least 1 core. It is possible to make the machine advertise a higher number of cores, but not sure how that would affect performance.
Because that solution only addresses the Deadlock, and not any UI issues. I changed it a bit.
MrCreosote
left a comment
There was a problem hiding this comment.
This certainly seems good enough for a brain dump given that dealing with this problem is currently not in scope for the project. Just a few relatively minor changes and I think this is done
| * `+` Simple solutions, quick turnarounds, fixes deadlock issue | ||
| * `-` Addresses only the deadlocking issue, UI still broken for regular runs and batch runs | ||
| * `-` A small amount of users can take over the entire system | ||
| * `-` The calculations done by the apps will interfere with other apps and cause crashes/failures |
There was a problem hiding this comment.
I think these two are true if we increase the number of slots, but not if KBP apps have a separate queue. Maybe those two solutions should be separate with individual pro/con lists?
|
|
||
| ### Seperate Queue for kbparallels apps that may or may not have its own limit to running jobs. | ||
| * `+` Simple solutions, quick turnarounds, fixes deadlock issue | ||
| * `+` Requires minimum changes to ee2 and condor if condor supports this feature |
There was a problem hiding this comment.
What changes would it require?
There was a problem hiding this comment.
ee2 * Make a list of apps get submitted to that queue
condor * somehow make the 2nd queue a consumable resource
There was a problem hiding this comment.
Make a list of apps get submitted to that queue
can't you just specify that in the catalog?
somehow make the 2nd queue a consumable resource
I don't understand what this means - why would this queue be any different than any other queue?
There was a problem hiding this comment.
You could specify it in the catalog, and then make ee2 add a KBP_LIMIT if it has that QUEUE clientgroup, but if we want those KBP jobs to run in the NJS queue and limit the max of those jobs, EE2 would need to keep a list of the apps it needs to add the KBP_LIMIT to.
As for why making it a consumable resource, is because you don't want to have one user take over the entire KBP queue, since those jobs wouldn't take up slots anymore. If we limit the amount of jobs sent to the KBP queue, we might as well just not make a KBP queue, because we can limit KBP jobs without them needing their own queue.
There was a problem hiding this comment.
As for why making it a consumable resource, is because you don't want to have one user take over the entire KBP queue, since those jobs wouldn't take up slots anymore. If we limit the amount of jobs sent to the KBP queue, we might as well just not make a KBP queue, because we can limit KBP jobs without them needing their own queue.
I'm still not quite grokking this - if we have a combined queue and a regular limit of 10 and a KBP limit of, say, 5, that means that a user running a lot of parallel jobs will be running at 50% capacity.
If instead we have a separate queue with a KBP limit of, say, 20, then the user can run at 100% capacity and run more KBP jobs at the same time. Theoretically we could also assign very low resources per KBP job (especially if we restructured the jobs to do most of the work on the regular queue).
There was a problem hiding this comment.
- Modifying code inside of apps that use KBP is not in scope of this solution
- What is a combined queue?
We have 60 cs nodes, which means we can have 60 jobs that take 60 nodes right now.
If we convert 20 nodes to KBP nodes and we had a limit of 20 jobs per user, then 1 user would submit 20 KBP jobs and no one else could run a KBP job. 20 KBP jobs would be set to running. Of those 20, only up to 10 could have a computation job running. The other 10 would be doing nothing, as the users 10 slots are all taken up.
There was a problem hiding this comment.
Ok, gotcha, so it makes sense to make KBP_limit <= regular limit.
If we convert 20 nodes to KBP nodes and we had a limit of 20 jobs per user, then 1 user would submit 20 KBP jobs and no one else could run a KBP job.
This is where I'm wondering if we can break the 1 node : 1 job rule to allow more KBP jobs to run at the same time.
There was a problem hiding this comment.
Modifying code inside of apps that use KBP is not in scope of this solution
Why not?
What is a combined queue?
I'm just talking about the case where KBP jobs and regular jobs run in the same queue.
There was a problem hiding this comment.
Also FYI having more than 1 job do something like combine stuff with DFU or upload things on the same cs node might cause too much disk pressure on these older machines, so thats we changed to a max of 1 job per machine.
There was a problem hiding this comment.
This is where I'm wondering if we can break the 1 node : 1 job rule to allow more KBP jobs to run at the same time.
- You sure can, but then you need to consider how many CPUS/MEMORY/DISK each KBP job is using and how many can fit in a node. If we do it wrong, then the jobs are all going to be affecting each other, causing more user tickets to come in complaining about their jobs mysteriously crashing.
Modifying code inside of apps that use KBP is not in scope of this solution
Well you have to cut off the solution at some point before you decide its time to enumerate a new solution.
It seems very complex to come up with 7 plans of modifying KBP apps that make them not do any computation on the same machines and not affect other jobs, and weigh that against just ripping KBP out of those apps within this solution, but it could be done in a new solution, and then once that is complete, compare this solution to a solution that has these plans WITH or WITHOUT job limits and new queues
I'm just talking about the case where KBP jobs and regular jobs run in the same queue.
KBP and regular jobs already run in the same queue, see the proposal that allows the jobs to run on their own nodes and also fixes deadlocking LIMIT KBP jobs to a maximum of 5 active jobs per user
|
maybe we should have a quick call and talk this out rather than going back and forth any more |
|
I thought of another way of doing it, so changed it around |
MrCreosote
left a comment
There was a problem hiding this comment.
Overall structure and contents looks good; I still have some questions / comments re the details
MrCreosote
left a comment
There was a problem hiding this comment.
Did a final read through and found a couple of minor things
| ## Alternatives Considered | ||
|
|
||
| * Ignore most issues and just make apps that run kbparallels limited to N instances of kbparallels per user to avoid deadlocks | ||
| * Writing new ee2 endpoints to entirely handle batch execution and possibly use a DAG |
There was a problem hiding this comment.
I think this line should be deleted, right?
| 8) The *Job Manager* returns the reference to the results of the *Report Job* | ||
|
|
||
| Pros/Cons | ||
| * `+` On an as needed basis, would have to rewrite apps that use KBP to use this new paradigm |
There was a problem hiding this comment.
I think this is a con, but less of a con than having to rewrite all apps that use KBP at once
Uh oh!
There was an error while loading. Please reload this page.