DATAUP-262 Create ADR For Split/Aggregate Jobs by bio-boris · Pull Request #422 · kbase/execution_engine2

bio-boris · 2021-09-22T16:59:00Z

ADR

MrCreosote · 2021-11-05T21:42:06Z

+* `+` Simplest solution, quickest turnaround, fixes deadlock issue
+* `-` UI still broken for batch analysis
+
+## DEADLOCK: Increase number of slots or Seperate Queue for kbparallels apps without 10 job limit


I'm starting to think a separate queue with barely any resources for the monitoring job might work. Basically the only responsibility of the jobs in that queue are to start other jobs and monitor them. So the monitor jobs have basically no resources, and all they do is:

Kick off a FO job that starts the PIP jobs. If that's not computationally intensive it could be done in the monitor job. The FO job returns the job IDs to the monitor job.

When the PIP jobs are done, the monitor job kicks off the FI job.

When the FI job is done, the monitor job returns the results.

That seems like it'd be a lot less work than modifying EE2 (which IMO is the "best" solution, but also the biggest lift)

Although it has all the issues around cancellation, UI tracking, etc.

Is it possible to kill jobs if they take up a certain amount of compute?

Also why is the header prefixed with DEADLOCK?

No it is not possible to kill them based on compute or disk, but it is possible to do it based on memory. One con would be that there would be a maximum amount of management jobs based on the number of cores, as most jobs require at least 1 core. It is possible to make the machine advertise a higher number of cores, but not sure how that would affect performance.

Because that solution only addresses the Deadlock, and not any UI issues. I changed it a bit.

MrCreosote

This certainly seems good enough for a brain dump given that dealing with this problem is currently not in scope for the project. Just a few relatively minor changes and I think this is done

MrCreosote · 2021-11-15T20:20:44Z

+* `+` Simple solutions, quick turnarounds, fixes deadlock issue
+* `-` Addresses only the deadlocking issue, UI still broken for regular runs and batch runs
+* `-` A small amount of users can take over the entire system
+* `-` The calculations done by the apps will interfere with other apps and cause crashes/failures


I think these two are true if we increase the number of slots, but not if KBP apps have a separate queue. Maybe those two solutions should be separate with individual pro/con lists?

@MrCreosote Which two?

MrCreosote · 2021-11-15T22:11:41Z


+###  Seperate Queue for kbparallels apps that may or may not have its own limit to running jobs.
+* `+` Simple solutions, quick turnarounds, fixes deadlock issue 
+* `+` Requires minimum changes to ee2 and condor if condor supports this feature


What changes would it require?

ee2 * Make a list of apps get submitted to that queue
condor * somehow make the 2nd queue a consumable resource

Make a list of apps get submitted to that queue

can't you just specify that in the catalog?

somehow make the 2nd queue a consumable resource

I don't understand what this means - why would this queue be any different than any other queue?

You could specify it in the catalog, and then make ee2 add a KBP_LIMIT if it has that QUEUE clientgroup, but if we want those KBP jobs to run in the NJS queue and limit the max of those jobs, EE2 would need to keep a list of the apps it needs to add the KBP_LIMIT to.

As for why making it a consumable resource, is because you don't want to have one user take over the entire KBP queue, since those jobs wouldn't take up slots anymore. If we limit the amount of jobs sent to the KBP queue, we might as well just not make a KBP queue, because we can limit KBP jobs without them needing their own queue.

As for why making it a consumable resource, is because you don't want to have one user take over the entire KBP queue, since those jobs wouldn't take up slots anymore. If we limit the amount of jobs sent to the KBP queue, we might as well just not make a KBP queue, because we can limit KBP jobs without them needing their own queue.

I'm still not quite grokking this - if we have a combined queue and a regular limit of 10 and a KBP limit of, say, 5, that means that a user running a lot of parallel jobs will be running at 50% capacity.

If instead we have a separate queue with a KBP limit of, say, 20, then the user can run at 100% capacity and run more KBP jobs at the same time. Theoretically we could also assign very low resources per KBP job (especially if we restructured the jobs to do most of the work on the regular queue).

Modifying code inside of apps that use KBP is not in scope of this solution

What is a combined queue?

We have 60 cs nodes, which means we can have 60 jobs that take 60 nodes right now.
If we convert 20 nodes to KBP nodes and we had a limit of 20 jobs per user, then 1 user would submit 20 KBP jobs and no one else could run a KBP job. 20 KBP jobs would be set to running. Of those 20, only up to 10 could have a computation job running. The other 10 would be doing nothing, as the users 10 slots are all taken up.

Ok, gotcha, so it makes sense to make KBP_limit <= regular limit.

If we convert 20 nodes to KBP nodes and we had a limit of 20 jobs per user, then 1 user would submit 20 KBP jobs and no one else could run a KBP job.

This is where I'm wondering if we can break the 1 node : 1 job rule to allow more KBP jobs to run at the same time.

Modifying code inside of apps that use KBP is not in scope of this solution

Why not?

What is a combined queue?

I'm just talking about the case where KBP jobs and regular jobs run in the same queue.

Also FYI having more than 1 job do something like combine stuff with DFU or upload things on the same cs node might cause too much disk pressure on these older machines, so thats we changed to a max of 1 job per machine.

This is where I'm wondering if we can break the 1 node : 1 job rule to allow more KBP jobs to run at the same time.

You sure can, but then you need to consider how many CPUS/MEMORY/DISK each KBP job is using and how many can fit in a node. If we do it wrong, then the jobs are all going to be affecting each other, causing more user tickets to come in complaining about their jobs mysteriously crashing.

Modifying code inside of apps that use KBP is not in scope of this solution

Well you have to cut off the solution at some point before you decide its time to enumerate a new solution.
It seems very complex to come up with 7 plans of modifying KBP apps that make them not do any computation on the same machines and not affect other jobs, and weigh that against just ripping KBP out of those apps within this solution, but it could be done in a new solution, and then once that is complete, compare this solution to a solution that has these plans WITH or WITHOUT job limits and new queues

I'm just talking about the case where KBP jobs and regular jobs run in the same queue.

KBP and regular jobs already run in the same queue, see the proposal that allows the jobs to run on their own nodes and also fixes deadlocking LIMIT KBP jobs to a maximum of 5 active jobs per user

MrCreosote · 2021-11-16T01:08:46Z

maybe we should have a quick call and talk this out rather than going back and forth any more

bio-boris · 2021-11-16T16:27:41Z

I thought of another way of doing it, so changed it around

MrCreosote

Overall structure and contents looks good; I still have some questions / comments re the details

MrCreosote

Did a final read through and found a couple of minor things

MrCreosote · 2021-11-23T19:45:07Z

+## Alternatives Considered
+
+* Ignore most issues and just make apps that run kbparallels limited to N instances of kbparallels per user to avoid deadlocks
+* Writing new ee2 endpoints to entirely handle batch execution and possibly use a DAG


I think this line should be deleted, right?

MrCreosote · 2021-11-23T19:48:10Z

+8) The *Job Manager* returns the reference to the results of the *Report Job*
+
+Pros/Cons
+* `+` On an as needed basis, would have to rewrite apps that use KBP to use this new paradigm


I think this is a con, but less of a con than having to rewrite all apps that use KBP at once

MrCreosote

LGTM

Create 004-SplitAndAggregate.md

ecc2535

bio-boris changed the title ~~Create 004-SplitAndAggregate.md~~ DATAUP-262 Create ADR For Split/Aggregate Jobs Sep 22, 2021

bio-boris added 7 commits September 30, 2021 13:58

Update 004-SplitAndAggregate.md

3c9ae6f

Update 004-SplitAndAggregate.md

312b9f9

Update 004-SplitAndAggregate.md

f05daff

Update 004-SplitAndAggregate.md

802a89b

Update 004-SplitAndAggregate.md

31c6c79

Update 004-SplitAndAggregate.md

65a7d7e

Update 004-SplitAndAggregate.md

78d1849

MrCreosote reviewed Nov 5, 2021

View reviewed changes

Comment thread docs/adrs/004-SplitAndAggregate.md Outdated

MrCreosote reviewed Nov 5, 2021

View reviewed changes

Comment thread docs/adrs/004-SplitAndAggregate.md Outdated

MrCreosote requested changes Nov 5, 2021

View reviewed changes

bio-boris added 5 commits November 15, 2021 09:09

Update 004-SplitAndAggregate.md

eed6fbe

Update 004-SplitAndAggregate.md

60489bc

Update 004-SplitAndAggregate.md

7cafa3d

Update 004-SplitAndAggregate.md

2de187a

Update 004-SplitAndAggregate.md

c1e9f15

bio-boris requested a review from MrCreosote November 15, 2021 15:27

MrCreosote requested changes Nov 15, 2021

View reviewed changes

Update 004-SplitAndAggregate.md

8e15969

bio-boris requested a review from MrCreosote November 15, 2021 21:06

MrCreosote reviewed Nov 15, 2021

View reviewed changes

Comment thread docs/adrs/004-SplitAndAggregate.md Outdated

MrCreosote reviewed Nov 15, 2021

View reviewed changes

Comment thread docs/adrs/004-SplitAndAggregate.md Outdated

bio-boris added 2 commits November 15, 2021 18:27

Update 004-SplitAndAggregate.md

db69491

Update 004-SplitAndAggregate.md

862baa7

bio-boris requested a review from MrCreosote November 16, 2021 00:38

Update 004-SplitAndAggregate.md

9fb7e18

bio-boris requested review from MrCreosote and removed request for MrCreosote November 16, 2021 16:27

bio-boris added 3 commits November 16, 2021 10:30

Update 004-SplitAndAggregate.md

f986206

Update 004-SplitAndAggregate.md

1b9ad89

Update 004-SplitAndAggregate.md

bebaad7

bio-boris requested review from MrCreosote and removed request for MrCreosote November 18, 2021 21:13

Update 004-SplitAndAggregate.md

39eec88

MrCreosote reviewed Nov 22, 2021

View reviewed changes

Comment thread docs/adrs/004-SplitAndAggregate.md Outdated

MrCreosote requested changes Nov 22, 2021

View reviewed changes

bio-boris added 6 commits November 23, 2021 10:23

Update 004-SplitAndAggregate.md

907cdca

Update 004-SplitAndAggregate.md

a8eba2e

Update 004-SplitAndAggregate.md

9a67753

Update 004-SplitAndAggregate.md

60cf061

Update 004-SplitAndAggregate.md

517ebea

Update 004-SplitAndAggregate.md

2189149

bio-boris requested a review from MrCreosote November 23, 2021 19:34

MrCreosote requested changes Nov 23, 2021

View reviewed changes

Update 004-SplitAndAggregate.md

80be107

MrCreosote approved these changes Nov 23, 2021

View reviewed changes

bio-boris merged commit 77f1416 into develop Nov 23, 2021

bio-boris deleted the DATAUP-262 branch January 19, 2022 20:59

Conversation

bio-boris commented Sep 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

MrCreosote Nov 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MrCreosote left a comment

Choose a reason for hiding this comment

Uh oh!

MrCreosote Nov 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bio-boris Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bio-boris Nov 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MrCreosote commented Nov 16, 2021

Uh oh!

bio-boris commented Nov 16, 2021

Uh oh!

Uh oh!

MrCreosote left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MrCreosote left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MrCreosote left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

bio-boris commented Sep 22, 2021 •

edited

Loading

MrCreosote Nov 5, 2021 •

edited

Loading

MrCreosote Nov 15, 2021 •

edited

Loading

bio-boris Nov 16, 2021 •

edited

Loading

bio-boris Nov 16, 2021 •

edited

Loading