Skip to content

DATAUP-262 Create ADR For Split/Aggregate Jobs#422

Merged
bio-boris merged 28 commits intodevelopfrom
DATAUP-262
Nov 23, 2021
Merged

DATAUP-262 Create ADR For Split/Aggregate Jobs#422
bio-boris merged 28 commits intodevelopfrom
DATAUP-262

Conversation

@bio-boris
Copy link
Copy Markdown
Collaborator

@bio-boris bio-boris commented Sep 22, 2021

  • ADR

@bio-boris bio-boris changed the title Create 004-SplitAndAggregate.md DATAUP-262 Create ADR For Split/Aggregate Jobs Sep 22, 2021
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
* `+` Simplest solution, quickest turnaround, fixes deadlock issue
* `-` UI still broken for batch analysis

## DEADLOCK: Increase number of slots or Seperate Queue for kbparallels apps without 10 job limit
Copy link
Copy Markdown
Member

@MrCreosote MrCreosote Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to think a separate queue with barely any resources for the monitoring job might work. Basically the only responsibility of the jobs in that queue are to start other jobs and monitor them. So the monitor jobs have basically no resources, and all they do is:

  1. Kick off a FO job that starts the PIP jobs. If that's not computationally intensive it could be done in the monitor job. The FO job returns the job IDs to the monitor job.
  2. When the PIP jobs are done, the monitor job kicks off the FI job.
  3. When the FI job is done, the monitor job returns the results.

That seems like it'd be a lot less work than modifying EE2 (which IMO is the "best" solution, but also the biggest lift)

Although it has all the issues around cancellation, UI tracking, etc.

Is it possible to kill jobs if they take up a certain amount of compute?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why is the header prefixed with DEADLOCK?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it is not possible to kill them based on compute or disk, but it is possible to do it based on memory. One con would be that there would be a maximum amount of management jobs based on the number of cores, as most jobs require at least 1 core. It is possible to make the machine advertise a higher number of cores, but not sure how that would affect performance.

Because that solution only addresses the Deadlock, and not any UI issues. I changed it a bit.

Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
Copy link
Copy Markdown
Member

@MrCreosote MrCreosote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This certainly seems good enough for a brain dump given that dealing with this problem is currently not in scope for the project. Just a few relatively minor changes and I think this is done

@bio-boris bio-boris requested a review from MrCreosote November 15, 2021 15:27
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
* `+` Simple solutions, quick turnarounds, fixes deadlock issue
* `-` Addresses only the deadlocking issue, UI still broken for regular runs and batch runs
* `-` A small amount of users can take over the entire system
* `-` The calculations done by the apps will interfere with other apps and cause crashes/failures
Copy link
Copy Markdown
Member

@MrCreosote MrCreosote Nov 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two are true if we increase the number of slots, but not if KBP apps have a separate queue. Maybe those two solutions should be separate with individual pro/con lists?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrCreosote Which two?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@bio-boris bio-boris requested a review from MrCreosote November 15, 2021 21:06
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated

### Seperate Queue for kbparallels apps that may or may not have its own limit to running jobs.
* `+` Simple solutions, quick turnarounds, fixes deadlock issue
* `+` Requires minimum changes to ee2 and condor if condor supports this feature
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changes would it require?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ee2 * Make a list of apps get submitted to that queue
condor * somehow make the 2nd queue a consumable resource

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a list of apps get submitted to that queue

can't you just specify that in the catalog?

somehow make the 2nd queue a consumable resource

I don't understand what this means - why would this queue be any different than any other queue?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could specify it in the catalog, and then make ee2 add a KBP_LIMIT if it has that QUEUE clientgroup, but if we want those KBP jobs to run in the NJS queue and limit the max of those jobs, EE2 would need to keep a list of the apps it needs to add the KBP_LIMIT to.

As for why making it a consumable resource, is because you don't want to have one user take over the entire KBP queue, since those jobs wouldn't take up slots anymore. If we limit the amount of jobs sent to the KBP queue, we might as well just not make a KBP queue, because we can limit KBP jobs without them needing their own queue.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for why making it a consumable resource, is because you don't want to have one user take over the entire KBP queue, since those jobs wouldn't take up slots anymore. If we limit the amount of jobs sent to the KBP queue, we might as well just not make a KBP queue, because we can limit KBP jobs without them needing their own queue.

I'm still not quite grokking this - if we have a combined queue and a regular limit of 10 and a KBP limit of, say, 5, that means that a user running a lot of parallel jobs will be running at 50% capacity.

If instead we have a separate queue with a KBP limit of, say, 20, then the user can run at 100% capacity and run more KBP jobs at the same time. Theoretically we could also assign very low resources per KBP job (especially if we restructured the jobs to do most of the work on the regular queue).

Copy link
Copy Markdown
Collaborator Author

@bio-boris bio-boris Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Modifying code inside of apps that use KBP is not in scope of this solution
  • What is a combined queue?

We have 60 cs nodes, which means we can have 60 jobs that take 60 nodes right now.
If we convert 20 nodes to KBP nodes and we had a limit of 20 jobs per user, then 1 user would submit 20 KBP jobs and no one else could run a KBP job. 20 KBP jobs would be set to running. Of those 20, only up to 10 could have a computation job running. The other 10 would be doing nothing, as the users 10 slots are all taken up.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, gotcha, so it makes sense to make KBP_limit <= regular limit.

If we convert 20 nodes to KBP nodes and we had a limit of 20 jobs per user, then 1 user would submit 20 KBP jobs and no one else could run a KBP job.

This is where I'm wondering if we can break the 1 node : 1 job rule to allow more KBP jobs to run at the same time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying code inside of apps that use KBP is not in scope of this solution

Why not?

What is a combined queue?

I'm just talking about the case where KBP jobs and regular jobs run in the same queue.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also FYI having more than 1 job do something like combine stuff with DFU or upload things on the same cs node might cause too much disk pressure on these older machines, so thats we changed to a max of 1 job per machine.

Copy link
Copy Markdown
Collaborator Author

@bio-boris bio-boris Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I'm wondering if we can break the 1 node : 1 job rule to allow more KBP jobs to run at the same time.

  • You sure can, but then you need to consider how many CPUS/MEMORY/DISK each KBP job is using and how many can fit in a node. If we do it wrong, then the jobs are all going to be affecting each other, causing more user tickets to come in complaining about their jobs mysteriously crashing.

Modifying code inside of apps that use KBP is not in scope of this solution

Well you have to cut off the solution at some point before you decide its time to enumerate a new solution.
It seems very complex to come up with 7 plans of modifying KBP apps that make them not do any computation on the same machines and not affect other jobs, and weigh that against just ripping KBP out of those apps within this solution, but it could be done in a new solution, and then once that is complete, compare this solution to a solution that has these plans WITH or WITHOUT job limits and new queues

I'm just talking about the case where KBP jobs and regular jobs run in the same queue.

KBP and regular jobs already run in the same queue, see the proposal that allows the jobs to run on their own nodes and also fixes deadlocking LIMIT KBP jobs to a maximum of 5 active jobs per user

Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
@bio-boris bio-boris requested a review from MrCreosote November 16, 2021 00:38
@MrCreosote
Copy link
Copy Markdown
Member

maybe we should have a quick call and talk this out rather than going back and forth any more

@bio-boris bio-boris requested review from MrCreosote and removed request for MrCreosote November 16, 2021 16:27
@bio-boris
Copy link
Copy Markdown
Collaborator Author

I thought of another way of doing it, so changed it around

@bio-boris bio-boris requested review from MrCreosote and removed request for MrCreosote November 18, 2021 21:13
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
Copy link
Copy Markdown
Member

@MrCreosote MrCreosote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall structure and contents looks good; I still have some questions / comments re the details

Comment thread docs/adrs/004-SplitAndAggregate.md
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
@bio-boris bio-boris requested a review from MrCreosote November 23, 2021 19:34
Copy link
Copy Markdown
Member

@MrCreosote MrCreosote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a final read through and found a couple of minor things

Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
## Alternatives Considered

* Ignore most issues and just make apps that run kbparallels limited to N instances of kbparallels per user to avoid deadlocks
* Writing new ee2 endpoints to entirely handle batch execution and possibly use a DAG
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line should be deleted, right?

Comment thread docs/adrs/004-SplitAndAggregate.md Outdated
8) The *Job Manager* returns the reference to the results of the *Report Job*

Pros/Cons
* `+` On an as needed basis, would have to rewrite apps that use KBP to use this new paradigm
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a con, but less of a con than having to rewrite all apps that use KBP at once

Copy link
Copy Markdown
Member

@MrCreosote MrCreosote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bio-boris bio-boris merged commit 77f1416 into develop Nov 23, 2021
@bio-boris bio-boris deleted the DATAUP-262 branch January 19, 2022 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants