Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Add Job Groups to Batch #14282

Merged
merged 149 commits into from
Feb 26, 2024

Conversation

jigold
Copy link
Contributor

@jigold jigold commented Feb 11, 2024

This PR adds the job groups functionality as described in this RFC to the Batch backend and hailtop.batch_client. This includes supporting nested job groups up to a maximum depth of 5. Note, that none of these changes are user-facing yet (hence no change log here).

The PRs that came before this one:

Subsequent PRs will need to implement the following:

  • Querying job groups with the flexible query language (v2)
  • Implementing job groups in the Scala Client for QoB
  • Using job groups in QoB with cancel_after_n_failures=1 for all new stages of worker jobs
  • UI functionality to page and sort through job groups
  • A new hailtop.batch interface for users to define and work with Job Groups

A couple of nuances in the implementation came up that I also tried to articulate in the RFC:

  1. A root job group with ID = 0 does not belong to an update ("update_id" IS NULL). This means that any checks that look for "committed" job groups need to do (batch_updates.committed OR job_groups.job_group_id = %s) where "%s" is the ROOT_JOB_GROUP_ID.
  2. When job groups are cancelled, only the specific job group that was cancelled is inserted into job_groups_cancelled. This table does NOT contain all transitive job groups that were also cancelled indirectly. The reason for this is we cannot guarantee that a user wouldn't have millions of job groups and we can't insert millions of records inside a single SQL stored procedure. Now, any query on the driver / front_end must look up the tree and see if any parent has been cancelled. This code looks similar to the code below [1].
  3. There used to be DELETE FROM statements in commit_batch_update and commit_batch that cleaned up old records that were no longer used in job_group_inst_coll_cancellable_resources and job_groups_inst_coll_staging. This cleanup now occurs in a periodic loop on the driver.
  4. The job_group_inst_coll_cancellable_resources and job_groups_inst_coll_staging tables have values which represent the sum of all child job groups. For example, if a job group has 1 job and it's child job group has 2 jobs, then the staging table would have n_jobs = 3 for the parent job group and n_jobs = 2 for the child job group. Likewise, all of the billing triggers and MJC have to use the job_group_self_and_ancestors table to modify the job group the job belongs to as well its parent job groups.

[1] Code to check whether a job group has been cancelled.

SELECT job_groups.*,
  cancelled_t.cancelled IS NOT NULL AS cancelled
FROM job_groups
LEFT JOIN LATERAL (
  SELECT 1 AS cancelled
  FROM job_group_self_and_ancestors
  INNER JOIN job_groups_cancelled
    ON job_group_self_and_ancestors.batch_id = job_groups_cancelled.id AND
      job_group_self_and_ancestors.ancestor_id = job_groups_cancelled.job_group_id
  WHERE job_groups.batch_id = job_group_self_and_ancestors.batch_id AND
    job_groups.job_group_id = job_group_self_and_ancestors.job_group_id
) AS cancelled_t ON TRUE
WHERE ...

@jigold
Copy link
Contributor Author

jigold commented Feb 15, 2024

Tests are all passing again.

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished at last! I think these are pretty minor changes. Let's try to merge this afternoon?

@@ -23,6 +25,8 @@

deploy_config = get_deploy_config()

MAX_JOB_GROUP_NESTING_DEPTH = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we import this from constants.py? I strongly prefer one source of truth.


assert len(debug_info['jobs']) == 1, str(debug_info)
assert len(list(jg.jobs())) == 1, str(debug_info)
assert jg.attributes()['name'] == 'foo', str(debug_info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing debug_info distracts from the important assertions of this test. Debug info is not a building block of any public API. It's meant only as a standard (possibly expensive to obtain) collection of information useful when a test fails.

assert len(job_groups) == 1, str(job_groups)
assert job_groups[0].attributes()['name'] == 'foo', str(job_groups)
assert len(jobs) == 1, str(jobs)
b.cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the function of this? We don't have any assertions after it (so we're not testing it as an operation) but we also don't try-finally it (so it's not meant as a cleanup step).

b.submit()
job_groups = list(b.job_groups())
# need to include the initial job group created
assert len(job_groups) == max_bunch_size + 2, str(job_groups)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The presence of a comment suggests to me that we haven't made the test code clear enough. Maybe a blank line after 1894 is enough?

There's also the extremely explicit option:

n_groups = 0

# ...
b.create_job_group(...)
n_groups += 1

for i in range(..):
    # ...
    n_groups += 1
# ...
assert n_groups == max_bunch_size + 2
assert len(job_groups) == n_groups

job_groups = list(b.job_groups())
assert len(job_groups) == 1, str(job_groups)
jobs = list(b.jobs())
assert len(jobs) == 4, str(jobs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems worthwhile to assert that the JobGroup.jobs is empty. A separate test for that also seems fine! We only seem to test JobGroup.jobs in the first test here and only in the case that all the jobs in the batch are in the group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm losing track, but I think this test addresses it:

def test_job_group_creation_with_no_jobs_but_batch_is_not_empty(client: BatchClient):
    b = create_batch(client)
    jg = b.create_job_group(attributes={'name': 'foo'})
    for _ in range(4):
        b.create_job(DOCKER_ROOT_IMAGE, ['true'])
    b.submit()

    job_groups = list(b.job_groups())
    assert len(job_groups) == 1, str(job_groups)

    jobs = list(b.jobs())
    assert len(jobs) == 4, str(jobs)

    assert len(list(jg.jobs())) == 0, str(jg.debug_info())
    assert len(list(jg.job_groups())) == 0, str(jg.debug_info())

assert len(jobs) == 1, str(jg.debug_info())
assert len(job_groups) == 0, str(jg.debug_info())

await jg.cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be in a finally or needs an assertion after it

cancel_after_n_failures=cancel_after_n_failures,
)

# FIXME Error if this is called while in a job within the same job group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve the FIXME or create an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i < 64:
i = i + 1

# FIXME Error if this is called while in a job within the same job group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve the FIXME or create an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -464,7 +699,7 @@ async def _wait(
if i < 64:
i = i + 1

# FIXME Error if this is called while within a job of the same Batch
# FIXME Error if this is called while in a job within the same Batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These FIXMEs should never have been in here in the first place. Let's either create an issue or address them in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want the link to the issue in the comment or no comment at all?

@@ -1 +1,3 @@
ROOT_JOB_GROUP_ID = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ROOT_JOB_GROUP_ID should be defined in exactly one place, since we need it both client-side and server-side, that place has to be hailtop.

@jigold
Copy link
Contributor Author

jigold commented Feb 22, 2024

On Azure, one of the tests timed out with 500 responses from the server. I'll need to debug in GCP, but the PR queue is long right now.

@jigold jigold force-pushed the the-job-groups-branch-fast-cancel branch from 5f4d439 to d5574c1 Compare February 22, 2024 21:41
@danking
Copy link
Contributor

danking commented Feb 23, 2024

@jigold everything passing, shall we merge now?

@jigold
Copy link
Contributor Author

jigold commented Feb 23, 2024

I took out the transaction in the migration as we'll need those triggers to be committed even in the event of the migration failing part of the way through changing the table primary keys.

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@danking danking merged commit 13de4e6 into hail-is:main Feb 26, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants