Skip to content

[qob] Add ability to attach to existing batch#14829

Merged
hail-ci-robot merged 3 commits into
hail-is:mainfrom
jigold:init-service-existing-batch
Apr 30, 2025
Merged

[qob] Add ability to attach to existing batch#14829
hail-ci-robot merged 3 commits into
hail-is:mainfrom
jigold:init-service-existing-batch

Conversation

@jigold

@jigold jigold commented Mar 4, 2025

Copy link
Copy Markdown
Contributor

The goal of this change is to be able to group QoB queries together to allow for
easier visibility of how much a pipeline costs as well as make the batches page
less cluttered so it's easier to tell what is running.

Adds batch_id parameter to hl.init that, when used with the 'batch' backend,
allows users to add query-on-batch jobs to an existing batch. Note that racing two
or more QoB jobs concurrently in the same batch will likely fail. Instead, users
should add dependences between query jobs so that they're scheduled consecutively.

This change has low security impact as it only modifies client code: users cannot
attach to batches that they do not already have access to.

if batch_client is None:
batch_client = await BatchClient.create(billing_project, _token=credentials_token)
async_exit_stack.push_async_callback(batch_client.close)
batch_attributes: Dict[str, str] = dict()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was unused.

@ehigham ehigham left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await self._batch.submit(disable_progress_bar=True)
self._batch_was_submitted = True
with timings.step("wait driver"):
try:
await asyncio.sleep(0.6) # it is not possible for the batch to be finished in less than 600ms
await self._batch.wait(
description=name,
disable_progress_bar=self.disable_progress_bar,
progress=progress,
starting_job=j.job_id,
)

^ these line makes me scared. What happens if this is submitted from an existing batch that waits for this job. Is this effectively a deadlock? I'm guessing yes and the solution is job groups in the batch client...

@jigold

jigold commented Mar 4, 2025

Copy link
Copy Markdown
Contributor Author

To my understanding and a quick glance at the code, each new query is in its own job group and the driver waits on jobs within it's child job group.

job_group=self._batch.create_job_group(attributes={'name': name}),

I could be wrong though and it's a legitimate concern.

@jigold

jigold commented Mar 17, 2025

Copy link
Copy Markdown
Contributor Author

I pushed the changes I made for my GLIMPSE pipeline. Feel free to either adopt this or close it.

@jigold

jigold commented Mar 20, 2025

Copy link
Copy Markdown
Contributor Author

This code works fine if you're only attaching one query at a time. Otherwise, when I tried to submit new queries for each chromosome simultaneously, it fails because the job group specs weren't submitted in order.

Relevant code in front_end.py. I cannot remember why we had this requirement here. The batch updates table has the job group ID reservations with the start_job_group_id. There might have been some other reason I can't remember. This isn't a blocker for me, but is a bit disappointing.

        next_job_group_id = start_job_group_id + job_group_specs[0]['job_group_id'] - 1
        if next_job_group_id != last_inserted_job_group_id['job_group_id'] + 1:
            raise web.HTTPBadRequest(reason='job group specs were not submitted in order')

@jigold

jigold commented Mar 31, 2025

Copy link
Copy Markdown
Contributor Author

@ehigham See my comment about the job groups above.

@ehigham ehigham force-pushed the init-service-existing-batch branch from c40c084 to 646b639 Compare March 31, 2025 21:26
@ehigham

ehigham commented Apr 1, 2025

Copy link
Copy Markdown
Member

@ehigham See my comment about the job groups above.

Thanks Jackie. I remember glancing over the front-end code while re-writing how the query driver executes job groups. There I just used the start_job_group_id the batch service gave me when creating an update with one job group. I wonder if the python code here should do the same and use the inner workings of the batch client.

@jigold

jigold commented Apr 1, 2025

Copy link
Copy Markdown
Contributor Author

I vaguely remember the reason for that assertion was because we didn't want a parent job group to be submitted after its child because of how a SQL procedure was written when committing the batch. I'm not sure if the assertion was still needed with the final code implementation for job groups as the original design was more complicated.

@ehigham ehigham force-pushed the init-service-existing-batch branch 2 times, most recently from 900bd64 to dadd137 Compare April 2, 2025 15:22
@ehigham ehigham force-pushed the init-service-existing-batch branch from dadd137 to 81c8557 Compare April 16, 2025 15:51
@ehigham ehigham requested a review from patrick-schultz April 16, 2025 15:53
@ehigham ehigham force-pushed the init-service-existing-batch branch from 81c8557 to 004d13b Compare April 16, 2025 17:31
@jigold

jigold commented Apr 16, 2025

Copy link
Copy Markdown
Contributor Author

Did you decide to go ahead with this despite the potential problems if submitting jobs to the same batch simultaneously?

@ehigham ehigham force-pushed the init-service-existing-batch branch from 004d13b to afeabc5 Compare April 16, 2025 21:13
@ehigham

ehigham commented Apr 16, 2025

Copy link
Copy Markdown
Member

Did you decide to go ahead with this despite the potential problems if submitting jobs to the same batch simultaneously?

Yes. I'll add a release note that explains this limitation. I think it would be a larger lift getting batch clients to use relative job group ids instead of absolute ids.

@ehigham ehigham force-pushed the init-service-existing-batch branch 3 times, most recently from a1060da to d837b75 Compare April 17, 2025 16:07
@ehigham

ehigham commented Apr 18, 2025

Copy link
Copy Markdown
Member

This code works fine if you're only attaching one query at a time. Otherwise, when I tried to submit new queries for each chromosome simultaneously, it fails because the job group specs weren't submitted in order.

Relevant code in front_end.py. I cannot remember why we had this requirement here. The batch updates table has the job group ID reservations with the start_job_group_id. There might have been some other reason I can't remember. This isn't a blocker for me, but is a bit disappointing.

        next_job_group_id = start_job_group_id + job_group_specs[0]['job_group_id'] - 1
        if next_job_group_id != last_inserted_job_group_id['job_group_id'] + 1:
            raise web.HTTPBadRequest(reason='job group specs were not submitted in order')

I'm not following what the limitation is here - I created this test based on what I thought you meant but it's passing so I've clearly misunderstood. Can you help me understand the problem?

def new_query_in_batch_job(b: Batch, name: str) -> Job:
# creates a query-on-batch job in the current batch
backend = b._backend
assert isinstance(backend, ServiceBackend)
run_query_pipeline = textwrap.dedent(
f"""
hailctl config set batch/remote_tmpdir {backend.remote_tmpdir}
hailctl config set batch/billing_project {backend._billing_project}
cat << EOF | python3
from os import getenv
import hail as hl
hl.init(backend='batch', batch_id=int(getenv('HAIL_BATCH_ID')))
hl.utils.range_table(2356)._force_count()
EOF
""",
)
j = b.new_bash_job(name=name)
j.command(run_query_pipeline)
return j
def test_submit_query_on_batch_pipeline(request, service_backend: ServiceBackend):
b = Batch(request.node.nodeid, service_backend, default_image=HAIL_GENETICS_HAIL_IMAGE)
ja = new_query_in_batch_job(b, 'Query Pipeline A')
jb = new_query_in_batch_job(b, 'Query Pipeline B')
jc = new_query_in_batch_job(b, 'Query Pipeline C')
jc.depends_on(ja, jb)
r = b.run()
assert r is not None
status = r.status()
assert status['state'] == 'success', str((status, r.debug_info()))

@jigold

jigold commented Apr 18, 2025

Copy link
Copy Markdown
Contributor Author

Your test is serial. I bet if you did a lot of batches (you don't need more than one job per batch) using async, it would probably trigger it. This is roughly what needs to happen.

def test_submit_query_on_batch_pipeline(request, service_backend: ServiceBackend):
    b = Batch(request.node.nodeid, service_backend, default_image=HAIL_GENETICS_HAIL_IMAGE)

    # Not sure how you'd get asyncio code working here
    # new query_in_batch_job should instantiate a new batch object, submit a true job each and then call b.submit(). It's the same as you calling hl.init() in separate threads or processes and then submitting a query.

    queries = await asyncio.gather([new_query_in_batch_job(b, 'Query Pipeline A'), new_query_in_batch_job(b, 'Query Pipeline B'), new_query_in_batch_job(b, 'Query Pipeline C')])

@ehigham

ehigham commented Apr 18, 2025

Copy link
Copy Markdown
Member

Your test is serial. I bet if you did a lot of batches (you don't need more than one job per batch) using async, it would probably trigger it. This is roughly what needs to happen.

def test_submit_query_on_batch_pipeline(request, service_backend: ServiceBackend):
    b = Batch(request.node.nodeid, service_backend, default_image=HAIL_GENETICS_HAIL_IMAGE)

    # Not sure how you'd get asyncio code working here
    # new query_in_batch_job should instantiate a new batch object, submit a true job each and then call b.submit(). It's the same as you calling hl.init() in separate threads or processes and then submitting a query.

    queries = await asyncio.gather([new_query_in_batch_job(b, 'Query Pipeline A'), new_query_in_batch_job(b, 'Query Pipeline B'), new_query_in_batch_job(b, 'Query Pipeline C')])

I'm defining the batch sequentially yes, but wont what I've called Query Pipeline A and B race each other in creating new job groups?

@ehigham ehigham force-pushed the init-service-existing-batch branch from d837b75 to 43a65b8 Compare April 18, 2025 17:38
@ehigham

ehigham commented Apr 18, 2025

Copy link
Copy Markdown
Member

This code works fine if you're only attaching one query at a time. Otherwise, when I tried to submit new queries for each chromosome simultaneously, it fails because the job group specs weren't submitted in order.
Relevant code in front_end.py. I cannot remember why we had this requirement here. The batch updates table has the job group ID reservations with the start_job_group_id. There might have been some other reason I can't remember. This isn't a blocker for me, but is a bit disappointing.

        next_job_group_id = start_job_group_id + job_group_specs[0]['job_group_id'] - 1
        if next_job_group_id != last_inserted_job_group_id['job_group_id'] + 1:
            raise web.HTTPBadRequest(reason='job group specs were not submitted in order')

I'm not following what the limitation is here - I created this test based on what I thought you meant but it's passing so I've clearly misunderstood. Can you help me understand the problem?

def new_query_in_batch_job(b: Batch, name: str) -> Job:
# creates a query-on-batch job in the current batch
backend = b._backend
assert isinstance(backend, ServiceBackend)
run_query_pipeline = textwrap.dedent(
f"""
hailctl config set batch/remote_tmpdir {backend.remote_tmpdir}
hailctl config set batch/billing_project {backend._billing_project}
cat << EOF | python3
from os import getenv
import hail as hl
hl.init(backend='batch', batch_id=int(getenv('HAIL_BATCH_ID')))
hl.utils.range_table(2356)._force_count()
EOF
""",
)
j = b.new_bash_job(name=name)
j.command(run_query_pipeline)
return j
def test_submit_query_on_batch_pipeline(request, service_backend: ServiceBackend):
b = Batch(request.node.nodeid, service_backend, default_image=HAIL_GENETICS_HAIL_IMAGE)
ja = new_query_in_batch_job(b, 'Query Pipeline A')
jb = new_query_in_batch_job(b, 'Query Pipeline B')
jc = new_query_in_batch_job(b, 'Query Pipeline C')
jc.depends_on(ja, jb)
r = b.run()
assert r is not None
status = r.status()
assert status['state'] == 'success', str((status, r.debug_info()))

That test does indeed capture the problem

@ehigham ehigham force-pushed the init-service-existing-batch branch 2 times, most recently from 20dd655 to ad0c859 Compare April 18, 2025 20:39
@ehigham ehigham force-pushed the init-service-existing-batch branch from a705f96 to 9c93b06 Compare April 28, 2025 15:51
@ehigham ehigham self-assigned this Apr 28, 2025
@ehigham ehigham requested review from chrisvittal and removed request for patrick-schultz April 30, 2025 15:30

@chrisvittal chrisvittal left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the enhancement!

@hail-ci-robot hail-ci-robot merged commit 8997444 into hail-is:main Apr 30, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants