[qob] Add ability to attach to existing batch#14829
Conversation
| if batch_client is None: | ||
| batch_client = await BatchClient.create(billing_project, _token=credentials_token) | ||
| async_exit_stack.push_async_callback(batch_client.close) | ||
| batch_attributes: Dict[str, str] = dict() |
There was a problem hiding this comment.
This code was unused.
There was a problem hiding this comment.
hail/hail/python/hail/backend/service_backend.py
Lines 404 to 415 in 2937ed8
^ these line makes me scared. What happens if this is submitted from an existing batch that waits for this job. Is this effectively a deadlock? I'm guessing yes and the solution is job groups in the batch client...
|
To my understanding and a quick glance at the code, each new query is in its own job group and the driver waits on jobs within it's child job group. I could be wrong though and it's a legitimate concern. |
|
I pushed the changes I made for my GLIMPSE pipeline. Feel free to either adopt this or close it. |
|
This code works fine if you're only attaching one query at a time. Otherwise, when I tried to submit new queries for each chromosome simultaneously, it fails because the job group specs weren't submitted in order. Relevant code in front_end.py. I cannot remember why we had this requirement here. The batch updates table has the job group ID reservations with the start_job_group_id. There might have been some other reason I can't remember. This isn't a blocker for me, but is a bit disappointing. next_job_group_id = start_job_group_id + job_group_specs[0]['job_group_id'] - 1
if next_job_group_id != last_inserted_job_group_id['job_group_id'] + 1:
raise web.HTTPBadRequest(reason='job group specs were not submitted in order') |
|
@ehigham See my comment about the job groups above. |
c40c084 to
646b639
Compare
Thanks Jackie. I remember glancing over the front-end code while re-writing how the query driver executes job groups. There I just used the start_job_group_id the batch service gave me when creating an update with one job group. I wonder if the python code here should do the same and use the inner workings of the batch client. |
|
I vaguely remember the reason for that assertion was because we didn't want a parent job group to be submitted after its child because of how a SQL procedure was written when committing the batch. I'm not sure if the assertion was still needed with the final code implementation for job groups as the original design was more complicated. |
900bd64 to
dadd137
Compare
dadd137 to
81c8557
Compare
81c8557 to
004d13b
Compare
|
Did you decide to go ahead with this despite the potential problems if submitting jobs to the same batch simultaneously? |
004d13b to
afeabc5
Compare
Yes. I'll add a release note that explains this limitation. I think it would be a larger lift getting batch clients to use relative job group ids instead of absolute ids. |
a1060da to
d837b75
Compare
I'm not following what the limitation is here - I created this test based on what I thought you meant but it's passing so I've clearly misunderstood. Can you help me understand the problem? hail/hail/python/test/hailtop/batch/test_batch_service_backend.py Lines 809 to 845 in d837b75 |
|
Your test is serial. I bet if you did a lot of batches (you don't need more than one job per batch) using async, it would probably trigger it. This is roughly what needs to happen. def test_submit_query_on_batch_pipeline(request, service_backend: ServiceBackend):
b = Batch(request.node.nodeid, service_backend, default_image=HAIL_GENETICS_HAIL_IMAGE)
# Not sure how you'd get asyncio code working here
# new query_in_batch_job should instantiate a new batch object, submit a true job each and then call b.submit(). It's the same as you calling hl.init() in separate threads or processes and then submitting a query.
queries = await asyncio.gather([new_query_in_batch_job(b, 'Query Pipeline A'), new_query_in_batch_job(b, 'Query Pipeline B'), new_query_in_batch_job(b, 'Query Pipeline C')]) |
I'm defining the batch sequentially yes, but wont what I've called |
d837b75 to
43a65b8
Compare
That test does indeed capture the problem |
20dd655 to
ad0c859
Compare
a705f96 to
9c93b06
Compare
chrisvittal
left a comment
There was a problem hiding this comment.
Thanks for the enhancement!
The goal of this change is to be able to group QoB queries together to allow for
easier visibility of how much a pipeline costs as well as make the batches page
less cluttered so it's easier to tell what is running.
Adds
batch_idparameter tohl.initthat, when used with the'batch'backend,allows users to add query-on-batch jobs to an existing batch. Note that racing two
or more QoB jobs concurrently in the same batch will likely fail. Instead, users
should add dependences between query jobs so that they're scheduled consecutively.
This change has low security impact as it only modifies client code: users cannot
attach to batches that they do not already have access to.