Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Job groups transient error causing a 400 to the user #14413

Open
jigold opened this issue Mar 14, 2024 · 0 comments
Open

[batch] Job groups transient error causing a 400 to the user #14413

jigold opened this issue Mar 14, 2024 · 0 comments

Comments

@jigold
Copy link
Collaborator

jigold commented Mar 14, 2024

What happened?

____________ test_submit_new_job_groups_after_a_group_was_cancelled ____________

client = <hailtop.batch_client.client.BatchClient object at 0x7f0ab0311790>

    def test_submit_new_job_groups_after_a_group_was_cancelled(client: BatchClient):
        b = create_batch(client)
        g1 = b.create_job_group()
        g1.create_job(DOCKER_ROOT_IMAGE, ['true'])
        b.submit()
        g1.cancel()
        g2 = b.create_job_group()
        g2.create_job(DOCKER_ROOT_IMAGE, ['true'])
>       b.submit()

io/test/test_batch.py:1989:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
usr/local/lib/python3.9/dist-packages/hailtop/batch_client/client.py:347: in submit
    async_to_blocking(self._async_batch.submit(*args, **kwargs))
usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py:186: in async_to_blocking
    raise exc
usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py:181: in async_to_blocking
    return loop.run_until_complete(task)
usr/local/lib/python3.9/dist-packages/nest_asyncio.py:98: in run_until_complete
    return f.result()
usr/lib/python3.9/asyncio/futures.py:201: in result
    raise self._exception
usr/lib/python3.9/asyncio/tasks.py:256: in __step
    result = coro.send(None)
usr/local/lib/python3.9/dist-packages/hailtop/batch_client/aioclient.py:1214: in submit
    start_job_group_id, start_job_id = await self._submit(
usr/local/lib/python3.9/dist-packages/hailtop/batch_client/aioclient.py:1180: in _submit
    start_job_group_id, start_job_id = await self._update_fast(
usr/local/lib/python3.9/dist-packages/hailtop/batch_client/aioclient.py:988: in _update_fast
    resp = await self._client._post(
usr/local/lib/python3.9/dist-packages/hailtop/batch_client/aioclient.py:1290: in _post
    return await self._session.post(self.url + path, data=data, json=json, headers=self._headers)
usr/local/lib/python3.9/dist-packages/hailtop/aiocloud/common/session.py:28: in post
    return await self.request('POST', url, **kwargs)
usr/local/lib/python3.9/dist-packages/hailtop/aiocloud/common/session.py:105: in request
    return await retry_transient_errors(self._request_with_valid_authn, method, url, **kwargs)
usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py:780: in retry_transient_errors
    return await retry_transient_errors_with_debug_string('', 0, f, *args, **kwargs)
usr/local/lib/python3.9/dist-packages/hailtop/utils/utils.py:796: in retry_transient_errors_with_debug_string
    return await f(*args, **kwargs)
usr/local/lib/python3.9/dist-packages/hailtop/aiocloud/common/session.py:117: in _request_with_valid_authn
    return await self._http_session.request(method, url, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    async def request_and_raise_for_status():
        json_data = kwargs.pop('json', None)
        if json_data is not None:
            if kwargs.get('data') is not None:
                raise ValueError('data and json parameters cannot be used at the same time')
            kwargs['data'] = aiohttp.BytesPayload(
                value=orjson.dumps(json_data),
                # https://github.com/ijl/orjson#serialize
                #
                # "The output is a bytes object containing UTF-8"
                encoding="utf-8",
                content_type="application/json",
            )
        resp = await self.client_session._request(method, url, **kwargs)
        if raise_for_status:
            if resp.status >= 400:
                # reason should always be not None for a started response
                assert resp.reason is not None
                body = (await resp.read()).decode()
                await resp.release()
>               raise ClientResponseError(
                    resp.request_info,
                    resp.history,
                    status=resp.status,
                    message=resp.reason,
                    headers=resp.headers,
                    body=body,
                )
E               hailtop.httpx.ClientResponseError: 400, message='Bad Request', url=URL('http://internal.hail/pr-14351-default-yojxd4mck4io/batch/api/v1alpha/batches/321/update-fast') body="400: error while inserting job group 1 into batch 321: (1213, 'Deadlock found when trying to get lock; try restarting transaction')"

Version

0.2.128

Relevant log output

No response

@jigold jigold added needs-triage A brand new issue that needs triaging. batch labels Mar 14, 2024
@daniel-goldstein daniel-goldstein added bug and removed needs-triage A brand new issue that needs triaging. labels Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants