Skip to content

A batch2 job could fail to transition to FAILED state when using MS SQL Server #5230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mrdnctrk opened this issue Aug 23, 2023 · 0 comments · Fixed by #5233
Closed

A batch2 job could fail to transition to FAILED state when using MS SQL Server #5230

mrdnctrk opened this issue Aug 23, 2023 · 0 comments · Fixed by #5233
Assignees

Comments

@mrdnctrk
Copy link
Collaborator

mrdnctrk commented Aug 23, 2023

Describe the bug
Updating the workchunk status on a failure for a batch2 job fails when using MS SQL Server, causing the job to remain in IN_PROGRESS state. This prevents the job to transition to FAILED state after max retries are exhausted.

This happens on SQL Server, works fine on H2, and Postgres.

To Reproduce
We will use a delete expunge operation (which is executed as a batch2 job) to reproduce the issue.

  1. Create a bunch of Patient resources
  2. Connect to your MS SQL server using a tool such as DBeaver but for the connection set the Transaction Isolation Level to Repeatable Read as shown in the screenshot.
  3. Execute the following script in DBeaver which reads Patient resources in a transaction but doesn't commit or rollback the transaction on purpose. After you execute this script leave the connection to the DB open in DBeaver (don't close or disconnect). Since you've set the isolation level to Repeatable Read in the previous step, no other connection will be able to modify Patient resources until this transaction terminates. This will fail the delete expunge operation we will invoke next, because the job won't be able to delete Patient resources.
BEGIN TRANSACTION;
SELECT * FROM HFJ_RESOURCE hr WHERE RES_TYPE = 'Patient';
  1. Send a delete expunge request for Patients, e.g. DELETE http://{your-server-address}/Patient/?_expunge=true

The delete expunge job will get stuck in IN_PROGRESS state, not transitioning to FAILED state after max retrials reached. The job will keep retrying the work in the background. You can see that the work is being retried by looking at the logs which will contain errors similar to following for each retry:

ERROR M: R: o.h.e.jdbc.spi.SqlExceptionHelper - The query has timed out.
INFO  M: R: c.uhn.fhir.log.batch_troubleshooting - Temporary problem executing job DELETE_EXPUNGE step expunge, marking chunk 0bce69c5-9844-4403-ad54-e9ba1b1f3079 as retriable ERRORED
WARN  M: R: o.h.e.jdbc.spi.SqlExceptionHelper - SQL Error: 245, SQLState: S0001
ERROR M: R: o.h.e.jdbc.spi.SqlExceptionHelper - Conversion failed when converting the varchar value 'Too many errors: ' to data type int.tionHelper - Conversion failed when converting the varchar value 'Too many errors: ' to data type int.

Expected behavior
The job should transition to FAILED state after max number of retrials, and not get stuck in IN_PROGRESS.

Screenshots
Screenshot 2023-08-23 at 11 27 24 AM

Environment (please complete the following information):

  • HAPI FHIR Version: 6.8.0
  • DB: MS SQL Server

Additional context

Related issue on Hibernate: https://hibernate.atlassian.net/jira/software/c/projects/HHH/issues/HHH-3627

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant