Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqlException Transaction (Process ID 66) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. #193

Open
yehiyam opened this issue Oct 1, 2023 · 4 comments
Labels
P2 Priority 2

Comments

@yehiyam
Copy link

yehiyam commented Oct 1, 2023

Hi all
We are using durable functions in a stand alone process (not azure functions) in kubernetes cluster.
Microsoft.DurableTask.SqlServer.AzureFunctions nuget version 1.1.1

Sometimes when calling ReadEntityStateAsync we get an exception:

Transaction (Process ID 63) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

It happens once every few thousands of orchestrations.

The full call stack:

Microsoft.VideoIndexer.Telemetry.ViException: Transaction (Process ID 63) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
 ---> Microsoft.Data.SqlClient.SqlException (0x80131904): Transaction (Process ID 63) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.SqlDataReader.TryHasMoreRows(Boolean& moreRows)
   at Microsoft.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)
   at Microsoft.Data.SqlClient.SqlDataReader.Read()
   at DurableTask.SqlServer.SqlOrchestrationService.GetOrchestrationStateAsync(String instanceId, String executionId, CancellationToken cancellationToken) in /_/src/DurableTask.SqlServer/SqlOrchestrationService.cs:line 614
   at DurableTask.SqlServer.AzureFunctions.SqlDurabilityProvider.RetrieveSerializedEntityState(EntityId entityId, JsonSerializerSettings serializierSettings) in /_/src/DurableTask.SqlServer.AzureFunctions/SqlDurabilityProvider.cs:line 103
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableClient.ReadEntityStateAsync[T](DurabilityProvider provider, EntityId entityId) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableClient.cs:line 566
   at Microsoft.VideoIndexer.IndexingService.Common.ActivityRunner.RunActivity[TContext](Func`3 function, TContext context, IDurableEntityClient entityClient, Boolean throwExceptions, String callingMethod) in C:\__w\1\s\Src\Microsoft.VideoIndexer.IndexingService.Common\ActivityRunner.cs:line 121
ClientConnectionId:dba86c5b-4d3b-4f29-89b9-d169ac0d9362
Error Number:1205,State:51,Class:13
   --- End of inner exception stack trace ---
@yehiyam
Copy link
Author

yehiyam commented Oct 1, 2023

image

@bachuv bachuv added P1 Priority 1 and removed Needs: Triage 🔍 labels Oct 3, 2023
@cgillum cgillum added P2 Priority 2 and removed P1 Priority 1 labels Oct 3, 2023
@yehiyam
Copy link
Author

yehiyam commented Oct 9, 2023

Kind ping
@cgillum

@cgillum
Copy link
Member

cgillum commented Oct 9, 2023

We should be automatically retrying deadlocks, so I wonder if those retries are being exhausted. Unfortunately, it's hard to fix these without having a reliable repro that can be used for testing.

Are you comfortable making changes directly to your database? One thing you can try is changing the deadlock priority of the dt.QuerySingleOrchestration stored procedure, adding a SET DEADLOCK_PRIORITY HIGH to the top of the procedure to make it more likely to survive deadlocks.

For example, you'd need to do an ALTER PROCEDURE statement similar to the following:

ALTER PROCEDURE [dt].[QuerySingleOrchestration]
    @InstanceID varchar(100),
    @ExecutionID varchar(50) = NULL,
    @FetchInput bit = 1,
    @FetchOutput bit = 1
AS
BEGIN
    SET DEADLOCK_PRIORITY HIGH

    DECLARE @TaskHub varchar(50) = dt.CurrentTaskHub()

    SELECT TOP 1
        I.[InstanceID],
        I.[ExecutionID],
        I.[Name],
        I.[Version],
        I.[CreatedTime],
        I.[LastUpdatedTime],
        I.[CompletedTime],
        I.[RuntimeStatus],
        I.[ParentInstanceID],
        (SELECT TOP 1 [Text] FROM Payloads P WHERE
            P.[TaskHub] = @TaskHub AND
            P.[InstanceID] = I.[InstanceID] AND
            P.[PayloadID] = I.[CustomStatusPayloadID]) AS [CustomStatusText],
        CASE WHEN @FetchInput = 1 THEN (SELECT TOP 1 [Text] FROM Payloads P WHERE
            P.[TaskHub] = @TaskHub AND
            P.[InstanceID] = I.[InstanceID] AND
            P.[PayloadID] = I.[InputPayloadID]) ELSE NULL END AS [InputText],
        CASE WHEN @FetchOutput = 1 THEN (SELECT TOP 1 [Text] FROM Payloads P WHERE
            P.[TaskHub] = @TaskHub AND
            P.[InstanceID] = I.[InstanceID] AND
            P.[PayloadID] = I.[OutputPayloadID]) ELSE NULL END AS [OutputText],
        I.[TraceContext]
    FROM Instances I
    WHERE
        I.[TaskHub] = @TaskHub AND
        I.[InstanceID] = @InstanceID AND
        (@ExecutionID IS NULL OR @ExecutionID = I.ExecutionID)
END

If you're comfortable doing so, can you try and see if that helps?

It doesn't fix the deadlock, but it makes it less likely to impact the ReadEntityStateAsync. Instead, the other deadlock participant will get rolled back. Assuming the other deadlock participant is less important or is a background task that can be automatically retried, this could improve the overall availability of your app. However, it could also just move the problem elsewhere.

If you're able to find a way to reproduce this consistently, I could use that reproducer to find a more reliable/permanent fix.

@yehiyam
Copy link
Author

yehiyam commented Nov 9, 2023

Hi @cgillum
I manually applied your change to the stored procedure and it seems to help. After the fix, I wasn't able to reproduce the error.
Is there a way to apply it programmatically when the process starts?
Currently, we are letting the SDK create the DB with the flag task.StorageProvider["createDatabaseIfNotExists"] = true;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Priority 2
Projects
None yet
Development

No branches or pull requests

3 participants