Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScheduleNewOrchestrationInstanceAsync fails when scale controller scaling in #398

Open
jason-daly opened this issue May 14, 2024 · 10 comments
Labels

Comments

@jason-daly
Copy link

jason-daly commented May 14, 2024

I've recently started using Netherite (Microsoft.Azure.Functions.Worker.Extensions.DurableTask.Netherite" Version="1.5.1") and have been seeing a few errors when starting orchestrations.

Orchestrations start fine most of the time, but occasionally I see the following error:

When starting an orchestration, as follows:

            var ingestionOrchestrationSpec = new IngestionOrchestrationSpec(etc...);
            var orchestrationInstanceId = $"ingestions.{tenant.ProjectId}.{pointcloud.Id}";
            await client.ScheduleNewOrchestrationInstanceAsync(nameof(IngestionsCreateAsyncOrchestrator), ingestionOrchestrationSpec, new StartOrchestrationOptions { InstanceId = orchestrationInstanceId });

This exception is thrown:

Exception: Grpc.Core.RpcException: Status(StatusCode="Unknown", Detail="Exception was thrown by handler.")
   at Microsoft.DurableTask.Client.Grpc.GrpcDurableTaskClient.ScheduleNewOrchestrationInstanceAsync(TaskName orchestratorName, Object input, StartOrchestrationOptions options, CancellationToken cancellation)
   at TrimbleGeospatial.Seurat.ApiApp.Functions.Ingestions_Create.Run(HttpRequestData req, DurableTaskClient client, FunctionContext functionContext, String projectId, String pointcloudId)
@jason-daly
Copy link
Author

jason-daly commented May 14, 2024

The relevant host.json settings are:

    "durableTask": {
      "hubName": "SeuratAPIDurableTasks",
      "UseGracefulShutdown": "true",
      "storageProvider": {
        "type": "Netherite",
        "PartitionCount": "12",
        "StorageConnectionName": "AzureWebJobsStorage",
        "EventHubsConnectionName": "EventHubsConnection"
      },
      "tracing": {
        "distributedTracingEnabled": true,
        "Version": "V2"
      },

      "TraceToBlob": true
    }

where "AzureWebJobsStorage" is a connection string to a Azure storage account:
DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=srtfuncdevaue;AccountKey=this_part_is_secret

and "EventHubsConnection" is a connection string to an EventHub, as follows:
Endpoint=sb://evhns-srt-netherite-dev-aue.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=secret_goes_here

@jason-daly jason-daly changed the title ScheduleNewOrchestrationInstanceAsync fails when multiple orchestrations are running ScheduleNewOrchestrationInstanceAsync fails when scale controller scaling in May 15, 2024
@jason-daly
Copy link
Author

more info...

Logs show that the error is raised in response to the host instance being shut down by the Scale Controller, but it appears that the orchestration actually started ok.

15/05/2024, 5:05:56.497 pm Scheduling new IngestionsCreateAsyncOrchestrator orchestration with instance ID 'ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873' and 3140 bytes of input data.
15/05/2024, 5:05:56.498 pm {"durabletask.task.execution_id":"b9e17ec1fb2a4464ae5ec25fbb3265cc","durabletask.task.name":"IngestionsCreateAsyncOrchestrator","durabletask.type":"orchestration","otel.status_description":"Client request was cancelled because host is shutting down.","durabletask.task.instance_id":"ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873"}

15/05/2024, 5:05:56.504 pm ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873: Function 'IngestionsCreateAsyncOrchestrator (Orchestrator)' scheduled. Reason: NewInstance. IsReplay: False. State: Scheduled. RuntimeStatus: Pending. HubName: SeuratAPIDurableTasks. AppName: func-srt-dev-aue. SlotName: Production. ExtensionVersion: 2.13.2. SequenceNumber: 21.

15/05/2024, 5:05:56.550 pm An instance was removed because all functions are either idle or seeing a steady decrease in load.
15/05/2024, 5:05:56.550 pm Instance count changed

15/05/2024, 5:05:56.551 pm ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873: Function 'IngestionsCreateAsyncOrchestrator (Orchestrator)' started. IsReplay: False. Input: (12560 bytes). State: Started. RuntimeStatus: Running. HubName: SeuratAPIDurableTasks. AppName: func-srt-dev-aue. SlotName: Production. ExtensionVersion: 2.13.2. SequenceNumber: 110. TaskEventId: -1

15/05/2024, 5:05:56.551 pm Executing 'Functions.IngestionsCreateAsyncOrchestrator' (Reason='(null)', Id=21415206-bfc1-446e-8e0c-287674404a50)
15/05/2024, 5:05:56.557 pm Executed 'Functions.IngestionsCreateAsyncOrchestrator' (Succeeded, Id=21415206-bfc1-446e-8e0c-287674404a50, Duration=5ms)

15/05/2024, 5:05:56.572 pm DrainMode mode enabled
15/05/2024, 5:05:56.572 pm Calling StopAsync on the registered listeners
15/05/2024, 5:05:56.590 pm Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableTaskListener' for function 'IngestionsCreateAsyncOrchestrator'
15/05/2024, 5:05:56.590 pm Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableTaskListener' for function 'IngestionsCreateAsyncOrchestrator'
... (I can supply more logs if needed)

15/05/2024, 5:05:56.888 pm Exception: Grpc.Core.RpcException: Status(StatusCode="Unknown", Detail="Exception was thrown by handler.")
at Microsoft.DurableTask.Client.Grpc.GrpcDurableTaskClient.ScheduleNewOrchestrationInstanceAsync(TaskName orchestratorName, Object input, StartOrchestrationOptions options, CancellationToken cancellation)

15/05/2024, 5:05:56.888 pm Failed to start IngestionsCreateAsyncOrchestrator with orchestrationInstanceId=ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873

@jason-daly
Copy link
Author

This seems to me to be similar to 2454, as explained in this comment.

@bachuv bachuv added P2 Priority 2 and removed Needs: Triage 🔍 labels May 22, 2024
@davidmrdavid
Copy link
Collaborator

Hi @jason-daly: can you share what Durable Functions packages you're suing? As suggested in the thread you linked, that interaction between the scale controller and DF should have been fixed.

You also mentioned that you encountered this error since moving to Netherite 1.5.1. What were you using before?

@davidmrdavid
Copy link
Collaborator

davidmrdavid commented May 22, 2024

Also - I think the thread you linked mostly discussed orchestrators being marked as failed when scale controller removes a VM from the app. In this case, it appears to me that the client function is failing, not the orchestrator. Do I have that right?

If so - does your client function have a cancellation token that you can inspect during these shutdowns? I suspect the cancellation token has been "set"/ fired.

@jason-daly
Copy link
Author

jason-daly commented May 22, 2024

Also - I think the thread you linked mostly discussed orchestrators being marked as failed when scale controller removes a VM from the app. In this case, it appears to me that the client function is failing, not the orchestrator. Do I have that right?

If so - does your client function have a cancellation token that you can inspect during these shutdowns? I suspect the cancellation token has been "set"/ fired.

Ah, yes, you are correct. The exception occurs in the call to ScheduleNewOrchestrationInstanceAsync on the DurableTaskClient instance.

In the latest edit of this code, a CancellationToken was supplied, and I still encountered the RpcException, although the underlying error logged by the 'framework' changed to:
image

Hi @jason-daly: can you share what Durable Functions packages you're suing? As suggested in the thread you linked, that interaction between the scale controller and DF should have been fixed.

You also mentioned that you encountered this error since moving to Netherite 1.5.1. What were you using before?

Prior to Netherite, we have had been using the default storage provider, Azure Storage, and had not encountered these failures.

Here's the packages list:

    <PackageReference Include="AutoMapper" Version="13.0.1" />
    <PackageReference Include="Azure.Identity" Version="1.11.3" />
    <PackageReference Include="Azure.Messaging.EventGrid" Version="4.24.0" />
    <PackageReference Include="Azure.Storage.Files.DataLake" Version="12.18.0" />
    <PackageReference Include="Azure.Storage.Queues" Version="12.18.0" />
    <PackageReference Include="Gridify" Version="2.14.1" />
    <PackageReference Include="Microsoft.Azure.Core.NewtonsoftJson" Version="2.0.0" />
    <PackageReference Include="Microsoft.Azure.Cosmos" Version="3.40.0-preview.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.CosmosDB" Version="4.8.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask" Version="1.1.3" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask.Netherite" Version="1.5.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.EventGrid" Version="3.4.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http" Version="3.1.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Storage.Queues" Version="5.4.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Timer" Version="4.3.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.2" OutputItemType="Analyzer" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.22.0" />
    <PackageReference Include="Microsoft.Extensions.Diagnostics.HealthChecks.Abstractions" Version="8.0.5" />
    <PackageReference Include="Microsoft.IdentityModel.Tokens" Version="7.5.2" />
    <PackageReference Include="Microsoft.IO.RecyclableMemoryStream" Version="3.0.0" />
    <PackageReference Include="System.Linq.Async" Version="6.0.1" />
    <PackageReference Include="System.Net.Http" Version="4.3.4" />
    <PackageReference Include="System.Text.RegularExpressions" Version="4.3.1" />

@jason-daly
Copy link
Author

BTW, I have backed out Netherite now, and gone back to Azure Storage. If you are able to figure out the issue and get it resolved, I can give Netherite another try.

@davidmrdavid
Copy link
Collaborator

@jason-daly: are you able to provide me with your app name and time in UTC where this issue ocurred in Azure? I'd need that to be able to debug this further

@jason-daly
Copy link
Author

@jason-daly: are you able to provide me with your app name and time in UTC where this issue ocurred in Azure? I'd need that to be able to debug this further

Sure. The function app name is: func-srt-dev-aue (in Australia East region); logs are with Application Insights 'appi-srt-dev-aue'.

The most recent occurrence was at 2024-05-16T02:58:06.6515245Z.
See operation_id 27ab279dcd3d61ccdd43934ec76717fe
See orchestration instanceId 'FeaturesServiceIntegration.nE0Z_35yWYM.82da1a0b-2812-4fee-9ca0-314369d386af.38d7f15b-99a5-4553-8a02-17557f50c1c8'.

To be honest, I kind of gave up on this already and reverted Netherite from our code. When I did this I also cleaned up the resources that it used, so this might hamper your troubleshooting efforts a bit...
If you are unable to figure out what was going wrong from the logs, I can temporarily reinstate Netherite in this app, so we can repeat the behavior. Let me know if you need this.

@davidmrdavid
Copy link
Collaborator

Thanks. No worries, we don't need your app to be active for us to review our telemetry, it is historical, not live :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants