Skip to content

sub orchestrator retries#84

Merged
cgillum merged 4 commits intomicrosoft:mainfrom
famarting:sub-orchestrator-retries
Nov 27, 2024
Merged

sub orchestrator retries#84
cgillum merged 4 commits intomicrosoft:mainfrom
famarting:sub-orchestrator-retries

Conversation

@famarting
Copy link
Copy Markdown
Contributor

second part for dapr/go-sdk#541

adds support for call suborchestrator with retries

based on the existing implementation for activities

@famarting
Copy link
Copy Markdown
Contributor Author

ping @cgillum

Signed-off-by: Fabian Martinez <46371672+famarting@users.noreply.github.com>
@famarting
Copy link
Copy Markdown
Contributor Author

ping

@cgillum
Copy link
Copy Markdown
Member

cgillum commented Nov 20, 2024

@famarting looks like your new tests are not passing. The CI is reporting timeouts.

@famarting
Copy link
Copy Markdown
Contributor Author

@cgillum tests are failing because of this error

ERROR: 2024/11/20 19:09:57 orchestration-processor: failed to complete work item: orchestration instance already exists

It makes me think that the strategy I followed to implement sub orchestrator retries is not a valid strategy? do you have any idea or suggestion how to address this?

First I implemented activity retries and I implemented a recursive algorithm that using timers scheduled the activity introducing delays between tries... however this doesnt seem to work OOB for sub orchestrations due to the "already exists" error. Additionally, looking at the python sdk, seems like retries are not implemented using a recursive algorithm, instead extra logic has been added into process event to account for delays and retries.... am I forced to follow the same strategy as python? or what options do we have?

@cgillum
Copy link
Copy Markdown
Member

cgillum commented Nov 22, 2024

My first guess is that the sqlite backend implementation doesn't allow reusing the existing sub-orchestration ID, so it's returning ErrDuplicateInstance from the call to CompleteOrchestrationWorkItem.

Your test might pass if you forgo providing a specific instance ID and allow the sub-orchestration to use a random instance ID. However, if you want the scenario of reusing a user-specific instance ID to work, then we'll need to change sqlite to handle the case where an existing instance already exists such that we delete the old (failed) instance and create the new instance. Note that the Dapr Actors backend might require similar changes.

…tor on retry

Signed-off-by: Fabian Martinez <46371672+famarting@users.noreply.github.com>
Signed-off-by: Fabian Martinez <46371672+famarting@users.noreply.github.com>
@famarting famarting force-pushed the sub-orchestrator-retries branch from 404799e to 7572160 Compare November 26, 2024 09:42
@famarting
Copy link
Copy Markdown
Contributor Author

My first guess is that the sqlite backend implementation doesn't allow reusing the existing sub-orchestration ID, so it's returning ErrDuplicateInstance from the call to CompleteOrchestrationWorkItem.

Your test might pass if you forgo providing a specific instance ID and allow the sub-orchestration to use a random instance ID. However, if you want the scenario of reusing a user-specific instance ID to work, then we'll need to change sqlite to handle the case where an existing instance already exists such that we delete the old (failed) instance and create the new instance. Note that the Dapr Actors backend might require similar changes.

ok, it was much simpler than I thought, thank you

@cgillum
Copy link
Copy Markdown
Member

cgillum commented Nov 27, 2024

@famarting looks like the new test is still failing due to a mismatch in expected OTel span outputs:

 Error Trace:	/home/runner/work/durabletask-go/durabletask-go/tests/tracing_test.go:172
        	        /home/runner/work/durabletask-go/durabletask-go/tests/tracing_test.go:144
        	        /home/runner/work/durabletask-go/durabletask-go/tests/tracing_test.go:125
        	        /home/runner/work/durabletask-go/durabletask-go/tests/tracing_test.go:30
        	        /home/runner/work/durabletask-go/durabletask-go/tests/orchestrations_test.go:466
 Error:      	[]attribute.KeyValue{attribute.KeyValue{Key:"durabletask.type", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"orchestration", slice:interface {}(nil)}}, attribute.KeyValue{Key:"durabletask.task.name", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"Child", slice:interface {}(nil)}}, attribute.KeyValue{Key:"durabletask.task.instance_id", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"8e5d728a-cce1-4b74-8928-d5df2a648de3_child", slice:interface {}(nil)}}, attribute.KeyValue{Key:"durabletask.runtime_status", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"FAILED", slice:interface {}(nil)}}} does not contain attribute.KeyValue{Key:"durabletask.task.task_id", Value:attribute.Value{vtype:2, numeric:0x0, stringly:"", slice:interface {}(nil)}}
 Test:       	Test_SingleSubOrchestrator_Failed_Retries

It looks like you're attempting to assert a particular task ID for the sub-orchestrations, but there aren't any such task ID attributes getting generated for sub-orchestrations. I don't see similar asserts for task IDs in the other sub-orchestration test (suggesting that they're not expected to exist), so I wonder if you should just remove the task ID asserts you added for the sub-orchestrations.

Signed-off-by: Fabian Martinez <46371672+famarting@users.noreply.github.com>
@famarting
Copy link
Copy Markdown
Contributor Author

followed your feedback, thank you

Copy link
Copy Markdown
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Everything's passing now.

@cgillum cgillum merged commit 6ab94e1 into microsoft:main Nov 27, 2024
JoshVanL referenced this pull request in JoshVanL/durabletask-go Apr 4, 2025
Signed-off-by: Fabian Martinez <46371672+famarting@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants