Skip to content

Conversation

@PhongChuong
Copy link
Collaborator

Fixes: #4015

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the Pub/Sub API. label Jan 22, 2026
@codecov
Copy link

codecov bot commented Jan 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.87%. Comparing base (5f5904e) to head (00d8054).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4338   +/-   ##
=======================================
  Coverage   94.87%   94.87%           
=======================================
  Files         188      188           
  Lines        7237     7241    +4     
=======================================
+ Hits         6866     6870    +4     
  Misses        371      371           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

/// # Ok(())
/// # }
/// ```
pub async fn resume_publish<T: std::convert::Into<std::string::String>>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a synchronous operation. We don't need for the resume to complete. Any messages that are put into the channel afterward will reach the unpaused worker.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While messages that are put into the channel after will reach the unpaused worker, I think there are cases where the application would want a signal that the worker is now unpaused and publishing is back to the normal behavior.

This is achievable with async or potentially a long blocking sync. My preference is for it to be async and the application choose to .await on it if needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the use case. Could you explain it a bit more? If you call a synchronous (fast) resume_publish(), all publishes that occur after it will be batched and sent (until the next error of course). Why do we need to wait for the signal to reach the background worker? What additional benefit is there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image if generating a message is expensive with a short TTL. In this case, the application may choose to await until the worker is ready again instead of having the messages potentially expire due to some other on going processes (i.e., flush).

Of course, this use case is completely imaginary but I also think that giving the application the option is beneficial. Is there a reason why sync fast resume is preferred over async version?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the application may choose to await until the worker is ready again instead of having the messages potentially expire due to some other on going processes (i.e., flush).
This is interesting, but I think maybe they should be awaiting on the flush not the resume_publish in this case.

One reason is to keep resume_publish cheap, its basically flipping a bit and we can immediately know that future publishes will get batched so there is just no need to delay in saying that its done.

Also from discussing this with Alex early in the design process he thought this should be a synchronous operation (if I am remembering correctly).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, the issue with sync is that the worker may be blocked on a long flush operation causing a large delay in the resume_publish being processed due to it awaiting on other ordering keys. The ideal case would be for resume_publish to be sync and for flush not block other operations from being processed.
I'll update resume_publish to be sync and we can update flush in a later PR.

let got = publisher
.publish(
PubsubMessage::new()
.set_ordering_key("ordering key with error")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be the same as the line 1318? without error? I'm not sure the name is that much more descriptive than just doing key1 and key2 if we want a second one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.
Updated to use a variable instead.
I'll follow up this PR with a cleanup of the tests.

Comment on lines +1582 to +1599
got_err = publisher
.publish(
PubsubMessage::new()
.set_ordering_key("ordering key with error 0")
.set_data("msg 2"),
)
.await
.unwrap_err();
let source = got_err
.source()
.and_then(|e| e.downcast_ref::<crate::error::PublishError>());
assert!(
matches!(
source,
Some(crate::error::PublishError::OrderingKeyPaused(()))
),
"{got_err:?}"
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe this could be in a helper, its kind of verbose and these tests become hard to skim.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledge.
I'll follow up this PR with a cleanup of the tests.

@PhongChuong PhongChuong marked this pull request as ready for review January 23, 2026 03:22
@PhongChuong PhongChuong requested a review from a team as a code owner January 23, 2026 03:22
@PhongChuong PhongChuong requested a review from dbolduc January 23, 2026 03:24
Comment on lines 32 to 39
@@ -35,8 +35,8 @@ pub(crate) enum ToBatchWorker {
Publish(BundledMessage),
/// A request to flush all outstanding messages.
Flush(oneshot::Sender<()>),
// TODO(#4015): Add a resume function to allow resume Publishing on a ordering key after a
// failure.
/// A request to resume publishing.
ResumePublish(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I'm only now noticing that ToWorker and ToBatchWorker are basically the same thing. I don't know if we can save anything performance wise from using the same type, but could be a place to consider refactoring in the future.

@PhongChuong PhongChuong merged commit 6aaad0e into googleapis:main Jan 23, 2026
30 checks passed
@PhongChuong PhongChuong deleted the orderingResume branch January 24, 2026 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: pubsub Issues related to the Pub/Sub API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pubsub: Support Publisher resume on ordering key when a failure occurs

2 participants