Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not receiving async done tag for failed incoming call requests #10136

Closed
jeady opened this issue Mar 13, 2017 · 23 comments
Closed

Not receiving async done tag for failed incoming call requests #10136

jeady opened this issue Mar 13, 2017 · 23 comments

Comments

@jeady
Copy link
Contributor

jeady commented Mar 13, 2017

The comment on "AsyncNotifyWhenDone" states "Has to be called before the rpc starts" but it seems that if the request tag is returned with ok=false (i.e. because the CQ is shutting down) then the async done tag is never received. Instead, I expect the async done tag to be received regardless of whether or not an incoming call request was successfully received.

Worth noting is I have also seen cases where I received both the call request with ok=false and and done tag when there was some error (such as deadline exceeded) and not when the CQ is shutting down.

@ctiller
Copy link
Member

ctiller commented Mar 13, 2017

Hey @lyuxuan - can you take a first peek at this.

I think we need a test case that:

  1. creates a ServerContext and calls AsyncNotifyWhenDone
  2. requests a call (using the async server API) using that ServerContext
  3. shuts down the server

And I think the expectation is that right now we'll see the tag for (2) come back but not for (1).
We should also see (1).

@lyuxuan
Copy link
Contributor

lyuxuan commented Mar 14, 2017

I add a test case to the async_end2end_test, and indeed (1) does not come back.
#10143

@lyuxuan
Copy link
Contributor

lyuxuan commented Mar 23, 2017

It's blocking internal user

@jeady
Copy link
Contributor Author

jeady commented Mar 28, 2017

As a side note, the reason we're calling AsyncNotifyWhenDone is because we want to be able to call ServerContext::IsCancelled, which has this comment:

// When using async API, it is only safe to call IsCancelled after
// the AsyncNotifyWhenDone tag has been delivered

We would probably be able to infer the "done-ness" of the connection without this tag otherwise.

@jeady
Copy link
Contributor Author

jeady commented May 8, 2017

Any status update here?

@lyuxuan
Copy link
Contributor

lyuxuan commented May 8, 2017

We have a fix plan, and should be able to update it by the end of this week. Thanks!

@abaldeva
Copy link

This issue is still present in 1.4.2. Any updates?

@lyuxuan
Copy link
Contributor

lyuxuan commented Aug 19, 2017

Fix is in progress. #11955

@vjpai
Copy link
Member

vjpai commented Oct 11, 2017

Is this issue still alive or is it closed?

@ctiller
Copy link
Member

ctiller commented Oct 11, 2017 via email

@muxi
Copy link
Member

muxi commented Dec 11, 2017

ping?

@abaldeva
Copy link

abaldeva commented Aug 1, 2018

Are there any plans for fixing this issue? For us, this results in lot of memory leaks at shutdown which screws up our memory leak validation tool chain.

Thanks.

@sreecha
Copy link
Contributor

sreecha commented Aug 1, 2018

Hi @abaldeva , I looked at it a few months ago and realized that the fix is very invasive and requires a lot of rewiring the code. It is currently in the back-burner. I am going through all our open issues this week; I'll give you a more concrete answer mid to late week.

@AspirinSJL
Copy link
Member

I also saw this issue when working on #15853. I will add some comment to that API so that users are warned about this issue before using that API.

@mehrdada
Copy link
Member

Past the two year mark and still an open issue. Pretty annoying as it seems to be the only exception to "whatever you feed the CQ, you'll get back" invariant.

@stale
Copy link

stale bot commented Nov 27, 2019

This issue/PR has been automatically marked as stale because it has not had any update (including commits, comments, labels, milestones, etc) for 180 days. It will be closed automatically if no further update occurs in 1 day. Thank you for your contributions!

@stale stale bot closed this as completed Dec 4, 2019
@mehrdada mehrdada reopened this Feb 21, 2020
@stale stale bot removed the disposition/stale label Feb 21, 2020
@mehrdada
Copy link
Member

Issue definitely persists -- reopening

@stale
Copy link

stale bot commented May 6, 2020

This issue/PR has been automatically marked as stale because it has not had any update (including commits, comments, labels, milestones, etc) for 30 days. It will be closed automatically if no further update occurs in 7 day. Thank you for your contributions!

@xiedeacc
Copy link

is this issue still exists?

@adrianimboden
Copy link
Contributor

Yes, it is still an issue. I also have to disable leak checking for most tests because of this now.

Sadly, as I understand it, from a user point it is unavoidable: You have to call context.AsyncNotifyWhenDone before the request starts. If it never starts it is also not safe to delete the tag myself as it sometimes seems to get delivered anyway?

Disabling leak checking in those tests seems to be easiest.

I assume the reason for not fixing this is that most applications don't start/stop the servers all the time on production.

@yoonseok-kim
Copy link

yoonseok-kim commented Jan 13, 2023

I come up with a workaround to solve this problem at the application level. Depending on whether rpc initiate is in progress, it is determined whether to free the (dynamically allocated) tag to be delivered to AsyncNotifyWhenDone.

I wonder if this method is safe?

class CallDataInterface {
public:
    virtual ~CallDataInterface() = default;
    virtual void Proceed() = 0;
};

class DoneCallData : public CallDataInterface {
public:
    DoneCallData() = default;
    ~DoneCallData() = default;

    void Proceed() override { delete this; }
};

class SayHelloCallData : public CallDataInterface {
   public:
    SayHelloCallData(Greeter::AsyncService* service, ServerCompletionQueue* cq)
        : service_(service), cq_(cq), responder_(&ctx_), status_(CREATE) {
      initiate_ = false;
      done_ = std::make_unique<DoneCallData>();
      Proceed();
    }

    ~SayHelloCallData() {
        if (initiate_) {
            done_.release();
        }
    }

    void Proceed() override {
      if (status_ == CREATE) {
        status_ = PROCESS;
        ctx_.AsyncNotifyWhenDone(done_.get());
        service_->RequestSayHello(&ctx_, &request_, &responder_, cq_, cq_,
                                  this);
      } else if (status_ == PROCESS) {
        initiate_ = true;
        new SayHelloCallData(service_, cq_);

        std::string prefix("Hello ");
        reply_.set_message(prefix + request_.name());

        status_ = FINISH;
        responder_.Finish(reply_, Status::OK, this);
      } else {
        GPR_ASSERT(status_ == FINISH);
        delete this;
      }
    }

   private:
    bool initiate_;
    std::unique_ptr<DoneCallData> done_;
    Greeter::AsyncService* service_;
    ServerCompletionQueue* cq_;
    ServerContext ctx_;

    HelloRequest request_;
    HelloReply reply_;

    ServerAsyncResponseWriter<HelloReply> responder_;

    enum CallStatus { CREATE, PROCESS, FINISH };
    CallStatus status_;
  };

(written with 'grpc helloworld async server example'.)

mehrdada added a commit to mehrdada/grpc that referenced this issue May 22, 2023
As the issue[1] documents, the behavior of AsyncNotifyWhenDone is
documented as:

"The comment on `AsyncNotifyWhenDone` states "Has to be
called before the rpc starts" but it seems that if the
request tag is returned with ok=false (i.e. because
the CQ is shutting down) then the async done tag is
never received. Instead, I expect the async done tag to
be received regardless of whether or not an incoming
call request was successfully received."

The TODO item is marked closed as stale, and it seems
unlikely this will be resolved, without breaking
existing users whose code is written under the assumption
that the tag is not seen if the call never starts,
so it may be time to documented the idiosyncratic
corner case and make it the expected behavior.

[1]: grpc#10136
markdroth pushed a commit that referenced this issue May 23, 2023
As the [issue](#10136) documents, the
behavior of AsyncNotifyWhenDone is documented as:

"The comment on `AsyncNotifyWhenDone` states "Has to be called before
the rpc starts" but it seems that if the request tag is returned with
ok=false (i.e. because the CQ is shutting down) then the async done tag
is never received. Instead, I expect the async done tag to be received
regardless of whether or not an incoming call request was successfully
received."

The TODO item is marked closed as stale, and it seems unlikely this will
be resolved, without breaking
existing users whose code is written under the assumption that the tag
is not seen if the call never starts, so it may be time to documented
the idiosyncratic corner case and make it the expected behavior.
eugeneo pushed a commit to eugeneo/grpc that referenced this issue Jun 1, 2023
As the [issue](grpc#10136) documents, the
behavior of AsyncNotifyWhenDone is documented as:

"The comment on `AsyncNotifyWhenDone` states "Has to be called before
the rpc starts" but it seems that if the request tag is returned with
ok=false (i.e. because the CQ is shutting down) then the async done tag
is never received. Instead, I expect the async done tag to be received
regardless of whether or not an incoming call request was successfully
received."

The TODO item is marked closed as stale, and it seems unlikely this will
be resolved, without breaking
existing users whose code is written under the assumption that the tag
is not seen if the call never starts, so it may be time to documented
the idiosyncratic corner case and make it the expected behavior.
@brunexgeek
Copy link

Sadly, as I understand it, from a user point it is unavoidable: You have to call context.AsyncNotifyWhenDone before the request starts. If it never starts it is also not safe to delete the tag myself as it sometimes seems to get delivered anyway?

In my implementation, I used a memory pool for managing call data objects. Rather than allocating memory for each individual object, I used a preallocated buffer from the memory pool and instantiate with placement new (e.g. new (buffer) CallData()). If the notification never arrives, the buffer remains reserved in the pool, although "wasted". However, I have the flexibility to deallocate these buffers either upon stopping the server or dynamically during execution (e.g. employing a timeout mechanism).

@markdroth
Copy link
Member

This is a known deficiency of the CQ-based async API. At this point, we would recommend users migrate to the newer callback-style API instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests