Fix refcounting in non blocking send path. #227

Gaurav-Gangalwar · 2021-04-08T09:22:34Z

We were facing connection fd leak when connection goes in
destroyed state, which causing clients hung, as fd was open and not
rearmed for epoll, client will get tcp zero window and may or
mat not do connection reset to recover based on kernel version.

If connection xprt is in destroyed state, then we may not
be able process POLLOUT event for send task and refs taken will
not go away.
We don't need to register for POLLOUT event with ref taken for send
queue, as we take ref for send task.
We don't need ref for each response we queue, we can just single ref
for send queue and release it when we start processing.

We need to close xp_fd_send on connection destroy.

Gaurav-Gangalwar · 2021-04-09T08:26:07Z

This is code path which caused issue
In svc_ioq_write, svc_rqst_evchan_write rearm with refs on EWOULDBLOCK
In svc_rqst_epoll_event, svc_xprt_lookup got xprt in destroyed state(it got destroyed in some other path, could be due to some error or idle cleanup happened at same time).
So svc_rqst_xprt_task_send won't get chance to execute and cleanup the refs taken for responses.

dang · 2021-04-12T15:51:57Z

I'm not sure about this change. I'll take some time to think about it and revisit.

Gaurav-Gangalwar · 2021-04-13T16:31:12Z

Sure @dang
Customer facing this hung only with oracle linux clients.
To add more the issue,
We are sending two zero window since fd is not closed but we destroyed the xprt, so we will not be polling on it and recv queue exhausted.
We tried with mix of centos 8 and oracle linux 7.9 clients, so on zero window centos clients reset the connection and start new connection to recover, so IO hung for sometime but it recovers. But oracle linux don't reset the connection and remain in hung state forever.
To reproduce it easily I ran IO with this patch to simulate connection destroy while doing IO

@@ -1308,9 +1308,6 @@ svc_rqst_clean_func(SVCXPRT *xprt, void *arg)
        if (xprt->xp_flags & (SVC_XPRT_FLAG_DESTROYED | SVC_XPRT_FLAG_UREG))
                return (false);
 
-       if ((acc->ts.tv_sec - REC_XPRT(xprt)->recv.ts.tv_sec) < acc->timeout)
-               return (false);
-
        SVC_DESTROY(xprt);
        acc->cleaned++;
        return (true);

Gaurav-Gangalwar · 2021-04-26T11:49:53Z

@dang @ffilz
can we merge this?

In svc_ioq_write for (rc < 0), we are not cleaning up writeq. We can just continue instead of break if (rc < 0). Its similar to how we handle errno != EWOULDBLOCK. Change-Id: I280b5bffe2a2fc5db36c0cc6af3d36951979aa62

Cleanup writeq if svc_ioq_flushv return -1.

9feb368

In svc_ioq_write for (rc < 0), we are not cleaning up writeq. We can just continue instead of break if (rc < 0). Its similar to how we handle errno != EWOULDBLOCK. Change-Id: I280b5bffe2a2fc5db36c0cc6af3d36951979aa62

Gaurav-Gangalwar force-pushed the push branch from a0c38aa to 9feb368 Compare August 6, 2021 06:33

Gaurav-Gangalwar closed this Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix refcounting in non blocking send path. #227

Fix refcounting in non blocking send path. #227

Gaurav-Gangalwar commented Apr 8, 2021

Gaurav-Gangalwar commented Apr 9, 2021

dang commented Apr 12, 2021

Gaurav-Gangalwar commented Apr 13, 2021 •

edited

Gaurav-Gangalwar commented Apr 26, 2021

Fix refcounting in non blocking send path. #227

Fix refcounting in non blocking send path. #227

Conversation

Gaurav-Gangalwar commented Apr 8, 2021

Gaurav-Gangalwar commented Apr 9, 2021

dang commented Apr 12, 2021

Gaurav-Gangalwar commented Apr 13, 2021 • edited

Gaurav-Gangalwar commented Apr 26, 2021

Gaurav-Gangalwar commented Apr 13, 2021 •

edited