Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix refcounting in non blocking send path. #227

Closed
wants to merge 1 commit into from

Conversation

Gaurav-Gangalwar
Copy link

We were facing connection fd leak when connection goes in
destroyed state, which causing clients hung, as fd was open and not
rearmed for epoll, client will get tcp zero window and may or
mat not do connection reset to recover based on kernel version.

If connection xprt is in destroyed state, then we may not
be able process POLLOUT event for send task and refs taken will
not go away.
We don't need to register for POLLOUT event with ref taken for send
queue, as we take ref for send task.
We don't need ref for each response we queue, we can just single ref
for send queue and release it when we start processing.

We need to close xp_fd_send on connection destroy.

@Gaurav-Gangalwar
Copy link
Author

This is code path which caused issue
In svc_ioq_write, svc_rqst_evchan_write rearm with refs on EWOULDBLOCK
In svc_rqst_epoll_event, svc_xprt_lookup got xprt in destroyed state(it got destroyed in some other path, could be due to some error or idle cleanup happened at same time).
So svc_rqst_xprt_task_send won't get chance to execute and cleanup the refs taken for responses.

@dang
Copy link

dang commented Apr 12, 2021

I'm not sure about this change. I'll take some time to think about it and revisit.

@Gaurav-Gangalwar
Copy link
Author

Gaurav-Gangalwar commented Apr 13, 2021

Sure @dang
Customer facing this hung only with oracle linux clients.
To add more the issue,
We are sending two zero window since fd is not closed but we destroyed the xprt, so we will not be polling on it and recv queue exhausted.
We tried with mix of centos 8 and oracle linux 7.9 clients, so on zero window centos clients reset the connection and start new connection to recover, so IO hung for sometime but it recovers. But oracle linux don't reset the connection and remain in hung state forever.
To reproduce it easily I ran IO with this patch to simulate connection destroy while doing IO

@@ -1308,9 +1308,6 @@ svc_rqst_clean_func(SVCXPRT *xprt, void *arg)
        if (xprt->xp_flags & (SVC_XPRT_FLAG_DESTROYED | SVC_XPRT_FLAG_UREG))
                return (false);
 
-       if ((acc->ts.tv_sec - REC_XPRT(xprt)->recv.ts.tv_sec) < acc->timeout)
-               return (false);
-
        SVC_DESTROY(xprt);
        acc->cleaned++;
        return (true);

@Gaurav-Gangalwar
Copy link
Author

@dang @ffilz
can we merge this?

In svc_ioq_write for (rc < 0), we are not cleaning up writeq.
We can just continue instead of break if (rc < 0).
Its similar to how we handle errno != EWOULDBLOCK.

Change-Id: I280b5bffe2a2fc5db36c0cc6af3d36951979aa62
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants