-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCP/CORE/JAVA: Fix doing force-close without error handling #6418
Conversation
473ad6d
to
8187d28
Compare
@brminich @hoopoepg @evgeny-leksikov could you review pls? |
src/ucp/core/ucp_worker.c
Outdated
@@ -2949,6 +2947,10 @@ void ucp_worker_discard_uct_ep(ucp_ep_h ucp_ep, uct_ep_h uct_ep, | |||
ucs_assert(uct_ep != NULL); | |||
ucs_assert(purge_cb != NULL); | |||
|
|||
if (uct_ep == &ucp_failed_tl_ep) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need special check?
i'd expect that flush and pending purge will be no-op on a failed_ep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ucp_failed_tl_ep
can be added to hash and then added again - so it will fail
src/ucp/core/ucp_ep.c
Outdated
ucp_ep_destroy_base(ep); | ||
} | ||
|
||
void ucp_ep_release_id(ucp_ep_h ep) | ||
{ | ||
ucs_status_t status; | ||
|
||
ucs_assert(!(ep->flags & UCP_EP_FLAG_FAILED)); | ||
if (ucp_ep_ext_control(ep)->local_ep_id == UCP_EP_ID_INVALID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we set UCP_EP_FLAG_FAILED
before discarding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would id be invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it already released in UCP error-handling flow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 509 in 9efeadc
ucp_ep_release_id(ucp_ep); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so issue is we release ep id twice?
can we release the id same place we set ep to failed? so FAILED flag check will prevent duplicate id release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so issue is we release ep id twice?
yes, we tries to release it twice
can we release the id same place we set ep to failed? so FAILED flag check will prevent duplicate id release
ok, done
src/ucp/core/ucp_ep.h
Outdated
@@ -476,6 +476,9 @@ typedef struct ucp_conn_request { | |||
} ucp_conn_request_t; | |||
|
|||
|
|||
extern uct_ep_t ucp_failed_tl_ep; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a function like ucp_lane_is_failed
instead of exposing global var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/ucp/core/ucp_ep.c
Outdated
@@ -97,6 +97,11 @@ static uct_ep_t ucp_failed_tl_ep = { | |||
}; | |||
|
|||
|
|||
int ucp_ep_is_uct_ep_failed(uct_ep_h uct_ep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then just ucp_is_uct_ep_failed
, since actually there is no ucp ep at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@yosefe @evgeny-leksikov do the changes look ok now? |
src/ucp/core/ucp_ep.c
Outdated
if (!(ep->flags & UCP_EP_FLAG_FAILED) && | ||
!(ep->flags & UCP_EP_FLAG_INTERNAL)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be simplified
!(ep->flags & (UCP_EP_FLAG_FAILED | UCP_EP_FLAG_INTERNAL))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/ucp/core/ucp_worker.c
Outdated
if (ucp_is_uct_ep_failed(uct_ep)) { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO better to move this check to ucp_worker_discard_tl_uct_ep
, it could be clearer why needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@yosefe could you review pls? |
a63830b
to
c03a2c1
Compare
@brminich could you review pls? |
What
Fix doing force-close without error handling.
Why ?
flush(CANCEL) isn't allowed for transport without error handling support.
How ?
UCP_EP_FLAG_FAILED
for UCP EP before discarding lanes.UCP_ERR_HANDLING_MODE_PEER
is requested when discarding lanes.ucp_failed_tl_ep
.