Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/CORE/JAVA: Fix doing force-close without error handling #6418

Merged
merged 1 commit into from
Mar 3, 2021

Conversation

dmitrygx
Copy link
Member

@dmitrygx dmitrygx commented Mar 1, 2021

What

Fix doing force-close without error handling.

Why ?

flush(CANCEL) isn't allowed for transport without error handling support.

How ?

  1. Fix JUCX to create UCP EPs with error handling mode.
  2. Set UCP_EP_FLAG_FAILED for UCP EP before discarding lanes.
  3. Add check that UCP_ERR_HANDLING_MODE_PEER is requested when discarding lanes.
  4. Don't do discarding for ucp_failed_tl_ep.

@dmitrygx
Copy link
Member Author

dmitrygx commented Mar 2, 2021

@brminich @hoopoepg @evgeny-leksikov could you review pls?

@@ -2949,6 +2947,10 @@ void ucp_worker_discard_uct_ep(ucp_ep_h ucp_ep, uct_ep_h uct_ep,
ucs_assert(uct_ep != NULL);
ucs_assert(purge_cb != NULL);

if (uct_ep == &ucp_failed_tl_ep) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need special check?
i'd expect that flush and pending purge will be no-op on a failed_ep

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucp_failed_tl_ep can be added to hash and then added again - so it will fail

ucp_ep_destroy_base(ep);
}

void ucp_ep_release_id(ucp_ep_h ep)
{
ucs_status_t status;

ucs_assert(!(ep->flags & UCP_EP_FLAG_FAILED));
if (ucp_ep_ext_control(ep)->local_ep_id == UCP_EP_ID_INVALID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we set UCP_EP_FLAG_FAILED before discarding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would id be invalid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it already released in UCP error-handling flow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucp_ep_release_id(ucp_ep);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so issue is we release ep id twice?
can we release the id same place we set ep to failed? so FAILED flag check will prevent duplicate id release

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so issue is we release ep id twice?

yes, we tries to release it twice

can we release the id same place we set ep to failed? so FAILED flag check will prevent duplicate id release

ok, done

@@ -476,6 +476,9 @@ typedef struct ucp_conn_request {
} ucp_conn_request_t;


extern uct_ep_t ucp_failed_tl_ep;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a function like ucp_lane_is_failed instead of exposing global var?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -97,6 +97,11 @@ static uct_ep_t ucp_failed_tl_ep = {
};


int ucp_ep_is_uct_ep_failed(uct_ep_h uct_ep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then just ucp_is_uct_ep_failed, since actually there is no ucp ep at all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@dmitrygx
Copy link
Member Author

dmitrygx commented Mar 2, 2021

@yosefe @evgeny-leksikov do the changes look ok now?

Comment on lines 1008 to 1009
if (!(ep->flags & UCP_EP_FLAG_FAILED) &&
!(ep->flags & UCP_EP_FLAG_INTERNAL)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be simplified

!(ep->flags & (UCP_EP_FLAG_FAILED | UCP_EP_FLAG_INTERNAL))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 2952 to 2954
if (ucp_is_uct_ep_failed(uct_ep)) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO better to move this check to ucp_worker_discard_tl_uct_ep, it could be clearer why needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@dmitrygx
Copy link
Member Author

dmitrygx commented Mar 2, 2021

@yosefe could you review pls?

@dmitrygx
Copy link
Member Author

dmitrygx commented Mar 3, 2021

@brminich could you review pls?

@dmitrygx
Copy link
Member Author

dmitrygx commented Mar 3, 2021

@brminich @yosefe ok to merge?

@brminich brminich merged commit d7a8113 into openucx:master Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants