-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PTL: Fix connection establishment protocol (with refactoring) #1240
Conversation
@jjhursey this issue I mentioned on the call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this isn't the correct fix - it will break the scenario where we execute a validation handshake operation or have no available validation mechanism. Let me poke at it a bit to see how we resolve it.
@rhc54 I spent time debugging it and want to work on the fix further. |
I will address the cases you mentioned and will update PR tomorrow |
Btw for handshake it should work fine I think. |
No problem - happy to let you resolve it if you like. The problem is that the handshake and "no-module" paths both require that the final resolution be returned to the client. The correct patch is probably to simply remove the code that returned status in the macro for the "non-handshake" path, letting the code in the main function return the result for all three paths. |
If you feel it would help, you are more than welcome to replace the security handshake macro with the actual code - nothing sacred about the macro. The intent is that the server always makes final determination of the outcome and communicates it to the client, regardless of the security mechanism in use. |
e7baebd
to
fc37cbf
Compare
@@ -155,22 +155,32 @@ PMIX_EXPORT pmix_psec_module_t* pmix_psec_base_assign_module(const char *options | |||
pmix_output_verbose(2, pmix_globals.debug_output, \ | |||
"credential validated"); \ | |||
} \ | |||
/* send them the result */ \ | |||
if (PMIX_SUCCESS != (_r = pmix_ptl_base_send_blocking((p)->sd, (char*)&(_r), sizeof(int)))) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhc54 note, that here a htonl conversion was missing.
@@ -1249,17 +1249,24 @@ static pmix_status_t recv_connect_ack(int sd, uint8_t myflag) | |||
return rc; | |||
} | |||
reply = ntohl(u32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while here ntohl conversion is present.
It was working because the representation of 0x0 is the same in both LE and BE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably also because I'm unaware of anyone actually using mixed endian clusters any more. Still, we try to maintain it just as good practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here if one side performs htonl, but another side doesn't it will lead to an issue even on homogeneous cluster
@@ -1746,68 +1754,9 @@ static void process_cbfunc(int sd, short args, void *cbdata) | |||
/* acquire the object */ | |||
PMIX_ACQUIRE_OBJECT(cd); | |||
|
|||
/* send this status so they don't hang */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was moved down to the place where a peer
is fully initialized.
@@ -1315,21 +1318,6 @@ static pmix_status_t recv_connect_ack(int sd, uint8_t myflag) | |||
pmix_globals.myid.nspace, pmix_globals.myid.rank, | |||
pmix_client_globals.myserver->info->pname.nspace, | |||
pmix_client_globals.myserver->info->pname.rank); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC we don't have psec modules with handshake so far.
So this change should be fine with respect to cross-version compatibility.
One clarification:
It seems to me that this code is not quite correct:
- Status message is received above (generic code for both client and server)
- but here you do another receive before and not after client's handshake.
If I understand the right way would be to perform the handshake and only afterward exchange with messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handshake itself exchanges messages as part of its procedure. I'd have to look at the broader code to understand the situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let me rephrase:
Does it make sense to send the message before server has completed handshake?
IMO it doesn't, but it seems to be here in this sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bringing the whole piece in question here:
pmix_ptl_base_recv_blocking(sd, (char*)&reply, sizeof(pmix_status_t));
if (PMIX_SUCCESS != reply) {
/* see if they want us to do the handshake */
if (PMIX_ERR_READY_FOR_HANDSHAKE == reply) {
PMIX_PSEC_CLIENT_HANDSHAKE(reply, pmix_client_globals.myserver, sd);
if (PMIX_SUCCESS != reply) {
return reply;
}
/* if the handshake succeeded, then fall thru to the next step */
} else {
return reply;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original status was already received above.
And after this code piece I don't see any additional exchange to confirm that handshake was successful
PMIX_RELEASE(peer); | ||
pmix_list_remove_item(&pmix_server_globals.nspaces, &nptr->super); | ||
PMIX_RELEASE(nptr); // will release the info object | ||
PMIX_PSEC_VALIDATE_CONNECTION(reply, peer, NULL, 0, NULL, NULL, &cred); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I haven't found any psec component that supports handshake, I think it is ok to move this above the subsequent sends.
And it is more logical to have authentification performed as the very first step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of us have components that implement handshakes (e.g., openssl) outside of the core code due to security requirements of specific customers. Let's make sure we don't break that support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
How can we do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to have an openssl component but we removed it awhile ago and now provide it only upon customer request. It will take me some time to get it released again to the public repo. Meantime, I can check it when I get back next week.
} | ||
|
||
/* if we got an identifier, send it back to the tool */ | ||
if (pnd->need_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, here
https://github.com/pmix/pmix/pull/1240/files#diff-2f3e46bf48df10ec8101693fba191ef3R1283
I don't see that this case is expected in any way.
Only 2 receives are performed instead of 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I'd have to look at the broader context to understand the scenario. Will do so upon return from vacation.
@rhc54, what would be a good way to check that tools communication is not broken? |
ac21cdd
to
12964c4
Compare
Easiest would be to switch to the |
|
Looks like working. |
Sure, I can do that easily enough. Good idea. |
I was thinking about the handshake codepath and I don't think it would work at all in the current code base for the following reasons.
Bottom line this sort of contributes to my statement that we are not breaking backward compatibility with tools as it was simply not working with the handshake.
This would be helpful because my original perception of the handshake logic was that server only needs to communicate that it is waiting for the handshake to begin and that afterward client-server communication (handshake) will happen over a separate channel and no additional confirmation would be required. So I was surprised by the following:
As I interpret that as in the handshake sequence:
|
@rhc54, as you know we have some urgency with 3.1.3 release. |
Sorry - I returned from vacation last night and am only just beginning to find the bottom of my work inbox, let alone finding time to deal with this issue. I should be able to make time tomorrow.
Somehow, I don't recall "email it to your competitor" being one of the options in my corporate software release manual 😄 |
Ok, thanks. |
Thanks, @karasevb ! |
@@ -566,7 +566,7 @@ void pmix_usock_send_handler(int sd, short flags, void *cbdata) | |||
pmix_peer_t *peer = (pmix_peer_t*)cbdata; | |||
pmix_ptl_send_t *msg = peer->send_msg; | |||
pmix_status_t rc; | |||
uint32_t nbytes; | |||
size_t nbytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean not this one, but the one that caused it in #1239
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking at this PR now - I am not convinced we want this last commit as we cannot backport it to the release branches to make them work. Will have to ponder and look at alternatives.
Adding the tool check to the Cross-version CI might be a good thing to do. Would it be as simple as all permutations of versions
If so I can work on adding that to the testing setup. |
Yep, that would do it! Don't include v1.2.5, obviously. |
This PR breaks tool compatibility in the way I described in #1240 (comment). I can restore the tools protocol if absolutely needed, but believe its not that important for the reasons above. |
Also @rhc54, could you confirm that #1240 (comment) is making sense to you. |
@rhc54 any progress on this? |
I'm working on it as time permits - we can discuss it on today's call. |
Per teleconf:
|
bot:retest |
The IBM CI (Cross-version) build failed! Please review the log, linked below. Gist: https://gist.github.com/97adb2838d0127ac4fabf8f03c6a5c2f |
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com>
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com>
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com>
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com> (cherry picked from commit 2cfed09)
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com> (cherry picked from commit 2cfed09)
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com> (cherry picked from commit 2cfed09)
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com> (cherry picked from commit 2cfed09)
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com> (cherry picked from commit 2cfed09)
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com> (cherry picked from commit 2cfed09)
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com> (cherry picked from commit 2cfed09)
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com> (cherry picked from commit 2cfed09) (cherry picked from commit 6795331)
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com> (cherry picked from commit 2cfed09) (cherry picked from commit 6795331)
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com> (cherry picked from commit 2cfed09)
@rhc54, note that this PR is mentioned in the release notes here: But I'm actually going to close it as we went ahead with an alternative solution: So last two lines of the release notes should both reference #1280. |
Avoid sending extra message in the client connection case. Is an alternative to: openpmix#1240 Signed-off-by: Artem Polyakov <artpol84@gmail.com> (cherry picked from commit 2cfed09) (cherry picked from commit 6795331)
On the server side the folowing messages are sent to the client during connection:
client only receives 2 messgaes:
For unknown reason in most of the time extra 4 bytes were consumed by the client.
But sometimes these extra bytes were prepending a valid message shifting tag and
size away and PMIx_Init hangs were observed.
Signed-off-by: Artem Polyakov artpol84@gmail.com