Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PTL: Fix connection establishment protocol (with refactoring) #1240

Closed
wants to merge 3 commits into from

Conversation

artpol84
Copy link
Contributor

On the server side the folowing messages are sent to the client during connection:

  • (4B) Security verification
  • (4B) Status
  • (4B) Server-local index of the peer

client only receives 2 messgaes:

  • (4B) Status
  • (4B) Server-local index of the peer

For unknown reason in most of the time extra 4 bytes were consumed by the client.
But sometimes these extra bytes were prepending a valid message shifting tag and
size away and PMIx_Init hangs were observed.

Signed-off-by: Artem Polyakov artpol84@gmail.com

@artpol84
Copy link
Contributor Author

@jjhursey this issue I mentioned on the call.
I still see another issue but it is less frequent and happens in finalize.

Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this isn't the correct fix - it will break the scenario where we execute a validation handshake operation or have no available validation mechanism. Let me poke at it a bit to see how we resolve it.

@rhc54 rhc54 closed this May 10, 2019
@artpol84 artpol84 reopened this May 10, 2019
@artpol84
Copy link
Contributor Author

@rhc54 I spent time debugging it and want to work on the fix further.
Please do not close my PRs.
You can provide your thoughts but I’d like to have a fix

@artpol84
Copy link
Contributor Author

I will address the cases you mentioned and will update PR tomorrow

@artpol84
Copy link
Contributor Author

Btw for handshake it should work fine I think.
Only absence of security module might be an issue

@rhc54
Copy link
Contributor

rhc54 commented May 10, 2019

No problem - happy to let you resolve it if you like. The problem is that the handshake and "no-module" paths both require that the final resolution be returned to the client. The correct patch is probably to simply remove the code that returned status in the macro for the "non-handshake" path, letting the code in the main function return the result for all three paths.

@rhc54
Copy link
Contributor

rhc54 commented May 10, 2019

If you feel it would help, you are more than welcome to replace the security handshake macro with the actual code - nothing sacred about the macro. The intent is that the server always makes final determination of the outcome and communicates it to the client, regardless of the security mechanism in use.

@artpol84 artpol84 force-pushed the fix/cli_connect branch 2 times, most recently from e7baebd to fc37cbf Compare May 10, 2019 18:25
@@ -155,22 +155,32 @@ PMIX_EXPORT pmix_psec_module_t* pmix_psec_base_assign_module(const char *options
pmix_output_verbose(2, pmix_globals.debug_output, \
"credential validated"); \
} \
/* send them the result */ \
if (PMIX_SUCCESS != (_r = pmix_ptl_base_send_blocking((p)->sd, (char*)&(_r), sizeof(int)))) { \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhc54 note, that here a htonl conversion was missing.

@@ -1249,17 +1249,24 @@ static pmix_status_t recv_connect_ack(int sd, uint8_t myflag)
return rc;
}
reply = ntohl(u32);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while here ntohl conversion is present.
It was working because the representation of 0x0 is the same in both LE and BE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also because I'm unaware of anyone actually using mixed endian clusters any more. Still, we try to maintain it just as good practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here if one side performs htonl, but another side doesn't it will lead to an issue even on homogeneous cluster

@@ -1746,68 +1754,9 @@ static void process_cbfunc(int sd, short args, void *cbdata)
/* acquire the object */
PMIX_ACQUIRE_OBJECT(cd);

/* send this status so they don't hang */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was moved down to the place where a peer is fully initialized.

@@ -1315,21 +1318,6 @@ static pmix_status_t recv_connect_ack(int sd, uint8_t myflag)
pmix_globals.myid.nspace, pmix_globals.myid.rank,
pmix_client_globals.myserver->info->pname.nspace,
pmix_client_globals.myserver->info->pname.rank);

Copy link
Contributor Author

@artpol84 artpol84 May 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we don't have psec modules with handshake so far.
So this change should be fine with respect to cross-version compatibility.

One clarification:
It seems to me that this code is not quite correct:

  • Status message is received above (generic code for both client and server)
  • but here you do another receive before and not after client's handshake.
    If I understand the right way would be to perform the handshake and only afterward exchange with messages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handshake itself exchanges messages as part of its procedure. I'd have to look at the broader code to understand the situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let me rephrase:
Does it make sense to send the message before server has completed handshake?
IMO it doesn't, but it seems to be here in this sequence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing the whole piece in question here:

        pmix_ptl_base_recv_blocking(sd, (char*)&reply, sizeof(pmix_status_t));	
        if (PMIX_SUCCESS != reply) {	
            /* see if they want us to do the handshake */	
            if (PMIX_ERR_READY_FOR_HANDSHAKE == reply) {	
                PMIX_PSEC_CLIENT_HANDSHAKE(reply, pmix_client_globals.myserver, sd);	
                if (PMIX_SUCCESS != reply) {	
                    return reply;	
                }	
                /* if the handshake succeeded, then fall thru to the next step */	
            } else {	
                return reply;	
            }	
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original status was already received above.
And after this code piece I don't see any additional exchange to confirm that handshake was successful

PMIX_RELEASE(peer);
pmix_list_remove_item(&pmix_server_globals.nspaces, &nptr->super);
PMIX_RELEASE(nptr); // will release the info object
PMIX_PSEC_VALIDATE_CONNECTION(reply, peer, NULL, 0, NULL, NULL, &cred);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I haven't found any psec component that supports handshake, I think it is ok to move this above the subsequent sends.
And it is more logical to have authentification performed as the very first step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of us have components that implement handshakes (e.g., openssl) outside of the core code due to security requirements of specific customers. Let's make sure we don't break that support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.
How can we do that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to have an openssl component but we removed it awhile ago and now provide it only upon customer request. It will take me some time to get it released again to the public repo. Meantime, I can check it when I get back next week.

}

/* if we got an identifier, send it back to the tool */
if (pnd->need_id) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, here
https://github.com/pmix/pmix/pull/1240/files#diff-2f3e46bf48df10ec8101693fba191ef3R1283
I don't see that this case is expected in any way.
Only 2 receives are performed instead of 4.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'd have to look at the broader context to understand the scenario. Will do so upon return from vacation.

@artpol84
Copy link
Contributor Author

@rhc54, what would be a good way to check that tools communication is not broken?

@artpol84 artpol84 force-pushed the fix/cli_connect branch 3 times, most recently from ac21cdd to 12964c4 Compare May 10, 2019 18:54
@rhc54
Copy link
Contributor

rhc54 commented May 10, 2019

@rhc54, what would be a good way to check that tools communication is not broken?

Easiest would be to switch to the test/simple directory and run ./simptest -e ./simptool.

@artpol84
Copy link
Contributor Author

$ ./simptest -e ./simptool
Testing version 4.0.0a1
[artpol-VirtualBox:03736] SERVER: ERRHANDLER REGISTRATION CALLBACK CALLED WITH STATUS 0, ref=0
Collecting inventory
[artpol-VirtualBox:03736] INVENTORY RECEIVED
Inventory collected: 0
[artpol-VirtualBox:03736] INVENTORY READY FOR DELIVERY
[artpol-VirtualBox:03736] SERVER: TOOL CONNECT
[artpol-VirtualBox:03752] Tool ns foobar rank 0: Running
[artpol-VirtualBox:03736] SERVER: QUERY
[artpol-VirtualBox:03736] 	Key: foobar
[artpol-VirtualBox:03736] 	Key: spastic
[artpol-VirtualBox:03752] Client ns foobar rank 0: Finalizing
[artpol-VirtualBox:03736] SERVER: FINALIZED foobar:0 WAKEUP 1
Client ns foobar rank 0:PMIx_Finalize successfully completed
SIMPTEST: Model event handler called with status -147(PMIX MODEL DECLARED)
	pmix.pgm.model:	PMIX
	pmix.mdl.name:	test
	pmix.evname:	SIMPTEST-MODEL
Test finished OK!

@artpol84
Copy link
Contributor Author

Looks like working.
We should implement the same thing for the test suite so we will cover tools as well.

@rhc54
Copy link
Contributor

rhc54 commented May 10, 2019

Looks like working.
We should implement the same thing for the test suite so we will cover tools as well.

Sure, I can do that easily enough. Good idea.

@artpol84
Copy link
Contributor Author

artpol84 commented May 11, 2019

I was thinking about the handshake codepath and I don't think it would work at all in the current code base for the following reasons.

  • Server would send a non-converted code for PMIX_ERR_READY_FOR_HANDSHAKE here. This will be value -14 = 0xFFFFFFF1
  • Client would receive it with conversion here. This would result in ntohl(0xFFFFFFF1) = 0x1FFFFFFF which is positive and not equal to original -14.

Bottom line this sort of contributes to my statement that we are not breaking backward compatibility with tools as it was simply not working with the handshake.

We used to have an openssl component but we removed it a while ago and now provide it only upon customer request. It will take me some time to get it released again to the public repo. Meantime, I can check it when I get back next week.

This would be helpful because my original perception of the handshake logic was that server only needs to communicate that it is waiting for the handshake to begin and that afterward client-server communication (handshake) will happen over a separate channel and no additional confirmation would be required. So I was surprised by the following:

The problem is that the handshake and "no-module" paths both require that the final resolution be returned to the client.

As I interpret that as in the handshake sequence:

  • Server sends PMIX_ERR_READY_FOR_HANDSHAKE
  • They do handshake
  • Another message is required to confirm that server has accepted the connection.

@artpol84
Copy link
Contributor Author

@rhc54, as you know we have some urgency with 3.1.3 release.
If you can propose any way to speed up verification of this PR would be great.
Maybe you can send me OpenSSL component and I’ll test it myself today.

@artpol84
Copy link
Contributor Author

@rhc54?

@rhc54
Copy link
Contributor

rhc54 commented May 15, 2019

Sorry - I returned from vacation last night and am only just beginning to find the bottom of my work inbox, let alone finding time to deal with this issue. I should be able to make time tomorrow.

Maybe you can send me OpenSSL component and I’ll test it myself today.

Somehow, I don't recall "email it to your competitor" being one of the options in my corporate software release manual 😄

@artpol84
Copy link
Contributor Author

Ok, thanks.
Regarding OpenSSL - it wasn’t clear to me that it is a private component.

@artpol84
Copy link
Contributor Author

Thanks, @karasevb !

@@ -566,7 +566,7 @@ void pmix_usock_send_handler(int sd, short flags, void *cbdata)
pmix_peer_t *peer = (pmix_peer_t*)cbdata;
pmix_ptl_send_t *msg = peer->send_msg;
pmix_status_t rc;
uint32_t nbytes;
size_t nbytes;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhc54 note, that effectively this change is breaking backward compatibility with previous versions as we discussed in #1239.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean not this one, but the one that caused it in #1239

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at this PR now - I am not convinced we want this last commit as we cannot backport it to the release branches to make them work. Will have to ponder and look at alternatives.

@jjhursey
Copy link
Member

Adding the tool check to the Cross-version CI might be a good thing to do. Would it be as simple as all permutations of versions A and B:

cd $HOME/pmix-A/test/simple
./simptest -e $HOME/pmix-B/test/simple/simptool

If so I can work on adding that to the testing setup.

@rhc54
Copy link
Contributor

rhc54 commented May 15, 2019

Adding the tool check to the Cross-version CI might be a good thing to do. Would it be as simple as all permutations of versions A and B:

cd $HOME/pmix-A/test/simple
./simptest -e $HOME/pmix-B/test/simple/simptool

If so I can work on adding that to the testing setup.

Yep, that would do it! Don't include v1.2.5, obviously.

@artpol84
Copy link
Contributor Author

This PR breaks tool compatibility in the way I described in #1240 (comment).
The rationale was the following:
We need to fix each branch anyway because of (a) the bug in connection protocol and (b) the fact that server sends unconverted handshake status and clients are using ntohl to interpret it.
Given that we had to touch every branch and recommend everyone to update I believe that it makes sense to refactor the protocol that is being fixed anyway.

I can restore the tools protocol if absolutely needed, but believe its not that important for the reasons above.

@artpol84
Copy link
Contributor Author

Also @rhc54, could you confirm that #1240 (comment) is making sense to you.
I extracted all my concerns/rationale here. So you can only consider this comment out of the whole discussion.

@artpol84
Copy link
Contributor Author

@rhc54 any progress on this?

@rhc54
Copy link
Contributor

rhc54 commented May 16, 2019

I'm working on it as time permits - we can discuss it on today's call.

@jjhursey
Copy link
Member

Per teleconf:

  • Ralph working on fixing the Tool connection portion of this PR (will post a PR when ready)
  • Prefer to not touch the ptl/usock component since it's only there to support the v1.2 series. We should avoid making changes to ptl/usock if at all possible.
  • Needs more investigation and review

@artpol84
Copy link
Contributor Author

bot:retest

@ibm-ompi
Copy link

The IBM CI (Cross-version) build failed! Please review the log, linked below.

Gist: https://gist.github.com/97adb2838d0127ac4fabf8f03c6a5c2f

artpol84 added a commit to artpol84/pmix that referenced this pull request May 17, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
artpol84 added a commit to artpol84/pmix that referenced this pull request May 17, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
@artpol84 artpol84 changed the title PTL: Fix connection establishment protocol PTL: Fix connection establishment protocol (with refactoring) May 17, 2019
artpol84 added a commit to artpol84/pmix that referenced this pull request May 17, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
karasevb pushed a commit to karasevb/pmix that referenced this pull request May 20, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit 2cfed09)
karasevb pushed a commit to karasevb/pmix that referenced this pull request May 20, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit 2cfed09)
karasevb pushed a commit to karasevb/pmix that referenced this pull request May 20, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit 2cfed09)
artpol84 added a commit to artpol84/pmix that referenced this pull request May 20, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit 2cfed09)
artpol84 added a commit to artpol84/pmix that referenced this pull request May 23, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit 2cfed09)
artpol84 added a commit to artpol84/pmix that referenced this pull request May 23, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit 2cfed09)
artpol84 added a commit to artpol84/pmix that referenced this pull request May 23, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit 2cfed09)
artpol84 added a commit to artpol84/pmix that referenced this pull request May 23, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit 2cfed09)
(cherry picked from commit 6795331)
artpol84 added a commit to artpol84/pmix that referenced this pull request May 23, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit 2cfed09)
(cherry picked from commit 6795331)
artpol84 added a commit to artpol84/pmix that referenced this pull request May 23, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit 2cfed09)
@artpol84
Copy link
Contributor Author

@rhc54, note that this PR is mentioned in the release notes here:
Changes since v3.1.2rc2
PR #1237: Avoid double-free of collective tracker
PR #1237: Ensure all participants are notified of fence complete
PR #1237: Ensure all participants are notified of connect and disconnect complete
PR #1250: Fix PMIx_server_finalize hang (rare)
PR #1271: PTL/usock doesn't support tools
PR #1240: Fix the PTL connection establishment protocol <---- !!!!!
PR #1280: Fix tool connection in psec/handshake mode

But I'm actually going to close it as we went ahead with an alternative solution: So last two lines of the release notes should both reference #1280.

@artpol84 artpol84 closed this May 29, 2019
karasevb pushed a commit to karasevb/pmix that referenced this pull request Jun 5, 2019
Avoid sending extra message in the client connection case.
Is an alternative to:
openpmix#1240

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
(cherry picked from commit 2cfed09)
(cherry picked from commit 6795331)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants