Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Apr 30, 2015

Summary

WHAT: Add new rml channel APIs (open, send and close) and a new QoS framework with ACK QoS component in ORTE.
WHY: to provide a method to specify end to end QoS requirements and to use it for sending RML messages
WHERE: Various files in RML, OOB and QoS. Please refer to change file list.
IMPACT: No impact on code using existing RML APIs. The QoS framework is only active when the new RML channel APIs are used.

More Details

Intended Use:

  1. rml_open_channel: The sender specifies the desired QoS requirements for the messages it wants to send to a peer by opening a channel to the peer. The desired QoS is specified using an attribute list. A no-op qos channel can also be opened if there are no QoS requirements. A channel number reference is returned to the sender in the completion callback upon success.
  2. rml_send_channel_nb: Sender sends messages to peer by providing the channel number corresponding to the desired QoS.
  3. rml_close_channel: When sender is done sending all the messages on the channel it should close the channel to release the resources held for book-keeping purpose.

Theory of Operation:

Open Channel: The sender creates a channel to the destination by calling open_channel and provides a list of QoS attributes describing the desired QoS. The RML layer creates a RML channel object, and calls the QoS to create the QoS channel object matching the specified QoS. The QoS Framework will call the QoS component matching the QoS type requested . The selected QoS component will create a QoS channel object with the requested attributes and returns it to RML. The RML associates the QoS channel object with the RML channel object and sends a open_channel request to the destination with the requested QoS attributes. The destination processes the open channel request and creates a RML channel object and the respective QoS channel object at its end and replies to the sender with its RML channel number (reference). The sender processes the response from the peer, stores the peer channel number and calls the completion callback with the local channel number.

Send Channel: rml_send_channel is called with the channel number (instead of the destination process name in rml_send_nb) and the rest of the send parameters similar to existing rml_send api. The RML retrieves the rml channel object corresponding to the channel number and associates it with the send request object. The QoS is called to prep for send, the respective QoS component will do the required book keeping operations and stores that info in the QoS channel object associated with the send request. The required channel info is added to send msg and is then forwarded to OOB for further send processing. The send completion path is also intercepted by the QoS – the QoS component will determine if the send request can be completed instantly or wait until some QoS specific action has occurred. In the case of ACK QoS, the send request is completed only after receiving ACK from the destination.

Recv Channel: There is no rml_recv_channel api as the receiving process cannot enforce QoS from its end. However the RML and QoS components on the receive process must perform the required msg post processing for a message received on a channel. The RML will retrieve the channel object using the channel number received in the request header. The RML will then call the QoS to do the required processing on the received request. The QoS component will update the book keeping info and perform any required operation such as sending an ACK back to the sender. The recv request is then returned to the RML for further processing.

Close Channel: The sending process is expected to close a channel to the peer after sending all the msgs. In response a close request is sent to the destination process and the RML and QoS channel objects are released on the sender and receiver’s end following the handshake.

Major Code Changes:

  1. QoS Framework : A new MCA framework for the QoS feature was added. The framework is called by RML to process the rml_channel_xxx requests.
  2. ACK QoS Component: The ACK QoS component provides windowed ACK functionality. It also supports retry of lost messages (out of order ACK and ACK timeout).
  3. Noop QoS Component: This is a no-op qos component intended for book-keeping and place holder purpose.
  4. RML – 3 new RML channel APIs, minor modifications to send processing path, send completion and recv handling. New orte_rml_channel object added and additional fields in send, recv objects to carry channel info.
  5. OOB – Minor modification to send completion and addition of channel specific data to msg headers.

@rhc54 rhc54 added the RFC label Apr 30, 2015
@rhc54 rhc54 added this to the Open MPI 1.9.0 milestone Apr 30, 2015
@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/496/

Build Log
last 50 lines

[...truncated 8467 lines...]
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_req.c: In function 'mca_oob_ud_req_complete':
oob_ud_req.c:292: error: expected '}' before 'else'
oob_ud_req.c:286: warning: enumeration value 'MCA_OOB_UD_REQ_RECV' not handled in switch
oob_ud_req.c:295: error: break statement not within loop or switch
oob_ud_req.c:296: error: case label not within a switch statement
oob_ud_req.c:326: error: 'mca_oob_ud_req_t' has no member named 'req_seqnum'
oob_ud_req.c:347: error: break statement not within loop or switch
oob_ud_req.c:348: error: 'default' label not within a switch statement
oob_ud_req.c:349: error: break statement not within loop or switch
oob_ud_req.c: At top level:
oob_ud_req.c:352: warning: data definition has no type or storage class
oob_ud_req.c:352: warning: type defaults to 'int' in declaration of 'mca_oob_ud_req_return'
oob_ud_req.c:352: warning: parameter names (without types) in function declaration
oob_ud_req.c:352: error: conflicting types for 'mca_oob_ud_req_return'
oob_ud_req.c:245: note: previous definition of 'mca_oob_ud_req_return' was here
oob_ud_req.c:353: error: expected identifier or '(' before '}' token
make[2]: *** [oob_ud_send.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [oob_ud_req.lo] Error 1
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte/mca/oob/ud'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte'
make: *** [install-recursive] Error 1
Build step 'Execute shell' marked build as failure
TAP Reports Processing: START
Looking for TAP results report in workspace using pattern: **/*.tap
Did not find any matching files.
Anchor chain: could not read file with links: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/jenkins_sidelinks.txt (No such file or directory)
[copy-to-slave] The build is taking place on the master node, no copy back to the master will take place.
Setting commit status on GitHub for https://github.com/open-mpi/ompi/commit/8addc6f857ce5c0dcd1b06a023fee2df31686ab6
[BFA] Scanning build for known causes...
[BFA] No failure causes found
[BFA] Done. 0s
Setting status of 2850db579fb38780e0b9274182af4e44e1486b2f to FAILURE with url http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr/496/ and message: Merged build finished.

Test FAILed.

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 30, 2015

bot:retest

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/503/
Test PASSed.

@jladd-mlnx
Copy link
Member

I would like for @nkogteva to review this patch and for her to run her data collection scripts on this commit on Orion before it is merged. She returns from May Day vacation (Russian holidays) on Wednesday. Please hold the merge until @nkogteva completes this task.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct misprint: mca_base_var_rtegister => mca_base_var_register. It doesn't compilable with --enable-timing.

@rhc54
Copy link
Contributor Author

rhc54 commented May 7, 2015

@nkogteva I think I have this cleaned up now. I see some significant work still to-be-done in the code, but the default code path appears to be untouched, as promised.

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/508/
Test PASSed.

@nkogteva
Copy link

nkogteva commented May 7, 2015

@rhc54 yes, thanks.

@jladd-mlnx Old functionality is working as it was promised in description. Light tests (like hello world or pmix test) work fine. I tried to run new oob_stress_channel test with oob ud. It hangs. But I'm sure that problem is in oob ud, not in this PR. Because old one (oob_stress test) with oob ud also hangs. So I think that this PR can be merged. I will look at the stress tests + ud and handle problem in different PR.

rhc54 pushed a commit that referenced this pull request May 7, 2015
Consolidate all the QOS changes into one clean commit
@rhc54 rhc54 merged commit d09927f into open-mpi:master May 7, 2015
@rhc54 rhc54 deleted the qos branch May 7, 2015 15:17
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants