Skip to content

Conversation

@annu13
Copy link
Contributor

@annu13 annu13 commented Apr 29, 2015

WHAT: Add new rml channel APIs (open, send and close) and a new QoS framework with ACK QoS component in ORTE.
WHY: to provide a method to specify end to end QoS requirements and to use it for sending RML messages
WHERE: Various files in RML, OOB and QoS. Please refer to change file list.
IMPACT: No impact on code using existing RML APIs. The QoS framework is only active when the new RML channel APIs are used.

More Details:
Intended Use:
rml_open_channel: The sender specifies the desired QoS requirements for the messages it wants to send to a peer by opening a channel to the peer. The desired QoS is specified using an attribute list. A no-op qos channel can also be opened if there are no QoS requirements. A channel number reference is returned to the sender in the completion callback upon success.

rml_send_channel_nb: Sender sends messages to peer by providing the channel number corresponding to the desired QoS.

rml_close_channel: When sender is done sending all the messages on the channel it should close the channel to release the resources held for book-keeping purpose.

Theory of Operation:
Open Channel: The sender creates a channel to the destination by calling open_channel and provides a list of QoS attributes describing the desired QoS. The RML layer creates a RML channel object, and calls the QoS to create the QoS channel object matching the specified QoS. The QoS Framework will call the QoS component matching the QoS type requested . The selected QoS component will create a QoS channel object with the requested attributes and returns it to RML. The RML associates the QoS channel object with the RML channel object and sends a open_channel request to the destination with the requested QoS attributes. The destination processes the open channel request and creates a RML channel object and the respective QoS channel object at its end and replies to the sender with its RML channel number (reference). The sender processes the response from the peer, stores the peer channel number and calls the completion callback with the local channel number.
Send Channel: rml_send_channel is called with the channel number (instead of the destination process name in rml_send_nb) and the rest of the send parameters similar to existing rml_send api. The RML retrieves the rml channel object corresponding to the channel number and associates it with the send request object. The QoS is called to prep for send, the respective QoS component will do the required book keeping operations and stores that info in the QoS channel object associated with the send request. The required channel info is added to send msg and is then forwarded to OOB for further send processing. The send completion path is also intercepted by the QoS – the QoS component will determine if the send request can be completed instantly or wait until some QoS specific action has occurred. In the case of ACK QoS, the send request is completed only after receiving ACK from the destination.
Recv Channel: There is no rml_recv_channel api as the receiving process cannot enforce QoS from its end. However the RML and QoS components on the receive process must perform the required msg post processing for a message received on a channel. The RML will retrieve the channel object using the channel number received in the request header. The RML will then call the QoS to do the required processing on the received request. The QoS component will update the book keeping info and perform any required operation such as sending an ACK back to the sender. The recv request is then returned to the RML for further processing.
Close Channel: The sending process is expected to close a channel to the peer after sending all the msgs. In response a close request is sent to the destination process and the RML and QoS channel objects are released on the sender and receiver’s end following the handshake.

Major Code Changes:
QoS Framework : A new MCA framework for the QoS feature was added. The framework is called by RML to process the rml_channel_xxx requests.
ACK QoS Component: The ACK QoS component provides windowed ACK functionality. It also supports retry of lost messages (out of order ACK and ACK timeout).
Noop QoS Component: This is a no-op qos component intended for book-keeping and place holder purpose.
RML – 3 new RML channel APIs, minor modifications to send processing path, send completion and recv handling. New orte_rml_channel object added and additional fields in send, recv objects to carry channel info.
OOB – Minor modification to send completion and addition of channel specific data to msg headers.

ggouaillardet and others added 20 commits February 20, 2015 09:44
… degree

of the topology is higher than the communicator size

It is possible to have a topology degree higher than the size of the communicator.
For example, a periodic cartesian communicator on MPI_COMM_SELF. This will leave
the neighborhood collectives with a request buffer that is too small.

This commits introduces a semantic change :
from now, c_topo must be set before invoking coll_select
…number of bytes remaining to be output or else we will output duplicate bytes when next we are able to write.
@rhc54
Copy link
Contributor

rhc54 commented Apr 29, 2015

@annu13 could you please rebase this PR, and then squash your commits into a single one? Makes the history cleaner.

@rhc54
Copy link
Contributor

rhc54 commented Apr 29, 2015

@annu13 also, I see that there are a fair number of files being touched here that just have whitespace deletions and/or some odd changes that have nothing to do with this PR. Can you please revert those so we only get the changes relative to this PR here? You can submit the whitespace changes separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all looks like whitespace changes in ompi even though this PR is suppose to be for orte. I'd prefer to see this type of change moved to a separate PR.

@rhc54 rhc54 added the RFC label Apr 29, 2015
@rhc54 rhc54 added this to the Open MPI 1.9.0 milestone Apr 29, 2015
@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/494/

Build Log
last 50 lines

[...truncated 8464 lines...]
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_req.c: In function 'mca_oob_ud_req_complete':
oob_ud_req.c:292: error: expected '}' before 'else'
oob_ud_req.c:286: warning: enumeration value 'MCA_OOB_UD_REQ_RECV' not handled in switch
oob_ud_req.c:295: error: break statement not within loop or switch
oob_ud_req.c:296: error: case label not within a switch statement
oob_ud_req.c:326: error: 'mca_oob_ud_req_t' has no member named 'req_seqnum'
oob_ud_req.c:347: error: break statement not within loop or switch
oob_ud_req.c:348: error: 'default' label not within a switch statement
oob_ud_req.c:349: error: break statement not within loop or switch
oob_ud_req.c: At top level:
oob_ud_req.c:352: warning: data definition has no type or storage class
oob_ud_req.c:352: warning: type defaults to 'int' in declaration of 'mca_oob_ud_req_return'
oob_ud_req.c:352: warning: parameter names (without types) in function declaration
oob_ud_req.c:352: error: conflicting types for 'mca_oob_ud_req_return'
oob_ud_req.c:245: note: previous definition of 'mca_oob_ud_req_return' was here
oob_ud_req.c:353: error: expected identifier or '(' before '}' token
make[2]: *** [oob_ud_req.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [oob_ud_send.lo] Error 1
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte/mca/oob/ud'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte'
make: *** [install-recursive] Error 1
Build step 'Execute shell' marked build as failure
TAP Reports Processing: START
Looking for TAP results report in workspace using pattern: **/*.tap
Did not find any matching files.
Anchor chain: could not read file with links: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/jenkins_sidelinks.txt (No such file or directory)
[copy-to-slave] The build is taking place on the master node, no copy back to the master will take place.
Setting commit status on GitHub for https://github.com/open-mpi/ompi/commit/56d3bb9a6b6a03c202166727dd2c4d17e15ade74
[BFA] Scanning build for known causes...
[BFA] No failure causes found
[BFA] Done. 0s
Setting status of 13adb9cdc01d71883e65013e2c5af8cd503bf2ca to FAILURE with url http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr/494/ and message: Merged build finished.

Test FAILed.

undoing whitespace auto insertions
@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/495/

Build Log
last 50 lines

[...truncated 8465 lines...]
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_send.c:114: error: 'orte_rml_send_t' has no member named 'msg'
oob_ud_req.c: In function 'mca_oob_ud_req_complete':
oob_ud_req.c:292: error: expected '}' before 'else'
oob_ud_req.c:286: warning: enumeration value 'MCA_OOB_UD_REQ_RECV' not handled in switch
oob_ud_req.c:295: error: break statement not within loop or switch
oob_ud_req.c:296: error: case label not within a switch statement
oob_ud_req.c:326: error: 'mca_oob_ud_req_t' has no member named 'req_seqnum'
oob_ud_req.c:347: error: break statement not within loop or switch
oob_ud_req.c:348: error: 'default' label not within a switch statement
oob_ud_req.c:349: error: break statement not within loop or switch
oob_ud_req.c: At top level:
oob_ud_req.c:352: warning: data definition has no type or storage class
oob_ud_req.c:352: warning: type defaults to 'int' in declaration of 'mca_oob_ud_req_return'
oob_ud_req.c:352: warning: parameter names (without types) in function declaration
oob_ud_req.c:352: error: conflicting types for 'mca_oob_ud_req_return'
oob_ud_req.c:245: note: previous definition of 'mca_oob_ud_req_return' was here
oob_ud_req.c:353: error: expected identifier or '(' before '}' token
make[2]: *** [oob_ud_req.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [oob_ud_send.lo] Error 1
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte/mca/oob/ud'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte'
make: *** [install-recursive] Error 1
Build step 'Execute shell' marked build as failure
TAP Reports Processing: START
Looking for TAP results report in workspace using pattern: **/*.tap
Did not find any matching files.
Anchor chain: could not read file with links: /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/jenkins_sidelinks.txt (No such file or directory)
[copy-to-slave] The build is taking place on the master node, no copy back to the master will take place.
Setting commit status on GitHub for https://github.com/open-mpi/ompi/commit/64711d3f3104ea0c164a9d1dd5150dab71a3e9a6
[BFA] Scanning build for known causes...
[BFA] No failure causes found
[BFA] Done. 0s
Setting status of e75963eee9fe574355f24f0a60ba638c4e4d0079 to FAILURE with url http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr/495/ and message: Merged build finished.

Test FAILed.

@rhc54
Copy link
Contributor

rhc54 commented Apr 30, 2015

Replaced by PR #564

@rhc54 rhc54 closed this Apr 30, 2015
jsquyres added a commit to jsquyres/ompi that referenced this pull request Nov 10, 2015
Remove the orte/qos framework and associated changes that should not …
markalle pushed a commit to markalle/ompi that referenced this pull request Sep 12, 2020
Defect 231781 - Pull in 3 orte commits, seems to fix orted setup race condition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants