Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL collective plugin interface #156

Merged
merged 1 commit into from
Dec 13, 2021

Conversation

vspetrov
Copy link
Collaborator

@vspetrov vspetrov commented Apr 9, 2021

What

PR adds the interface for the custom/closed/vendor plugins inside TL, i.e. plugin at the collective algorithm implementation level

Why ?

The use case is: given the existing open source TL (TL/UCP is the most obvious example) provide a way for 3rd party implement an algorithm re-using TL/UCP resources and functionality and distribute this algorithm as a closed plug-in.

How ?

The tl level iface is provided. Added a necessary logic to the build process that searches for the plugins. The build of plugins can be enabled with "--with-tlcp", where tlcp stands for "tl collective plugin".
Example reference implementation of a TL/UCP plugin is added in components/tl/ucp/coll_plugins/example.

The development flow for the 3rd party: fork UCC repo, add a "git submodule" with the code of the plugin. The code base of a plugin can be stored separately. The sync with the main ucc becomes easy and smooth: git pull. Since plugin code is in a separate folder (submodule) no merge issues.

The integration of a plugin algorithm into score-based selection logic is made naturally. Plugin provides "get_scores" interface and thus reports to the TL dering team creation. This allows vendor to define which collective, which msg range, which mem type thier plugin will run on. And this can be altered in runtime with the same parameter "SCORE". For the example plugin above it is: UCC_TL_UCP_TLCP_EXAMPLE_SCORE.

Plugin may implement several algorithms for different collectives if needed. So, no more than 1 plugin from single vendor is required.

@gvallee
Copy link
Contributor

gvallee commented Nov 17, 2021

@vspetrov I successfully created a new plugin, no major issue there. But I am not sure i understand the mpirun parameters to use to activate the plugin during the execution of a job. My understanding is the following, please let me know if i got something wrong:

  • The plugin correctly compiles and is installed.
  • Activate UCC for the collective by adding the following mpirun parameters: --mca coll_ucc_enable 1 --mca coll_ucc_priority 100.
  • Add the environment variable associated to the plugin to the mpirun command, in the context of a example plugin targeting alltoallv, something like: -x UCC_TL_UCP_TLCP_EXAMPLE_TUNE="UCC_COLL_TYPE_ALLTOALLV:UCC_MEMORY_TYPE_HOST:0,inf:100".

Unfortunately, my plugin is not being picked up at runtime and i do not see what i am doing wrong. Could you please help? Thanks.

@vspetrov
Copy link
Collaborator Author

@vspetrov I successfully created a new plugin, no major issue there. But I am not sure i understand the mpirun parameters to use to activate the plugin during the execution of a job. My understanding is the following, please let me know if i got something wrong:

  • The plugin correctly compiles and is installed.
  • Activate UCC for the collective by adding the following mpirun parameters: --mca coll_ucc_enable 1 --mca coll_ucc_priority 100.
  • Add the environment variable associated to the plugin to the mpirun command, in the context of a example plugin targeting alltoallv, something like: -x UCC_TL_UCP_TLCP_EXAMPLE_TUNE="UCC_COLL_TYPE_ALLTOALLV:UCC_MEMORY_TYPE_HOST:0,inf:100".

Unfortunately, my plugin is not being picked up at runtime and i do not see what i am doing wrong. Could you please help? Thanks.

@gvallee you TUNE syntax looks incorrect. See wiki: https://github.com/openucx/ucc/wiki/FAQ#6-what-is-tl-scoring-and-how-to-select-a-certain-tl

do you see any output from UCC regarding wrong syntax ?

@gvallee
Copy link
Contributor

gvallee commented Nov 17, 2021

@vspetrov I successfully created a new plugin, no major issue there. But I am not sure i understand the mpirun parameters to use to activate the plugin during the execution of a job. My understanding is the following, please let me know if i got something wrong:

  • The plugin correctly compiles and is installed.
  • Activate UCC for the collective by adding the following mpirun parameters: --mca coll_ucc_enable 1 --mca coll_ucc_priority 100.
  • Add the environment variable associated to the plugin to the mpirun command, in the context of a example plugin targeting alltoallv, something like: -x UCC_TL_UCP_TLCP_EXAMPLE_TUNE="UCC_COLL_TYPE_ALLTOALLV:UCC_MEMORY_TYPE_HOST:0,inf:100".

Unfortunately, my plugin is not being picked up at runtime and i do not see what i am doing wrong. Could you please help? Thanks.

@gvallee you TUNE syntax looks incorrect. See wiki: https://github.com/openucx/ucc/wiki/FAQ#6-what-is-tl-scoring-and-how-to-select-a-certain-tl

do you see any output from UCC regarding wrong syntax ?

No warning whatsoever and i do not understand the documentation pointed by your link. There is no list of possible values so i am sure the documentation works well for someone who has some understanding of what needs to be done, but i personally do not get it at all. I will dig into the code.

@vspetrov
Copy link
Collaborator Author

@gvallee lets try with just UCC_TL_UCP_TLCP_EXAMPLE_TUNE=inf

@gvallee
Copy link
Contributor

gvallee commented Nov 17, 2021

@gvallee lets try with just UCC_TL_UCP_TLCP_EXAMPLE_TUNE=inf

It works for the example plugin but not my plugin so something is wrong with my plugin. I will investigate. Thanks for your help!

@gvallee
Copy link
Contributor

gvallee commented Nov 17, 2021

@vspetrov All done and it works. No suggestion for changes at this point other than it might be useful to add a version to guarantee the compatibility with the UCC where the plugin may be dropped. I think this could wait, it is not really required at the moment.

It would be nice if this PR could be merged so that I can start to really work on the plugin and the code be as close as possible to the main branch (to avoid API change issues and so on).

Thanks for your work!

src/coll_score/ucc_coll_score.c Outdated Show resolved Hide resolved
config/m4/tl_coll_plugins.m4 Outdated Show resolved Hide resolved
src/coll_score/ucc_coll_score.c Outdated Show resolved Hide resolved
src/components/tl/ucp/tl_ucp_team.c Outdated Show resolved Hide resolved
src/components/tl/ucp/tl_ucp_lib.c Outdated Show resolved Hide resolved
src/components/tl/ucp/coll_plugins/example/example.c Outdated Show resolved Hide resolved
src/components/tl/ucp/coll_plugins/example/example.c Outdated Show resolved Hide resolved
src/components/tl/ucp/coll_plugins/example/example.c Outdated Show resolved Hide resolved
@vspetrov
Copy link
Collaborator Author

vspetrov commented Dec 9, 2021

@Sergei-Lebedev addressed

config/m4/tl_coll_plugins.m4 Outdated Show resolved Hide resolved
src/components/tl/ucp/coll_plugins/example/Makefile.am Outdated Show resolved Hide resolved
@vspetrov vspetrov merged commit c972a58 into openucx:master Dec 13, 2021
@vspetrov vspetrov deleted the topic/tl_coll_plugin branch December 13, 2021 07:55
kingchc pushed a commit to facebookresearch/ucc that referenced this pull request Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ready-for-Review Target: Master PRs for the master branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants