Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Improve pipeline cpuset logic #551

Open
2 tasks done
pdmack opened this issue Dec 14, 2022 · 0 comments
Open
2 tasks done

[FEA]: Improve pipeline cpuset logic #551

pdmack opened this issue Dec 14, 2022 · 0 comments
Assignees
Labels
feature request New feature or request improvement Improvement to existing functionality

Comments

@pdmack
Copy link
Contributor

pdmack commented Dec 14, 2022

Is this a new feature, an improvement, or a change to existing functionality?

Change

How would you describe the priority of this feature request

High

Please provide a clear description of problem this feature solves

Currently Morpheus assumes an available user cpuset range hardcoded from 0 up to the number of threads (processors) minus one when setting the options for a MRC/SRF Executor in a pipeline. However, there may be certain environments where cpusets have been created for users that restrict CPU access to certain groups of processes (e.g., Slurm).

https://github.com/nv-morpheus/Morpheus/blob/branch-23.01/morpheus/pipeline/pipeline.py#L70

MRC does not make this same assumption and instead evaluates the hwloc topology of visible CPU and compares that to the user_cpuset that has been configured.

https://github.com/nv-morpheus/MRC/blob/branch-23.01/cpp/mrc/src/internal/system/topology.cpp#L141

If the intersection of the two sets is null, MRC errors out and the pipeline fails with stacktrace.

Describe your ideal solution

Not sure which is ideal but:

  • provide an option for Morpheus to pass in a usable cpuset (user responsibility)
  • Morpheus doesn't do any cpuset configuration and instead defers to MRC to make a decision, possibly guided by a handful of configurable algorithms
  • MRC exposes an interface for the topology queries it is already doing prior to an Executor being built and Morpheus can fail more gracefully informing the user they must choose a usable cpuset from the topology query

Describe any alternatives you have considered

No response

Additional context

====Registering Pipeline====
Error occurred during Pipeline.build(). Exiting.
Traceback (most recent call last):
  File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 277, in build_and_start
    self.build()
  File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 175, in build
    self._srf_executor = srf.Executor(self._exec_options)
RuntimeError: intersection between user_cpuset and topo_cpuset is null
Traceback (most recent call last):
  File "/data/sdp/cybersecurity_ai/files/pass_thru/run_passthru.py", line 40, in <module>
Exception occurred in pipeline. Rethrowing
Traceback (most recent call last):
  File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 251, in join
    await self._srf_executor.join_async()
AttributeError: 'NoneType' object has no attribute 'join_async'
====Pipeline Complete====
    run_pipeline()
  File "/data/sdp/cybersecurity_ai/files/pass_thru/run_passthru.py", line 37, in run_pipeline
    pipeline.run()
  File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 517, in run
    asyncio.run(self._do_run())
  File "/opt/conda/envs/morpheus/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/envs/morpheus/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 495, in _do_run
    await self.join()
  File "/opt/conda/envs/morpheus/lib/python3.8/site-packages/morpheus/pipeline/pipeline.py", line 251, in join
    await self._srf_executor.join_async()
AttributeError: 'NoneType' object has no attribute 'join_async'

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request
@pdmack pdmack added feature request New feature or request improvement Improvement to existing functionality Needs Triage Need team to review and classify labels Dec 14, 2022
@jarmak-nv jarmak-nv removed the Needs Triage Need team to review and classify label Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request improvement Improvement to existing functionality
Projects
Status: Todo
Development

No branches or pull requests

4 participants