Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pyproject.toml with build system dependencies for PEP-517 installs #3758

Closed
wants to merge 4 commits into from
Closed

Add pyproject.toml with build system dependencies for PEP-517 installs #3758

wants to merge 4 commits into from

Conversation

itsdani
Copy link

@itsdani itsdani commented Oct 27, 2022

This should fix missing modules errors when installed with pip install --use-pep517 horovod

Description

Fixes #3697.

This should fix missing modules when installed with
`pip install --use-pep517 horovod`
@itsdani
Copy link
Author

itsdani commented Oct 27, 2022

Now this won't work, because the build defaults to using PEP-517, and it also misses the tensorflow dependency, which was the original problem in #3697

@itsdani
Copy link
Author

itsdani commented Oct 27, 2022

I think the build fails because it tries to build the wheel using PEP-517 (because it finds the pyproject.toml) with HOROVOD_WITH_TENSORFLOW=1, but the build-requirements in pyproject.toml doesn't include tensorflow as it's not always needed.

I think the solution is to define a custom wrapper for the setuptools build system (see this in the setuptools docs), which can specify additional dynamic build-dependencies by providing a get_requires_for_build_wheel function. This could check which additional packages are needed during the build based on build parameters (I'm not sure if env vars or required extras), and return them.

Unfortunately I was unable to set up the build locally to reproduce the error in CI, and I don't have more time to work on this for a while.

@itsdani itsdani marked this pull request as draft October 27, 2022 12:38
@maxhgerlach
Copy link
Collaborator

maxhgerlach commented Oct 27, 2022

Yes, some dynamic identification of build dependencies would be necessary.

I did some research in #3483 (comment) and follow up comments and then it looked like it wouldn't work well to have differing build requirements depending on extras, but of course there may be a way.


If none of the frameworks TensorFlow, PyTorch, or MXNet are available at build time, Horovod is installed in a minimal configuration without its C++ core and cannot be used for distributed training. I suspect that's what you built locally.

HOROVOD_WITH_TENSORFLOW=1 enforces that TF must be available instead of (silently) skipping support.

@itsdani
Copy link
Author

itsdani commented Nov 7, 2022

Based on this POC the dynamic build-dependencies might work this way, but there are a few problems:

  • In setup.py there is a from horovod import __version__ line I had to remove, because this makes the build depend on horovod itself, and the build fails. The baked-in version in my POC is just a hack to test the dependency installs...
  • The build tries to install the tensorflow dependency from the dynamic list, but it fails for some reason:
Installing backend dependencies: started
Running command /opt/anaconda3/envs/wmlce/bin/python /tmp/pip-standalone-pip-30bvicml/__env_pip__.zip/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-vxb6p8od/normal --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- tensorflow wheel
ERROR: Could not find a version that satisfies the requirement tensorflow (from versions: none)
ERROR: No matching distribution found for tensorflow
Installing backend dependencies: finished with status 'error'

I failed to set up the local build for the project and this makes it hard to test anything. I also feel inadequate to write the correct dynamic dependency list code, because I'm mostly clueless about the project's build process.

@maxhgerlach
Copy link
Collaborator

Thanks for giving this another shot, @itsdani!

The ppc64le build in the CI (which runs without explicit approval) is a bit particular in how the environment is set up, see https://github.com/horovod/horovod/blob/master/Jenkinsfile.ppc64le

But I think this highlights that it's actually pretty important to use precisely the tensorflow (and/or pytorch, mxnet) package that's installed in the environment which Horovod is added to: Horovod needs to be compiled accordingly.

@stale
Copy link

stale bot commented Jan 10, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@joeyearsley
Copy link

joeyearsley commented Oct 4, 2023

I've just rerun into this issue; locally, I've managed to get it working by changing some of the environment variables.
E.g.
HOROVOD_WITH_TENSORFLOW=1 to HOROVOD_WITH_TENSORFLOW=2.8.0

Then, you can use this in the build backend to correctly build with the necessary libs. This is less nice than pulling from the lock files but enables PEP517, which enforces this isolation. I'm curious if there is a way to pull this from a lock file due to the many varieties; we'd be solving a problem excessively.

Another potential way would be changing the build_ext from on install to on the first run, i.e. check the extension is available; if not, build it.

This should only impact PyTorch and TF, as MxNet is at the end of its life.
Do either of you have a preference?

@itsdani
Copy link
Author

itsdani commented Oct 5, 2023

I've just rerun into this issue; locally, I've managed to get it working by changing some of the environment variables.
E.g.
HOROVOD_WITH_TENSORFLOW=1 to HOROVOD_WITH_TENSORFLOW=2.8.0

I think you just built it without tensorflow, HOROVOD_WITH_TENSORFLOW seems to be a flag that only does something if it's value is 1.

# Find TF
set(TF_REQUIRED "")
if ("$ENV{HOROVOD_WITH_TENSORFLOW}" STREQUAL "1")
set(TF_REQUIRED "REQUIRED")
endif ()
find_package(Tensorflow "1.15.0" ${TF_REQUIRED})

The solution to this issue would be declaring build dependencies explicitly instead of using them from the existing environment (a.k.a. isolation). The problem with that is that these dependencies might not be trivial to build, and requirements are currently governed by env-vars like HOROVOD_WITH_TENSORFLOW.
I'm still not sure if creating a custom build-backend (like the one in this draft) is the only viable option for this, maybe optional dependencies declared as extras could also work. On the other hand this would probably require changes in the current build, so this can govern dependencies instead of the env-vars.

@joeyearsley
Copy link

Locally, I've altered the CMake to ensure it matches a version string.

I don't think extras would work; in my opinion, they are only for Python optional extras intended to say if you want this functionality, install this lib in isolation. Not for defining build extras.

Since we already require people to say explicitly they want tensorflow or pytorch links, I don't think it's much more to ask them to define the versions instead of a binary flag.

I'll open my MR soon so you can see the changes.

@itsdani
Copy link
Author

itsdani commented Oct 5, 2023

Ah, I see, extras might be only run-time dependencies and we need build-time deps 😞

I'm still unsure about this "versions in the env-var" thing, I feel like this still contradicts having an isolated build. Although I have to admit I'm interested in your solution 🤞

@joeyearsley
Copy link

The PR: #3991

Having tried many ways, this was the only pragmatic approach I could make work.

It still fulfils an isolated build but tells the build venv which versions of the necessary DL libraries to install, ensuring we don't get linking issues when the finished build is moved from the isolated build venv into its end site-packages location (i.e. the none isolated poetry venv site-packages).

@itsdani
Copy link
Author

itsdani commented Oct 6, 2023

Ah very nice, I understand now, the custom build backend reports the required dependencies based on the env-var values, and then it can be used in the isolated environment. I think I missed this part in your original comment.

Maybe we should move the conversation back to the issue instead of this closed PR-draft, if you want feedback from the others 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

Horovod fails to install via Poetry
3 participants