Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try pytorch again #367

Merged
merged 15 commits into from
Aug 24, 2022
Merged

try pytorch again #367

merged 15 commits into from
Aug 24, 2022

Conversation

ngam
Copy link
Contributor

@ngam ngam commented Aug 16, 2022

fixes #358

I will also add a crucial test for jaxlib here, forgot to add it in #364

@github-actions
Copy link
Contributor

Binder 👈 Try on Mybinder.org!
Binder 👈 Try on Pangeo GCP Binder!
Binder 👈 Try on Pangeo AWS Binder!

@pangeo-bot
Copy link
Collaborator

/condalock
Automatically locking new conda environment, building, and testing images...

tests/test_ml-notebook.py Outdated Show resolved Hide resolved
@ngam ngam mentioned this pull request Aug 16, 2022
@ngam
Copy link
Contributor Author

ngam commented Aug 17, 2022

/condalock again

Copy link
Member

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for working on this. LGTM, but I don't fully understand the details.

@ngam
Copy link
Contributor Author

ngam commented Aug 17, 2022

Thanks so much for working on this. LGTM, but I don't fully understand the details.

We actually have two potential silent errors and I am trying to make sure we don't face them here in the containers. Our conda-forge tensorflow build is usually solid, and is designed to take precedence, so we usually run into problems with jaxlib (especially when in the same env as tensorflow) along the lines:

  • a too strict, misplaced dependency on abseil that slips through, e.g. tracking here relax pin on jaxlib when tf--jax issues are resolved #368, should be fixed in a few hours (nothing is wrong per se, it just needs a specific abseil that tensorflow doesn't accept yet, so it ends up silently breaking things)
  • both tensorflow and jaxlib utilize protobuf, grpc, and abseil, but they reshare some definitions, and so if they both use the shared library, they end up conflicting, e.g. see this issue about protobuf jaxlib 0.3.0 conda-forge package is broken conda-forge/jaxlib-feedstock#89 but similar issues arise due to abseil, etc. --- this issue often only manifests after one imports both in the same session, but stays silent otherwise

For use, people will likely need a separate dataloader/datapipeline to use with jax anyway, so it makes sense to make jax and tensorflow live together in harmony --- we likely could and should add pytorch to the mix soon! I am not certain about bandwidth/storage limitation but having a super ML container (containing all platforms a la kaggle or colab) may be a good idea, albeit heavy to store and heavy to pull, etc.

@ngam
Copy link
Contributor Author

ngam commented Aug 18, 2022

/condalock

@ngam
Copy link
Contributor Author

ngam commented Aug 23, 2022

@scottyhq is this stuck now or can you resolve the conflicts without going file by file? Maybe best to open a new PR? (Edit: trying to see if the smart bots --- condalock --- can unlock it!)

At any rate, you can safely unpin pytorch like I did here and things should be good to go. I will close this PR this weekend if no further activity (I like to keep open only issues and PRs that are not moot, actually following @rabernat's blog post about using github issues/PRs as a calendar/todo list!)

@ngam
Copy link
Contributor Author

ngam commented Aug 23, 2022

/condalock

@scottyhq
Copy link
Member

/condalock

@scottyhq
Copy link
Member

Edit: trying to see if the smart bots --- condalock --- can unlock it!

Unfortunately the bots are not currently smart enough to resolve lockfiles when one active PR is merged before another. It's really a one PR at a time setup right now, which is fine most of the time :)

Thanks again @ngam for all your hard work on the ML library dependencies and packaging

@scottyhq scottyhq merged commit 5f5673d into pangeo-data:master Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unpin pytorch dependency
4 participants