Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] switch pip packages to conda-forge in conda buildpack #728

Merged
merged 4 commits into from Aug 5, 2019

Conversation

@scottyhq
Copy link
Contributor

scottyhq commented Jul 3, 2019

This pull request modifies the conda buildpack to install jupyterhub and nteract_on_jupyter from conda-forge instead of pip.

According to @ocefpaf "this can lead to dependencies installed with conda and cause problems. (Sometimes things just works but it is 50-50 change.)" And we are seeing this mixing currently in Pangeo environments built with repo2docker (pangeo-data/pangeo-cloud-federation#305).

Described in detail #714 (comment)

Because we aren't changing versions here I suspect there aren't going to be issues, but let me know if I can add tests or other modifications.

@betatim betatim changed the title switch pip packages to conda-forge in conda buildpack [MRG] switch pip packages to conda-forge in conda buildpack Jul 3, 2019
@betatim

This comment has been minimized.

Copy link
Member

betatim commented Jul 3, 2019

Thanks for this. Let's wait a bit to see if @yuvipanda or others remember why we used the pip packages, otherwise we can merge this.

I'd prefer not to merge this while travis is red. I've created #729 to investigate why this happens. All help and ideas welcome as this blocks all PRs. I don't have a good idea why we run into this.

@yuvipanda yuvipanda requested a review from minrk Jul 3, 2019
@minrk

This comment has been minimized.

Copy link
Member

minrk commented Jul 9, 2019

The main reason to get jupyterhub from pip instead of conda is that the conda package (rightly) depends on nodejs, but we don't need that in our case (and indeed, we are getting nodejs from nodesource instead of conda). So that's just an image size question.

The other packages it was just a convenience since they weren't on conda yet, and I agree we should get packages from conda where available in the base env.

So I'm +1 to this change, and we can investigate getting at the size optimization for jupyterhub another way (especially since now CHP isn't the only proxy for jupyterhub)

@minrk
minrk approved these changes Jul 9, 2019
@yuvipanda

This comment has been minimized.

Copy link
Collaborator

yuvipanda commented Jul 9, 2019

Does this mean conda will bring in nodejs as well now?

@betatim

This comment has been minimized.

Copy link
Member

betatim commented Jul 9, 2019

I'm unsure. Right now I think if we switch jupyterhub to be a conda package then we will get nodejs from conda as well. There doesn't seem to be anything concrete that is broken by us doing this.

If both of those are correct I'd vote we stick with using the pip version of jupyterhub to keep the image smaller. Smaller images mean faster launches and right now it should all be about faster launches (I think).

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Jul 16, 2019

Does this mean conda will bring in nodejs as well now?

Yes, it does. If size optimization is primary, keeping jupyterhub from pip is the way to go, at least until jupyterhub splits jupyterhub-singleuser into a standalone package (requires at least 3 packages for shared infrastructure/auth utils, singleuser, and the hub itself).

@scottyhq

This comment has been minimized.

Copy link
Contributor Author

scottyhq commented Jul 24, 2019

Just added a bit to the docs to clarify where the definition of the base environment lives. I suspect there might be a lot of binder repos out there that specify a different jupyterhub version (for example https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/environment.yml), resulting in even larger images due to nodejs brought in with conda on top of the system package.

I think it can be confusing to have these pip packages show up for someone who is expecting just the conda packages in their repo's environment.yml

@betatim

This comment has been minimized.

Copy link
Member

betatim commented Jul 25, 2019

I like the docs updated that clarifies things. If you move jupyterhub back to being a pip package we can merge this.

@scottyhq

This comment has been minimized.

Copy link
Contributor Author

scottyhq commented Jul 29, 2019

@betatim - I'm actually still in favor of sticking exclusively with conda :) So I'm going to try to articulate the rationale in more detail:

  1. We're using repo2docker to generate our default environment images that run in jupyterhubs. It's common for users to update packages in the default environment or create new conda environments, and we've been running into some 'inconsistent environment' conda solver issues with the current setup (pangeo-data/pangeo-cloud-federation#254)

  2. I think the repo size changes will be small (I did a couple tests running locally and installing a basic environment.yml just w/ dask and the net difference in image size is less than 10Mb).

  3. Even with the docs improvement I still find it odd to encounter many pypi packages in the resulting 'conda' environment. For example: https://github.com/scottyhq/pangeo-binder-test/tree/jhub-conda

  4. Environments mixing conda and pip might not work. This is hard to test because the environment installs fine, but errors can manifest importing packages or running code. This article describes some issues https://www.anaconda.com/using-pip-in-a-conda-environment/ and the current conda docs still suggest "The differences between pip and conda packages cause certain unavoidable limits in compatibility but conda works hard to be as compatible with pip as possible."

  5. Users could still specify pip packages if they want in their repo’s environment.yml

If the decision is between slightly smaller images or faster build times, versus better assurance of functioning environments I'd vote for the latter! Would the next step be to push new conda 'frozen' files to re-run tests if there is still interest?

@betatim

This comment has been minimized.

Copy link
Member

betatim commented Jul 29, 2019

If the size difference is only 10MB then that does seem like a trade-off worth making. Can you push one image built with master and one with this branch to docker hub for your test repo? That makes it easy to compare layers and sizes.

Is there a repo we can look at that doesn't build with master but does build with this PR? I always thought that there is a difference between conda install foo; pip install bar; conda install blah and listing foo, bar and blah as packages in an environment.yml. Namely that if a pip package is listed in the environment.yml conda can "see" it. Which I thought avoids most (all?) of the problems that the blog posts are hinting at. It would be interesting to have an example.

For my education why does https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/environment.yml list jupyterhub as a dependency (without specifying a version) even though jupyterhub is already installed by repo2docker?

If we start installing node via conda we should do some archaeology to find out why we started installing node as a system package in the first place. This would (maybe) let us stop installing it to avoid the problem that two (different) versions of node end up being installed which will invariably lead to weird problems, this time not about pip and conda but the two node versions.

@scottyhq

This comment has been minimized.

Copy link
Contributor Author

scottyhq commented Jul 29, 2019

If the size difference is only 10MB then that does seem like a trade-off worth making. Can you push one image built with master and one with this branch to docker hub for your test repo? That makes it easy to compare layers and sizes.

Binder repos that link to Docker Images:
https://github.com/scottyhq/pangeo-binder-test/tree/dask-only-master (1.76GB)
https://github.com/scottyhq/pangeo-binder-test/tree/dask-only-pr (1.83GB)

Is there a repo we can look at that doesn't build with master but does build with this PR? I always thought that there is a difference between conda install foo; pip install bar; conda install blah and listing foo, bar and blah as packages in an environment.yml. Namely that if a pip package is listed in the environment.yml conda can "see" it. Which I thought avoids most (all?) of the problems that the blog posts are hinting at. It would be interesting to have an example.

Good point, I don't know if there is a difference in these approaches. Maybe @ocefpaf has a minimal example contrasting behavior?

For my education why does https://github.com/pangeo-data/pangeo-stacks/blob/master/base-notebook/binder/environment.yml list jupyterhub as a dependency (without specifying a version) even though jupyterhub is already installed by repo2docker?

Probably should have an explicit 1.0. But this is left over from trying out the latest available packages by not pinning versions, and an attempt to get everything from the conda-forge channel.

If we start installing node via conda we should do some archaeology to find out why we started installing node as a system package in the first place. This would (maybe) let us stop installing it to avoid the problem that two (different) versions of node end up being installed which will invariably lead to weird problems, this time not about pip and conda but the two node versions.

Yes. Is node as a system package to keep the version consistent across pip, conda, and other builds? I'm only familiar with needing it for jupyterlab extensions (https://jupyterlab.readthedocs.io/en/stable/user/extensions.html#extensions)

In any case, with this PR, conda brings in these versions:

jovyan@jupyter-scottyhq-2dpangeo-2dbinder-2dtest-2dxitd0vxz:~$ node --version
v11.14.0
jovyan@jupyter-scottyhq-2dpangeo-2dbinder-2dtest-2dxitd0vxz:~$ npm --version
6.7.0

In addition to the ones in /usr/bin/

jovyan@jupyter-scottyhq-2dpangeo-2dbinder-2dtest-2dzgb5nkin:~$ node --version
v10.16.0
jovyan@jupyter-scottyhq-2dpangeo-2dbinder-2dtest-2dzgb5nkin:~$ npm --version
6.9.0

A related issue: #307 (comment)

@scottyhq scottyhq force-pushed the scottyhq:nopip branch from 0d9b20f to 955e422 Jul 29, 2019
@betatim

This comment has been minimized.

Copy link
Member

betatim commented Jul 30, 2019

@GladysNalvarte or @minrk do you recall why we started installing npm as a system package?

@scottyhq thanks for the bump and refreeze. Do you have a repo which is broken with master and then fixed with this PR? So we can confirm that this change actually fixes the problem :)

I'd extend the npm test that is currently failing to check for the two npm's that are installed. Modulo removing the system npm based on the answer from Min or Gladys.

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Jul 30, 2019

do you recall why we started installing npm as a system package?

I'm pretty sure nodejs from nodesource is more widely used and kept better up-to-date (I maintain nodejs on conda-forge, and don't always keep up).

That said, as nodejs is increasingly likely to be pulled in as a conda dependency (e.g. by jupyterlab things), I think it probably makes sense to do this here.

@scottyhq

This comment has been minimized.

Copy link
Contributor Author

scottyhq commented Jul 31, 2019

@scottyhq thanks for the bump and refreeze. Do you have a repo which is broken with master and then fixed with this PR? So we can confirm that this change actually fixes the problem :)

@betatim here is an simple example environment.yml that results in a surprising situation on master, but not with this PR:

name: test
dependencies:
  - requests=2.21

on master we end up with requests 2.22.0 pypi_0 pypi! (which comes from pip install jupyter but on this PR we end up with requests 2.21.0 py37_1000 conda-forge. I'm haven't really dug into why this occurs, but I'm guessing the logic is we're doing a 'upgrade' to the existing environment and we already have a higher version?!

Here is another example you could run locally (conda 4.7.10) illustrating that solving for mixed conda and pypi packages doesn't always work:

name: broken
channels:
  - conda-forge
dependencies:
  - dask=2
  - pip:
    - sagemaker

Gives this error: ERROR: Cannot uninstall 'PyYAML'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

But this works:

name: fixed
channels:
  - conda-forge
dependencies:
  - dask=2
  - pyyaml=3
  - pip:
    - sagemaker

Finally, although I haven't encountered errors of this type via repo2docker, but I've been bitten a number of times with environments that mix pypi and conda packages for geospatial packages. So personally if using conda I only go to pip as a last resort. See for example: mapbox/rasterio#1244

@scottyhq

This comment has been minimized.

Copy link
Contributor Author

scottyhq commented Jul 31, 2019

updated the test, but might consider pinning the node version in repo2docker/buildpacks/conda/environment.yml to one of these in conda-forge (and re-freezing):

nodejs                       11.10.0      hf484d3e_0  conda-forge         
nodejs                       11.11.0      hf484d3e_0  conda-forge         
nodejs                       11.14.0      he1b5a44_0  conda-forge         
nodejs                       11.14.0      he1b5a44_1  conda-forge         
nodejs                        12.1.0      he1b5a44_1  conda-forge         
nodejs                        12.3.0      he1b5a44_0  conda-forge

or removing the system package install (and associated tests) if that is the preferred approach?

@betatim

This comment has been minimized.

Copy link
Member

betatim commented Aug 5, 2019

This looks good to me. @scottyhq do you want to add a test for this environment:

name: test
dependencies:
  - requests=2.21

I'd consider what currently happens on master as a bug, so having a test to prevent this from coming back would be good. If you don't have time to add the test could you make a ticket instead so it doesn't get forgotten?

Why pin the version of node in the "source" environment.yml ? It isn't a package we directly install so the pinning happens when we freeze the environment. For me it is ok if this means a test fails when we change node versions as it is probably something we should think about briefly when it happens.

I will merge this now, we can address the pinning and the test in a new PR.

@betatim betatim merged commit d72c601 into jupyter:master Aug 5, 2019
5 checks passed
5 checks passed
ci/circleci: build_docs Your tests passed on CircleCI!
Details
ci/dockercloud Your tests passed in Docker Cloud
Details
codecov/patch 100% of diff hit (target 20%)
Details
codecov/project 90.89% (+0%) compared to bf29f66
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.