Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a guide to pinning dependencies #161

Open
choldgraf opened this issue Mar 25, 2019 · 9 comments
Open

Add a guide to pinning dependencies #161

choldgraf opened this issue Mar 25, 2019 · 9 comments

Comments

@choldgraf
Copy link
Member

choldgraf commented Mar 25, 2019

In a recent debugging session, @minrk pointed out that if you only partially pin your repository (e.g. pin numpy versions but don't pin Python), you are likely going to break future-reproducibility. This is because the non-pinned version may stop supporting your pinned version.

We have a short guide to reproducibility here, and this would make a nice addition!

Edit from Tim: Want to help out with this? Suggested steps on what needs doing are here.

@betatim
Copy link
Member

betatim commented Mar 25, 2019

An issue to find and link from here is the one about adding a repo2docker freeze command that produces a "well pinned" environment.yml for you.

@minrk
Copy link
Member

minrk commented Mar 27, 2019

Illustrative example that is coming up for folks right now:

# requirements.txt
notebook==5.7.4

Notebook 5.7.4 requires tornado>=4.3, but tornado 6 has been released since with some changes that break notebook 5.7.4. Notebook 5.7.5 is released with the fix for tornado 6 compatibility. By pinning notebook and not tornado, you are guaranteeing future breakage because your env is allowing a package's dependencies to be upgraded, but not allowing the package itself to receive its upgrades that are needed to keep compatibility with dependencies.

Two general approaches:

  1. freeze everything, e.g. with conda env export, or pip freeze plus requirements.txt ("reproducible environment"),
  2. latest of everything, and trust package maintainers to keep up ("living environment"). This may require you to update your repo once in a while, but it also keeps your repo relevant.

Specific things that should generally be avoided:

  1. exact pinning of just a few direct dependencies
  2. exact pinning of anything that has dependencies without also pinning its dependencies
  3. exact pinning of anything without also pinning the runtime (e.g. Python), but especially compiled packages like numpy, etc.

@choldgraf
Copy link
Member Author

so....it sounds like we should recommend an "all or none" approach, no?

@betatim
Copy link
Member

betatim commented Mar 29, 2019

+1 on Chris' suggestion and I like Min's example.

Action to take for someone who wants to help out on this issue: Take Min's comment and include it in the guide to reproducibility the source of which is https://github.com/jupyterhub/binder/blob/master/doc/tutorials/reproducibility.rst

@mdeff
Copy link

mdeff commented Mar 29, 2019

I like the "all or none" recommendation from @minrk.

  • "pin all" is great for code that supports research papers: you don't want to maintain this code, and you don't want results to change because a dependency updated something. That's best for reproducibility.
  • "pin none" is great for packages / libraries / frameworks. Those are living, and should be kept up to date with their dependencies. In an ideal world, they'd be continuously tested with the oldest and newest supported versions of those, and that would be acknowledged (e.g., requires numpy >= 1.14 and <= 1.16).
  • Teaching code is somewhat in the middle. The author could either be willing to maintain it and keep it relevant (hence recommend "pin none"), or maybe that was a one-shot (hence recommend "pin all").

I think what's missing is "best practices" on how to achieve "pin none" and "pin all", and when to choose which. (I faced the future-reproducibility issue myself by forgetting to pin the python version.)

Things to keep in mind:

  • conda env export is not cross-platform, i.e., you cannot create the env on linux, export, and recreate on Windows or macOS. You need to manually export on the three platforms, and only pin the lowest denominator. Then pray that packages have pinned the versions of their dependencies sufficiently tight so that it won't break too soon... I've hit this issue when teaching a university course, where students could have any platform, and I wanted binder as a backup.
  • I've seen binder overhaul pinned versions of jupyter. Does it still do that? What should we do if the version required by binder is different that the one required by the github repo? We could maybe recommend to not pin jupyter (or even to not list it as a dependency, it's just an editor after all), but what about its dependencies then?
  • When using pip, what about non-python implicit dependencies?
  • I wonder if repo2docker could be used to create a future-proof Dockerfile. That would pin the dependency chain down to the kernel API, which is absolutely stable. The downside is that binder would need to guarantee to support those Dockerfiles long-term. (A deprecation warning could be automatically raised when an old Dockerfile version is built, even automatically creating an issue in the github repo.)

Dependency management and reproducibility are really hard. Surely people have thought about these issues before. But where?

@minrk
Copy link
Member

minrk commented Mar 29, 2019

I think this is the "repo2docker freeze" command that's been discussed a few times. Essentially, it would run repo2docker to install everything and then run conda env export and/or pip freeze to generate the "frozen" version of the env within the repo2docker environment. There are several version of freeze, depending on the kind of environment.

A first version of this is to use conda env export as the command passed to repo2docker, i.e.

$ jupyter-repo2docker https://github.com/binder-examples/conda -- conda env export -n root
Picked Git content provider.
Cloning into '/var/folders/9p/clj0fc754y35m01btd46043c0000gn/T/repo2dockerdc0h_3ne'...
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 6 (delta 0), reused 2 (delta 0), pack-reused 0
Unpacking objects: 100% (6/6), done.
Reusing existing image (r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda4373085), not building.name: root
channels:
  - conda-forge
  - defaults
dependencies:
  - attrs=19.1.0=py_0
  - backcall=0.1.0=py_0
  - bleach=3.1.0=py_0
  - bokeh=1.0.4=py37_1000
  - bzip2=1.0.6=h14c3975_1002
  - ca-certificates=2019.3.9=hecc5488_0
  - certifi=2019.3.9=py37_0
  - click=7.0=py_0
  - cloudpickle=0.8.1=py_0
  - conda=4.5.11=py37_1000
  - cryptography=2.6.1=py37h72c5cf5_0
  - cycler=0.10.0=py_1
  - cytoolz=0.9.0.1=py37h14c3975_1001
  - dask=1.1.4=py_0
  - dask-core=1.1.4=py_0
  - decorator=4.4.0=py_0
  - defusedxml=0.5.0=py_1
  - dill=0.2.9=py37_0
  - distributed=1.26.0=py37_1
  - entrypoints=0.3=py37_1000
  - expat=2.2.5=hf484d3e_1002
  - fontconfig=2.13.1=he4413a7_1000
  - freetype=2.10.0=he983fc9_0
  - gettext=0.19.8.1=hc5be6a0_1002
  - glib=2.56.2=had28632_1001
  - heapdict=1.0.0=py37_1000
  - icu=58.2=hf484d3e_1000
  - ipykernel=5.1.0=py37h24bf2e0_1002
  - ipython=7.4.0=py37h24bf2e0_0
  - ipython_genutils=0.2.0=py_1
  - ipywidgets=7.4.2=py_0
  - jedi=0.13.3=py37_0
  - jinja2=2.10=py_1
  - jpeg=9c=h14c3975_1001
  - jsonschema=3.0.1=py37_0
  - jupyter_client=5.2.4=py_3
  - jupyter_core=4.4.0=py_0
  - jupyterlab=0.35.4=py37_0
  - jupyterlab_server=0.2.0=py_0
  - kiwisolver=1.0.1=py37h6bb024c_1002
  - libblas=3.8.0=4_openblas
  - libcblas=3.8.0=4_openblas
  - libffi=3.2.1=he1b5a44_1006
  - libgfortran=3.0.0=1
  - libiconv=1.15=h516909a_1005
  - liblapack=3.8.0=4_openblas
  - libpng=1.6.36=h84994c4_1000
  - libsodium=1.0.16=h14c3975_1001
  - libtiff=4.0.10=h648cc4a_1001
  - libuuid=2.32.1=h14c3975_1000
  - libxcb=1.13=h14c3975_1002
  - libxml2=2.9.8=h143f9aa_1005
  - locket=0.2.0=py_2
  - markupsafe=1.1.1=py37h14c3975_0
  - matplotlib=3.0.3=py37_0
  - matplotlib-base=3.0.3=py37h167e16e_0
  - mistune=0.8.4=py37h14c3975_1000
  - msgpack-python=0.6.1=py37h6bb024c_0
  - nbconvert=5.4.1=py_2
  - nbformat=4.4.0=py_1
  - ncurses=6.1=hf484d3e_1002
  - notebook=5.7.6=py37_0
  - numpy=1.16.2=py37h8b7e671_1
  - olefile=0.46=py_0
  - openblas=0.3.5=ha44fe06_0
  - openssl=1.1.1b=h14c3975_1
  - packaging=19.0=py_0
  - pandas=0.24.2=py37hf484d3e_0
  - pandoc=2.7.1=0
  - pandocfilters=1.4.2=py_1
  - parso=0.3.4=py_0
  - partd=0.3.9=py_0
  - pexpect=4.6.0=py37_1000
  - pickleshare=0.7.5=py37_1000
  - pillow=5.4.1=py37h00a061d_1000
  - pip=19.0.3=py37_0
  - prometheus_client=0.6.0=py_0
  - prompt_toolkit=2.0.9=py_0
  - psutil=5.6.1=py37h14c3975_0
  - pthread-stubs=0.4=h14c3975_1001
  - ptyprocess=0.6.0=py37_1000
  - pygments=2.3.1=py_0
  - pyparsing=2.3.1=py_0
  - pyqt=5.6.0=py37h13b7fb3_1008
  - pyrsistent=0.14.11=py37h14c3975_0
  - python=3.7.2=h381d211_0
  - python-dateutil=2.8.0=py_0
  - pytz=2018.9=py_0
  - pyyaml=5.1=py37h14c3975_0
  - pyzmq=18.0.1=py37h0e1adb2_0
  - readline=7.0=hf8c457e_1001
  - send2trash=1.5.0=py_0
  - setuptools=40.8.0=py37_0
  - sip=4.18.1=py37hf484d3e_1000
  - six=1.12.0=py37_1000
  - sortedcontainers=2.1.0=py_0
  - sqlite=3.26.0=h67949de_1001
  - tblib=1.3.2=py_1
  - terminado=0.8.1=py37_1001
  - testpath=0.4.2=py_1001
  - tk=8.6.9=h84994c4_1000
  - toolz=0.9.0=py_1
  - tornado=6.0.1=py37h14c3975_0
  - traitlets=4.3.2=py37_1000
  - wcwidth=0.1.7=py_1
  - webencodings=0.5.1=py_1
  - wheel=0.33.1=py37_0
  - widgetsnbextension=3.4.2=py37_1000
  - xorg-libxau=1.0.9=h14c3975_0
  - xorg-libxdmcp=1.1.3=h516909a_0
  - xz=5.2.4=h14c3975_1001
  - zeromq=4.2.5=hf484d3e_1006
  - zict=0.1.4=py_0
  - zlib=1.2.11=h14c3975_1004
  - asn1crypto=0.24.0=py37_0
  - cffi=1.11.5=py37he75722e_1
  - chardet=3.0.4=py37_1
  - conda-env=2.6.0=1
  - dbus=1.13.2=h714fa37_1
  - gst-plugins-base=1.14.0=hbbd80ab_1
  - gstreamer=1.14.0=hb453b48_1
  - idna=2.7=py37_0
  - libedit=3.1.20170329=h6b74fdf_2
  - libgcc-ng=8.2.0=hdf63c60_1
  - libstdcxx-ng=8.2.0=hdf63c60_1
  - pcre=8.43=he6710b0_0
  - pycosat=0.6.3=py37h14c3975_0
  - pycparser=2.18=py37_1
  - pyopenssl=18.0.0=py37_0
  - pysocks=1.6.8=py37_0
  - qt=5.6.3=h8bf5577_3
  - requests=2.19.1=py37_0
  - ruamel_yaml=0.15.46=py37h14c3975_0
  - urllib3=1.23=py37_0
  - yaml=0.1.7=had09818_2
  - pip:
    - alembic==1.0.8
    - async-generator==1.10
    - jupyterhub==0.9.4
    - mako==1.0.8
    - msgpack==0.6.1
    - nteract-on-jupyter==2.0.0
    - pamela==1.0.0
    - python-editor==1.0.4
    - python-oauth2==1.1.0
    - sqlalchemy==1.3.1
prefix: /srv/conda

We'll then want to figure out what to do about "lockfiles" since this freeze pattern generally means there are two files: one that specifies the loose requirements, and one that records an actual working installation (Pipfile.lock, etc.). To use this right now, you would have to clobber the environment.yml, or use top-level environment.yml for loose and binder/environment.yml for frozen or something similar.

@betatim
Copy link
Member

betatim commented Mar 30, 2019

I just learnt about pip install --constraint constraints.txt combined with pip freeze > constraints.txt via https://twitter.com/ChristianHeimes/status/1111228403250876417

@pganssle
Copy link

One thing to note here is that because binder apparently uses a specific conda env with a bunch of packages already pinned as the base Python environment, pip freeze > requirements.txt is not a very reliable way to create a reproducible environment.

This is because pip and conda are not perfectly compatible, and apparently if pip finds a conflicting requirement that has been installed by conda, it will fail (presumably because conda does not have the same install-time metadata or something, so pip doesn't know what to remove). Here's a minimal repo that reproduces the issue, you can see that the binder for this repo fails to build.

I think for now the documentation should be updated to mention that the pip freeze mechanism is unreliable, and you are better off using a conda env. In the long run, maybe binder can be updated to use a virtualenv if the repo has requirements.txt (and maybe runtime.txt) but not environment.yaml.

@wragge
Copy link

wragge commented Jun 20, 2019

Thanks for the useful discussion! I haven't been including pinned versions of all dependencies, so I need to rethink what I'm doing.

@mdeff asks above:

I've seen binder overhaul pinned versions of jupyter. Does it still do that? What should we do if the version required by binder is different that the one required by the github repo? We could maybe recommend to not pin jupyter (or even to not list it as a dependency, it's just an editor after all), but what about its dependencies then?

I'm wondering the same thing. When I include a pinned version of jupyter I've found that the images build ok, but don't launch, so I've been leaving it out.

More generally, are there packages that shouldn't be pinned? For example, I just tried generating a new requirements.txt via pip freeze for a repo and found that the Binder build dies complaining that pip can't uninstall certifi as it's installed by distutils.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants