Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion - Rule 10. Use the container daily, rebuild the image weekly #17

Open
psychemedia opened this issue Oct 24, 2019 · 5 comments
Open

Comments

@psychemedia
Copy link
Collaborator

@psychemedia psychemedia commented Oct 24, 2019

So... this touches on comments to Rule 2 and Rule 4 - what is the stable thing that can be reused with confidence? A particular image instance, or the thing produced by building against a particular Dockerfile?

For testing, I often test against MyBinder, although that can be subject to dependencies introduced by the repo2docker process that are outside my control.

I also use Dockerhub to auto build on commits to particular repos. I have one repo that builds and tags against each branch, so I can call a tagged image that represents the latest build for each branch. I think other rules let you build against particular paths in a specific branch.

you cannot expect to take a year old Docker image form the shelf and that can be extended, it will likely "run" but just as-is

But - if you want to access a computational environment that you were using a year ago, that old image is exactly the image you want to be using? I have a few images running applications that are legacy and that I would probably struggle to ever build again; when I lose the image, I've lost the application / environment.

As an archiving strategy, I should probably be running those images every so often and saving a new image from them so that the external packaging (whatever data structures docker uses to create/define the image wrapper around my environment) will presumably get updated in the image.

To maintain the image, I could probably also try updating some of the packages inside the image, or using it as a base layer in a new image that does run particular updates that don't seem to break the original application.

@psychemedia

This comment has been minimized.

Copy link
Collaborator Author

@psychemedia psychemedia commented Oct 24, 2019

The scope suggests not referencing docker run but here there is the assumption that folk will be in a position to test their Dockerfile by using a container launched from an image built from the Dockerfile?

nuest added a commit that referenced this issue Dec 10, 2019
- following up on #21
- trying to address comments in #17
@nuest

This comment has been minimized.

Copy link
Owner

@nuest nuest commented Dec 10, 2019

I think all things will eventually break, that's why I introduced this rule - so that researchers realize breaks sooner than later.

@psychemedia I think you make some interesting points regarding preservation, and I've tried to put that angle into the latest revision of Rule 10 here: 4721b9e

@betatim

This comment has been minimized.

Copy link
Collaborator

@betatim betatim commented Dec 10, 2019

I think it depends a lot on what you want to do with the "computational environment". If all you want to do is observe (observe includes run it) the artefact handed down from our ancestors then the image is the thing to store and worry about.

If you want to use the thing handed down from our ancestors to build on, modify, remix then you are (very) likely to need to rebuild it. The sooner you attempt to rebuild it the less painful it will be. A bit like a building that you want to use over a time period of hundreds of years: you need to replace and upgrade stuff in it to keep it habitable. Doing that frequently and in small steps is likely much easier than doing nothing for 50years and then dealing with all the accumulated problems in one go.

@vsoch

This comment has been minimized.

Copy link
Collaborator

@vsoch vsoch commented Dec 10, 2019

Well stated.

@psychemedia

This comment has been minimized.

Copy link
Collaborator Author

@psychemedia psychemedia commented Dec 10, 2019

At the risk of going off-topic to try to beat the bounds and then bring the focus back again, I think we could distinguish:

  • keeping the inside of the container working;
  • keeping the outside of the container working.

Inside the container is the computational environment we're working with. If you want to keep things working with current packages, it makes sense to keep rebuilding the inside regularly because if something does break, it's more likely down to one thing and you're more likely to spot it.

Tests come into play here: they allow you to upgrade code packages whilst preserving functionality. If your code relies on a bug in a particular package, you need a test that checks the effects of that bug are maintained if you update that particular package. (I'm not saying this is good practice! I was trying to think of a limiting case...)

By outside the container I mean the container package and how eg Docker machine runs the container. If you build your contain under docker version 1, and then try to run it under docker version 25, will it still work? Just as physical media (floppies, CD-ROMS) can deteriorate, so too can digital document formats rot (will container version 1 open under docker version 25). I think this is more an archiving question as deemed out of scope in #9 . If we are talking about content submitted to a journal in 2019, what would someone in 2029 need to do to get it working? There are two senses they might want to get it working: 1) as something "habitable" as @betatim puts it, even if everything that was originally round in 2019 has since been replaced, making sure at each replacement that the original tests still pass; 2) as the code would have been experienced in 2019 ("the code that was actually run").

In the latter case, I think one approach that the archivists take is to use emulation to provide eg a version 1 docker wrapper inside a version 25 machine (so eg things like http://eaas.uni-freiburg.de/ ). In which case, we could say it's someone else's problem.

In terms of best practice, I guess the question is: do you want to write something that pins everything and explicitly declares a very particular environment from a particular period in time (something that could recreate the internals of an archived image); or do you want to write something that is maintainable and updateable (perhaps automatically so via weekly scheduled rebuilds).

At some point, there is a tension when you talk about pinning version numbers. What do you pin, and why?

In a course context, I've started to make the distinction between packages which can presumably be updated to keep the operating environment working (updates to ssh etc); and packages used for the computational / course environment (eg what version of pandas we're running) which we might not want to update because it might break our teaching materials. There are also some packages that might be upgradeable in the computational environment to a point (eg numpy) but we may have to pin those at some point if an update includes a breaking change (eg a numpy update that creates a breaking change as far as our pinned pandas package is concerned). In this case, I probably pin pandas (for example), regularly rebuild until pandas breaks, then pin back whatever if was whose most recent update broke pandas. (Updating the course materials to use a newer version of pandas is not an option!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.