Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Discussion - Rule 10. Use the container daily, rebuild the image weekly #17
So... this touches on comments to Rule 2 and Rule 4 - what is the stable thing that can be reused with confidence? A particular image instance, or the thing produced by building against a particular Dockerfile?
For testing, I often test against MyBinder, although that can be subject to dependencies introduced by the repo2docker process that are outside my control.
I also use Dockerhub to auto build on commits to particular repos. I have one repo that builds and tags against each branch, so I can call a tagged image that represents the latest build for each branch. I think other rules let you build against particular paths in a specific branch.
But - if you want to access a computational environment that you were using a year ago, that old image is exactly the image you want to be using? I have a few images running applications that are legacy and that I would probably struggle to ever build again; when I lose the image, I've lost the application / environment.
As an archiving strategy, I should probably be running those images every so often and saving a new image from them so that the external packaging (whatever data structures docker uses to create/define the image wrapper around my environment) will presumably get updated in the image.
To maintain the image, I could probably also try updating some of the packages inside the image, or using it as a base layer in a new image that does run particular updates that don't seem to break the original application.
I think it depends a lot on what you want to do with the "computational environment". If all you want to do is observe (observe includes run it) the artefact handed down from our ancestors then the image is the thing to store and worry about.
If you want to use the thing handed down from our ancestors to build on, modify, remix then you are (very) likely to need to rebuild it. The sooner you attempt to rebuild it the less painful it will be. A bit like a building that you want to use over a time period of hundreds of years: you need to replace and upgrade stuff in it to keep it habitable. Doing that frequently and in small steps is likely much easier than doing nothing for 50years and then dealing with all the accumulated problems in one go.
At the risk of going off-topic to try to beat the bounds and then bring the focus back again, I think we could distinguish:
Inside the container is the computational environment we're working with. If you want to keep things working with current packages, it makes sense to keep rebuilding the inside regularly because if something does break, it's more likely down to one thing and you're more likely to spot it.
Tests come into play here: they allow you to upgrade code packages whilst preserving functionality. If your code relies on a bug in a particular package, you need a test that checks the effects of that bug are maintained if you update that particular package. (I'm not saying this is good practice! I was trying to think of a limiting case...)
By outside the container I mean the container package and how eg Docker machine runs the container. If you build your contain under docker version 1, and then try to run it under docker version 25, will it still work? Just as physical media (floppies, CD-ROMS) can deteriorate, so too can digital document formats rot (will container version 1 open under docker version 25). I think this is more an archiving question as deemed out of scope in #9 . If we are talking about content submitted to a journal in 2019, what would someone in 2029 need to do to get it working? There are two senses they might want to get it working: 1) as something "habitable" as @betatim puts it, even if everything that was originally round in 2019 has since been replaced, making sure at each replacement that the original tests still pass; 2) as the code would have been experienced in 2019 ("the code that was actually run").
In the latter case, I think one approach that the archivists take is to use emulation to provide eg a version 1 docker wrapper inside a version 25 machine (so eg things like http://eaas.uni-freiburg.de/ ). In which case, we could say it's someone else's problem.
In terms of best practice, I guess the question is: do you want to write something that pins everything and explicitly declares a very particular environment from a particular period in time (something that could recreate the internals of an archived image); or do you want to write something that is maintainable and updateable (perhaps automatically so via weekly scheduled rebuilds).
At some point, there is a tension when you talk about pinning version numbers. What do you pin, and why?
In a course context, I've started to make the distinction between packages which can presumably be updated to keep the operating environment working (updates to ssh etc); and packages used for the computational / course environment (eg what version of pandas we're running) which we might not want to update because it might break our teaching materials. There are also some packages that might be upgradeable in the computational environment to a point (eg numpy) but we may have to pin those at some point if an update includes a breaking change (eg a numpy update that creates a breaking change as far as our pinned pandas package is concerned). In this case, I probably pin pandas (for example), regularly rebuild until pandas breaks, then pin back whatever if was whose most recent update broke pandas. (Updating the course materials to use a newer version of pandas is not an option!)