Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a capsule from a DESCRIPTION file #28

Open
gadenbuie opened this issue Aug 3, 2022 · 7 comments
Open

Create a capsule from a DESCRIPTION file #28

gadenbuie opened this issue Aug 3, 2022 · 7 comments

Comments

@gadenbuie
Copy link
Contributor

I'd love to see an alternative capsule::create() workflow that uses the DESCRIPTION file to set up the capsule rather than an R script with library() calls.

This is typically handled via renv::install() and in my experience it works really well as a light-weight and user-friendly entry point for dependency declaration. Basically, DESCRIPTION is to renv.lock as package.json is to package.lock.json.

@gadenbuie gadenbuie changed the title Create a capsule from a description Create a capsule from a DESCRIPTION file Aug 3, 2022
@MilesMcBain
Copy link
Owner

MilesMcBain commented Aug 4, 2022

I'm resistant to using DESCRIPTION files for R projects. I also get chills everytime I look at the number of json/yaml files I need to get a JS project off the ground!

For projects, DESCRIPTION creates a second place to declare dependencies that can conflict with the renv.lock if they're not dutifully maintained. Packages dependencies aside, I also don't know that they add a lot of new relevant information to a version-controlled project. A user with access to a project can likely easily determine the git remote and project authors etc.

I like the idea of packages.R, or whatever you choose to call it, because it's executable config. You don't need a special load_all(), that's source('packages.R'). You also already have tools to extract dependencies, e.g. renv::depenencies, capsule::detect_dependencies() (which I just exported). It's also nice to have the dependencies shown alongside the conflicted::conflict_prefer() calls that together describe the dependency environment more rigorously than DESCRIPTION.

If you're looking for something like renv::install(), there's capsule::dev_mirror_lockfile() - which is similar.

If you're looking for a way to specify loose version constraints (e.g. > x_version), you might enjoy using::pkg(), which again can ensure constraints are met, and do installations etc, using the standard source('packages.R') approach. It's also compatible with dependency detection in capsule.

If there's a compelling case for DESCRIPTION, I'm happy to hear it though. I feel like I'd be more likely to accept automatically generated DESCRIPTION files, since they can't get out of sync. But I'd have to understand what the payoff is.

@MilesMcBain
Copy link
Owner

The best argument for DESCRIPTION I could think of for projects was: It's a place to specify non-R dependencies.

But as you point out with the JS example, those likely have their own format that plug into other dependency management machinery. So I don't know if it's that useful for this.

@gadenbuie
Copy link
Contributor Author

I get the resistance to DESCRIPTION and I think it often gets conflated with the "everything as a package" style of project management, which I also am resistant to.

I think the benefits of using DESCRIPTION as the user-facing place to declare dependencies become more apparent when the size of the project and team working on the project are much larger. There are two main features of this size of project that are important:

  1. Packages are not used by all scripts or documents in the project. You wouldn't run load_all() and you wouldn't want to source("packages.R") to create an environment.

  2. The renv target environment (in particular the operating system) isn't the same as the user's environment. In other words, local development happens on MacOS or Windows (or wherever else), but the "production" environment is an Ubuntu GitHub actions runner (or a rocker/r-ver Docker container).

In light of the above, the renv.lock file is better managed automatically in the target environment — for example the renv.lock is written by GHA and points to RSPM as necessary. In our experience, renv works best when the lock file is built to match the target environment. Editing the renv.lock directly or using snapshot() locally often leads to large diffs and adds a bunch of friction in other places.

The DESCRIPTION file works nicely here in the exact same way you're using packages.R, with a few additional features. First of all, we can use Imports and Suggests to cleanly differentiate between "prod" and "local dev" dependencies and we can use Remotes to use non-standard package repositories. Because DESCRIPTION is a standard format, it works out of the box with existing tooling, like install.packages(), remotes, pak, etc. It also supports loose versioning out of the box, e.g. rlang (>= 1.0.0) in a nice concise format. OTOH renv::install() is the only function I'm aware of that will include all of the version constraints in the DESCRIPTION as part of the dependency resolution. That, said, pak::local_install_deps() and similar functions can still use the DESCRIPTION to get the right packages installed — and the capsule would exist when you need replicate the production environment.

Speaking of renv::install(), it's usage is a little different from what you described. When used with a DESCRIPTION file, renv::install() will install the packages in the DESCRIPTION, so the workflow becomes:

  1. Define hard package constraints in DESCRIPTION
  2. Use renv::install() to install those packages, plus their dependencies
  3. Use renv::snapshot() to snapshot the complete environment

At this point, it seems like your usage of packages.R and using::pkg() replicate renv's built-in DESCRIPTION workflow (and both circumnavigate problems that arise with renv's inferred dependency resolution workflow). When you want to be able to source("packages.R") to load all the packages, then you're approach is great! When you don't want to or want to rely on existing R tooling in more places, or to communicate but not snapshot local dev requirements, the DESCRIPTION file is a good fit. As you point out, it's also a great place to declare system dependencies.

I'm obviously biased, but I'm convinced that you're on to something with capsule and that supporting this workflow would make capsule more broadly accessible to more people and in more scenarios.

I also think you're overthinking (or at least over-extrapolating from) my package.json example. There are two nice things about package.json: you can differentiate between production and dev dependencies and you don't need to record every single library in your dependency graph. Ultimately, package.lock.json is a record of how those dependencies were resolved on a particular system at a particular moment in time. In that way, the DESCRIPTION workflow is quite similar (not to say anything else about JS dependency management).

@gadenbuie
Copy link
Contributor Author

To reply directly to some of your points...

For projects, DESCRIPTION creates a second place to declare dependencies that can conflict with the renv.lock if they're not dutifully maintained.

The renv.lock flows from DESCRIPTION and the goal is that machines write renv.lock and humans modify DESCRIPTION.

Packages dependencies aside, I also don't know that they add a lot of new relevant information to a version-controlled project.

They consolidate and surface information, and helpfully in a human/machine readable format. The information contained by DESCRIPTION is much closer to human intention (I only use SHAs when I absolutely mean it).

A user with access to a project can likely easily determine the git remote and project authors etc.

Yeah but with a DESCRIPTION file they don't even have to since the DESCRIPTION file is a critical component used by tons of R tooling, e.g. usethis.

You also already have tools to extract dependencies, e.g. renv::depenencies, capsule::detect_dependencies() (which I just exported).

Yeah, I don't like the inferred dependency approach other than to check that you haven't noticed a direct dependency.

It's also nice to have the dependencies shown alongside the conflicted::conflict_prefer() calls that together describe the dependency environment more rigorously than DESCRIPTION.

This is a good point but in the projects I'm thinking about, you wouldn't want to use conflict_prefer() at the project level, it'd be something decided more locally in an Rmd or an R script.

If you're looking for a way to specify loose version constraints (e.g. > x_version), you might enjoy using::pkg(), which again can ensure constraints are met, and do installations etc, using the standard source('packages.R') approach. It's also compatible with dependency detection in capsule.

This is what renv::install() does, so it's really about the format of the file containing dependency declarations. using looks great and I would definitely use it (pun intended) but not everywhere and not in the places that caused me to open this issue.

@MilesMcBain
Copy link
Owner

MilesMcBain commented Aug 4, 2022

Okay! You've convinced me this is worth exploring further. In particular, the idea that it's an overarching thing that might sit above many versions of packages.R that support different parts of to project.

This one surprised me though:

First of all, we can use Imports and Suggests to cleanly differentiate between "prod" and "local dev" dependencies

Just because one of the reasons I created capsule was I wanted something that didn't have to care about people's dev dependencies. I have my own universe of stuff, as I'm sure you do too, that other people shouldn't have to see in project config files.

I suspect we might be talking about some kind of optional part of the project - e.g. like machinery for generating a documentation website or something? You don't need it to get the outputs, but it might help you context switch back to the project? Do you have an example of a dev dependency for a project?

I'm still not clear on the DESCRIPTION workflow. Is it:

  1. Human A starts project and adds dependencies to DESCRIPTION during dev process.
  2. Human A calls capsule::create("DESCRIPTION") which generates the capsule library and lockfile.
  • They do test or actual prod runs in here etc with capsule::run() or desired run variant.
  1. When project is committed and pushed to remote CI Bot does something to get its own capsule - Don't really understand this bit. It creates a new lockfile in the process?
  2. Human B clones project. Do they call:
  • reproduce_lib() to generate capsule library from lockfile?
  • reproduce_lib("DESCRIPTION") to 'reproduce' capsule library from DESCRIPTION? - although it will not be the same as the lockfile?
  • renv::install() to install the dependencies in the DESCRIPTION in their local library?
  1. At some point Human B is ready to create an updated capsule. They call capsule::create("DESCRIPTION") to create new library and lockfile.
  • Human B does test or actual prod runs in capsule
  1. Project is committed and pushed to remote and CI bot again does something I don't understand to get it's own lockfile.

Help understanding this would be appreciated.

@gadenbuie
Copy link
Contributor Author

gadenbuie commented Aug 5, 2022

Okay! You've convinced me this is worth exploring further.

Awesome 😃

Just because one of the reasons I created capsule was I wanted something that didn't have to care about people's dev dependencies. I have my own universe of stuff, as I'm sure you do too, that other people shouldn't have to see in project config files.

Totally! But there is still a middle ground:

  1. Some dependencies are absolutely required for the final product to work.
  2. Some dependencies aren't needed by the final product but are used by the team to get work done inside the project.
  3. Some dependencies are just things I use when I'm working.

It's really helpful to be able to include those team dependencies (point 2) in some way that helps people install them. I've found that putting them in Suggests is a good place. These are the packages that would appear in on-boarding docs, instead of saying "you'll want to install usethis, teamPackage1 and teamPackage2" they can be suggested dependencies.

A more concrete example would be a project containing reports and a package that provides the report template. If you're working on the project you'll want that report template package because it makes life a whole lot easier, but you don't want the final product to depend on that reporting package. Suggests gives a place to say "if you're working in here, you'll want this package" and also a method for installing all the things, without actually requiring it in the prod environment.

I'm still not clear on the DESCRIPTION workflow. Is it:

Yeah that's basically it! The whole CI bot thing adds an additional player but it's still a fairly similar workflow to the current capsule setup. The point of the CI bot is that as a team we've agreed that the target environment is the one where the bot lives and so the bot's lockfile has precedence over the lockfiles created by humans A or B. I imagine you end up in a similar situation if you didn't have the bot involved but A and B were on different operating systems, etc.

If no bot is involved and if humans A and B are using close-enough machines, you can just take out the steps involving the bot from your workflow.

But if you know you have a specific target environment and somebody (or a bot) who can generate a lockfile in that env, you'd probably just resolve lockfile conflicts by accepting whatever version is generated in that target env.

It would look something like this:

  1. Human A starts project and adds dependencies to DESCRIPTION during dev process.

  2. Human A calls capsule::create("DESCRIPTION") which generates the capsule library and lockfile. Or they don't and just wait for the CI bot to create the lockfile for them.

  3. When project is committed and pushed to remote CI Bot does capsule::create("DESCRIPTION") (when DESCRIPTION changes) to create a new lockfile, which it pushes back to the repo. Under the hood, capsule::create("DESCRIPTION") is basically renv::install(); renv::snapshot().

  4. Human A pulls the updates back down or Human B clones project. They call reproduce_lib() to create the capsule library from the lockfile. They can now enter the "reproducible environment" when they need to.

    • Human B could also use renv::install() to install packages from the DESCRIPTION to get a reasonably close environment + dev dependencies, or they could run reproduce_lib() plus remotes::install_deps(dependencies = "Suggests") to install the dev deps that aren't tracked by the lockfile.
  5. At some point Human B wants to add a new package to project. They modify DESCRIPTION, push the changes and wait for the CI bot to update the official lockfile.

The key realization for me is that the lockfile isn't a universal guarantee of reproducibility, even though it kind of tries to be. Really it's a description of the packages installed in a target environment. Lockfiles work best when they're tied to where they're expected to run.

So you could replace "CI bot" with any process that creates a trustworthy lockfile for the target environment. For example, recently I worked on a project locally, but then called capsule::create() inside an Ubuntu Docker container to generate the lockfile. That got me a lockfile I could expect to work in the target runtime environment. I could use run() outside of the Docker container (on my Mac) to test with appropriate package versions but know that my code most likely work as expected when I ran it in the production environment (Ubuntu GHA runner).

@MilesMcBain
Copy link
Owner

Thanks for the clarification!

The key realization for me is that the lockfile isn't a universal guarantee of reproducibility, even though it kind of tries to be. Really it's a description of the packages installed in a target environment. Lockfiles work best when they're tied to where they're expected to run.

Yes I hear this, I've learned this lesson the hard way myself. My team's solution was to move development to the cloud, such that we're full-time developing on a prod-like environment.

For example, recently I worked on a project locally, but then called capsule::create() inside an Ubuntu Docker container to generate the lockfile. That got me a lockfile I could expect to work in the target runtime environment.

This is a cool idea. I have been thinking about some kind of integration of capsule with docker for a little bit now. But I was thinking from the perspective of some kind of 'lockfile rescue' function where you try recreate the capsule on a container image from around the time the lockfile was created.

There's still a bit more sketching to do on the workflow I think. Having lockfiles created locally that you don't want feels kind of lame.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants