Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to cache dependencies #147

Closed
michaelsauter opened this issue Aug 16, 2021 · 17 comments · Fixed by #460
Closed

Allow to cache dependencies #147

michaelsauter opened this issue Aug 16, 2021 · 17 comments · Fixed by #460
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@michaelsauter
Copy link
Member

It would be great if ods-build-go and friends would be able to cache third party dependencies between builds to speed up build/test execution time. Given we already use a PVC as a workspace, it should be possible to cache. One issue is that the mount point currently gets wiped completely in ods-start.

@michaelsauter michaelsauter added enhancement New feature or request good first issue Good for newcomers labels Aug 16, 2021
@michaelsauter michaelsauter added this to the 0.3.0 milestone Aug 26, 2021
@michaelsauter
Copy link
Member Author

See tektoncd/pipeline#3097 as it contains some background on current limitations and potential other approaches to the caching issue.

@stitakis
Copy link
Member

@michaelsauter very interesting discussion! Unfortunately, it doesn't mention why PVC seems to not be the preferred option, thou works.

But honestly, for the build environment we really aren't even interested in using PVCs, but we're forced to do so because emptyDir can't be used between Tasks in Pipelines.

Do you see any blocker regarding this approach? Have you explored to use a PVC?

@michaelsauter
Copy link
Member Author

Unfortunately, it doesn't mention why PVC seems to not be the preferred option, thou works.

The main argument I think is that PVC are local to nodes or at least to clusters. If your builders run anywhere then you need a place to upload caches to and download caches from. You see this pattern with e.g. GitHub Actions as well.

That said, this is not the case for us (all our builders are in the same cluster) so I think using a PVC is a viable option. But maybe it is not the only option for us, therefore I linked this discussion :)

@stitakis
Copy link
Member

stitakis commented Sep 20, 2021

I see some interesting advantages in using PVC: it can be mounted in a task step e.g. the cluster task ods-build-gradle step build-gradle-binary`. And the PVC name can be be parameterized in the task. This could be used to assign a project specific PVC volume for a project.

@henrjk
Copy link
Member

henrjk commented Jan 7, 2022

Having https://tekton.dev/vault/pipelines-v0.24.3/workspaces/ and docs/design/relationship-shared-library.adoc in mind I believe the following is the situation:

  • There is exactly one PVC used by all pipelines in a project.
  • As a consequence only one pipeline (or task?) can run in parallel per project until issue Support repo-specific PVCs #160 is addressed
  • The ods-pipeline cluster tasks all have their working directory at $(workspaces.source.path) of the same PVC

Is it correct that the working directory is actually the root directory of the ods-pipeline PVC?

One way to enable caching would be to have ods-start no longer wipe the entire pvc. For example it could spare ./.ods-cache.
Then to avoid having different pipelines and tasks interfere with each other, they could use
./.ods-cache/<pipeline-name>/<task-name>/ as cache for dependencies.

Instead of hard coding this location an environment variable PIPELINE_CACHE could be defined so that build scripts can use that and know that the directory will persist over multiple pipeline runs of the same pipeline.

When would the pipeline cache directory be cleaned up?
There could be a cleanup pipeline which would react to Merged, Deleted or Declined and/or potentially a cleaning collector running periodically

Another way could be to provide a PVC per pipeline.
With this ods-start would still need to spare .ods-cache but PIPELINE_CACHE could be ./.ods-cache/<task-name>/.

@michaelsauter @stitakis What do you think?

@michaelsauter
Copy link
Member Author

@henrjk Yes, how you describe the current situation is spot on.

One way to enable caching would be to have ods-start no longer wipe the entire pvc. For example it could spare ./.ods-cache.
Then to avoid having different pipelines and tasks interfere with each other, they could use
./.ods-cache/pipeline-name/task-name/ as cache for dependencies.

Caching underneath the pipeline name will cause the cache to be used only in rare cases I believe. Given the current architecture of one branch = one pipeline, the first push in any branch would run without cache. Assuming most work happens in branches, we wouldn't see a bit speed-up. Maybe caching per repository would work better? (As a side note: I am not fully convinced one branch = one pipeline is really what we want).

When would the pipeline cache directory be cleaned up?
There could be a cleanup pipeline which would react to Merged, Deleted or Declined and/or potentially a cleaning collector running periodically

Right now, webhook events of type Merged, Deleted or Declined do not trigger a pipeline run, therefore they also do not have access to the workspace.

Maybe cleanup could be quite simple (at least to begin with)? The strategy could be to attempt to keep disk usage under e.g. 80%. Assuming a cache location of ./.ods-cache/<repo-name>/<task-name>/, we'd delete the oldest cache directory, until we are below 80% or run out of cache dirs to delete. This cleanup could happen during ods-start.

However if we had one PVC per repo (which is something we likely want to have I'd say), then the outlined strategy will clean up either all the time (because the PVC is "too small") or it will almost never clean up, only protecting the cache from growing forever.

@henrjk
Copy link
Member

henrjk commented Jan 10, 2022

@gerardcl this may interest you as well!

@stitakis
Copy link
Member

@henrjk I like the idea of one PVC per repo. It could work, even if the builds will not be able to run in parallel. I think is a limitation that we can accept.

My initial approach was to mount a PVC per cluster task. E.g the ods-build-gradle would define a parameter that could be use to define the PVC to be used for caching dependencies for a given technology. This assumes the PVC could be defined as ReadWriteMany (RWX). The advantage is that it would make possible to reuse the PVC between different repo builds. However, I'm not sure if this technically feasible.

@henrjk
Copy link
Member

henrjk commented Jan 10, 2022

@stitakis and @michaelsauter I thought that one PVC per pipeline=branch makes sense for the following reasons:

  • I'd be hesitant to have caching in develop as I'd rather catch caching bugs at an early stage.
  • In a prior caching implementation for python it did not pay off to simply cache the dependencies as reinstalling them from the cache was not much faster than reinstalling them via nexus. To make it worthwhile we had to cache the entire venv. However at least for Python I'd rather not share a venv across branches. @gerardcl correct me if this has not been our experience.
  • There would be a very easy cleanup strategy.

Of course this would not help with develop pipelines. For branch pipelines it appears that most people create a branch via Jira and the build triggered by that branch would then take the hit for initial caching.

@michaelsauter
Copy link
Member Author

I thought a bit more about this and propose the following:

Instead of implementing a certain caching strategy, we could add a cache-key param to each build task. This cache key can be set by pipeline authors in ods.yaml according to their needs. This follows the approach taken e.g. by GitHub Actions. With this in place, users have the following options:

  • empty cache key (or a value like none, TBD) = no caching
  • pipeline name (available through Tekton via $(context.pipeline.name) = caching per pipeline (= caching per branch)
  • repo name + task name = caching across all branches of one repo
  • project name + task name = caching across all branches AND repos of one project (see caveat at the end for this option)

I am not sure what the default should be (but lean towards "no caching"). Regarding cleanup, I would still start with my proposal above to reduce cache dirs until disk usage drops below e.g. 80%.

I believe that based on our above discussion regarding different preferences, and taking into account that different build tools also take different approaches when it comes to dependency management, it is best not to make a decision on the level of ODS pipeline as a whole but to delegate this decision to the pipeline authors.

The implementation in ODS pipeline is then to simply configure the build tool to use the given cache key (which either may be empty and will be filled, or is already filled and will be modified).

Note that this "cache key" approach works within the current "one PVC across all pipelines" situation, but will also work with "one PVC per repo". Once we switch to "one PVC per repo" though, the ability to cache across repos would not work anymore (I think this is acceptable). However we could also think about using a separate PVC just for caching as @stitakis suggested. But I think we need to be very careful what effect this has on parallel execution (knowing ReadWriteMany doesn't work on e.g. AWS). In any case, all that should not affect the user interface, which simply is a cache-key param on the task.

Thoughts @henrjk @stitakis @oalyman @kuebler @gerardcl?

@gerardcl
Copy link
Member

hi! right, trying to reuse dependencies/envs is not a direct thing:

  • if using different folder paths (based on name convinations, for example on pipeline ID+branch) most of the build tools will fail on reusing such folders/caches (npm, pip,...), hence one might need to have a fresh new initial cached folder when starting a new (branch) pipeline. (see ****)

  • if being able to use same folder path (mount) then we could see a better reusability chances but we need to ensure there is a queueing for the pipelines (one at a time) <- in any case, this could be the option of having a PVC only for caching purposes (see ****)

  • I would not have caching on QA, maybe only in DEV and dev branches

  • in a previous project we had same approach of cleanup based on certain % capacity used and deletion of oldest, and worked nicely (we were also protecting the current targetted branch pipeline)

I see as a good starting point to have such cache-key options proposed above, and per repo.

In any case, keep in mind that PVC mounting, unmounting,... is also a resource consumption time which might sometimes be the same amount of time as not doing cache. Same applies to mv or cp or rsync commands.

(****comment that, this might be already a known issue, if I push to a branch and then I create a PR, then I have two pipelines of the same branch...fail!)

@michaelsauter
Copy link
Member Author

I would not have caching on QA, maybe only in DEV and dev branches

Note that for promotion you'd typically not do any build so I think caching wouldn't play a role there.

In any case, keep in mind that PVC mounting, unmounting,... is also a resource consumption time which might sometimes be the same amount of time as not doing cache. Same applies to mv or cp or rsync commands.

Therefore I would start by using the same PVC that is mounted anyway for the workspace.

(****comment that, this might be already a known issue, if I push to a branch and then I create a PR, then I have two pipelines of the same branch...fail!)

Please open a separate issue. This needs more thought. I think we do want pr:opened to trigger a new pipeline run, but of course without failing the pipeline.

@henrjk
Copy link
Member

henrjk commented Jan 12, 2022

I agree with:

  • having ods-pipeline not to impose a particular way of caching
  • having non-caching as a default
  • a PVC per repo

How would one decouple caching from the build scripts implementation?
Should there be a way to map a key to a path in the workspace by a having an ods-pipeline cache task somewhat similar but perhaps simpler to the cache task of GitHub action?
See https://docs.github.com/en/actions/advanced-guides/caching-dependencies-to-speed-up-workflow for background info.

@michaelsauter
Copy link
Member Author

How would one decouple caching from the build scripts implementation?

Why would you like to decouple this?

I think we could start having this coupled. A separate cache task like GitHub Actions would have the disadvantage that it requires launching a new pod (until Tekton supports grouping multiple tasks in one pod).

@henrjk
Copy link
Member

henrjk commented Jan 12, 2022

I meant decouple in the sense of that the build script does not itself impose the caching strategy.
For example for Python I am not sure that venv sharing is something that should go in a general task. I guess it depends on the details. But this was were I was coming from. So it could also be supported by all build tasks supporting caching.

@michaelsauter
Copy link
Member Author

I meant decouple in the sense of that the build script does not itself impose the caching strategy.

Oh ok, now I get it.

My proposal would be to delegate the decision how to deal with the cache key to the build tasks. I think it will depend on the used technology what to do with it. I do not know how Python should handle it, and maybe there isn't a one-size-fits-all approach for Python. In that case we just have to make a call what we support I guess?

I was approaching this with Go caching in mind. The task cache should simply cache the Go module cache. A short description of how that works is at https://go.dev/ref/mod#module-cache. The implementation in the build task would simply set GOMODCACHE to the directory referenced by the cache key. Knowing how the Go cache works, it would be beneficial for users to use a static cache key that allows to reuse the module cache across the whole repo as there shouldn't be any negative consequences. Of course, if people want to be more conservative (I would not know why, but anyway), one could opt for a module cache per branch by using a dynamic cache key.

@stitakis
Copy link
Member

@michaelsauter your proposal sounds good to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants