Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/go: provide cache support for code rewriting tools #29430

Open
josharian opened this Issue Dec 26, 2018 · 28 comments

Comments

Projects
None yet
10 participants
@josharian
Copy link
Contributor

josharian commented Dec 26, 2018

This issue could easily have been a comment on #19109, but opening a new issue seemed better for everyone.

Some tools such as go-fuzz do a source-to-source transformation followed by a build. This doesn't play nicely with cmd/go's cache. The problem is where to the translated files. The natural option is to write them all to a tmp dir, but that busts cmd/go's cache, making 'go fuzz' compilation slow. You could instead write them to the user's GOPATH somewhere, but that's mildly hostile, and you end up needing to fight import path rewrites. You could write them into the vendor directory, except that that doesn't work when a package has already been vendored, plus it is definitely hostile (since many people check in their vendor directory).

Note that the one s2s transformation supported in the main Go toolchain, test coverage, gets explicit support from cmd/go, including caching instrumented source code in the pkg dir. (At least if I read the -x output correctly.)

I don't know what the right interface looks like here. It could be to tell cmd/go "use this transformation tool", but it's tricky to get cmd/go to call the transformation tool with the right set of information (see e.g. #15677). It could be to ask cmd/go for a work dir and then ask it to treat those work dir entries as replacements for the GOPATH/module packages. Or...?

cc @bcmills @dvyukov

@dvyukov

This comment has been minimized.

Copy link
Member

dvyukov commented Dec 27, 2018

This doesn't play nicely with cmd/go's cache.

This also does not play nicely with modules/vendor/internal/cgo/some_other_new_feature/etc.

The most reasonable interface from a tools perspective would be that it's invoked by go command and given a single package to transform (either in-place or to a new given location). But this also needs to work with cgo and allow parsing the package with go/types, including importing other packages.

But reflecting on go-fuzz experience, it's still an open question if s2s is the right approach. It may be a wrong level for such tools. The current go coverage instrumentation is simpler. But for go-fuzz we have an infinite long tail of corner cases where it miscompiles, crashes, produces invalid code, etc. These corner cases are related to specific constant values used in specific contexts, specific nested expression constructs, specific uses of types, imports, shadowing, etc. Here are just few examples:
dvyukov/go-fuzz@f15634d
https://github.com/dvyukov/go-fuzz/blob/31e2128c9be91445df05c3bbfb34e0dc3f1e4809/go-fuzz-build/cover.go#L299-L319
On SSA level (e.g. how we do it in llvm) all of that complexity simply does not exist.
And we are talking about just 1 tool that tries to do moderately complex instrumentation.

@mvdan

This comment has been minimized.

Copy link
Member

mvdan commented Dec 27, 2018

I think this could fall under tooling, so I'll CC @ianthehat for his thoughts.

The most reasonable interface from a tools perspective would be that it's invoked by go command and given a single package to transform (either in-place or to a new given location). But this also needs to work with cgo and allow parsing the package with go/types, including importing other packages.

Reminds me of go/packages, see ParseFile and Overlay in here:
https://godoc.org/golang.org/x/tools/go/packages#Config

Not saying that that package should solve this problem, but the source-to-source usecase seems quite similar to what go/packages does.

@josharian

This comment has been minimized.

Copy link
Contributor Author

josharian commented Dec 27, 2018

Another spot-fix idea: Just as we use //line comments to tell the compiler that the following line is really from somewhere else, we could use a //file comment at the top of a file to tell the compiler and cmd/go that the file is really from somewhere else. It seems like that might help with some of the compilation caching failures.

it's still an open question if s2s is the right approach. It may be a wrong level for such tools.

I was pondering that myself yesterday, for obvious reasons. But cmd/compile doesn't have a stable IR like llvm, and I don't think we are anywhere near ready to introduce one. Source is the only lingua franca, which makes it almost the only game in town.

Here's another (half-baked) idea. Write a single s2s preprocessor (compiler) that rewrites source code into a regular, minimal SSA form, making it simpler to deal with in all other tools. All vars would have explicit types when declared and systematic names (vNNN or phiNNN). Extra information (original var name?) could perhaps be conveyed in structured comments. All dot imports would be rewritten to be regular imports, all imports would be given explicit unique names for easier syntactic resolution. And so on. This is a bit vague, but I think it'd move us a lot closer from N*M to N+M.

@dvyukov

This comment has been minimized.

Copy link
Member

dvyukov commented Dec 27, 2018

I was pondering that myself yesterday, for obvious reasons. But cmd/compile doesn't have a stable IR like llvm, and I don't think we are anywhere near ready to introduce one. Source is the only lingua franca, which makes it almost the only game in town.

Well, there is a bunch of SSA passes in the compiler that somehow exist ;)

@josharian

This comment has been minimized.

Copy link
Contributor Author

josharian commented Dec 27, 2018

Well, there is a bunch of SSA passes in the compiler that somehow exist ;)

Right, but the set of ops, their semantics, etc. change from release to release. There is no serialization format, much less a stable one. And ssa.Values refer (via Aux) to other compiler structures like gc.Node, obj.LSym, which opens a giant can of worms.

@dvyukov

This comment has been minimized.

Copy link
Member

dvyukov commented Dec 28, 2018

Yet these passes achieve their goal. I mean it may not be a generic solution for allowing users to write arbitrary s2s transformations, but it still may allow to solve the problem at hand (fuzzing coverage). Just like race detector is an SSA pass.

@josharian

This comment has been minimized.

Copy link
Contributor Author

josharian commented Dec 29, 2018

Yes, we could declare fuzzing special enough to build into the toolchain. That's #19109. But we obviously can't do that for all tools.

Over in #19109, Russ asked repeatedly for go-fuzz to be prototyped out of tree to match how it would ultimately be integrated. I was experimenting with that. (Before anyone gets excited, it is enough work that I'm unlikely to follow through on it unless I find sponsorship, and I'm not planning to search for that at the moment.) Many of the stumbling blocks struck me as addressable (using go/packages, fixing corner cases, etc.) or merely requiring extra design (corpus location and migration), but this caching problem did not, which means that it is of independent interest for similar tools, and thus worth discussing independently.

Incidentally, cc @alandonovan re: my s2s SSA form pipe dream idea above, since that is very much his wheelhouse.

@dvyukov

This comment has been minimized.

Copy link
Member

dvyukov commented Dec 29, 2018

All of these problems (go/packages, corner cases, caching) simply don't exist if we go with compiler instrumentation. I think "go-fuzz prototype" includes only the interface user-facing part. So I would start with the interface part, the rest is implementation details.

@josharian

This comment has been minimized.

Copy link
Contributor Author

josharian commented Dec 29, 2018

But as someone who works on the compiler, I don't want more special things in it, I want less. They increase the testing surface area, make other work harder, etc.

And as someone who writes tools, I want the toolchain to be more flexible and welcoming to tools, not to wonder "how does test coverage do it?" and discover the answer is that it is special.

I want go-fuzz to be built-in to the toolchain. (Although I have to say, writing a good fuzz function is often surprisingly subtle.) But I also think it is worth trying to face some of the challenges it raises head on, in a general way. If it turns out there are no good answers available, sure, let's make it special. But let's at least try first.

@dvyukov

This comment has been minimized.

Copy link
Member

dvyukov commented Dec 29, 2018

Fair point.
But it still leaves the question open if it's really the right level for fuzzing coverage. I mean it will sure solve the problem of not making the compiler any more complex. But do we want to also move all of optimization passes, escape analysis and codegen out of the compiler to make it simpler and to allow external users to write their own replacements? Or do we want to write and maintain 2 mostly identical compilers/ssa-engines, one without stable API that will be used for compilation and another with stable API that will be used for coverage? Note that the second one needs be as elaborate as the first one, because, say, if we don't want to trace trivial inlined functions, then the compiler used for coverage will need to be able to do inlinability analysis and inlining yet preserving all original source:line info for the output. Or, if we want to transform y := x + 3; if y < 10 to if x < 7 for comparison tracing, we will need to be able to do pretty complex analysis and transformations, potentially across functions too. But if the second impl is as good as the first one, why do we want the first one at all? Or our strategy will be "no, all smart cool stuff is only for us in the compiler; and you are limited only to inserting full statements in a dumb way"?

@FiloSottile

This comment has been minimized.

Copy link
Member

FiloSottile commented Dec 30, 2018

Last time we discussed fuzzing with @rsc, I think he argued it should not get special compiler support, but just use s2s transformations and live outside of the go tool, at least initially. As I understood it it was a sustainability argument, since as the project grow we can’t add and maintain support to the compiler for everything, but we can add reliable support for s2s.

I agree with that view, and I think it’s consistent with the direction set by go generate. We should fix this.

(I even have a pipe dream of extracting cgo and maybe coverage into external, forkable tools, where they can be adapted to more niche use cases, instead of having to live with the tradeoffs we selected for the go tool.)

@dvyukov

This comment has been minimized.

Copy link
Member

dvyukov commented Dec 30, 2018

@FiloSottile what types of tools/transformations have you considered as a use case?

I even have a pipe dream of extracting cgo and maybe coverage into external, forkable tools

Well, if the current source coverage is not trivially expressible in this framework, then it's not working and we failed. I would convert source coverage just as a proof of concept because any such framework badly needs realistic use cases, so it would happen just as a side effect.

@FiloSottile

This comment has been minimized.

Copy link
Member

FiloSottile commented Dec 30, 2018

Some ideas are a C FFI that trades safety for performance, coverage for main(), and prototyping FuzzX functions in test files. I don’t think there’s any major blockers, just UX and documentation bits like this issue.

@dvyukov

This comment has been minimized.

Copy link
Member

dvyukov commented Dec 30, 2018

Some ideas are...

Does this potentially include things like inlining, alias analysis, etc (e.g. y := x + 3; if y < 10 to if x < 7)?

prototyping FuzzX functions in test files

Does this require s2s transformations? Or you mean fuzzing coverage?

@FiloSottile

This comment has been minimized.

Copy link
Member

FiloSottile commented Jan 4, 2019

Does this potentially include things like inlining, alias analysis, etc?

Not the ones I considered. I suppose there's a tradeoff between power and exposed surface: as you say if we push this too far we reach a point where we are exposing too much, so some things do belong in the compiler and not outside.

But some things can easily live outside of it with s2s, so we should fix issues like this one to enable that.

Does this require s2s transformations? Or you mean fuzzing coverage?

IIRC testing uses s2s to call Test* functions, and we would have to do the same to call Fuzz* functions.

@eandre

This comment has been minimized.

Copy link

eandre commented Jan 29, 2019

This issue started out with a very actionable description: providing the ability to use the build cache with code rewriting tools. While this discussion sounds valuable, I don't think the desire to create such tools will go away anytime soon, or ever. Therefore it would be nice to treat the s2s versus not discussion separately from how we might enable build caching for these use cases.

As @josharian notes, main hurdle to that is that the cache is keyed by the full path to the build directory, which is usually a temporary directory. I'm happy to take a stab at a fix, but it would be good to reach consensus on what said change should look like. Any suggestions, @josharian et al?

@josharian

This comment has been minimized.

Copy link
Contributor Author

josharian commented Jan 29, 2019

@eandre thanks for refocusing the conversation on the specific, actionable issue. I had one suggest in #29430 (comment) (look for “spot-fix”). The design decisions here ultimately rest with @bcmills I believe.

@bcmills bcmills added the ToolSpeed label Jan 29, 2019

@bcmills

This comment has been minimized.

Copy link
Member

bcmills commented Jan 29, 2019

I expect that we will not get to this for 1.13. (We can revisit after that.)

@jayconrod and/or @ianthehat may have some insight into how much impact this might have for the broader tooling ecosystem.

@jayconrod

This comment has been minimized.

Copy link
Contributor

jayconrod commented Jan 30, 2019

There's a reasonably good chance there will be changes to caching in the 1.13-1.14 time frame. There are some optimizations I'd like to look at, and I think we'll want to do some streamlining when GOPATH is eventually removed (hopefully that doesn't affect caching much, but a lot of things are surprisingly coupled). I don't think we should expose a public interface in 1.13 because it would be hard to avoid breaking changes.

For now, I'd rather focus on fixing cases where something is not cached unnecessarily. For any action, if the inputs and outputs are identical, we should be able to use the same cache key. For example, we might not need to include absolute paths in the cache key if those paths don't affect debug or stack info in the compiled archive.

@eandre

This comment has been minimized.

Copy link

eandre commented Jan 30, 2019

To my knowledge the absolute paths are included in the compiled archive as a debugging aid. It seems to me that that fact will continue to be desirable. It also seems to me that the need to do source-to-source rewriting in a temporary directory will continue to be how it's done. So even if we want to make changes to caching, if we're thoughtful about what interface we provide to handle this case it doesn't seem to me that these goals conflict with your work necessarily.

@josharian

This comment has been minimized.

Copy link
Contributor Author

josharian commented Jan 30, 2019

To my knowledge the absolute paths are included in the compiled archive as a debugging aid.

Indeed. Thus my suggestion to provide a way to tell both cmd/compile and cmd/go that a file’s true path is elsewhere.

@dvyukov

This comment has been minimized.

Copy link
Member

dvyukov commented Jan 31, 2019

This issue started out with a very actionable description: providing the ability to use the build cache with code rewriting tools. While this discussion sounds valuable, I don't think the desire to create such tools will go away anytime soon, or ever. Therefore it would be nice to treat the s2s versus not discussion separately from how we might enable build caching for these use cases.

Uses of this feature directly affect its priority. Will it enable new capabilities? Which ones? Are they more important than other capabilities? While there are lots of generally useful things, there are few things that we can actually do.
The issue description uses fuzzing as main motivation. If it won't be used for fuzzing, do we still want to do it ahead of fuzzing? Or do fuzzing support first?

@eandre

This comment has been minimized.

Copy link

eandre commented Jan 31, 2019

The priority is mostly important when it comes to asking somebody to do work. Like I said, I'm happy to work on this if we can reach consensus on how it should be implemented. Based on the current fuzzing implementation it seems like it would be worthy supporting it. Between fuzzing, go test, cgo(?), and the go/packages.(*Config).Overlay functionality the use case seems quite validated to me.

@ianthehat

This comment has been minimized.

Copy link

ianthehat commented Jan 31, 2019

If a source code transformation tool wants to play nicely with the build cache, can't it just generate a stable path?
Temp directories should not inherently cause problems, only ones with unique randomized names, and if that is required to prevent runs of the tool colliding, then you already have a problem the build cache is not going to fix for you.
One approach I have used in the past was to pick a temp directory named for the tool, and write all files with a filename of the hash of the file contents.

@jimmyfrasche

This comment has been minimized.

Copy link
Member

jimmyfrasche commented Jan 31, 2019

It's been a bit since I wrote an S2S tool (possibly even pre-Go1!).

I imagine the benefits of an ideal integration with the build cache would be

  • I could say "I'm S2S'ing $pkg with $tool/$tool_version" and go would either say "that's up to date, here's $dir with the transformed artifacts" or "I don't have that cached, you need to do that and can store the results in $dir (or provide a temporary directory that you can write to and copy everything into the build cache on success)"
  • The directory with the transformed artifacts would be reaped whenever there's a change to $pkg or go clean is run
  • It could handle unchanged packages local to the module (including internal/ packages), local vendor, external dependencies, and any external dependencies injected into the code during the transform, as if they were in the correct place, without having to copy the entire module root before during the S2S transform
  • It just handles stuff like build tags
  • Does the right thing when switching between different go commands
  • The mechanism could also be used for caching arbitrary metadata like ctags-esque autocomplete info, lists of local reverse dependencies, analysis results, etc.
  • It would keep up with changes in the go tool, if there are any that need to be taken into account

A stable directory would need to look something like $USER_CACHE_DIR/$tool/$tool_version/$go_version/$build_tags/$mod_and_pkg_path/src. You need to create it if doesn't exist and recreate if any files are different from the last time it was created. When creating/recreating, you need to start by copying the module containing the package in there before doing the relevant transform. (If it's pre-modules code then you're going to need to prepend that directory to $GOPATH before invoking go when building). I wouldn't be surprised if I missed a step or an edge case in there somewhere.

If you do all that, I think it gets most of the points, albeit with effort from the tool writer. It seems like the major losses feature-wise are:

  • it doesn't get automatically obliterated by go clean so you have to handle garbage collection and your tool needs to provide its own version of clean
  • a change to the go tool could mean having to update a lot of fiddly code
  • You need to have multiple copies of a module for each transformed package in a module/differing build tags/etc

Perhaps a library could take care of all or at least most of that heavy lifting? Even so, in an ideal world, having the go command somehow handle all of this would certainly be appreciated and provide benefit.

@jayconrod

This comment has been minimized.

Copy link
Contributor

jayconrod commented Jan 31, 2019

I agree with Ian's comment. If a source file is moved to a different path (or regenerated there), the output should not be cached in cases where the output depends on the path.

If we are judicious with flags like -trimpath, maybe we can ensure that parts of the path that don't affect the output aren't included in the ActionID computation. It may be preferable to write files to deterministic locations though.

@eandre

This comment has been minimized.

Copy link

eandre commented Feb 1, 2019

The downside of using a stable working directory is that it puts the burden on tool authors to worry about concurrent runs of the tool and gracefully handling that without blowing up. That's both annoying, and hard to get right due to the difficulty of file locking in a cross-platform way. From the perspective of a tool author it would be nice to outsource that complexity to the build cache.

@thepudds

This comment has been minimized.

Copy link

thepudds commented Mar 10, 2019

In #16860 (comment), @josharian said there is some chance a fix for #16860 could also address this issue here.

In #16860 (comment), Bryan said he thought Russ is most likely planning on addressing #16860 for 1.13.

If that happens for 1.13, would that also effectively address this issue here, or would follow-up work also be required to close out this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.