Join GitHub today
proposal: build: define standard way to recognize machine-generated files #13560
Edit in 2018: This is a large issue with many comments (so many, that GitHub hides most of them by default). Here is the summary of the latest status.
This proposal has been accepted and implemented by Rob Pike. The final format can be seen here (it links to a comment in this thread by Rob Pike, with the final format that was chosen):
By now, most of the generated Go code uses a comment that matches the format that's described there.
The original proposal is below.
I propose Go creates a standardized format, which would enable code-generating tools to reliably communicate to humans and other machine tools that the output is in fact a generated file. Additionally, Go should add a recommended style for a simple code generated disclaimer (which satisfies the above criteria).
A file is considered to be "generated" if and only if the maintainer(s) of the project consider it a non-canonical source. In order to make long-term changes to such files, another source must be modified, and the file in question is then fully (re)generated by a reproducible machine tool.
A distinguishable property of generated files is that they can be deleted and re-generated with a zero diff.
One of the strong values that Go brings are conventions and best practices that reduce bikeshedding, increase consistency and readability across diverse teams of Go programmers. Having a well defined convention, format, or standard for things that are unimportant to the key task, but need to have some value saves time.
To expand on that, there are many examples for things that have have a recommend format/style in Go that let you simply reuse that and not force you (and other people) to invent your own style:
There is one type of comment which is commonly used, but has no existing well-defined officially suggested style recommended by Go.
It is a comment that most tools that generate Go code tend to write somewhere at the top of the code.
There are currently many variations of such disclaimer headers in the wild, and they often vary insignificantly (in spacing, punctuation, etc.). New variations come to be when authors look at how other tools do this, see a large variance, end up picking their favorite and tweaking it.
Consider the following examples in the wild:
// generated by stringer -type Pill pill.go; DO NOT EDIT // Code generated by "stringer -type Pill pill.go"; DO NOT EDIT // Code generated by vfsgen; DO NOT EDIT // Created by cgo -godefs - DO NOT EDIT /* Created by cgo - DO NOT EDIT. */ // Generated by stringer -i a.out.go -o anames.go -p ppc64 // Do not edit. // DO NOT EDIT // generated by: x86map -fmt=decoder ../x86.csv // DO NOT EDIT. // Generate with: go run gen.go -full -output md5block.go // generated by "go run gen.go". DO NOT EDIT. // DO NOT EDIT. This file is generated by mksyntaxgo from the RE2 distribution. // GENERATED BY make_perl_groups.pl; DO NOT EDIT. // generated by mknacl.sh - do not edit // DO NOT EDIT ** This file was generated with the bake tool ** DO NOT EDIT // // Generated by running // maketables --tables=all --data=http://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt --casefolding=http://www.unicode.org/Public/8.0.0/ucd/CaseFolding.txt // DO NOT EDIT /* * CODE GENERATED AUTOMATICALLY WITH github.com/ernesto-jimenez/gogen/unmarshalmap * THIS FILE SHOULD NOT BE EDITED BY HAND */
This creates 2 problems.
This leads to circular arguments and PRs/CLs. For example, see the discussion and the change itself at https://go-review.googlesource.com/#/c/15073/. It started from https://github.com/github/linguist/blob/473282d/lib/linguist/generated.rb#L241, which led to CL 15073. That lead to shurcooL/vfsgen@b2aab1c and shurcooL/go@43b2166. But the initial GitHub behavior came from protobuf disclaimers.
I've created the following func to try to answer the question if a file is generated. At this time, it uses heuristics and best-effort to tell if a file is generated. https://github.com/shurcooL/go/blob/master/analysis/generated_detection.go If there was a well defined requirement for tools to follow, this code can be made simpler and more reliable. Ideally, that helper should be moved into external library for people to reuse, and for generator tools that wish to be compliant to be able to use it for verification.
The goals of this proposal are twofold.
Primarily, to resolve the current impossibility of reliable communication between code generator tool output, and tools that try to determine if a file is code generated.
There should be a way for code generator tool authors to be able to express in their generated output that the file is generated, such that it's possible to reliably detect if a file is generated by other tools.
Secondarily, for code generator authors that simply don't care about what their disclaimer header looks like, provide a recommended style (that satisfies the first condition) template to use.
The implementation details should be defined in a design doc.
It is a non-goal to figure out how existing tools should choose to use or not use the fact whether a given file is generated.
There is some fear that if it's possible to determine if a file is generated reliably, then tools that display code differences will hide generated code differences. That is absolutely the choice of the tool, and in my opinion it should not enforce any behavior that users are unhappy with.
Having additional information (whether a file is generated or not) should enable tools to offer better user experiences - it should not cause tools offer worse experiences than currently.
This proposal focuses solely on enabling code generator tool authors that wish to use a standard disclaimer header to do without forcing them to invent their own format, and for tool authors that wish to make use of information whether a file is generated or not to be able to use that information as they wish. Details of how they do that is outside the scope.
By not standardizing a way for those two types of tools to communicate, it leads to ad-hoc solutions that are sub-optimal emerging, as can be seen above. Go has an opportunity to de-fragment this space and create a recommended standard format that will resolve the needs above, and allow people to migrate existing tools to use the specified format.
Once there's a standard, it's easy to begin updating existing tools towards it over time, and new generator/other tools can start relying on it.
I expect coming up with a recommended style may likely cause a lot of bikeshedding. However, I think it's a cost that's worth incurring, to go through this process once, so that we can avoid having to continuously suffer it while there's no standard at all. I personally don't care too much about what the actual format is (as much as I do about resolving the higher level problems described); I'm okay with whatever Go authors come up with. Any standard is better than no standard at all.
changed the title
proposal: Standardize format for simple code generator disclaimer headers; enable a reliable machine-readable way to determine if a file is generated.
Dec 10, 2015
I think a good and simple way to approach it would be to add somewhere (Go style guide, or a more internal document) a section that says something like this:
That way, people who don't want to be creative can just copy/paste that example disclaimer template and use it.
However, if someone wants to add more details or make their disclaimer different, they can still do that. They just need to follow rules 1 and 2 so that it can still be recognized by tools as a generated disclaimer header.
To detect if a given file is generated, you'd only need to check if those described rules are satisfied - no need for any other heuristics. (If there are some false negatives, then the generator can be updated to satisfy the rules, instead of muddying the detection algorithm.)
I think putting it that way makes this workable. I don't think it's possible to enforce a single header format that will satisfy everyone's needs/wants and have people be okay with it. But providing simple rules that the header will follow and giving an example would work splendidly.
And what about people writing in languages that are not, er, English? :P
What actual machine-readable text should be used to denote a generated file is wider issue. This quickly explodes in scope and potential bloat in the future -- e.g. Rusts function decorations.
That's a valid point, and I think it should be considered.
The outcome may happen to be a string that is recognized by GitHub with their current code in linguist, but I respectfully disagree with this being a high priority criteria (if that's what was implied, I'm not sure).
As I mentioned in the proposal, what GitHub came to recognize as a Go generated file had a high degree of luck and variance in it (and circular arguments):
Specifically, see these two PRs that have shaped what GitHub recognizes today:
Notice the motivation of those PRs (first one was to detect protobuf generated files, and the second was modeled after the first to detect go-bindata output). Also notice how easy it was to get them merged in.
Now, maybe we got lucky with the above sequence of events and what github recognizes today is a great format. But if not, I think it's a relatively low effort followup to submit a PR to GitHub's linguist – similar to those 2 PRs above – to make any necessary corrections after this proposal is resolved and Go has an officially recommended and recognized way to indicate that a file is generated.
Compare that with the alternative of Go using a potentially suboptimal mechanism for many, many years to come.
I am very happy with that outcome and I trust whatever Rob Pike comes up with is going to be a great resolution for this issue.
I don't see this as a big issue. Autogenerated files are usually not meant for being read by people, and I don't know how someone would write Go code without being able to read English anyway.
How about using the file extension ".gen.go" (or suffix _gen.go) for generated files, instead of standardising comments? This had two advantages:
Having standard suffixes makes attaching build tags (GOOS and GOARCH) to file names impossible. Also, it will make generating test files impossible. If we must use file names to label generated files, I prefer the standard library's way of prefix the filename with "z", e.g. used in syscall and runtime packages.
Actually, I was just thinking today of suggesting that generated files could be zipped; i.e. ".gogz";