Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/go: allow extraction of urls used to download dependencies #35922

Open
williamh opened this issue Dec 1, 2019 · 32 comments
Open

cmd/go: allow extraction of urls used to download dependencies #35922

williamh opened this issue Dec 1, 2019 · 32 comments

Comments

@williamh
Copy link

@williamh williamh commented Dec 1, 2019

Hello,

I am the go package maintainer on Gentoo Linux, and I maintain several packages written in Go as well.

Our package manager does not allow network access during the build process after downloading the source for a package, so it need to be able to download the .zip files for the modules a package needs in advance.

I believe I can download the .zip files to a path, which I will call DISTDIR, then during the build, set GOPROXY="file://${DISTDIR}" and avoid network access.

To do that, I need a way to extract all of the the URLs for the .zip files for the dependencies of a package so I can put them in a list for the package manager to download.

Is there a way to do this?

Thanks much,

William

I am going to tag @robbat2 on this report also to include him since he was part of my discussion on our IRC channel.

@mvdan
Copy link
Member

@mvdan mvdan commented Dec 1, 2019

Loading

@mvdan
Copy link
Member

@mvdan mvdan commented Dec 1, 2019

A starting point could be go mod download -json, though note that it first downloads the modules, and also that it shows the location of the zip on the local cache once downloaded.

A better approach might be go list -m -json all to get information about all the modules involved in the current module, and constructing the URLs to download the go.mod files or zip source archives from https://proxy.golang.org/. You can use go help goproxy to see what the REST interface looks like.

I'm sure there could be better ways to handle this, though. For example, if you just want to build a subset of the module, you probably don't need to download all of the modules required directly or indirectly by the main module.

Loading

@williamh
Copy link
Author

@williamh williamh commented Dec 2, 2019

@mvdan My thought is to create a cache, e.g. file://${DISTDIR}/go-cache which could be pointed to by GOPROXY so that when the package manager attempts to build the main module it will not need to download from the network. Is this the best way to handle this? Also, how are the paths in the cache created?

Loading

@robbat2
Copy link

@robbat2 robbat2 commented Dec 2, 2019

@mvdan neither of those commands (go mod download -json, go list -m -json all) print the locations of the upstream URLs for zipfiles.

Given a go.mod and go.sum, produce a listing of the URLs, stable filenames to map to.
Using _ as a sample replacement for _ here. Not set on that character yet.

$PROXY/k8s.io/minikube/@v/v1.5.2.info => goproxy-k8s.io_minikube_@v_v1.5.2.info
$PROXY/k8s.io/minikube/@v/v1.5.2.mod => goproxy-k8s.io_minikube_@v_v1.5.2.mod
$PROXY/k8s.io/minikube/@v/v1.5.2.zip  => goproxy-k8s.io_minikube_@v_v1.5.2.zip

The package manager tracks those RHS filenames, and repopulates the expected directory structure for the GOPROXY=file:///... to use.

FYI The package manager also captures & verifies checksums on the URLs.

The trivial case for well-versions stuff I can see producing from this case per the goproxy REST API, but it's the corner cases that I don't follow.

E.g, this line from minikube-1.5.2 go.mod:

github.com/olekukonko/tablewriter v0.0.0-20160923125401-bdcc175572fd

That version doesn't appear in the list endpoint.

Loading

@dmitshur dmitshur changed the title allow extraction of urls used to download dependencies cmd/go: allow extraction of urls used to download dependencies Dec 2, 2019
@dmitshur dmitshur added this to the Backlog milestone Dec 2, 2019
@robbat2
Copy link

@robbat2 robbat2 commented Dec 2, 2019

Are there any specific ASCII characters that are NOT permitted to occur in module strings or version strings?

Loading

@mvdan
Copy link
Member

@mvdan mvdan commented Dec 2, 2019

neither of those commands (go mod download -json, go list -m -json all) print the locations of the upstream URLs for zipfiles.

Yes, I realise that. Please read the rest of my comment above. I meant these as examples to point you in the right direction, not as your perfect solution.

Loading

@hyangah
Copy link
Contributor

@hyangah hyangah commented Dec 2, 2019

@robbat2 Is it not possible to use the cache in the $GOPATH/pkg/mod/cache/download' (the module cache) after running go list -m -json all? The directory structure reflects proxy requests sent to the proxy except .zip. Zip files needed for actual build will have the same base and path but with the .zip extension.

(I wonder if there is any magic flag in list or build that downloads required .zip files as well but skips actual builds)

The details of the proxy protocol including encoding is https://golang.org/cmd/go/#hdr-Module_proxy_protocol. Currently accepted characters and encoding rule is described in https://godoc.org/golang.org/x/mod/module#hdr-Unicode_Restrictions

Loading

@jayconrod
Copy link
Contributor

@jayconrod jayconrod commented Dec 2, 2019

Just to confirm what @mvdan and @hyangah have said:

Running go mod download without arguments within a module will download all the files a module needs to build. After that, it should be possible to build only from the module cache by setting GOPROXY=off.

You can control the location of the module cache by setting GOPATH: it will be in $GOPATH/pkg/mod. Downloaded files are in $GOPATH/pkg/mod/cache/download. It's possible to use the module cache as a proxy by setting GOPROXY=file://$GOPATH/pkg/mod/cache/download.


@williamh @robbat2 One thing I was a little unclear on: is there a restriction against using go mod download to populate the module cache? It sounds like you want to create the cache only using package manager infrastructure without running the go command.

To make a list of URLs for that, you could run go mod download manually once in an empty cache, the convert the file names to URLs. You only need .info, .mod, and .zip files. Something like this might work?

cd go/src/golang.org/x/tools/gopls   # or any other module
export GOPATH=$(mktemp -d)
go mod download
find $GOPATH/pkg/mod/cache/download -type f | \
    grep '\.\(mod\|info\|zip\)$' | \
    sed -e "s,$GOPATH/pkg/mod/cache/download,https://proxy.golang.org,"

(https://proxy.golang.org/ can also be replaced with any other server that implements the proxy protocol).

Loading

@jayconrod
Copy link
Contributor

@jayconrod jayconrod commented Dec 2, 2019

The trivial case for well-versions stuff I can see producing from this case per the goproxy REST API, but it's the corner cases that I don't follow.

E.g, this line from minikube-1.5.2 go.mod:

github.com/olekukonko/tablewriter v0.0.0-20160923125401-bdcc175572fd
That version doesn't appear in the list endpoint.

You shouldn't need to cache the list or latest endpoints. Those are needed to find new versions of modules, but if go.mod is not missing any requirements (i.e., go mod tidy does not change it), then building packages within the module will not cause the go command to hit those endpoints.

Loading

@jayconrod
Copy link
Contributor

@jayconrod jayconrod commented Dec 2, 2019

Are there any specific ASCII characters that are NOT permitted to occur in module strings or version strings?

golang.org/x/mod/module.CheckPath documents the restrictions on module paths.

Additionally, in the proxy protocol and within the module cache, module paths are case encoded so that the cache can be stored on a case-insensitive file system without conflict. go help goproxy explains that.

Sorry the documentation is not in great shape right now. I'm working on a module reference specification that will include all this for Go 1.14 (#33637).

Loading

@jayconrod
Copy link
Contributor

@jayconrod jayconrod commented Dec 2, 2019

(I wonder if there is any magic flag in list or build that downloads required .zip files as well but skips actual builds)

go list all and go build -n all will do something very similar to go mod download. But note that they will apply build constraints (not following imports in excluded source files), and they won't catch test imports unless you ask for them specifically.

Loading

@robbat2
Copy link

@robbat2 robbat2 commented Dec 3, 2019

Just to confirm what @mvdan and @hyangah have said:

Running go mod download without arguments within a module will download all the files a module needs to build. After that, it should be possible to build only from the module cache by setting GOPROXY=off.

Yes, I understood that much already, please see further below.

The package manager tooling will re-create layout of the cache, for all specifically declared modules to the package manager (generated by the package maintainer based on go.mod).

@williamh @robbat2 One thing I was a little unclear on: is there a restriction against using go mod download to populate the module cache? It sounds like you want to create the cache only using package manager infrastructure without running the go command.

Correct, the cache would be pre-populated by the package manager, in the correct layout.

To make a list of URLs for that, you could run go mod download manually once in an empty cache, the convert the file names to URLs. You only need .info, .mod, and .zip files. Something like this might work?

(omit example)

Yes, that example works, but still requires network connectivity. My ask was asking for a trivial modification of go mod download that emits the (absolute or relative) URLs without actually doing the download at that phase.

Package maintainer steps:

  1. (human) Get new go package from upstream that they want to package, verifies the initial download if possible & meaningful (HTTPS, GPG etc)
  2. (tooling) Run maintainer-specific tooling get-ego-vender (or successor) that converts go.mod to package manager directives (URLs etc)
  3. (human) Maintainer makes further edits to the directives for gentoo-specific things (init scripts, documentation, config files).
  4. (tooling) maintainer-specific tooling captures & stores traditional checksums for all files (Gentoo Manifest files)
  5. (human/tooling) maintainer commits ebuild & Manifest.

User steps

# emerge somegopackage
.. input files are somegopackage-version.ebuild, Manifest
.. package manager fetches the declared URLs to local package manager cache
.. package manager starts network sandbox/container
.. package manager arranges the files from it's cache into the expected goproxy cache layout (symlinks or hardlinks to the real locations)
.. package manager calls build process

(https://proxy.golang.org/ can also be replaced with any other server that implements the proxy protocol).

Related question here. I was reviewing the h1: hash mechanism, and it seems that it would be stable for the content & relative paths of files, but it would not capture any file metadata (mtime, permissions, ownership). As such the h1: hash should be stable between any server that implements the proxy protocol, but it's not clear if conventional checksums will be identical (this matters to the package manager).

Loading

@robbat2
Copy link

@robbat2 robbat2 commented Dec 3, 2019

Are there any specific ASCII characters that are NOT permitted to occur in module strings or version strings?

Thanks.

golang.org/x/mod/module.CheckPath documents the restrictions on module paths.

Thanks, as a tidbit there: it tries to describe part of the rules:
the leading path element (up to the first slash, if any), by convention a domain name,
But them it has an incomplete test described for domain names: specifically . and - should not appear adjacent in any domain name, or at the start & end. One . also cannot be adjacent to another ..

Additionally, in the proxy protocol and within the module cache, module paths are case encoded so that the cache can be stored on a case-insensitive file system without conflict. go help goproxy explains that.

Yes, I caught that part already.

Loading

@jayconrod
Copy link
Contributor

@jayconrod jayconrod commented Dec 3, 2019

Yes, that example works, but still requires network connectivity. My ask was asking for a trivial modification of go mod download that emits the (absolute or relative) URLs without actually doing the download at that phase.

We can't provide a general solution for this. If there are multiple sources in the GOPROXY list, the go command will attempt to download from each one, falling back to later sources if an earlier sources returns a "not found" error (either 404 or 410). If one of the sources is direct, there's another process for locating the origin repository, cloning all or part of it, and extracting a zip file from the repository. That can't really be represented with a URL field in the JSON output.

Also, go mod download won't go out to the network at all for modules that are already in the cache. So we couldn't report anything for cached modules unless we also saved where they came from.

Related question here. I was reviewing the h1: hash mechanism, and it seems that it would be stable for the content & relative paths of files, but it would not capture any file metadata (mtime, permissions, ownership). As such the h1: hash should be stable between any server that implements the proxy protocol, but it's not clear if conventional checksums will be identical (this matters to the package manager).

That's true: we only hash module contents, not the archives themselves. There's no promise that module zip files have stable hashes over time; for example file order or compression could change. We ignore metadata when creating and extracting zip files.

(IMO, it would have been better to hash the zip files themselves, but that ship has sailed).

Thanks, as a tidbit there: it tries to describe part of the rules:
the leading path element (up to the first slash, if any), by convention a domain name,
But them it has an incomplete test described for domain names: specifically . and - should not appear adjacent in any domain name, or at the start & end. One . also cannot be adjacent to another ..

Maybe we can tighten that up without breaking anyone. It's technically possible to have a module path that isn't a domain name if it's only served from a proxy server (i.e., there's no need to look up the origin repository). There is code that checks that dots are not allowed at the beginning or end of a path element or together. I don't think a.- is rejected though.

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 4, 2019

I think it would be helpful to step up a level so that we can understand the higher-level problem that you want to solve.

Specifically, I would like to understand the need to download .zip files using the Gentoo package manager tooling, rather than downloading the .zip files using go mod download or source files using go mod vendor on the maintainer side of the workflow.

Downloading on the maintainer side of the workflow also seems like it would provide the required checksum stability: if the maintainer, rather than the user, downloads the files, then the maintainer can compute the package mainager's checksum based on that specific instance of those files rather than relying on a specific Go proxy to serve a zipfile with exactly the same bytes.

Loading

@williamh
Copy link
Author

@williamh williamh commented Dec 7, 2019

@bcmills Consider this situation.

  • package foo-1.0 has 100 dependencies which are packaged up in a tarball that can be extracted to ${GOPATH}/pkg/cache/download.
  • one of those dependencies has a security vulnerability, so foo-1.0.1 is released and the only change is that dependency is updated.
  • At this point, the maintainer would have to regenerate the tarball of dependencies, so now we have two big tarballs with only one dependency different between them.

Some see this as a big maintenance cost.

Loading

@robbat2
Copy link

@robbat2 robbat2 commented Dec 8, 2019

@bcmills sure, looking at a higher-level is good.

Problem Statement:
Make it easier to package Go-based packages in Gentoo, building from source (NOT using prebuilt binaries).

Constraints
All the constraints that are implicit in Gentoo Package manager requirements (that's a huge hand-waving set)

That's probably too high-level ;-).

In other languages, e.g. Perl, Python, C, the common route is to have all of the (build or runtime) dependencies installed on the host system, and then the package just uses them at runtime (interpreted or compiled against dynamic libraries) and/or build-time (compiled against static binaries).

For Go, the closest representation here is the build-time model. Go has the additional complexity that packages may use differing versions of dependencies.

This needs to include sharing Go module source between packages, and taking advantage of the Gentoo mirroring system (if two different Go-based packages in Gentoo both require the same version of a Go module, the files for that module should be shared).

For content not in any public goproxy, the Gentoo package maintainer needs to generate the module files (.zip, .mod, .info). I'll have to figure out that process in the meantime, to create tooling to make it easier.

I think I have enough figured out to do the rest on the Gentoo side here.

Is there easy tooling that can at least convert the full go.mod file other than this hack I have so far:

# go list -m -json all |jq -r 'if .Replace then .Replace|.GoMod else .GoMod end' |sed -r -e 's,.*pkg/mod/cache/download/,,g'

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 11, 2019

@williamh, if the maintainer has to regenerate the list of dependencies to fetch anyway, it seems like the only significant difference is the need to re-download the resulting tarball. But that seems like a detail for the packaging system: it's also quite common for a bugfix in a project to change only one source file within the project, and won't the user have to re-download all of the source files for the project anyway?

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 11, 2019

@robbat2, if you intend to share module source between packages, then it seems like you fundamentally need one of a handful of approaches.

The key decision, I think, is whether you want to use the upstream go.mod file as the canonical list of dependencies, or to try to map the module dependencies into the native package manager's dependency resolution, or some combination of the two.

Specifically, you could consider:

  1. Preserve the target's module dependencies as package-manager dependencies, with a separate system package for each version of each module (so that different versions can coexist). Install the sources to a shared module cache somewhere as part of the build step, and list those sources as explicit dependencies in the package manager. (Perhaps you could install them to /home/root/go/pkg/mod? See also 34527.)

  2. Map the module requirements to system package-manager constraints, and install sources only for the selected versions of modules, perhaps in something resembling GOPATH/src. Then have either the package maintainer or the build system add replace directives to target the install directories. To make this approach work, you may need to add trivial go.mod files for module dependencies that currently lack them. (The least invasive approach is probably to list all of the installed modules — including the module containing the target package — in a replace block for some synthetic, empty module, and then do the go build step within that synthetic module.)

  3. Decide not to share source code after all, and use go mod vendor to instead produce a minimal set of source dependencies that you can scoop up in one big tarball to distribute for each module to be installed.

Loading

@williamh
Copy link
Author

@williamh williamh commented Jan 1, 2020

@bcmills @robbat2 First off, I hope your holidays went well.
I definitely do not think we should try to convert go.mod to package manager dependencies. I think that would create a very large number of packages that we would be maintaining only to have source code installed on everyone's system.
and there would possibly have to be several versions of this source code installed on systems, so that would lead to a lot of disk space being occupied by source code.

I looked around and came up with a script which can make a tarball that can be unpacked and pointed to by having the package manager set GOPATH or whatever variable comes out of #33637 during the build. The difference between my script and your option 3 above is my script uses
"go mod download" instead of "go mod vendor".

@robbat2 What is your status?

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Jan 10, 2020

Thinking about this some more... you're going to need some way to inject the downloaded URL contents back into the go command anyway, and that probably looks like a GOPROXY implementation.

And if you're implementing a GOPROXY anyway, you could pretty easily add a “record mode”, that simply tracks all of the URLs requested from your GOPROXY and replaces those with the same paths at some other proxy, such as https://proxy.golang.org.

Loading

@robbat2
Copy link

@robbat2 robbat2 commented Feb 8, 2020

@bcmills Hi! We're making implementation progress on this, but have a few followup questions:

  • Re: go.sum lines for {package} {version}/go.mod WITHOUT a matching {package} {version} line.

    • If they are removed from the go.sum line, every package I tried seems to build successfully still.
    • Q: Is this fluke or guaranteed to be true?
    • This drastically trims the number of files needed to fetch for building a package.
  • Given a .zip from the goproxy mirror ecosystem:

    • Q: Is the .mod file inside the .zip always going to match the .mod file on the goproxy mirror?
    • Q: Is it possible to byte-for-byte synthesize the matching .info and .mod mirror files (such that the h1: hash matches).
    • Q: How can the .info be generated?

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 10, 2020

  • If they are removed from the go.sum line, every package I tried seems to build successfully still.
  • Q: Is this fluke or guaranteed to be true?

After go mod tidy, the go.sum file contains a /go.mod line for every module in the transitive dependencies of the main module.

You should find that every go build or go test command fetches all of those go.mod files. If those files do not have corresponding entries in the go.mod file, then the checksums will (by default) be fetched from sum.golang.org, so removing those entries should result in strictly more network traffic.

(However, if the module author has not run go mod tidy recently, it is possible that the go.sum file contains a large number of irrelevant entries in addition to the relevant ones.)

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 10, 2020

Q: Is the .mod file inside the .zip always going to match the .mod file on the goproxy mirror?

Yes, with the caveat that if the .zip file does not contain a go.mod file, the mirror may synthesize an empty or nearly-empty one.

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 10, 2020

Q: Is it possible to byte-for-byte synthesize the matching .info and .mod mirror files (such that the h1: hash matches).

Any invocation of the go command that downloads modules (such as go mod download or go list) should fetch a go.mod file with the same checksum.

The .info file is not needed for reproducible builds, and therefore is not checksummed. We do not guarantee that its contents will remain stable.

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 10, 2020

Q: How can the .info be generated?

Run go mod download -json and read the file indicated in the Info field.
(https://tip.golang.org/cmd/go/#hdr-Download_modules_to_local_cache)

Loading

@robbat2
Copy link

@robbat2 robbat2 commented Feb 11, 2020

I'll clarify the point of my recent questions: I'm trying to identify the minimal possible set of files to pre-download for any given Go package, such that it can be built offline. Ideally down to ONE file per dependency package.

I'm hoping I can get away with this logic:

  • If the .zip is required, store the .zip and synthesize .mod & .info from it.
  • If the .zip is NOT required, store the .mod and synthesize .info from it.

Given this as an example:
golang/tour@eb9b2d8#diff-f949e2d81c8076ebbf8af38fcbb72c1f

The minimal set of files to provide to the offline environment:

file:///...goproxy/golang.org/x/crypto/@v/v0.0.0-20190308221718-c2843e01d9a2.mod
file:///...goproxy/golang.org/x/net/@v/v0.0.0-20190311183353-d8887717615a.zip
file:///...goproxy/golang.org/x/sys/@v/v0.0.0-20190215142949-d0b11bdaac8a.mod
file:///...goproxy/golang.org/x/text/@v/v0.3.0.mod
file:///...goproxy/golang.org/x/tools/@v/v0.0.0-20190312164927-7b79afddac43.zip

Generate the following files based on the above files:

# mod files where the zip was downloaded (unpack from zip or synthesize)
file:///...goproxy/golang.org/x/net/@v/v0.0.0-20190311183353-d8887717615a.mod
file:///...goproxy/golang.org/x/tools/@v/v0.0.0-20190312164927-7b79afddac43.mod

# plus all the info files
file:///...goproxy/golang.org/x/crypto/@v/v0.0.0-20190308221718-c2843e01d9a2.info
file:///...goproxy/golang.org/x/net/@v/v0.0.0-20190311183353-d8887717615a.info
file:///...goproxy/golang.org/x/sys/@v/v0.0.0-20190215142949-d0b11bdaac8a.info
file:///...goproxy/golang.org/x/text/@v/v0.3.0.info
file:///...goproxy/golang.org/x/tools/@v/v0.0.0-20190312164927-7b79afddac43.info

Loading

@robbat2
Copy link

@robbat2 robbat2 commented Feb 11, 2020

The proposed Gentoo eclass & sample ebuilds for building Go based on at least downloading .mod+.info, and possibly .zip as well.

eclass:
https://archives.gentoo.org/gentoo-dev/message/84bb8585311c5fd03781f873d662860a
example builds:

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 12, 2020

I'm hoping I can get away with this logic:

I think that should be sufficient, but of course it is possible that I have missed something.

Loading

@robbat2
Copy link

@robbat2 robbat2 commented Feb 12, 2020

Q: Is the .mod file inside the .zip always going to match the .mod file on the goproxy mirror?

Yes, with the caveat that if the .zip file does not contain a go.mod file, the mirror may synthesize an empty or nearly-empty one.

Can you point to this existing Golang code? I tried to find it, but came up blank, wondering if it's in some other codebase.

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 12, 2020

func (r *codeRepo) GoMod(version string) (data []byte, err error) {
if version != module.CanonicalVersion(version) {
return nil, fmt.Errorf("version %s is not canonical", version)
}
if IsPseudoVersion(version) {
// findDir ignores the metadata encoded in a pseudo-version,
// only using the revision at the end.
// Invoke Stat to verify the metadata explicitly so we don't return
// a bogus file for an invalid version.
_, err := r.Stat(version)
if err != nil {
return nil, err
}
}
rev, dir, gomod, err := r.findDir(version)
if err != nil {
return nil, err
}
if gomod != nil {
return gomod, nil
}
data, err = r.code.ReadFile(rev, path.Join(dir, "go.mod"), codehost.MaxGoMod)
if err != nil {
if os.IsNotExist(err) {
return r.legacyGoMod(rev, dir), nil
}
return nil, err
}
return data, nil
}
func (r *codeRepo) legacyGoMod(rev, dir string) []byte {
// We used to try to build a go.mod reflecting pre-existing
// package management metadata files, but the conversion
// was inherently imperfect (because those files don't have
// exactly the same semantics as go.mod) and, when done
// for dependencies in the middle of a build, impossible to
// correct. So we stopped.
// Return a fake go.mod that simply declares the module path.
return []byte(fmt.Sprintf("module %s\n", modfile.AutoQuote(r.modPath)))
}

Loading

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 12, 2020

Note that the logic for locating the go.mod file is not trivial. (You're probably better off taking the factor-of-two increase in the number of files and letting the go command figure it out.)

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants