Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Availability vs. correctness in list API #1532

Closed
linzhp opened this issue Jan 28, 2020 · 17 comments · Fixed by #1717
Closed

Availability vs. correctness in list API #1532

linzhp opened this issue Jan 28, 2020 · 17 comments · Fixed by #1717

Comments

@linzhp
Copy link
Contributor

linzhp commented Jan 28, 2020

Is your feature request related to a problem? Please describe.
When Athens is serving list API, it issues request to both storage and VCS and merge two lists. It fails if there is a storage error. I understand this is to guarantee that Athens returns tags that are deleted from VCS but cached locally can be returned.

However, this approach compromise the availability of Athens server: either VCS or storage is down, Athens list API will be down too.

Describe the solution you'd like
Specifically to this list API, I think availability is more important than correctness. Given that Go toolchain relies on list api to get the latest version of a module, we can think about two scenarios:

  • storage is down, but VCS is up. If the deleted but cached tag is the latest, then people won't be able to use go get -u or go get <something>@latest to get that version, but they can still use go get <something>@<cached tag>. If the deleted tag is not the latest, no impact.
  • storage is up, but VCS is down. People will only get the latest from storage instead of from VCS, but Go commands will not break with 500 errors.

Describe alternatives you've considered
With current Athens implementation, Go commands will break with 500 errors in above scenarios. I agree that neither situation is ideal, but when that happens, instead of having people not able to run go mod or go get at all, I would prefer limiting the capability of getting the latest version.

@marwan-at-work @arschles thoughts?

@marwan-at-work
Copy link
Contributor

marwan-at-work commented Jan 28, 2020

@linzhp

storage is down, but VCS is up. If the deleted but cached tag is the latest, then people won't be able to use go get -u or go get @latest to get that version, but they can still use go get @.

If storage is down, pretty much nothing will work. Athens always serves from the storage, never directly from go mod download. Therefore, go get <something>@<cached tag> will fail

storage is up, but VCS is down. People will only get the latest from storage instead of from VCS, but Go commands will not break with 500 errors.

  1. If VCS is down, and you have a resolved go.mod file, then everything should still work. This is because the list endpoint never gets hit.

  2. If VCS is up, but the repo got deleted, we actually continue to work by just "listing" the versions from the storage.

  3. If VCS is down, and you don't have a resolved go.mod file, or maybe you're doing a go get -u etc. Then this is the only scenario where things break.

If you're okay with all of the above, I think the only thing we should focus on is issue number 3 right above^.

For this, Athens chose "consistency" over "availability".
Say you have module x with 2 versions in storage: v0.1.0 and v0.2.0
And say module s has 2 extra versions in VCS: v0.3.0 and v0.4.0

With availability > consistency:

  • VCS is down for a period of time, users will only get v0.1.0 and v0.2.0

  • But if VCS is up, then users will get v0.1.0, v0.2.0, v0.3.0, and v0.4.0

  • Pro: things stay working

  • Con: users can get confused as to why they get inconsistent results.

  • Con: users might never know something is wrong in their system

With consistency > availability:

  • VCS is down: users won't get anything and they are aware of something being wrong in the system.

  • VCS is up: users will get v0.1.0, v0.2.0, v0.3.0, and v0.4.0

  • Pro: Athens is always deterministic

  • Pro: when users are confused, it's a good thing, because they should dig into what is wrong at that moment.

  • Con: Athens (well, only the list part of Athens) is fully reliant on VCS being up all the time.

Solutions:

  1. Keep things the way they are
  2. Switch to the new and proposed behavior (availability > consistency)
  3. Make behavior configurable

For 2, is there a good case for this? Is your VCS down so frequently that you want to just ignore it a lot?

For 3, Athens already has a lot of configuration options, and it makes Athens a bit difficult and scary to approach because users see a lot of knobs and are not sure what to do with them. We do our best documenting the config file but it's pretty big and scary to a lot of people I would imagine.

@linzhp I'm curious to hear your thoughts on 2 and 3. I don't feel strongly enough either way 👍

@twexler
Copy link
Collaborator

twexler commented Jan 29, 2020

It sounds like this could be solved by offering a configuration option that allows the operator to determine which they prefer. It will increase code complexity a tad bit, but allow the operator to specify the behavior they prefer.

Thoughts?

@marwan-at-work
Copy link
Contributor

@twexler re-iterating my point 3 above:

For 3, Athens already has a lot of configuration options, and it makes Athens a bit difficult and scary to approach because users see a lot of knobs and are not sure what to do with them. We do our best documenting the config file but it's pretty big and scary to a lot of people I would imagine.

@marwan-at-work
Copy link
Contributor

I hesitate towards adding another configuration since our config file is already gigantic. But if people think it's definitely nice to have, then I'm okay with that.

Also the original issue makes an assumption that if Storage is down then Athens would continue to work which is not true.

Therefore, I'd like to get a second feel on whether the current behavior still makes sense or the configuration is necessary.

Thanks ✌️

@twexler
Copy link
Collaborator

twexler commented Jan 29, 2020

Oops, sorry @marwan-at-work. I was speed reading while having my first coffee of the day and missed that.

I can see both sides of the argument from my previous experience (having dealt with VCS outages partially breaking my builds and having internal caches hide breakage from me). I think there may be a reasonable middle ground.

I see a few of approaches to realize that middle ground:

  1. Recommend GOPROXY be set to Athens' url and direct (e.g. GOPROXY=https://athens.internal,direct) which would surface errors with upstream VCS endpoints during a VCS outage but also allow users to resolve modules when Athens can't reach it's storage.
  2. Recommend that users run two Athens instances, one pointing at remote storage that proxies to an instance that has sufficient local storage which proxies upstream (seems bad?)
  3. Implement @linzhp's recommendation and ignore storage failures (basically the redirect download mode on storage failures)
  4. let the user decide via configuration

@linzhp
Copy link
Contributor Author

linzhp commented Jan 31, 2020

I realize my original description of "storage down" is very confusing, and I didn't provide our context. My apologizes.

We experienced a partially down storage earlier this week. Most of the APIS of the storage worked except the storage.List. So even when Athens got a pretty good list of versions from VCS, users still got 500 errors. Let me rephrase the two scenarios of list API:

  1. storage list API is down, but VCS list API is up. If the deleted but cached tag is the latest, then people won't be able to use go get -u or go get <something>@latest to get that version, but they can still use go get <something>@<cached tag>. If the deleted tag is not the latest, no impact.
  2. storage list API is up, but VCS list API is down. People will only get the latest from storage instead of from VCS, but Go commands will not break with 500 errors.

Scenarios 1 can be a bit confusing for users who run go get -u as the version may be downgraded, but they can always revert to the previous working version.

Say you have module x with 2 versions in storage: v0.1.0 and v0.2.0
And say module x has 2 extra versions in VCS: v0.3.0 and v0.4.0

Scenarios 2 means users cannot get the latest version using go get -u. But if the VCS list is down, they may not know there are newer versions anyway. If some tool uses go list -versions to get a list of versions without downloading them into Athens' cache, the result will be inconsistent, as the tool may see v0.3.0 and v0.4.0 previously but not see them during downtime. I am not aware of any of our tools calling go list -versions without downloading the latest. go get -u and go mod tidy won't see any consistencies.

In either scenarios, whether users sees a 500 error or 200, they are not able to use go get -u to upgrade a package. However the big difference comes when users are running go mod tidy after then import x in a new module, which previously did not depend on x. If consistency > availability, then go mod tidy fails; if availability > consistency, it succeeds.

users might never know something is wrong in their system

Bigger companies like Uber, we have centralized developer experience teams to discover and fix issues in dev infra. We can easily monitor and alert these types of events from Athens log. These type of failures are often outside the scope of a developer experience team. While working with other teams to fix those issues, we can keep the business running as much as possible in the meantime with availability > consistency approach.

when users are confused, it's a good thing, because they should dig into what is wrong at that moment.

Dev infra is too intimidating for most users to dig. Instead, they paged us in the midnight, and we had to patch Athens so it ignores storage.List errors...

@praseodym
Copy link

Currently Athens is a little hard to use in offline environments (such as one described in the the 'pre-filling disk storage' docs scenario) because there's no way to find out what module versions Athens has available, aside from looking at the actual files in storage. A list API that works without connecting to the upstream VCS would be a great help in that situation.

Furthermore, in this case having a storage list API isn't a discussion of availability vs. correctness, because storage is the only truth Athens will ever see. I do agree that this should be a new configuration option to be able to ensure correctness in online environments.

@arschles
Copy link
Member

because there's no way to find out what module versions Athens has available, aside from looking at the actual files in storage

@praseodym would the /catalog endpoint work for you? (example)

@arschles
Copy link
Member

I think this is a really important issue along with the better offline support in #1506 and #1532 (comment) above. I also think these two issues are related.

I don't think an Athens running with default configuration should work if there's a storage outage, even if some of the APIs still work. That behavior would violate the primary and original goal of deterministic builds. I'm always open to being convinced otherwise though 😄.

At the moment, we've applied this determinism goal to projects with a complete go.mod file. By design, go get -u and go get <module>@latest calls are not deterministic, so I think that Athens can & should provide some simple configuration to manage the nondeterminism.

@marwan-at-work I think we can accomplish the additional configuration with a single new variable in config.toml or environment variable, which I'll call MODE in this example. MODE can have these values:

  • MODE=online means that Athens will behave in the same way it does now, across all API endpoints. This is the default value
  • MODE=offline means that Athens will never try to fetch any data from the internet. Users of this mode will need to pre-fill storage to make it work. I think this mode fulfills part of Use Athens as repository in private (offline) network #1506 and does ensure correctness in offline environments
  • MODE=offline-list means that Athens will only fetch version lists from storage. This would enable users to fill storage with go get module@new-version, but all nondeterminism would be taken out of /list and @latest
  • MODE=best-effort-list means that Athens will try to fetch from VCS and storage, succeeding if it gets a response from at least one source, and merging if it gets a response from both sources. In the latter case, the merge algorithm would be the same as the one we currently use

@linzhp @marwan-at-work @praseodym I'm trying to solve all the problems at once here, which comes with the danger of solving none of them. Tell me what you think?

@linzhp
Copy link
Contributor Author

linzhp commented Feb 14, 2020

I like the idea of MODE=offline-list. It's an alternative solution to #1511. And the MODE=best-effort-list is basically what I need for this thread.

@arschles
Copy link
Member

glad to hear it @linzhp. I'll try to solicit some more opinions and hopefully we can come up with a good solution and build it

@arschles
Copy link
Member

I tweeted a request for people to come comment here

@arschles
Copy link
Member

also note that the Go team settled on a similar behavior for their public proxy as what MODE=offline or MODE=offline-list (depending on the state of the upstream repo and connectedness to the internet) would do.

golang/go#37079 (comment)

@nathanhack
Copy link

I'm late to this topic and my use case my not help. But I'm currently using Athens in an offline network and it is painful (of course it's better than the alternative). I ended up setting up a ftp server just so people could look for the version that was available so they can manually update their go.mod files.

Originally I thought by setting DownloadMode = "none" and GoBinaryEnvVars = ["GOPROXY=off"] would have been enough to have Athens use it's own storage as it's only source of truth.

To be fair, I don't know the details behind the magic that makes go mod tidy work. But from a go mod tidy user point of view I don't know that I care. If I've set the proxy then I would assume it would do its best.

I don't know all the details behind the magic that makes go mod tidy work, that siad here's my two cents to muddy the water. It seems to me that using the already existing configs with an additional DownloadMode values could hand the cases, maybe?

MODE=online

  • DownloadMode = "sync"
  • GoBinaryEnvVars = ["GOPROXY=direct"]

MODE=offline

  • DownloadMode = "none"
  • GoBinaryEnvVars = ["GOPROXY=off"]

MODE=offline-list

  • DownloadMode = "sync-nonlist"
  • GoBinaryEnvVars = ["GOPROXY=direct"]

MODE=best-effort-list

  • DownloadMode="sync-best-effort"
  • GoBinaryEnvVars=["GOPROXY=direct"]

@nathanhack
Copy link

But to be clear, I don't really care what the implementation is as long as go mod tidy works without the real internet ;-)

@arschles
Copy link
Member

arschles commented Apr 7, 2020

Thanks @nathanhack! We have two major features to do - external storage and offline mode. We just finished the former so we're going to start tackling this as soon as we can.

@NateDreier
Copy link

Just checking in on the MODE=offline, looking forward to this feature :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants