Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy Cache Fill From Olympus #571

Closed
marwan-at-work opened this issue Aug 26, 2018 · 19 comments
Closed

Proxy Cache Fill From Olympus #571

marwan-at-work opened this issue Aug 26, 2018 · 19 comments
Assignees
Labels
on hold Issues and pull requests that shouldn't be closed yet, but shouldn't be worked on yet either

Comments

@marwan-at-work
Copy link
Contributor

marwan-at-work commented Aug 26, 2018

Issue #495 made it so that we can redirect to Olympus. We should also have the option to save an Olympus module into the proxy storage backend.

  1. This should configurable.
  2. This should weigh out synchronous vs asynchronous options based on clean code and performance.
  3. This should either be done through a backend passed to the middleware, or an Olympus implementation of the download.Protocol
@marpio
Copy link
Member

marpio commented Aug 26, 2018

Ref #474

@fedepaol
Copy link
Member

fedepaol commented Aug 28, 2018

In case of storage miss currently the handlers fetch from the vcs synchronously and fill the storage

Let's put down some possible approaches:

We let the middleware check the storage and do the redirect in case of miss

In case of hit we let the handlers serve the data from the storage.
I don't like this approach because the logic is split between the middleware and the handlers.
It works relying on the fact that the handler will always found the storage full because the middleware checked it for them.

We let the handlers check the storage, redirect and trigger the asynchronous cache filler

This has the advantage of having all the logic in the handlers and being more readable, it's probably faster than the worst case scenario of double miss (if the olympus storage is empty as well the whole operation would take 2x filling the storage and serving the data).

We let the handlers check the storage, download & fill the cache from olympus synchronously

Again, all the code is in the handler. We'll have to have another download protocol (different from the one that serves from olympus) but probably the worst case scenario would be slower. We should take into account that filling the cache by the proxy generally would faster since the proxy is intended to work in proximity. Another advantage of this solution is that olympus will act as source of truth even for the first fetch.

I'm torn between the last two, in any case the handlers will need to be specialized between the proxy and olympus and / or the public / private flag.

Would love to hear your thoughts before I start messing up with the code.

@ghost
Copy link

ghost commented Aug 28, 2018

I'm torn between the same two. My first instinct is to go for the second option:

We let the handlers check the storage, redirect and trigger the asynchronous cache filler

But I also really like the idea of Olympus being the source of truth from the first fetch.

@marwan-at-work
Copy link
Contributor Author

marwan-at-work commented Aug 29, 2018

Haven't fully read the response yet, but my initial thought is this:

  1. Have the downloadProtocol be able to cache Filling just like it has before. It does it currently from go get, but it can also do it trivially from Olympus.
  2. Have the handlers check if err is KindX (KindNotFound, or KindRedirect), and then redirect to Olympus.

My current open PR (573) does a few changes in that we can pass a "Stasher" interface to the download Protocol and so it might make it even easier to have an Olympus Stasher (just one method implementation) instead of an Olympus download.Protocol (5 method implementation)

Will read more about your suggestions and update on my thoughts when I get the time :)

@michalpristas
Copy link
Member

I'm on the edge if this is dup of #474
I would go with 3 up to the point we find out that it won't perform well, then we know how to fix it immediately by switching to 2

@fedepaol
Copy link
Member

Ok then, I'll try to go for 3 trying to leverage @marwan-at-work PR for the stasher interface

@marpio
Copy link
Member

marpio commented Aug 29, 2018

@fedepaol sorry for the late response.
I also think we should go with the third option. If performance will be a problem we can switch to option two or apply an optimization we already considered - writing to the storage and response at the same time.
Also, I think we don't have to implement a whole new download protocol but just swap out the module.Fetcher in the existing one:

mf, err := module.NewGoGetFetcher(goBin, fs)

We could use this:
https://github.com/gomods/athens/blob/master/pkg/module/download.go
Let me know what you think.

@fedepaol
Copy link
Member

Thanks @marpio for your opinion! I will try to collect all the pieces and start working.

@marwan-at-work
Copy link
Contributor Author

@marpio If I understood you correctly, I don't think we can swap the Olympus with the GoGetFetcher because how else is the download.Protocol gonna get private modules? See my comment for a suggestion where we can have the Stasher be able to stash from an upstream (olympus) or local (goget)

@marpio
Copy link
Member

marpio commented Aug 29, 2018

@marwan-at-work my idea was to make it possible to provide a module.Fetcher to the download protocol so we could pass the GoGetFetcher to fetch private modules and let say OlympusFetcher for public mods.

@fedepaol
Copy link
Member

The download protocol would need to know if the module is public / private and have both the fetchers / stasher + fetcher .
@marwan-at-work you mentioned redirect but given that we fetch from olympus it's not needed, right?

@marpio
Copy link
Member

marpio commented Aug 29, 2018

@fedepaol i don't intend to put both module.Fetchers into the same protocol. More like letting the caller of goget.New (where goget wouldn't be the right name anymore) pass an appropriate one. So something like this:
func New(mf module.Fetcher) (download.Protocol, error)
so we can do
gogetProtocol := New(module.NewGoGetFetcher(goBin, fs))
and
olympusProtocol := New(module.NewOlympusFetcher(goBin, fs))
The rest of the code inside pkg/download/goget/goget.go would stay the same.
Implementing the Olympus Stasher doesn't seem consistent to me - we don't have Goget Stasher. Why would we treat getting the .mod, .zip, .info bits from Olympus, differently that getting them from disk using go mod download? Olympus is just another source, just like the files on disk.

@michalpristas
Copy link
Member

I would like to avoid using go to download bits from Olympus (in proxy). It is not an effective way to download modules. If we could use API of each VCS we would.

We have both of them under control, Olympus provides a deterministic API, we can parallelize downloads of 3 pieces, without any additional go magic. Let's treat it as a network disk, which it is.
We can treat zip as a stream and pipe to disk or even DB maybe directly to lower memory consumption. But this is an implementation detail.

@marpio
Copy link
Member

marpio commented Aug 30, 2018

I agree 💯 percent with you @michalpristas - we shouldn't use go to download bits from Olympus. This is also not what I had in mind.
If you look at https://github.com/gomods/athens/blob/master/pkg/download/goget/goget.go the only place where it interacts with go binary is here:

ref, err := gg.fetcher.Fetch(mod, ver)
if err != nil {
return nil, errors.E(op, err)
}
v, err := ref.Read()
if err != nil {
return nil, errors.E(op, err)
}
return v, nil

It does it through the module.Fetcher.
The implementation of the module.Fetcher used is here:
https://github.com/gomods/athens/blob/master/pkg/module/go_get_fetcher.go
This is the code which executes the go binary and provides module.Reference to the files on disk.
What if we would provide another module.Fetcher, let say - pkg/module/olympus_fetcher.go which would return module.olympusRef instead of module.diskRef (https://github.com/gomods/athens/blob/master/pkg/module/disk_ref.go)?
The implementation of the Read() function of the module.olympusRef would return module.Version using https://github.com/gomods/athens/blob/master/pkg/module/download.go

@michalpristas
Copy link
Member

ok i get it now, Olympus implementation of Ref interface.
This actually might work.
Ref says Read reads modules into memory, so this is a limitation and avoids using streams for saving Olympus -> DB. Which might not even work.
But as a first step, I think it good enough. olympusGoget passed into a filtering middleware which currently redirects to Olympus in case of Include.
But with stasher in place we will need to figure out how to share workers for both of them, Olympus and vcs pulls will be equally resource heavy.

@marpio
Copy link
Member

marpio commented Aug 30, 2018

Ref says Read reads modules into memory, so this is a limitation and avoids using streams for saving Olympus -> DB. Which might not even work.

The only thing that Ref.Read() loads into memory is .mod and .info. The .zip is passed as a stream from disk/olympus.

sourceFile, err := d.fs.Open(filepath.Join(packagePath, fmt.Sprintf("%s.zip", d.version)))
if err != nil {
return nil, errors.E(op, err)
}
// note: don't close sourceFile here so that the caller can read directly from disk.
//
// if we close, then the caller will panic, and the alternative to make this work is
// that we read into memory and return an io.ReadCloser that reads out of memory
ver.Zip = &zipReadCloser{sourceFile, d}

@michalpristas
Copy link
Member

Discussed on Slack.
We can pass resp.Body as a ver.Zip.
watch out for Close
as work is limited by memory it will not allow handling so many jobs that will exhaust sockets.
Once we improve on memory, we need to keep an eye on sockets.

@fedepaol
Copy link
Member

fedepaol commented Sep 2, 2018

Discussed on slack, leaving this open in order to focus the work on the proxy.

@ghost ghost added the on hold Issues and pull requests that shouldn't be closed yet, but shouldn't be worked on yet either label Sep 2, 2018
@arschles
Copy link
Member

As of #772, we're not going to try and build a registry for the time being, so I'm closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on hold Issues and pull requests that shouldn't be closed yet, but shouldn't be worked on yet either
Projects
None yet
Development

No branches or pull requests

5 participants