Questions about mirroring + caching #504

jonashaag · 2022-03-15T11:55:19Z

IIUC the mirroring code correctly, downloading and caching works as follows (assuming you're using a GCP backend):

If file not in pkgstore, download from upstream (eg. conda-forge) and place in pkgstore (eg. GCS).
Send pre-signed redirect URL to client.

Special case repodata.json:

Invalidate the placement in pkgstore after some time.
Try to read repodata.json.gz from pkgstore and stream that to client.
If no .gz exists, stream repodata.json from pkgstore to client.
Never use redirects, always stream from Quetz instance.

Questions:

Where are the repodata.json.gz files coming from?
Why don't we use redirects from repodata.json{,.gz}?
Does it make sense to implement another Quetz-local cache for some packages? Eg., for very small and/or highly-frequented packages it could be quicker to stream them directly from the Quetz instance.

The text was updated successfully, but these errors were encountered:

wolfv · 2022-03-15T12:10:18Z

There are two mirroring modes: full mirror and proxy.

In the proxy mode, we forward to the repodata.json from upstream but download the requested packages and cache them on the server.

quetz/quetz/main.py

Lines 1639 to 1648 in 0c91da2

    
           if channel.mirror_channel_url and channel.mirror_mode == "proxy": 
        
               repository = RemoteRepository(channel.mirror_channel_url, session) 
        
               if not pkgstore.file_exists(channel.name, path): 
        
                   download_remote_file(repository, pkgstore, channel.name, path) 
        
               elif path.endswith(".json"): 
        
                   # repodata.json and current_repodata.json are cached locally 
        
                   # for channel.ttl seconds 
        
                   _, fmtime, _ = pkgstore.get_filemetadata(channel.name, path) 
        
                   if time.time() - fmtime >= channel.ttl: 
        
                       download_remote_file(repository, pkgstore, channel.name, path)

Streaming repodata.json is almost always a bad idea, it's much bigger than the gzip compressed one.
We pre-compute the .gz file but you can also configure nginx to do it on the fly.

wolfv · 2022-03-15T12:14:22Z

This is where the .gz and .bz2 files are created:

quetz/quetz/utils.py

Lines 35 to 51 in 0c91da2

    
           def add_static_file(contents, channel_name, subdir, fname, pkgstore, file_index=None): 
        
               if type(contents) is not bytes: 
        
                   raw_file = contents.encode("utf-8") 
        
               else: 
        
                   raw_file = contents 
        
               bz2_file = bz2.compress(raw_file) 
        
               gzp_file = gzip.compress(raw_file) 
        
               path = f"{subdir}/{fname}" if subdir else fname 
        
               pkgstore.add_file(bz2_file, channel_name, f"{path}.bz2") 
        
               pkgstore.add_file(gzp_file, channel_name, f"{path}.gz") 
        
               pkgstore.add_file(raw_file, channel_name, f"{path}") 
        
               if file_index: 
        
                   add_entry_for_index(file_index, subdir, fname, raw_file) 
        
                   add_entry_for_index(file_index, subdir, f"{fname}.bz2", bz2_file) 
        
                   add_entry_for_index(file_index, subdir, f"{fname}.gz", gzp_file)

jonashaag · 2022-03-15T12:36:51Z

Thanks!

cache them on the server.

server = pkgstore? What do you think about an additional layer of caching/storage for remote pkgstores like S3?

This is where the .gz and .bz2 files are created:

It doesn't seem to be used for repodata.json though.

wolfv · 2022-03-15T12:44:37Z

What do you think about an additional layer of caching/storage for remote pkgstores like S3?

I don't see the point? What would be the point? Do you think for very small files that the redirect is the bottleneck?

jonashaag · 2022-03-15T12:45:18Z

Do you think for very small files that the redirect is the bottleneck?

Yes. Will do some benchmarking on this.

wolfv · 2022-03-15T12:47:54Z

Just FYI the pre-authentication is computed completely on the quetz side usually (it's usually some encryption of the request metadata with the authentication token).
So there is no round-trip or so needed.

I am interested to see a benchmark. In general, I think I'd like to try to avoid as much as possible that Python "touches" static files. Static files should be routed through nginx or S3 / GCS etc.

jonashaag · 2022-03-15T14:25:17Z

Here are a bunch of timings with GCS pkgstore and from a WiFi home internet connection (each fastest of ~10 tries):

Generate pre-signed URL in Quetz (1 HTTP request): 80 ms
Download tiny package from GCS redirect URL (1 HTTPS request): 270 ms
Same as 2. but with HTTP: 200 ms
1 and 3 combined (curl -L, 1 HTTP + 1 HTTPS request): 460 ms
Download tiny repodata.json from Quetz (180 K uncompressed, streams from GCS): 600 ms
Download tiny package from conda-forge (1 HTTPS request): 130 ms
Ping GCS: 20 ms

So, GCS overhead seems to be ~ 250 ms, and the overhead from the roundtrip seems to be another 110 ms. (Not sure how curl needs 110 ms to process the redirect?!) So there is a "budget" of 380 ms for Quetz to serve packages directly without redirect. (270 ms if you ignore the 110 ms spent in curl.)

Also, the way repodata.json is served from GCS (Quetz streaming it to the client) seems really slow.

Edit: GCS bucket was in US while I'm in EU. I did another test with GCS in EU, this removes ~ 150 ms. I also tried adding Cloud CDN in front, which shaves off another 50 ms. So the budget shrinks to 230/180 ms. (Or 120/70 ms with the curl overhead removed.)

jonashaag · 2022-03-15T15:40:18Z

It doesn't seem to be used for repodata.json though.

Never mind, it's with add_temp_static_file. Looks we aren't running the task correctly.

wolfv · 2022-03-15T16:06:36Z

it's quite possible that there are issues with the proxy mode... I would have to look into it more deeply.

Interesting findings re. timing.

Tbh I am less concerned about small files then about large files and that's where I think they should really not be served through Python.
One could argue that small files can be served through python and everything else redirected but it seems like a complication.

If you have 5 (or more!) parallel downloads, this should only give a tiny hit on the overall picture ...

wolfv · 2022-03-15T16:12:42Z

For S3 we already have support in powerloader btw (to natively pre-sign URLs on the client side): https://github.com/mamba-org/powerloader/blob/effe2b7e1f555616e4e4c877648658d1e6c89ded/src/mirrors/s3.cpp#L239-L244

For GCS the algorithm looks extremely similar so we could also add support for gcs:// mirrors.

https://cloud.google.com/storage/docs/access-control/signing-urls-manually

That would remove the initial redirect roundtrip -- but force you to distribute S3 / GCS credentials to users.

jonashaag · 2022-03-15T19:36:33Z

Why don't we use redirects from repodata.json{,.gz}?

See #506.

jonashaag · 2022-03-15T19:36:56Z

If you have 5 (or more!) parallel downloads, this should only give a tiny hit on the overall picture ...

Likely. Will do some testing on this as well :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about mirroring + caching #504

Questions about mirroring + caching #504

jonashaag commented Mar 15, 2022

wolfv commented Mar 15, 2022

wolfv commented Mar 15, 2022

jonashaag commented Mar 15, 2022

wolfv commented Mar 15, 2022

jonashaag commented Mar 15, 2022

wolfv commented Mar 15, 2022

jonashaag commented Mar 15, 2022 •

edited

Loading

jonashaag commented Mar 15, 2022

wolfv commented Mar 15, 2022

wolfv commented Mar 15, 2022

jonashaag commented Mar 15, 2022

jonashaag commented Mar 15, 2022

Questions about mirroring + caching #504

Questions about mirroring + caching #504

Comments

jonashaag commented Mar 15, 2022

wolfv commented Mar 15, 2022

wolfv commented Mar 15, 2022

jonashaag commented Mar 15, 2022

wolfv commented Mar 15, 2022

jonashaag commented Mar 15, 2022

wolfv commented Mar 15, 2022

jonashaag commented Mar 15, 2022 • edited Loading

jonashaag commented Mar 15, 2022

wolfv commented Mar 15, 2022

wolfv commented Mar 15, 2022

jonashaag commented Mar 15, 2022

jonashaag commented Mar 15, 2022

jonashaag commented Mar 15, 2022 •

edited

Loading