Why does mamba often redownload conda-forge/osx-64 channel index rather than checking/using cached version? #2021

corneliusroemer · 2022-10-13T19:32:47Z

I'm using mamba 0.27.0 on an M1 mac.

I'm confused why mamba often seems to redownload the entire package index (~25MB) instead of simply checking whether there have been any changes, e.g. using etag or time of last modification.

Where can I read more about the package index cache, and how to configure it?

There seem to be 3 different modes:

Use cache
Download index and see whether there are changes

For some reason the index seems to be downloaded for certain channels but not for others. This is odd.

What's the difference between "Using cache" and "No change"? The operation leading to "No change" is slower, does this mean the whole index is redownloaded and compared, instead of using some faster way like a hash? How does "Using cache" work? Is there a time to live of the cache or do you check some hash every time?

It seems that some indexes are updated every few seconds, is that realistic? Or is it a bug in mamba and or the index server?

jonashaag · 2022-10-13T20:32:03Z

I‘m on mobile so pardon the short response:

IIUC there are 3 scenarios:

Local cache is up to date (has etag/whatever that’s recent enough) — no download
Local cache is not up to date, will check upstream and receive HTTP 304 — no download but we made an HTTP request
Same as 2 but we actually have a newer package index. We need to download the entire index, unfortunately partial downloads are not yet implemented (although very much WIP)

If an index is downloaded every time it probably means it doesn’t send some HTTP headers relevant for caching.

There is an option that I don’t remember right now that controls the maximum frequency for steps 2/3. I have it set to a couple of hours.

corneliusroemer · 2022-10-16T19:08:47Z

Thanks! What's the difference between 1. and 2.? In 1 you don't even make a request? How do you know that local cache is up to date then?

Could it be that the package index is faulty and updated too often despite no fundamental changes?

jonashaag · 2022-10-16T19:56:44Z

Yes essentially the HTTP cache headers say: don‘t check for changes in the next X seconds. Eg for conda-forge that would be something like 20 minutes because that‘s the frequency of their repodata updates.

Then when you check for updates it might still be the case that there are none. You send the server your local cache‘s timestamp and it will respond with HTTP 304 „Not Modified“. In that case we don‘t download and update anything but we still made the request.

If the index has bad or no cache headers then Mamba will have to check every time you install anything. And/or if the server does not support responding with 304 then whenever Mamba makes a request it will have to perform a download.

jonashaag · 2022-10-16T19:58:52Z

xref #1504

corneliusroemer · 2022-10-17T11:42:55Z

Actually, I think what was really confusing me is how slow the download is.

I have a 20-50MB/s, connection, yet the indeces only download at 1 MB/s...

Instead of 8 seconds, this should take 1 second, then this wouldn't matter.

Is this throttling from anaconda infrastructure? Quite surprising that this is so slow - modern infrastructure should be faster...

jonashaag · 2022-10-17T13:48:57Z

Try this as a benchmark curl https://conda.anaconda.org/bioconda/noarch/repodata.json >/dev/null

corneliusroemer · 2022-10-17T14:11:48Z

Thanks good idea!

Maybe mamba is a tiny bit slower, but that could also be due to parallel requests through one connection?

The above was what mamba update --all did, the below is the curl request a few seconds after that.

corneliusroemer · 2022-10-17T14:12:09Z

Some more tests, it looks like it's mamba related?

jonashaag · 2022-10-17T16:06:54Z

Hm, interesting. I think Mamba doesn't download the bz2 but the "uncompressed" one (which uses HTTP gzip compression). And the 3.6 MiB file is too small for benchmarking. But the 180 MiB file seems large enough and indeed Mamba download performance is bad.

What's your download_threads setting?

What version of Mamba are you on? Can you try a newer/older one?

corneliusroemer · 2022-10-17T16:11:20Z

Aha, that may be it! It could be that the gzip compression takes time on the server!

Well the conda-forge/osx-64 is 23.8 MB ;) that should be large enough.

mamba 0.27.0

Happy to try out lower/higher version. Can you reproduce? I'm happy to try out stuff but would be good to know whether this is just me.

How do I find download_threads? Couldn't find anything in docs and info etc.

corneliusroemer · 2022-10-17T16:12:28Z

Here's osx-64, why don't you use .bz2?

Even the uncompressed is much faster:

How can I force redownload of cache? Need to clean index?

jonashaag · 2022-10-17T16:14:39Z

It could be that the gzip compression takes time on the server!

I don't think so, it's very likely pre-compressed

corneliusroemer · 2022-10-17T16:15:54Z

This is mamba 0.24, same thing, it's always been like this, always had to wait for a few seconds, up to 10 - just never really closely looked at how slow it was (2.8MB/s)

Is it just me? Can you not reproduce with mamba clean -i; mamba update --all

jonashaag · 2022-10-17T16:16:54Z

Even the uncompressed is much faster:

I'm not sure you can compare this. Does curl report the speed relative to the compressed or uncompressed file size?

jonashaag · 2022-10-17T16:17:26Z

why don't you use .bz2?

I guess we just rely on whatever compression curl negotiates with the server.

corneliusroemer · 2022-10-17T16:18:32Z

I'm not sure you can compare this. Does curl report the speed relative to the compressed or uncompressed file size?

Probably the raw size. Even if not, it is definitely much faster. It takes only 3 seconds even if downloading uncompressed. Vs 10 seconds for mamba.

Verbose curl logs

>    curl https://conda.anaconda.org/conda-forge/osx-64/repodata.json >/dev/null --verbose
>     % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
>                                    Dload  Upload   Total   Spent    Left  Speed
>     0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 104.17.93.24:443...
>   * Connected to conda.anaconda.org (104.17.93.24) port 443 (#0)
>   * ALPN, offering h2
>   * ALPN, offering http/1.1
>   * successfully set certificate verify locations:
>   *  CAfile: /etc/ssl/cert.pem
>   *  CApath: none
>   * (304) (OUT), TLS handshake, Client hello (1):
>   } [323 bytes data]
>   * (304) (IN), TLS handshake, Server hello (2):
>   { [100 bytes data]
>   * TLSv1.2 (IN), TLS handshake, Certificate (11):
>   { [2318 bytes data]
>   * TLSv1.2 (IN), TLS handshake, Server key exchange (12):
>   { [115 bytes data]
>   * TLSv1.2 (IN), TLS handshake, Server finished (14):
>   { [4 bytes data]
>   * TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
>   } [37 bytes data]
>   * TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
>   } [1 bytes data]
>   * TLSv1.2 (OUT), TLS handshake, Finished (20):
>   } [16 bytes data]
>   * TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
>   { [1 bytes data]
>   * TLSv1.2 (IN), TLS handshake, Finished (20):
>   { [16 bytes data]
>   * SSL connection using TLSv1.2 / ECDHE-ECDSA-CHACHA20-POLY1305
>   * ALPN, server accepted to use h2
>   * Server certificate:
>   *  subject: C=US; ST=California; L=San Francisco; O=Cloudflare, Inc.; CN=anaconda.org
>   *  start date: May  5 00:00:00 2022 GMT
>   *  expire date: May  5 23:59:59 2023 GMT
>   *  subjectAltName: host "conda.anaconda.org" matched cert's "*.anaconda.org"
>   *  issuer: C=US; O=Cloudflare, Inc.; CN=Cloudflare Inc ECC CA-3
>   *  SSL certificate verify ok.
>   * Using HTTP2, server supports multiplexing
>   * Connection state changed (HTTP/2 confirmed)
>   * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
>   * Using Stream ID: 1 (easy handle 0x14d011a00)
>   > GET /conda-forge/osx-64/repodata.json HTTP/2
>   > Host: conda.anaconda.org
>   > user-agent: curl/7.79.1
>   > accept: */*
>   > 
>   * Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
>   < HTTP/2 200 
>   < date: Mon, 17 Oct 2022 16:18:40 GMT
>   < content-type: application/json
>   < content-length: 189263653
>   < cf-ray: 75ba59f8a88401f0-ZRH
>   < accept-ranges: bytes
>   < age: 609
>   < cache-control: public, max-age=1200
>   < etag: "df4e3a670908e75ebd002ddfa96e52b3"
>   < expires: Mon, 17 Oct 2022 16:38:40 GMT
>   < last-modified: Mon, 17 Oct 2022 16:06:46 GMT
>   < vary: Accept-Encoding
>   < cf-cache-status: HIT
>   < x-amz-id-2: /+0hNuEudAzBNgPC31kFWvkuduaZ9UgL/L1GMIYaD/0G2BXTHcd9L03E4H0Rd1DDRws11/1V5CE=
>   < x-amz-request-id: SDV9HY2M1QQSWDE3
>   < x-amz-version-id: null
>   < set-cookie: __cf_bm=Y9WyPETpWXhQQxYgD06bKCAnEmt.z.njvUGbU9zfwhs-1666023520-0-Afq0u1zNwgXRBiVFT/pUUkNumbzbygodJOkLpL8hGiJ7r3bfau8P1fldtHZMC/Ukzwi9VG1JBrtXb6pMKOJ/XohSFcAUED2H0FmjzT2fnnr6; path=/; expires=Mon, 17-Oct-22 16:48:40 GMT; domain=.anaconda.org; HttpOnly; Secure; SameSite=None
>   < server: cloudflare
>   < 
>   { [807 bytes data]
>   100  180M  100  180M    0     0  42.0M      0  0:00:04  0:00:04 --:--:-- 44.0M
>   * Connection #0 to host conda.anaconda.org left intact

jonashaag · 2022-10-17T16:25:58Z

For me it's roughly the same duration of download with curl and Mamba:

conda-forge/noarch                                   9.9MB @   3.7MB/s  3.0s

time curl https://conda.anaconda.org/conda-forge/noarch/repodata.json >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 67.8M  100 67.8M    0     0  27.0M      0  0:00:02  0:00:02 --:--:-- 27.1M

________________________________________________________
Executed in    2.52 secs      fish           external

Baseline:

curl https://speed.hetzner.de/100MB.bin > /dev/null
100  100M  100  100M    0     0  29.4M      0  0:00:03  0:00:03 --:--:-- 29.5M

The .bz2 one seems be be MUCH faster with 0.5 s. Even if you add | bunzip2 it's still only 1 s.

jonashaag · 2022-10-17T16:28:02Z

For osx-64 curl reports the same download speed and it takes 6 s to download (via curl). With bz2 it takes a little over 1 s via curl.

jonashaag · 2022-10-17T16:28:37Z

I guess it would be good to change to the bz2 variant if it's available for a channel. cc @wolfv tl;dr: .json.bz2 download seems to be 5x faster than HTTP compressed .json download.

corneliusroemer · 2022-10-17T16:30:08Z

Right, so you can reproduce that mamba is very slow in downloading - or at least, it shows you download speed of compressed yet downloads at speed of uncompressed?). Maybe that's where the confusion comes in.

How exciting that we may have figured out a way to make mamba updates much faster 😄

jonashaag · 2022-10-17T16:31:01Z

How do I find download_threads? Couldn't find anything in docs and info etc.

Yeah unfortunately not everything is documented yet. You can put it into your .condarc:

local_repodata_ttl: 10000  # don't update repodata if younger than ~3 h
download_threads: 5  # default

corneliusroemer · 2022-10-17T16:32:13Z

So upping the threads to >= no of channels is another good idea. :)

jonashaag · 2022-10-17T16:32:44Z

it shows you download speed of compressed yet downloads at speed of uncompressed

Yeah that seems to be the key point here. Mamba and curl report different numbers. But the duration that's being reported seems accurate and similar to what curl reports.

corneliusroemer · 2022-10-17T16:33:56Z

Upping threads to 20 was already an improvement for envs with 8 channels! Thanks for the tip!

jonashaag · 2022-10-17T16:37:22Z

Yeah we might want to consider increasing that number. It isn't actually the number of threads (xref #1963) and the curl default is 50 if you use curl --parallel.

wolfv · 2022-10-18T07:34:11Z

Hmm, I always thought that Anaconda was somehow limiting the total download speed to ~3 Mb/s for repodata.json but maybe that isn't the case :)
Currently we're passing the supported compression schemes manually into libcurl. I wonder if this change could help:

    curl_easy_setopt(
                m_handle, CURLOPT_ACCEPT_ENCODING, "gzip, deflate, compress, identity");

to

            curl_easy_setopt(
                m_handle, CURLOPT_ACCEPT_ENCODING, "");

which will add all libcurl-supported encodings automatically. Don't really think it will change much though.

I am on not-great hotel wifi right now -- could someone compare:

curl --compressed https://conda.anaconda.org/conda-forge/noarch/repodata.json > /dev/null
# vs.
curl https://conda.anaconda.org/conda-forge/noarch/repodata.json.bz2 > /dev/null

I am not a big fan of the added complexity of using bz2 encoded files and I also don't like bz2 anymore (zstd would be cool!).

Also, the upcoming jlap format is going to make this much better for interactive / local use (incremental repodata patch updates). Implementation has been started here: #2029

The .jlap files are already available on the anaconda.org website: https://conda.anaconda.org/conda-forge/linux-64/repodata.jlap so hopefully we can try it soon :)

jonashaag · 2022-10-18T08:07:57Z

$ hyperfine -M 5 'curl https://conda.anaconda.org/conda-forge/noarch/repodata.json.bz2 > /dev/null'
Benchmark 1: curl https://conda.anaconda.org/conda-forge/noarch/repodata.json.bz2 > /dev/null
  Time (mean ± σ):     742.2 ms ± 226.6 ms    [User: 108.3 ms, System: 26.6 ms]
  Range (min … max):   496.5 ms … 1097.8 ms    5 runs

$ hyperfine -M 5 'curl --compressed https://conda.anaconda.org/conda-forge/noarch/repodata.json > /dev/null'
Benchmark 1: curl --compressed https://conda.anaconda.org/conda-forge/noarch/repodata.json > /dev/null
  Time (mean ± σ):      3.117 s ±  0.414 s    [User: 0.481 s, System: 0.109 s]
  Range (min … max):    2.523 s …  3.561 s    5 runs

wolfv · 2022-10-18T08:14:30Z

So at least it looks like it's not "mamba"'s fault :) It could be possible that the Anaconda CDN servers aren't caching the gzipped response and it's slowly encoded on the fly?! Or Anaconda could limit the bandwidth / speed on purpose for this file?! Idk

jonashaag · 2022-10-18T08:28:26Z

The bz2 file is 10% smaller than the gzip compressed download, so that doesn't explain the 5x slowdown.

Also bandwidth doesn't seem to be limited:

Download of uncompressed json
100 67.9M  100 67.9M    0     0  22.8M      0  0:00:02  0:00:02 --:--:-- 22.8M
                                 ^^^^^ bandwidth
Download with --compressed
100 9682k    0 9682k    0     0  3285k      0 --:--:--  0:00:02 --:--:-- 3292k
                                 ^^^^^ bandwidth

So this actually seems to be a server problem. Where can we raise this?

jonashaag · 2022-10-18T08:30:16Z

It could be that the gzip compression takes time on the server!

I don't think so, it's very likely pre-compressed

This didn't age well @corneliusroemer. Another lesson in: Never make assumptions, and if you do, verify them.

wolfv · 2022-10-18T08:53:24Z

The conda slack or the infra repo under conda or conda-incubator

corneliusroemer · 2022-10-18T10:08:13Z

Sometimes not being a pro has advantages @jonashaag

This does indeed look like a server issue - but why don't you just move to bz2?

corneliusroemer · 2022-10-18T10:12:17Z

@wolfv @jonashaag , I also opened a parallel issue in the conda-forge repo, where @jakirkham has now asked to open in conda/infra: conda-forge/conda-forge.github.io#1835

wolfv · 2022-10-18T10:13:21Z

Yes, I think the conda/infra place would be ideal! :)

corneliusroemer · 2022-10-18T16:13:06Z

I've opened something here: conda/infrastructure#637

wolfv · 2023-01-10T08:38:19Z

I am going to close this for the moment. The repodata.zst support should land with the next release (optionally enable with repodata_use_zst: true in the ~/.condarc file.

corneliusroemer · 2023-01-11T08:21:29Z

Great news :) Very happy to see that zst speedup landing in production soon. Well done @jonashaag @wolfv & co

corneliusroemer · 2023-02-09T21:24:12Z

As far as I'm aware, repodata_use_zst only works with micromamba for now - not with mamba. Is that correct?

Do you plan on including it in mamba at some point? @jonashaag @wolfv

corneliusroemer mentioned this issue Oct 18, 2022

PERF: conda-forge index download speed slow (~1 MB/s) conda-forge/conda-forge.github.io#1835

Closed

1 task

corneliusroemer mentioned this issue Oct 18, 2022

Conda servers don't seem to cache gzip compressed indexes - limiting download speed to 2-3MB/s (compressed) conda/infrastructure#637

Closed

2 tasks

corneliusroemer mentioned this issue Nov 6, 2022

ENH: Make zstd compressed index files available conda/infrastructure#648

Closed

2 tasks

wolfv closed this as completed Jan 10, 2023

corneliusroemer mentioned this issue May 10, 2023

ENH: Make repodata_use_zst=true work with mamba, not just micromamba #2523

Closed

2 tasks

Why does mamba often redownload conda-forge/osx-64 channel index rather than checking/using cached version? #2021

Why does mamba often redownload conda-forge/osx-64 channel index rather than checking/using cached version? #2021

Comments

corneliusroemer commented Oct 13, 2022

jonashaag commented Oct 13, 2022 • edited Loading

corneliusroemer commented Oct 16, 2022

jonashaag commented Oct 16, 2022 • edited Loading

jonashaag commented Oct 16, 2022

corneliusroemer commented Oct 17, 2022

jonashaag commented Oct 17, 2022

corneliusroemer commented Oct 17, 2022

corneliusroemer commented Oct 17, 2022

jonashaag commented Oct 17, 2022

corneliusroemer commented Oct 17, 2022

corneliusroemer commented Oct 17, 2022 • edited Loading

jonashaag commented Oct 17, 2022

corneliusroemer commented Oct 17, 2022 • edited Loading

jonashaag commented Oct 17, 2022

jonashaag commented Oct 17, 2022

corneliusroemer commented Oct 17, 2022 • edited Loading

jonashaag commented Oct 17, 2022

jonashaag commented Oct 17, 2022

jonashaag commented Oct 17, 2022 • edited Loading

corneliusroemer commented Oct 17, 2022 • edited Loading

jonashaag commented Oct 17, 2022

corneliusroemer commented Oct 17, 2022

jonashaag commented Oct 17, 2022

corneliusroemer commented Oct 17, 2022

jonashaag commented Oct 17, 2022

wolfv commented Oct 18, 2022

jonashaag commented Oct 18, 2022

wolfv commented Oct 18, 2022

jonashaag commented Oct 18, 2022 • edited Loading

jonashaag commented Oct 18, 2022

wolfv commented Oct 18, 2022

corneliusroemer commented Oct 18, 2022

corneliusroemer commented Oct 18, 2022

wolfv commented Oct 18, 2022

corneliusroemer commented Oct 18, 2022

wolfv commented Jan 10, 2023

corneliusroemer commented Jan 11, 2023

corneliusroemer commented Feb 9, 2023

jonashaag commented Oct 13, 2022 •

edited

Loading

jonashaag commented Oct 16, 2022 •

edited

Loading

corneliusroemer commented Oct 17, 2022 •

edited

Loading

corneliusroemer commented Oct 17, 2022 •

edited

Loading

corneliusroemer commented Oct 17, 2022 •

edited

Loading

jonashaag commented Oct 17, 2022 •

edited

Loading

corneliusroemer commented Oct 17, 2022 •

edited

Loading

jonashaag commented Oct 18, 2022 •

edited

Loading