Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does mamba often redownload conda-forge/osx-64 channel index rather than checking/using cached version? #2021

Closed
corneliusroemer opened this issue Oct 13, 2022 · 38 comments

Comments

@corneliusroemer
Copy link
Contributor

I'm using mamba 0.27.0 on an M1 mac.

I'm confused why mamba often seems to redownload the entire package index (~25MB) instead of simply checking whether there have been any changes, e.g. using etag or time of last modification.

Where can I read more about the package index cache, and how to configure it?

There seem to be 3 different modes:

  1. Use cache
  2. Download index and see whether there are changes

For some reason the index seems to be downloaded for certain channels but not for others. This is odd.

What's the difference between "Using cache" and "No change"? The operation leading to "No change" is slower, does this mean the whole index is redownloaded and compared, instead of using some faster way like a hash? How does "Using cache" work? Is there a time to live of the cache or do you check some hash every time?

image

It seems that some indexes are updated every few seconds, is that realistic? Or is it a bug in mamba and or the index server?

@jonashaag
Copy link
Collaborator

jonashaag commented Oct 13, 2022

I‘m on mobile so pardon the short response:

IIUC there are 3 scenarios:

  1. Local cache is up to date (has etag/whatever that’s recent enough) — no download
  2. Local cache is not up to date, will check upstream and receive HTTP 304 — no download but we made an HTTP request
  3. Same as 2 but we actually have a newer package index. We need to download the entire index, unfortunately partial downloads are not yet implemented (although very much WIP)

If an index is downloaded every time it probably means it doesn’t send some HTTP headers relevant for caching.

There is an option that I don’t remember right now that controls the maximum frequency for steps 2/3. I have it set to a couple of hours.

@corneliusroemer
Copy link
Contributor Author

Thanks! What's the difference between 1. and 2.? In 1 you don't even make a request? How do you know that local cache is up to date then?

Could it be that the package index is faulty and updated too often despite no fundamental changes?

@jonashaag
Copy link
Collaborator

jonashaag commented Oct 16, 2022

Yes essentially the HTTP cache headers say: don‘t check for changes in the next X seconds. Eg for conda-forge that would be something like 20 minutes because that‘s the frequency of their repodata updates.

Then when you check for updates it might still be the case that there are none. You send the server your local cache‘s timestamp and it will respond with HTTP 304 „Not Modified“. In that case we don‘t download and update anything but we still made the request.

If the index has bad or no cache headers then Mamba will have to check every time you install anything. And/or if the server does not support responding with 304 then whenever Mamba makes a request it will have to perform a download.

@jonashaag
Copy link
Collaborator

xref #1504

@corneliusroemer
Copy link
Contributor Author

Actually, I think what was really confusing me is how slow the download is.

I have a 20-50MB/s, connection, yet the indeces only download at 1 MB/s...

Instead of 8 seconds, this should take 1 second, then this wouldn't matter.

image

Is this throttling from anaconda infrastructure? Quite surprising that this is so slow - modern infrastructure should be faster...

@jonashaag
Copy link
Collaborator

Try this as a benchmark curl https://conda.anaconda.org/bioconda/noarch/repodata.json >/dev/null

@corneliusroemer
Copy link
Contributor Author

Thanks good idea!
image

Maybe mamba is a tiny bit slower, but that could also be due to parallel requests through one connection?

The above was what mamba update --all did, the below is the curl request a few seconds after that.

image

@corneliusroemer
Copy link
Contributor Author

Some more tests, it looks like it's mamba related?

image

@jonashaag
Copy link
Collaborator

Hm, interesting. I think Mamba doesn't download the bz2 but the "uncompressed" one (which uses HTTP gzip compression). And the 3.6 MiB file is too small for benchmarking. But the 180 MiB file seems large enough and indeed Mamba download performance is bad.

What's your download_threads setting?

What version of Mamba are you on? Can you try a newer/older one?

@corneliusroemer
Copy link
Contributor Author

Aha, that may be it! It could be that the gzip compression takes time on the server!

Well the conda-forge/osx-64 is 23.8 MB ;) that should be large enough.

mamba 0.27.0

Happy to try out lower/higher version. Can you reproduce? I'm happy to try out stuff but would be good to know whether this is just me.

How do I find download_threads? Couldn't find anything in docs and info etc.

@corneliusroemer
Copy link
Contributor Author

corneliusroemer commented Oct 17, 2022

Here's osx-64, why don't you use .bz2?

image

image

Even the uncompressed is much faster:
image

How can I force redownload of cache? Need to clean index?

@jonashaag
Copy link
Collaborator

It could be that the gzip compression takes time on the server!

I don't think so, it's very likely pre-compressed

@corneliusroemer
Copy link
Contributor Author

corneliusroemer commented Oct 17, 2022

This is mamba 0.24, same thing, it's always been like this, always had to wait for a few seconds, up to 10 - just never really closely looked at how slow it was (2.8MB/s)
image

Is it just me? Can you not reproduce with mamba clean -i; mamba update --all

@jonashaag
Copy link
Collaborator

Even the uncompressed is much faster:

I'm not sure you can compare this. Does curl report the speed relative to the compressed or uncompressed file size?

@jonashaag
Copy link
Collaborator

why don't you use .bz2?

I guess we just rely on whatever compression curl negotiates with the server.

@corneliusroemer
Copy link
Contributor Author

corneliusroemer commented Oct 17, 2022

I'm not sure you can compare this. Does curl report the speed relative to the compressed or uncompressed file size?

Probably the raw size. Even if not, it is definitely much faster. It takes only 3 seconds even if downloading uncompressed. Vs 10 seconds for mamba.

Verbose curl logs
>    curl https://conda.anaconda.org/conda-forge/osx-64/repodata.json >/dev/null --verbose
>     % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
>                                    Dload  Upload   Total   Spent    Left  Speed
>     0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 104.17.93.24:443...
>   * Connected to conda.anaconda.org (104.17.93.24) port 443 (#0)
>   * ALPN, offering h2
>   * ALPN, offering http/1.1
>   * successfully set certificate verify locations:
>   *  CAfile: /etc/ssl/cert.pem
>   *  CApath: none
>   * (304) (OUT), TLS handshake, Client hello (1):
>   } [323 bytes data]
>   * (304) (IN), TLS handshake, Server hello (2):
>   { [100 bytes data]
>   * TLSv1.2 (IN), TLS handshake, Certificate (11):
>   { [2318 bytes data]
>   * TLSv1.2 (IN), TLS handshake, Server key exchange (12):
>   { [115 bytes data]
>   * TLSv1.2 (IN), TLS handshake, Server finished (14):
>   { [4 bytes data]
>   * TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
>   } [37 bytes data]
>   * TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
>   } [1 bytes data]
>   * TLSv1.2 (OUT), TLS handshake, Finished (20):
>   } [16 bytes data]
>   * TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
>   { [1 bytes data]
>   * TLSv1.2 (IN), TLS handshake, Finished (20):
>   { [16 bytes data]
>   * SSL connection using TLSv1.2 / ECDHE-ECDSA-CHACHA20-POLY1305
>   * ALPN, server accepted to use h2
>   * Server certificate:
>   *  subject: C=US; ST=California; L=San Francisco; O=Cloudflare, Inc.; CN=anaconda.org
>   *  start date: May  5 00:00:00 2022 GMT
>   *  expire date: May  5 23:59:59 2023 GMT
>   *  subjectAltName: host "conda.anaconda.org" matched cert's "*.anaconda.org"
>   *  issuer: C=US; O=Cloudflare, Inc.; CN=Cloudflare Inc ECC CA-3
>   *  SSL certificate verify ok.
>   * Using HTTP2, server supports multiplexing
>   * Connection state changed (HTTP/2 confirmed)
>   * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
>   * Using Stream ID: 1 (easy handle 0x14d011a00)
>   > GET /conda-forge/osx-64/repodata.json HTTP/2
>   > Host: conda.anaconda.org
>   > user-agent: curl/7.79.1
>   > accept: */*
>   > 
>   * Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
>   < HTTP/2 200 
>   < date: Mon, 17 Oct 2022 16:18:40 GMT
>   < content-type: application/json
>   < content-length: 189263653
>   < cf-ray: 75ba59f8a88401f0-ZRH
>   < accept-ranges: bytes
>   < age: 609
>   < cache-control: public, max-age=1200
>   < etag: "df4e3a670908e75ebd002ddfa96e52b3"
>   < expires: Mon, 17 Oct 2022 16:38:40 GMT
>   < last-modified: Mon, 17 Oct 2022 16:06:46 GMT
>   < vary: Accept-Encoding
>   < cf-cache-status: HIT
>   < x-amz-id-2: /+0hNuEudAzBNgPC31kFWvkuduaZ9UgL/L1GMIYaD/0G2BXTHcd9L03E4H0Rd1DDRws11/1V5CE=
>   < x-amz-request-id: SDV9HY2M1QQSWDE3
>   < x-amz-version-id: null
>   < set-cookie: __cf_bm=Y9WyPETpWXhQQxYgD06bKCAnEmt.z.njvUGbU9zfwhs-1666023520-0-Afq0u1zNwgXRBiVFT/pUUkNumbzbygodJOkLpL8hGiJ7r3bfau8P1fldtHZMC/Ukzwi9VG1JBrtXb6pMKOJ/XohSFcAUED2H0FmjzT2fnnr6; path=/; expires=Mon, 17-Oct-22 16:48:40 GMT; domain=.anaconda.org; HttpOnly; Secure; SameSite=None
>   < server: cloudflare
>   < 
>   { [807 bytes data]
>   100  180M  100  180M    0     0  42.0M      0  0:00:04  0:00:04 --:--:-- 44.0M
>   * Connection #0 to host conda.anaconda.org left intact

@jonashaag
Copy link
Collaborator

For me it's roughly the same duration of download with curl and Mamba:

conda-forge/noarch                                   9.9MB @   3.7MB/s  3.0s
time curl https://conda.anaconda.org/conda-forge/noarch/repodata.json >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 67.8M  100 67.8M    0     0  27.0M      0  0:00:02  0:00:02 --:--:-- 27.1M

________________________________________________________
Executed in    2.52 secs      fish           external

Baseline:

curl https://speed.hetzner.de/100MB.bin > /dev/null
100  100M  100  100M    0     0  29.4M      0  0:00:03  0:00:03 --:--:-- 29.5M

The .bz2 one seems be be MUCH faster with 0.5 s. Even if you add | bunzip2 it's still only 1 s.

@jonashaag
Copy link
Collaborator

For osx-64 curl reports the same download speed and it takes 6 s to download (via curl). With bz2 it takes a little over 1 s via curl.

@jonashaag
Copy link
Collaborator

jonashaag commented Oct 17, 2022

I guess it would be good to change to the bz2 variant if it's available for a channel. cc @wolfv tl;dr: .json.bz2 download seems to be 5x faster than HTTP compressed .json download.

@corneliusroemer
Copy link
Contributor Author

corneliusroemer commented Oct 17, 2022

Right, so you can reproduce that mamba is very slow in downloading - or at least, it shows you download speed of compressed yet downloads at speed of uncompressed?). Maybe that's where the confusion comes in.

How exciting that we may have figured out a way to make mamba updates much faster 😄

@jonashaag
Copy link
Collaborator

How do I find download_threads? Couldn't find anything in docs and info etc.

Yeah unfortunately not everything is documented yet. You can put it into your .condarc:

local_repodata_ttl: 10000  # don't update repodata if younger than ~3 h
download_threads: 5  # default

@corneliusroemer
Copy link
Contributor Author

So upping the threads to >= no of channels is another good idea. :)

@jonashaag
Copy link
Collaborator

it shows you download speed of compressed yet downloads at speed of uncompressed

Yeah that seems to be the key point here. Mamba and curl report different numbers. But the duration that's being reported seems accurate and similar to what curl reports.

@corneliusroemer
Copy link
Contributor Author

Upping threads to 20 was already an improvement for envs with 8 channels! Thanks for the tip!

@jonashaag
Copy link
Collaborator

Yeah we might want to consider increasing that number. It isn't actually the number of threads (xref #1963) and the curl default is 50 if you use curl --parallel.

@wolfv
Copy link
Member

wolfv commented Oct 18, 2022

Hmm, I always thought that Anaconda was somehow limiting the total download speed to ~3 Mb/s for repodata.json but maybe that isn't the case :)
Currently we're passing the supported compression schemes manually into libcurl. I wonder if this change could help:

    curl_easy_setopt(
                m_handle, CURLOPT_ACCEPT_ENCODING, "gzip, deflate, compress, identity");

to

            curl_easy_setopt(
                m_handle, CURLOPT_ACCEPT_ENCODING, "");

which will add all libcurl-supported encodings automatically. Don't really think it will change much though.

I am on not-great hotel wifi right now -- could someone compare:

curl --compressed https://conda.anaconda.org/conda-forge/noarch/repodata.json > /dev/null
# vs.
curl https://conda.anaconda.org/conda-forge/noarch/repodata.json.bz2 > /dev/null

I am not a big fan of the added complexity of using bz2 encoded files and I also don't like bz2 anymore (zstd would be cool!).

Also, the upcoming jlap format is going to make this much better for interactive / local use (incremental repodata patch updates). Implementation has been started here: #2029

The .jlap files are already available on the anaconda.org website: https://conda.anaconda.org/conda-forge/linux-64/repodata.jlap so hopefully we can try it soon :)

@jonashaag
Copy link
Collaborator

$ hyperfine -M 5 'curl https://conda.anaconda.org/conda-forge/noarch/repodata.json.bz2 > /dev/null'
Benchmark 1: curl https://conda.anaconda.org/conda-forge/noarch/repodata.json.bz2 > /dev/null
  Time (mean ± σ):     742.2 ms ± 226.6 ms    [User: 108.3 ms, System: 26.6 ms]
  Range (min … max):   496.5 ms … 1097.8 ms    5 runs

$ hyperfine -M 5 'curl --compressed https://conda.anaconda.org/conda-forge/noarch/repodata.json > /dev/null'
Benchmark 1: curl --compressed https://conda.anaconda.org/conda-forge/noarch/repodata.json > /dev/null
  Time (mean ± σ):      3.117 s ±  0.414 s    [User: 0.481 s, System: 0.109 s]
  Range (min … max):    2.523 s …  3.561 s    5 runs

@wolfv
Copy link
Member

wolfv commented Oct 18, 2022

So at least it looks like it's not "mamba"'s fault :) It could be possible that the Anaconda CDN servers aren't caching the gzipped response and it's slowly encoded on the fly?! Or Anaconda could limit the bandwidth / speed on purpose for this file?! Idk

@jonashaag
Copy link
Collaborator

jonashaag commented Oct 18, 2022

The bz2 file is 10% smaller than the gzip compressed download, so that doesn't explain the 5x slowdown.

Also bandwidth doesn't seem to be limited:

Download of uncompressed json
100 67.9M  100 67.9M    0     0  22.8M      0  0:00:02  0:00:02 --:--:-- 22.8M
                                 ^^^^^ bandwidth
Download with --compressed
100 9682k    0 9682k    0     0  3285k      0 --:--:--  0:00:02 --:--:-- 3292k
                                 ^^^^^ bandwidth

So this actually seems to be a server problem. Where can we raise this?

@jonashaag
Copy link
Collaborator

It could be that the gzip compression takes time on the server!

I don't think so, it's very likely pre-compressed

This didn't age well @corneliusroemer. Another lesson in: Never make assumptions, and if you do, verify them.

@wolfv
Copy link
Member

wolfv commented Oct 18, 2022

The conda slack or the infra repo under conda or conda-incubator

@corneliusroemer
Copy link
Contributor Author

Sometimes not being a pro has advantages @jonashaag

This does indeed look like a server issue - but why don't you just move to bz2?

@corneliusroemer
Copy link
Contributor Author

@wolfv @jonashaag , I also opened a parallel issue in the conda-forge repo, where @jakirkham has now asked to open in conda/infra: conda-forge/conda-forge.github.io#1835

@wolfv
Copy link
Member

wolfv commented Oct 18, 2022

Yes, I think the conda/infra place would be ideal! :)

@corneliusroemer
Copy link
Contributor Author

I've opened something here: conda/infrastructure#637

@wolfv
Copy link
Member

wolfv commented Jan 10, 2023

I am going to close this for the moment. The repodata.zst support should land with the next release (optionally enable with repodata_use_zst: true in the ~/.condarc file.

@wolfv wolfv closed this as completed Jan 10, 2023
@corneliusroemer
Copy link
Contributor Author

Great news :) Very happy to see that zst speedup landing in production soon. Well done @jonashaag @wolfv & co

@corneliusroemer
Copy link
Contributor Author

As far as I'm aware, repodata_use_zst only works with micromamba for now - not with mamba. Is that correct?

Do you plan on including it in mamba at some point? @jonashaag @wolfv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants