Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noarch repodata.json is empty for proxy channels #527

Closed
janjagusch opened this issue May 10, 2022 · 11 comments
Closed

Noarch repodata.json is empty for proxy channels #527

janjagusch opened this issue May 10, 2022 · 11 comments

Comments

@janjagusch
Copy link
Collaborator

In our quetz server, we proxy conda-forge into a channel of the same name:

  {
    "name": "conda-forge",
    "description": null,
    "private": true,
    "size_limit": null,
    "ttl": 36000,
    "mirror_channel_url": "https://conda.anaconda.org/conda-forge",
    "mirror_mode": "proxy",
    "members_count": 1,
    "packages_count": 0
  },

Installation often fails because certain packages cannot be found. We narrowed the issue down to the noarch repodata.json being empty. Navigating to /get/conda-forge/noarch/repodata.json yields:

{
  "info": {
    "subdir": "noarch"
  },
  "packages": {},
  "packages.conda": {},
  "repodata_version": 1
}

This problem only occurs for the noarch platform. Also, in our package backend (GCS) we have a cached version of the repodata.json, which is not empty (it's 55MB large).

Any idea what might be causing this?

@wolfv
Copy link
Member

wolfv commented May 10, 2022

hmmm, this is the code in question:

quetz/quetz/main.py

Lines 1645 to 1654 in 05dbfc6

if channel.mirror_channel_url and channel.mirror_mode == "proxy":
repository = RemoteRepository(channel.mirror_channel_url, session)
if not pkgstore.file_exists(channel.name, path):
download_remote_file(repository, pkgstore, channel.name, path)
elif path.endswith(".json"):
# repodata.json and current_repodata.json are cached locally
# for channel.ttl seconds
_, fmtime, _ = pkgstore.get_filemetadata(channel.name, path)
if time.time() - fmtime >= channel.ttl:
download_remote_file(repository, pkgstore, channel.name, path)

I wonder if the gzip magic does something bad here:

quetz/quetz/main.py

Lines 1673 to 1683 in 05dbfc6

if accept_encoding and 'gzip' in accept_encoding and path.endswith('.json'):
# return gzipped response
try:
package_content_iter = iter_chunks(
pkgstore.serve_path(channel.name, path + '.gz')
)
path += '.gz'
headers['Content-Encoding'] = 'gzip'
headers['Content-Type'] = 'application/json'
except FileNotFoundError:
pass

Since we might not have the proper repodata.json.gz file on the package store ...

@janjagusch
Copy link
Collaborator Author

hmmm, this is the code in question:

quetz/quetz/main.py

Lines 1645 to 1654 in 05dbfc6

if channel.mirror_channel_url and channel.mirror_mode == "proxy":
repository = RemoteRepository(channel.mirror_channel_url, session)
if not pkgstore.file_exists(channel.name, path):
download_remote_file(repository, pkgstore, channel.name, path)
elif path.endswith(".json"):
# repodata.json and current_repodata.json are cached locally
# for channel.ttl seconds
_, fmtime, _ = pkgstore.get_filemetadata(channel.name, path)
if time.time() - fmtime >= channel.ttl:
download_remote_file(repository, pkgstore, channel.name, path)

I wonder if the gzip magic does something bad here:

quetz/quetz/main.py

Lines 1673 to 1683 in 05dbfc6

if accept_encoding and 'gzip' in accept_encoding and path.endswith('.json'):
# return gzipped response
try:
package_content_iter = iter_chunks(
pkgstore.serve_path(channel.name, path + '.gz')
)
path += '.gz'
headers['Content-Encoding'] = 'gzip'
headers['Content-Type'] = 'application/json'
except FileNotFoundError:
pass

Since we might not have the proper repodata.json.gz file on the package store ...

I just checked the repodata.json.gz and it's empty (check file size):

image

Interestingly, the other platform don't contain the repodata.json.gz, only repodata.json.

@janjagusch
Copy link
Collaborator Author

janjagusch commented May 10, 2022

https://conda.anaconda.org/conda-forge

One more thing: repodata.json.gz also doesn't seem to exist on the upstream channel, see: https://conda.anaconda.org/conda-forge/noarch/repodata.json.gz

image

@janjagusch
Copy link
Collaborator Author

https://conda.anaconda.org/conda-forge

One more thing: repodata.json.gz also doesn't seem to exist on the upstream channel, see: https://conda.anaconda.org/conda-forge/noarch/repodata.json.gz

image

Deleting repodata.json.bz2 and repodata.json.gz seems to solve the issue for me. The question remains where these files come from, though.

@wolfv
Copy link
Member

wolfv commented May 10, 2022

OK, so the bug goes as follows:

  • we added a repodata.json.gz file to support returning / streaming gzip compressed files from S3 buckets and similar
  • we do initialize every channel with an empty noarch repodata.json because the existence of teh noarch/repodata.json is what marks a channel as "existing"
  • apparently we do initialize even a proxy-mirror channel with the static noarch/repodata.json files (including the .gz / .bz2 ones). We should not do that.
  • the repodata.json.gz is a quetz-specific extension (and a bit of a workaround for OVH because they don't properly support setting the Content-Encoding header for a given file)

@wolfv
Copy link
Member

wolfv commented May 10, 2022

This is where we call update_indexes for all kinds of channels:

indexing.update_indexes(dao, pkgstore, new_channel.name)

That creates the empty noarch/repodata.json ...

Should we just not call that for a proxy-mirror channel?

@janjagusch
Copy link
Collaborator Author

This is where we call update_indexes for all kinds of channels:

indexing.update_indexes(dao, pkgstore, new_channel.name)

That creates the empty noarch/repodata.json ...

Should we just not call that for a proxy-mirror channel?

Sounds reasonable to me. 👍

@wolfv
Copy link
Member

wolfv commented May 11, 2022

On the other hand, a better solutoin might be to create the .gz files so that we can serve gzipped repodata (which saves a lot, it's 20% or so of the full repodata). E.g. instead of downloading 120Mb, you only need 20 or so.
Did you experience long downloads or are you gzipping the responses through nginx?

@janjagusch
Copy link
Collaborator Author

On the other hand, a better solutoin might be to create the .gz files so that we can serve gzipped repodata (which saves a lot, it's 20% or so of the full repodata). E.g. instead of downloading 120Mb, you only need 20 or so. Did you experience long downloads or are you gzipping the responses through nginx?

So far I don't think long download times have been an issue for us. But if we could build it in a way such that we have to send a lot less data over the network, then i would be all in favour for that.

@wolfv
Copy link
Member

wolfv commented May 12, 2022

quetz is released, with this bug fixed.

@wolfv wolfv closed this as completed May 12, 2022
@janjagusch
Copy link
Collaborator Author

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants