Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDI download fails due to nbdkit: curl error #2561

Closed
k8scoder192 opened this issue Jan 27, 2023 · 13 comments
Closed

CDI download fails due to nbdkit: curl error #2561

k8scoder192 opened this issue Jan 27, 2023 · 13 comments
Labels

Comments

@k8scoder192
Copy link

k8scoder192 commented Jan 27, 2023

What happened:
CDI download of large image file (380M) fails with error

E0127 02:55:26.347182       1 prlimit.go:174] qemu-img failed output is:
E0127 02:55:26.347210       1 prlimit.go:175]     (0.00/100%)
    (1.02/100%)
    (2.09/100%)
    (3.16/100%)
    (4.22/100%)
    (5.22/100%)
    (6.26/100%)
    (7.32/100%)
    (8.33/100%)
    (9.33/100%)
    (10.39/100%)
    (11.39/100%)
    (12.39/100%)
    (13.39/100%)
    (14.40/100%)
    (15.40/100%)


E0127 02:55:26.347218       1 prlimit.go:176] qemu-img: error while reading at byte 379387904: Input/output error

Describe on importer pod output

      Reason:       Error
      Message:      Unable to process data: Unable to convert source data to target format: Conversion to Raw failed: could not convert image to raw Log line from nbdkit: nbdkit: curl[4]: error: pread: curl_easy_perform: Failed sending data to the peer: Connection died, tried 5
times before giving up: qemu-img execution failed: exit status 1
      Exit Code:    1
      Started:      Fri, 27 Jan 2023 15:40:40 -0500
      Finished:     Fri, 27 Jan 2023 15:45:06 -0500
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     4
      memory:  10Gi
    Requests:
      cpu:     1
      memory:  8Gi
    Environment:
      IMPORTER_SOURCE:               http
      IMPORTER_ENDPOINT:             https://cloud-images.ubuntu.com/bionic/current/bionic-server-cloudimg-amd64.img
      IMPORTER_CONTENTTYPE:          kubevirt
      IMPORTER_IMAGE_SIZE:           10Gi
      OWNER_UID:                     ffccd373-9db1-4712-9d4a-4ab7ff23e006
      FILESYSTEM_OVERHEAD:           0
      INSECURE_TLS:                  false
      IMPORTER_DISK_ID:
      IMPORTER_UUID:
      IMPORTER_READY_FILE:
      IMPORTER_DONE_FILE:
      IMPORTER_BACKING_FILE:
      IMPORTER_THUMBPRINT:
      http_proxy:
      https_proxy:
      no_proxy:
      IMPORTER_CURRENT_CHECKPOINT:
      IMPORTER_PREVIOUS_CHECKPOINT:
      IMPORTER_FINAL_CHECKPOINT:
      PREALLOCATION:                 false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wwxgd (ro)
    Devices:
      /dev/cdi-block-volume from cdi-data-vol

What you expected to happen:
CDI should be able to download and convert image file to raw without connection issues.

wget of source https://cloud-images.ubuntu.com/bionic/current/bionic-server-cloudimg-amd64.img works flawlessly even when ran multiple times.

How to reproduce it (as minimally and precisely as possible):
Ceph block storage (PVC in block mode)
CDI v1.54.0

apiVersion: cdi.kubevirt.io/v1beta1
kind: DataVolume
metadata:
  name: ubuntu1804-dv
spec:
  source:
    http:
      url: "https://cloud-images.ubuntu.com/bionic/current/bionic-server-cloudimg-amd64.img"
  pvc:
    storageClassName: rook-ceph-block
    volumeMode: Block
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi

Also cdi CR resources were increased to eliminate this as a possible issue

        podResourceRequirements:
          limits:
            cpu: "4"
            memory: 10Gi
          requests:
            cpu: "1"
            memory: 8Gi

Additional context:
It appears using nbdkit to curl the file could be causing issues.

Support for this stems from the fact that when using CDI to download .xz files, nbdkit is not used and no connection issues were seen (even when ran multiple times).

In the .xz case, a golang http client downloads to scratch space and when complete, qemu-img starts the conversion to a raw. See PR 2351

Others have seen nbdkit curl errors
See issue 1737 and see step 5 (log output) in issue 1980

Suggestion: Don't use nbdkit to curl, use same method as code currently uses for .xz files

Environment:

  • CDI version (use kubectl get deployments cdi-deployment -o yaml): v1.54.0
  • Kubernetes version (use kubectl version): v1.24.0 and/or v1.21.0
  • DV specification: N/A
  • Cloud provider or hardware configuration: N/A
  • OS (e.g. from /etc/os-release): Ubuntu 18.04
  • Kernel (e.g. uname -a): 4.15
  • Install tools: N/A
  • Others: ceph:v16.2.7, rook/ceph:v1.9.2
@k8scoder192 k8scoder192 changed the title CDI download fails due to nbdkit: curl[4]: error CDI download fails due to nbdkit: curl error Jan 27, 2023
@rwmjones
Copy link

@rwmjones
Copy link

rwmjones commented Jan 30, 2023

This is how we arrange the filters in virt-v2v, on top of nbdkit-curl-plugin:

nbdkit --filter=cache --filter=cacheextents --filter=retry curl \
    timeout=2000 \
    cache-on-read=true \
    cache-max-size=1G

A few notes about this:

  • timeout=2000 was added for https://bugzilla.redhat.com/show_bug.cgi?id=1146007#c10
  • cache-on-read=true is important, so that reads are actually cached (otherwise only the special NBD prefetch operation is cached).
  • The cacheextents filter is needed because of the way that qemu-img makes inefficient requests for extent data.
  • I've suggested cache-max-size=1G to limit the cache to 1G, but of course you may wish to adjust this. Note by default the cache will use all available /var/tmp unless you limit it.
  • We were using the readahead filter and then dropped it. However in the meantime I rewrote this filter upstream (nbdkit >= 1.31.2). If you know that you're using the newer nbdkit then you might consider adding --filter=readahead first in the list, but we don't have much experience yet of whether it's really going to solve the problem.

Do you have an example of a source where the current nbdkit command is especially slow?

@mhenriks
Copy link
Member

@rwmjones
Copy link

Thanks - I'll play with it tomorrow & see if I can reproduce it and come up with suggestions.

@rwmjones
Copy link

rwmjones commented Feb 1, 2023

Thanks again for the example, it was useful to have something concrete to experiment with.

Firstly the answer to this bug is to add --filter=retry. If you are using other filters already, add it last in the list (which means closest to the plugin). That will transparently retry the connection if there is a transient network failure. There are various knobs for that filter, but tbh the defaults will be fine in most cases. We use this filter extensively already and it seems robust.

The answer to the performance issues was more interesting and we're still discussing it on IRC. A summary is in this email: https://listman.redhat.com/archives/libguestfs/2023-February/030581.html

nbdkit curl plugin has some overhead. I measured it about 25% slower than wget. However that overhead completely disappears when NBD multi-conn is enabled - it's basically the same speed as wget.

But there are three bugs which get in the way:

(1) The plugin doesn't enable multi-conn. (There is a simple workaround for this involving nbdkit-multi-conn-filter, see email above, but also we will fix the plugin.)

(2) In this particular case because the file is qcow2, you want to run qemu-img convert in the pipeline. This uses the NBD client in qemu:

web server -- (HTTPS) --> nbdkit-curl-plugin -- (NBD) --> qemu-img convert

Unfortunately the NBD client in qemu doesn't support multi-conn. In discussion with Eric Blake about if we can fix this.

(3) Also it looks like there may be another problem in qemu-img convert because performance is terrible. I even tried using the internal qemu curl client, which completely bypasses NBD, nbdkit etc, and performance was still terrible. It was taking 4 or 5 minutes to download and convert the image, even with all the caching and readahead features turned up. If fixing (1) and (2) still shows any performance drop versus wget I'll have a look at this later.

@mhenriks
Copy link
Member

mhenriks commented Feb 1, 2023

@rwmjones thanks for digging into this! Good to know the retry filter will address the issue here.

I think it would be best to track your performance related work/observations in #2358 in which we would like to see if qemu-img + nbdkit can perform as well as wget to scratch space then qemu-img convert locally

@alromeros
Copy link
Collaborator

Since #2584 is merged, @k8scoder192 have you been able to replicate your issue again with the retry filter? If the download failure has been addressed I think we can consider closing the issue, unless we want to continue with the performance conversation here.

@k8scoder192
Copy link
Author

@alromeros I can't test this unless the fix is backported to CDI v1.54. Can someone do that? 1.55 introduced issues (pod doesn't run as root, which fails with my current cluster config)

@alromeros
Copy link
Collaborator

@alromeros I can't test this unless the fix is backported to CDI v1.54. Can someone do that? 1.55 introduced issues (pod doesn't run as root, which fails with my current cluster config)

@k8scoder192 sure, we can backport the fix to that version. I'll let you know once we release the version with the fix.

@aglitke
Copy link
Member

aglitke commented Mar 27, 2023

Backport PR is #2666

@awels
Copy link
Member

awels commented Mar 28, 2023

Hi, I just released v1.54.1 which contains the retry filter.

@k8scoder192
Copy link
Author

@awels @aglitke @mhenriks
v1.54.1 seems to work, so I'd say the issue appears to to be resolved.

@awels
Copy link
Member

awels commented Mar 31, 2023

Excellent, feel free to close the issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants