Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdi-importer: import virtualbox ova or tar archive in oneshot #2748

Closed

Conversation

lxs137
Copy link

@lxs137 lxs137 commented Jun 12, 2023

What this PR does / why we need it:

This PR try to make CDI support import Virtualbox OVA file (tar archive is also supported) in oneshot.

  • with nbdkit-tar-filter, we can extract vmdk file in remote OVA endpoint, which make the conversion don't need scratch disk space.

Without the PR, we need to download OVA file, and extract vmdk file in it, then convert it to QCOW2 file, upload the QCOW2 file into the PVC.

something need help

  • if there are multiple vmdk in one ova file, should find a way to let user specify the target vmdk file.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

cdi-importer: http-datasource can import (compressed) tar archive (or virtualbox OVA) streamly

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 12, 2023
@kubevirt-bot
Copy link
Contributor

Hi @lxs137. Thanks for your PR.

I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@awels
Copy link
Member

awels commented Jun 12, 2023

/test all

@awels
Copy link
Member

awels commented Jun 12, 2023

Hi, thanks for the PR, looks like there are some unit tests failures. You can run the unit tests by calling make test-unit

@awels
Copy link
Member

awels commented Jun 12, 2023

/test pull-containerized-data-importer-e2e-hpp-latest

@@ -46,6 +46,7 @@ nbdkit-basic-filters
nbdkit-curl-plugin
nbdkit-xz-filter
nbdkit-gzip-filter
nbdkit-tar-filter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to understand if/how nbdkit does caching with this filter. I bet some caching is going on as it would be pretty expensive to always open/seek to the appropriate offset in the archive. If there is any significant on disk/memory caching I'd prefer we use simply the golang tar reader to seek to the file and download it to scratch. Then convert from scratch to target. Thus not getting nbdkit involved at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can't get the point, why nbdkit cache will be the concern.
I believe nbdkit-tar-filter will only retrieve tar header block, if find the header name match "tar-entry", then get the target file data offset. (I think golang tar reader will do the same thing)

Copy link
Member

@mhenriks mhenriks Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to me like tar plugin writes to tmp which is something we have to avoid: https://github.com/libguestfs/nbdkit/blob/45b72f5bd8fc1b475fa130d06c86cd877bf595d5/filters/tar/tar.c#L146

But I'm not an expert on nbdkit.

@rwmjones care to comment on how curl + tar filter uses tmp space?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tar filter will not use tmp space (or more accurately, only a few bytes are used for a temporary file used to parse offsets printed by tar). Here's how it works:

https://gitlab.com/nbdkit/nbdkit/-/blob/462eb6cbf605c2969f6e851bb9076f8e3961cccb/filters/tar/tar.c#L138

Copy link

@rwmjones rwmjones Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise the code isn't very easy to follow. What it does is this:

(1) When we connect first time, we start the command tar --no-auto-compress -t --block-number -v -f - filename > /tmp/output

(2) We read over the first blocks of the tar file and feed them to stdin of tar. This does mean that if the file inside the tar is not early on inside the tar then we end up reading through the earlier files. But we do not write them to disk.

(3) As we go along we look at the output of tar which will eventually print the offset of the file we want in the tar to the randomly named /tmp/output file.

(4) We are then able to read at the right offset/size within the tar directly when the NBD client reads the disk image.

@mhenriks
Copy link
Member

/ok-to-test

@kubevirt-bot kubevirt-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 13, 2023
@mhenriks
Copy link
Member

I'd like to think about this PR more as "tar file support" then "ova support" so a couple suggestions. If the tar archive only has one file, use that regardless of extension. Support for other extensions like ".iso" ".raw" etc

What do you think @awels?

@rwmjones
Copy link

It should be possible to tell cheaply if a tar file contains either just one file or (more realistic and still usable) the large file is contained early on. Tar files are basically:

GLOBAL HEADER
FILE1 HEADER
FILE1 CONTENT
FILE2 HEADER
FILE2 CONTENT
&c

(There is no central index unlike zip or any modern format.) You could read the first NN bytes (eg. 10M or whatever you consider small enough), feed that into tar -t --block-number -f - filename and see if it produces a result.

In fact this would be a simple modification to nbdkit-tar-file, adding some sort of "read limit". (@ebblake)

@mhenriks
Copy link
Member

FYI: single file in a tar came from a discussion with @rmohr regarding how GCE imports disk images as described here: https://cloud.google.com/compute/docs/import/import-existing-image#requirements_for_the_image_file

@rwmjones
Copy link

Patch for above posted: https://listman.redhat.com/archives/libguestfs/2023-June/031837.html

@mhenriks
Copy link
Member

mhenriks commented Jun 13, 2023

FYI: single file in a tar came from a discussion with @rmohr regarding how GCE imports disk images as described here: https://cloud.google.com/compute/docs/import/import-existing-image#requirements_for_the_image_file

Supporting "tar.gz" is probably beyond the scope of this PR though.

Edit: for now (this pr), I think supporting the additional file extensions may be good enough

@lxs137 lxs137 force-pushed the extract-ova-and-tar-archive-streamly branch from a907711 to e158570 Compare June 14, 2023 07:21
@lxs137 lxs137 changed the title cdi-importer: import virtualbox ova file in oneshot cdi-importer: import virtualbox ova or other tar archive in oneshot Jun 14, 2023
@lxs137
Copy link
Author

lxs137 commented Jun 14, 2023

Update:

  • support extract file extensions: "qcow2", "vmdk", "vdi", "vhd", "vhdx", "iso", "raw"
  • if the tar archive only has one file, use that regardless of extension
  • fix e2e functional tests

@lxs137 lxs137 force-pushed the extract-ova-and-tar-archive-streamly branch from e158570 to 546fa75 Compare June 14, 2023 09:23
@lxs137 lxs137 changed the title cdi-importer: import virtualbox ova or other tar archive in oneshot cdi-importer: import virtualbox ova or tar archive in oneshot Jun 14, 2023
@lxs137
Copy link
Author

lxs137 commented Jun 14, 2023

/test pull-containerized-data-importer-e2e-ceph

1 similar comment
@lxs137
Copy link
Author

lxs137 commented Jun 15, 2023

/test pull-containerized-data-importer-e2e-ceph

@lxs137
Copy link
Author

lxs137 commented Jun 15, 2023

I am trying to use nbdkit serve a remote tar.gz archive with curl plugin, tar filter and gzip filter, but I find it will be super slow even execute qemu-img info over it.

fast:

nbdkit --foreground --readonly --exit-with-parent -U /tmp/nbdkit.sock --pidfile /tmp/nbdkit.pid --filter=tar -r curl url=http://xxx.tar tar-entry=xx.img &

// will cost near 0.1s
qemu-img info nbd+unix:///?socket=/tmp/nbdkit.sock

slow (60x):

nbdkit --foreground --readonly --exit-with-parent -U /tmp/nbdkit.sock --pidfile /tmp/nbdkit.pid --filter=tar --filter=gzip -r curl url=http://xxx.tar.gz tar-entry=xx.img &

// will cost near 6s
qemu-img info nbd+unix:///?socket=/tmp/nbdkit.sock

I find it's because when combined tar filter and gzip filter, nbdkit will try to download whole remote file.
If I use tar filter only, nbdkit just peek serval bytes in remote file.

@rwmjones will you take a look on it, thank you :-)

@rwmjones
Copy link

gzip format isn't seekableᵃ. Use xz format as described here: https://libguestfs.org/nbdkit-xz-filter.1.html#Getting-best-random-access-performance-from-xz

ᵃWell, it's a complicated story. It may be possible to make a modifiable seekable gzip: https://rwmj.wordpress.com/2022/12/01/creating-a-modifiable-gzipped-disk-image/

@lxs137
Copy link
Author

lxs137 commented Jun 15, 2023

gzip format isn't seekableᵃ. Use xz format as described here: https://libguestfs.org/nbdkit-xz-filter.1.html#Getting-best-random-access-performance-from-xz

ᵃWell, it's a complicated story. It may be possible to make a modifiable seekable gzip: https://rwmj.wordpress.com/2022/12/01/creating-a-modifiable-gzipped-disk-image/

OK, I test a tar.xz file with small block size, the qemu-img command run more fast (3x slow than run qemu-img over tar file).

If user provide an unseekable compressed tar archive (gzip, xz with large block size), run qemu-img info command over nbdkit will be super slow.

@mhenriks What do you think?

asomers pushed a commit to asomers/nbdkit that referenced this pull request Jun 15, 2023
This can be used to ensure that the tar filter does not read
indefinite amounts of input when opening the tar file.

See: kubevirt/containerized-data-importer#2748 (comment)
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign awels for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lxs137 lxs137 force-pushed the extract-ova-and-tar-archive-streamly branch from 7a86f10 to 293f7fe Compare July 22, 2023 08:13
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jul 22, 2023
@lxs137
Copy link
Author

lxs137 commented Jul 22, 2023

@mhenriks Sorry for my late reply, I have done all the work. :-)

  • tar (inside file has known file extension): use nbdkit-tar-filter to extract inner file content
  • compressed tar: FormatReaders will seek to inner file content
  • tar (inside file has unknown file extension): can't handle

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2023
Signed-off-by: menyakun <lxs137@hotmail.com>
* if find a compressed file, it will look inner file format
* for archive format, user can read archive inner file content directly

Signed-off-by: menyakun <lxs137@hotmail.com>
* use nbdkit-tar-filter to extract archive inner file content streamly
* for unseekable compressed archive (gzip, xz with large block size), qemu-img over nbdkit will be super slow, so use FormatReaders to download inner file to scratch space (fallback to previous method)

Signed-off-by: menyakun <lxs137@hotmail.com>
Signed-off-by: menyakun <lxs137@hotmail.com>
@lxs137 lxs137 force-pushed the extract-ova-and-tar-archive-streamly branch from f122355 to d8b9611 Compare July 27, 2023 07:05
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 27, 2023
@lxs137
Copy link
Author

lxs137 commented Jul 27, 2023

/test pull-cdi-unit-test

@alromeros
Copy link
Collaborator

/test pull-containerized-data-importer-non-csi-hpp

@@ -52,6 +52,8 @@ func main() {
[]string{".qcow2", ".gz"},
[]string{".qcow2", ".xz"},
[]string{".qcow2", ".zst"},
[]string{".tar", ".gz"},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondering, if it can deal with file like xxx.raw.tar.gz ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a180285 Sure, actually we match file by magic number instead of extension.

@lxs137
Copy link
Author

lxs137 commented Aug 9, 2023

@mhenriks Sorry for my late reply, I have done all the work. :-)

  • tar (inside file has known file extension): use nbdkit-tar-filter to extract inner file content
  • compressed tar: FormatReaders will seek to inner file content
  • tar (inside file has unknown file extension): can't handle

@mhenriks Hi, do you have some time to review on this PR ~

@kubevirt-bot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 12, 2023
@awels
Copy link
Member

awels commented Aug 24, 2023

Sorry this needs a rebase before we can review it.

@lxs137
Copy link
Author

lxs137 commented Aug 25, 2023

I found we disable the nbdkit streamly importing in #2832 .
But in this PR, I try to extract the tar inner content with nbdkit filter (except compressed tar, use golang reader instead).
What do you think?
@awels @mhenriks

@lxs137
Copy link
Author

lxs137 commented Aug 26, 2023

I found we disable the nbdkit streamly importing in #2832 . But in this PR, I try to extract the tar inner content with nbdkit filter (except compressed tar, use golang reader instead). What do you think? @awels @mhenriks

Read some comments from #2809.

Maybe this PR should hold unitl ndkit-1.35.8 is availavle in centos 9 stream,
or split this PR into golang importing part and nbdkit importing part, we just focus on the golang importing part.

@kubevirt-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubevirt-bot kubevirt-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2023
@kubevirt-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@kubevirt-bot kubevirt-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 24, 2023
@kubevirt-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

@kubevirt-bot
Copy link
Contributor

@kubevirt-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants