HTTP Datasource can't handle s3 presigned URL for ISO file #2737

lxs137 · 2023-06-02T02:38:38Z

What happened:
Create a DataVolume with HTTP Datasource. The HTTP URL is a s3 presigned URL, which is only signed with HTTP Method 'GET' (which means any HTTP Method other than 'GET' is forbidden by s3 policy)

I got error log from cdi-importer, looks like nbdkit try to do HTTP 'HEAD' on the s3 presigned URL, which is forbidden.

E0601 12:44:17.844208       1 importer.go:177] qemu-img: Could not open 'nbd+unix:///?socket=/tmp/nbdkit.sock': Requested export not available
, qemu-img execution failed: exit status 1 Log line from nbdkit: nbdkit: curl[1]: error: problem doing HEAD request to fetch size of URL [http://xxx/]: HTTP response code said error: The requested URL returned error: 403
Unable to convert source data to target format

kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).initDefaultPhases.func6
	pkg/importer/data-processor.go:242
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause
	pkg/importer/data-processor.go:275
kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData
	pkg/importer/data-processor.go:184
main.handleImport
	cmd/cdi-importer/importer.go:174
main.main
	cmd/cdi-importer/importer.go:140
runtime.main
	GOROOT/src/runtime/proc.go:250
runtime.goexit
	GOROOT/src/runtime/asm_amd64.s:1571

What you expected to happen:
A clear and concise description of what you expected to happen.

How to reproduce it (as minimally and precisely as possible):
Steps to reproduce the behavior.

Additional context:
I can't use s3 datasource, because I think it's not safe to use s3 AK/SK just to download something.

Environment:

CDI version: 1.56.0
Others: nbdkit-1.30.1-2.el9

The text was updated successfully, but these errors were encountered:

lxs137 · 2023-06-02T03:41:30Z

I test different file format behind s3 presigned URL.

qcow2, vmdk: work
iso: not work

If we can handle iso file format, the error will not occur.

awels · 2023-06-02T12:17:35Z

Thanks for the report. So I am a little confused, because we do detect if an image is RAW (RAW is the same as ISO) and if we detect that, nbdkit should not be involved at all since there is no conversion needed. We need to investigate why the detection is not working, as we do have some functional tests that are downloading iso files and it seems to be working there.

lxs137 · 2023-06-03T04:06:07Z

@awels I dig into http-processor, for s3 presigned URL of ISO image, the hs.readers.Convert will be false, and the hs.brokenForQemuImg will be true, so it will go to the nbdkit function.

containerized-data-importer/pkg/importer/http-datasource.go

Lines 133 to 141 in 20d21d4

    
           if hs.readers.Convert { 
        
           	if hs.brokenForQemuImg || hs.readers.Archived || hs.customCA != "" { 
        
           		return ProcessingPhaseTransferScratch, nil 
        
           	} 
        
           } else { 
        
           	if hs.readers.Archived || hs.customCA != "" { 
        
           		return ProcessingPhaseTransferDataFile, nil 
        
           	} 
        
           }

rwmjones · 2023-06-05T19:23:46Z

The error is:

nbdkit: curl[1]: error: problem doing HEAD request to fetch size of URL [http://xxx/]: HTTP response code said error: The requested URL returned error: 403

This comes from the following code:

https://gitlab.com/nbdkit/nbdkit/-/blob/be9bf539aa2298cf0e86ddf5f4a52cc320ae3764/plugins/curl/pool.c#L319-L336

What we are doing here is a HEAD request when initially opening the URL in order to find out the size of the remote file. This is fairly fundamental to the way that the plugin (and NBD) works since it always needs to know the size of the disk image. The HEAD request is also needed for detecting servers which don't support range requests (and IIRC AWS S3 is one of those too).

So I guess I have a few questions ...

Do you have an example URL we can use for testing?
Can you change the S3 datasource to permit HEAD requests as well as GET?
Does this source allow range requests?
Do you know the size of the ISO through some other means?

awels · 2023-06-05T19:56:25Z

@awels I dig into http-processor, for s3 presigned URL of ISO image, the hs.readers.Convert will be false, and the hs.brokenForQemuImg will be true, so it will go to the nbdkit function.

containerized-data-importer/pkg/importer/http-datasource.go

Lines 133 to 141 in 20d21d4

if hs.readers.Convert {

if hs.brokenForQemuImg || hs.readers.Archived || hs.customCA != "" {

return ProcessingPhaseTransferScratch, nil

}

} else {

if hs.readers.Archived || hs.customCA != "" {

return ProcessingPhaseTransferDataFile, nil

}

}

So I think that is the real bug, if not convert, we should be able to download directly with the go client. We shouldn't start nbdkit at that point.

mhenriks · 2023-06-06T00:16:22Z

@awels I dig into http-processor, for s3 presigned URL of ISO image, the hs.readers.Convert will be false, and the hs.brokenForQemuImg will be true, so it will go to the nbdkit function.

containerized-data-importer/pkg/importer/http-datasource.go

Lines 133 to 141 in 20d21d4

if hs.readers.Convert {

if hs.brokenForQemuImg || hs.readers.Archived || hs.customCA != "" {

return ProcessingPhaseTransferScratch, nil

}

} else {

if hs.readers.Archived || hs.customCA != "" {

return ProcessingPhaseTransferDataFile, nil

}

}

So I think that is the real bug, if not convert, we should be able to download directly with the go client. We shouldn't start nbdkit at that point.

Yeah wonder if we can always return ProcessingPhaseTransferDataFile in that else branch

lxs137 · 2023-06-06T02:10:20Z

The error is:
nbdkit: curl[1]: error: problem doing HEAD request to fetch size of URL [http://xxx/]: HTTP response code said error: The requested URL returned error: 403
This comes from the following code:

https://gitlab.com/nbdkit/nbdkit/-/blob/be9bf539aa2298cf0e86ddf5f4a52cc320ae3764/plugins/curl/pool.c#L319-L336

What we are doing here is a HEAD request when initially opening the URL in order to find out the size of the remote file. This is fairly fundamental to the way that the plugin (and NBD) works since it always needs to know the size of the disk image. The HEAD request is also needed for detecting servers which don't support range requests (and IIRC AWS S3 is one of those too).

So I guess I have a few questions ...

Do you have an example URL we can use for testing?

Can you change the S3 datasource to permit HEAD requests as well as GET?

Does this source allow range requests?

Do you know the size of the ISO through some other means?

@rwmjones

For s3 pre-signed URL, we can only presign the url for either GET or HEAD and not both (it's the s3 rule).
The source allow "Accept-Ranges".
I think there is no way to get the size of the ISO unless we actually download it (from the http response header "Content-Length").

lxs137 · 2023-06-06T03:43:59Z

@awels @mhenriks

I think we introduce nbdkit to save scratch space when process image.

Now there are 3 exception cases in HTTP datasource:

brokenForQemuImg=true, which means the s3 pre-signed URL, or the http server don't support Accetp-Range header.
readers.Archived=true, which means the file is xz or gz.
use customCA, which means we should do TLS handshake with user provided cert.

This bug hit case-1, so we should download the file to scratch space (or let nbdkit-curl-plugin can handle s3 presigned URL, I find no neat way to do this).

For case-2, I think we can improve the code, we can use nddkit-tar-filter to read image inside a remote (compressed) tar file.

For case-3, I think current code has some bug, the nbdkit-curl-plugin can handle custom TLS cert.

containerized-data-importer/pkg/image/nbdkit.go

Lines 98 to 100 in fab858e

    
           if certDir != "" { 
        
           	pluginArgs = append(pluginArgs, fmt.Sprintf("cainfo=%s/%s", certDir, "tls.crt")) 
        
           }

lxs137 · 2023-06-06T03:49:03Z

@awels @mhenriks

I think we introduce nbdkit to save scratch space when process image.

Now there are 3 exception cases in HTTP datasource:

brokenForQemuImg=true, which means the s3 pre-signed URL, or the http server don't support Accetp-Range header.

readers.Archived=true, which means the file is xz or gz.

use customCA, which means we should do TLS handshake with user provided cert.

This bug hit case-1, so we should download the file to scratch space (or let nbdkit-curl-plugin can handle s3 presigned URL, I find no neat way to do this).

For case-2, I think we can improve the code, we can use nddkit-tar-filter to read image inside a remote (compressed) tar file.

For case-3, I think current code has some bug, the nbdkit-curl-plugin can handle custom TLS cert.

containerized-data-importer/pkg/image/nbdkit.go

Lines 98 to 100 in fab858e

if certDir != "" {

pluginArgs = append(pluginArgs, fmt.Sprintf("cainfo=%s/%s", certDir, "tls.crt"))

}

I think it will be better to add some e2e tests to cover these exception cases.

rwmjones · 2023-06-06T07:36:14Z

For s3 pre-signed URL, we can only presign the url for either GET or HEAD and not both (boto/boto3#1350 (comment)).

Crazy .. Anyway it should be possible to fall back to using GET, grabbing the content-length header and closing the connection. Let me see how possible that is.

lxs137 · 2023-06-06T09:24:29Z

For s3 pre-signed URL, we can only presign the url for either GET or HEAD and not both (boto/boto3#1350 (comment)).

Crazy .. Anyway it should be possible to fall back to using GET, grabbing the content-length header and closing the connection. Let me see how possible that is.

Yes, we just use HEAD request to find the size of the HTTP file, GET request with "Range: bytes=0-0" header can also "peek" the size of the HTTP file without download it.

Anyway it's the nbdkit stuff, I am not sure whether it is proper to handle this case in nbdkit-curl-plugin.

rwmjones · 2023-06-06T11:23:51Z

Potential patch for curl plugin posted: https://listman.redhat.com/archives/libguestfs/2023-June/031723.html

Yes, we just use HEAD request to find the size of the HTTP file, GET request with "Range: bytes=0-0" header can also "peek" the size of the HTTP file without download it.

This is an idea too, but I wonder if all servers really support zero-length ranges. Do you have an AWS URL we can test against?

lxs137 · 2023-06-06T12:06:16Z

@rwmjones

Nice patch 😄

I generate a s3 presigned URL of iso file for test. (it will be expire in 12 hours)

rwmjones · 2023-06-06T12:19:28Z

Thanks for the URL, I tested it here and it works with my posted patches.

Unfortunately setting Range: bytes=0-0 doesn't work as it causes S3 to return Content-Length: 1 (which is correct for the request, but not the size of the actual file which is what we want).

rwmjones · 2023-06-06T12:22:42Z

My test:

$ ./nbdkit curl '[VERY LONG URL]' --run 'nbdinfo $uri'
nbdkit: curl[1]: error: problem doing HEAD request to fetch size of URL [VERY LONG URL]: HTTP response code said error: The requested URL returned error: 403
protocol: newstyle-fixed without TLS, using structured packets
export="":
	export-size: 51380224 (49M)
	content: DOS/MBR boot sector; partition 2 : ID=0xef, start-CHS (0x3ff,254,63), end-CHS (0x3ff,254,63), startsector 300, 2880 sectors
	uri: nbd://localhost:10809/
	contexts:
		base:allocation
	is_rotational: false
	is_read_only: false
	can_cache: false
	can_df: true
	can_fast_zero: true
	can_flush: false
	can_fua: false
	can_multi_conn: false
	can_trim: false
	can_zero: true

So it still prints the error on the fallback path, but at least it doesn't fail.

lxs137 · 2023-06-06T12:44:08Z

Potential patch for curl plugin posted: https://listman.redhat.com/archives/libguestfs/2023-June/031723.html

Yes, we just use HEAD request to find the size of the HTTP file, GET request with "Range: bytes=0-0" header can also "peek" the size of the HTTP file without download it.

This is an idea too, but I wonder if all servers really support zero-length ranges. Do you have an AWS URL we can test against?

@mhenriks Do we still need a quick fix in CDI project, or just wait the nbdkit patch merged and upgrade our nbdkit rpms version.

mhenriks · 2023-06-06T15:16:30Z

Thanks for the URL, I tested it here and it works with my posted patches.

Unfortunately setting Range: bytes=0-0 doesn't work as it causes S3 to return Content-Length: 1 (which is correct for the request, but not the size of the actual file which is what we want).

Not familiar with nbdkit internals, but am curious if the patch works with chunked content encoding as well?

mhenriks · 2023-06-06T15:22:53Z

Potential patch for curl plugin posted: https://listman.redhat.com/archives/libguestfs/2023-June/031723.html

Yes, we just use HEAD request to find the size of the HTTP file, GET request with "Range: bytes=0-0" header can also "peek" the size of the HTTP file without download it.

This is an idea too, but I wonder if all servers really support zero-length ranges. Do you have an AWS URL we can test against?

@mhenriks Do we still need a quick fix in CDI project, or just wait the nbdkit patch merged and upgrade our nbdkit rpms version.

I just posted #2743. We can add the nbdkit fix late and maybe eliminate brokenForQemuImg flag allowing us to not use scratch space for S3 hosted qcow2

rwmjones · 2023-06-06T15:30:59Z

We never send Transfer-encoding: chunked so web servers would not use it. In any case I don't think chunked encoding has much of a future since HTTP/2 and above implement multiplexing over TCP (which we will use if available).

mhenriks · 2023-06-06T18:09:55Z

We never send Transfer-encoding: chunked so web servers would not use it. In any case I don't think chunked encoding has much of a future since HTTP/2 and above implement multiplexing over TCP (which we will use if available).

All HTTP/1.1 applications that receive entities MUST accept the
"chunked" transfer-coding (section 3.6), thus allowing this mechanism
to be used for messages when the message length cannot be determined
in advance.

https://www.rfc-editor.org/rfc/rfc2616#section-4.4

Edit: my point is that you may want to do a http 1.0 connection for the fallback get request in your patch

rwmjones · 2023-06-06T18:49:26Z

Huh the more you know ...

Looks like curl does deal with this case correctly, so we're all good:
https://github.com/curl/curl/blob/6e4fedeef7ff2aefaa4841a4710d6b8a37c6ae7c/lib/http.c#L3562

Edit: So the fallback GET request is an issue if the server doesn't send a valid Content-Length header (for example if it is generating the content on the fly). That's a case for forcing HTTP 1.0 or otherwise disabling chunked-encoding. Not sure how server would handle that well.

Edit 2: Added to discussion upstream.

For a future commit we will need to be able to simulate an Amazon AWS S3 server, which in some circumstances will fail HEAD requests (403 Forbidden error) but allow GET requests to the same URL. Add this capability to our test web server code. Currently unused, so all existing callers pass an extra ", false" parameter. Related: kubevirt/containerized-data-importer#2737 Reviewed-by: Laszlo Ersek <lersek@redhat.com>

Some servers do not support HEAD for requesting the headers. If the HEAD request fails, fallback to using the GET method, abandoning the transfer as soon as possible after the headers have been received. Fixes: kubevirt/containerized-data-importer#2737 Reviewed-by: Laszlo Ersek <lersek@redhat.com>

akalenyu · 2023-06-19T12:23:47Z

Summing up sig storage comments:

So this is now merged in development branches, https://gitlab.com/nbdkit/nbdkit/-/commit/4b111535c803e896e6bc4cd020651db861d1d8b1

We should think if we want this backported to any stable branch in the short term, or just let it make its way to us
and follow up by bumping rpm deps similarly to #2741.

Note if we want this in, we need to open backport bugs specifying the versions.

alromeros · 2023-08-14T07:51:48Z

Hey @lxs137, this should've been temporarily fixed with #2841. We plan to release this version today, so feel free to try it when ready and confirm if your issue persists.

Edit: v1.57 is already out https://github.com/kubevirt/containerized-data-importer/releases/tag/v1.57.0

lxs137 added the kind/bug label Jun 2, 2023

lxs137 changed the title ~~HTTP Datasource can't handle s3 presigned URL~~ HTTP Datasource can't handle s3 presigned URL for ISO file Jun 2, 2023

lxs137 mentioned this issue Jun 2, 2023

FormatReader should support ISO file #2738

Closed

mhenriks mentioned this issue Jun 6, 2023

skip nbdkit for all raw/iso image imports #2743

Closed

lxs137 closed this as completed Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP Datasource can't handle s3 presigned URL for ISO file #2737

HTTP Datasource can't handle s3 presigned URL for ISO file #2737

lxs137 commented Jun 2, 2023 •

edited

lxs137 commented Jun 2, 2023

awels commented Jun 2, 2023

lxs137 commented Jun 3, 2023

rwmjones commented Jun 5, 2023

awels commented Jun 5, 2023

mhenriks commented Jun 6, 2023

lxs137 commented Jun 6, 2023

lxs137 commented Jun 6, 2023 •

edited

lxs137 commented Jun 6, 2023

rwmjones commented Jun 6, 2023

lxs137 commented Jun 6, 2023 •

edited

rwmjones commented Jun 6, 2023

lxs137 commented Jun 6, 2023

rwmjones commented Jun 6, 2023

rwmjones commented Jun 6, 2023

lxs137 commented Jun 6, 2023

mhenriks commented Jun 6, 2023

mhenriks commented Jun 6, 2023

rwmjones commented Jun 6, 2023

mhenriks commented Jun 6, 2023 •

edited

rwmjones commented Jun 6, 2023 •

edited

akalenyu commented Jun 19, 2023

alromeros commented Aug 14, 2023 •

edited

HTTP Datasource can't handle s3 presigned URL for ISO file #2737

HTTP Datasource can't handle s3 presigned URL for ISO file #2737

Comments

lxs137 commented Jun 2, 2023 • edited

lxs137 commented Jun 2, 2023

awels commented Jun 2, 2023

lxs137 commented Jun 3, 2023

rwmjones commented Jun 5, 2023

awels commented Jun 5, 2023

mhenriks commented Jun 6, 2023

lxs137 commented Jun 6, 2023

lxs137 commented Jun 6, 2023 • edited

lxs137 commented Jun 6, 2023

rwmjones commented Jun 6, 2023

lxs137 commented Jun 6, 2023 • edited

rwmjones commented Jun 6, 2023

lxs137 commented Jun 6, 2023

rwmjones commented Jun 6, 2023

rwmjones commented Jun 6, 2023

lxs137 commented Jun 6, 2023

mhenriks commented Jun 6, 2023

mhenriks commented Jun 6, 2023

rwmjones commented Jun 6, 2023

mhenriks commented Jun 6, 2023 • edited

rwmjones commented Jun 6, 2023 • edited

akalenyu commented Jun 19, 2023

alromeros commented Aug 14, 2023 • edited

lxs137 commented Jun 2, 2023 •

edited

lxs137 commented Jun 6, 2023 •

edited

lxs137 commented Jun 6, 2023 •

edited

mhenriks commented Jun 6, 2023 •

edited

rwmjones commented Jun 6, 2023 •

edited

alromeros commented Aug 14, 2023 •

edited