GPU / CPU Transfers #45

grorg · 2018-01-31T21:26:59Z

This is Apple's proposal for GPU to CPU transfers, and visa versa.

We believe that for a first version (MVP), we can stick to an extremely simple model. If we later discover we need something more complicated for efficiency, we can add to the API.

partial interface HostAccessPass {
    Promise<ArrayBuffer> downloadData(GPUBuffer buffer, UnsignedLong offset, UnsignedLong length);
    void uploadData(GPUBuffer buffer, ArrayBuffer input, UnsignedLong offset);
}

Benefits

Asynchronous: It is impossible to synchronously read from a buffer, and therefore cause a GPU flush.
Portable: There is no ambiguity when the site's Javascript can request data to be downloaded or uploaded. (And it's not stateful.)
Well-defined: It is impossible to use this API to cause a data race between the CPU and GPU. Transfers will only ever occur when both the CPU and GPU are ready for them to occur.
Secure: ArrayBuffer automatically handles the situation of reading out of bounds.
Simple: Downloading and uploading are each a single easily-understandable call.
Implementable: Implementations which don't support mapping work naturally.
Optimizable: Web content doesn't need to have a special path for UMA vs discrete GPU scenarios, or have to know about how some buffers are CPU accessible but slow on the GPU but others are not CPU accessible but fast on the GPU. The implementation is more likely than the web app to handle all the cases in the most optimized way possible. (Write once, run anywhere.)
Easy to use: It's likely that any website code using this API will be correct. It's difficult (impossible?) to use this API wrong.
Style: The rest of the Web platform uses Promises and ArrayBuffers, and this API is no exception.

Drawbacks

All transfers require at least one copy.

Example

function performAsynchronousMath(gpuQueue, gpuBuffer, inputBuffer) {
    let uploadPass = queue.createHostAccessPass();
    uploadPass.uploadData(gpuBuffer, inputBuffer, 0);

    let computePass = queue.createComputePass();
    computePass.setState(...);
    computePass.setBuffer(buffer, ...);
    computePass.dispatch(...);

    let downloadPass = queue.createHostAccessPass();
    downloadPass.downloadData(buffer, 0, buffer.getLength()).then(function(arrayBuffer) {
        let typedArray = new Float32Array(arrayBuffer);
        for (let i = 0; i < buffer.getLength() / Float32Array.BYTES_PER_ELEMENT; ++i) {
            console.log(String(arrayBuffer[i]));
        }
    });

    queue.enqueue(uploadPass);
    queue.enqueue(computePass);
    queue.enqueue(downloadPass);
}

The text was updated successfully, but these errors were encountered:

dmikis · 2018-01-31T23:04:34Z

In the meeting discussion went a bit into organisational matters, so I'll put my question here.

It may be a bit stupid, but here it goes: how with this API we'll protect buffer we're trying to download from write-after-read (IIUC) hazards from commands further in the queue that may write into it? Is there such a problem at all?

For example:

queue.enqueue(computePass);
queue.enqueue(downloadPass);
queue.enqueue(anotherComputePassOverSameBuffer);

My understanding is that under the hood of the GPUWeb something like this will be happening:

enqueueStuffFromTheComputePass();
insertFence();
enqueueStuffFromTheOtherComputePass();

// concurrently:
waitForFence();
mapBufferAndCopyFromIt();

There's a way out of it: between compute passes insert a copy command that will copy contents of the buffer into some staging area, that will then be safely read from. But that's +1 copy, which may be undesirable on UMA GPUs (on dGPUs, AFAIK, there's need for "staging" anyway).

UPD. I think I've got a possible answer:) (moral is: don't ask questions at 2 a.m.) It seems that at least in 2 target APIs there's a way to make device wait for an event or fence being signalled from CPU. In Vulkan it's vkCmdWaitEvents, in D3D12 its Wait on ID3D12CommandQueue. IDK however about Metal.

devshgraphicsprogramming · 2018-11-14T17:39:17Z

The way we do it in our engine is that we have an address (virtual memory) allocator sitting over default up/down stream (Staging) buffers that are persistently mapped (yes you can have that in all APIs)

To these buffers the first writing of data occurs, then a copy to the actual device-native immutable buffer.
If data is to large to fit in streaming buffer, it gets uploaded in parts.

This has many performance benefit, you do not want your actual GPU-side buffer to be mappable, sit in some special DMA memory, or be updateable... all of this causes serious performance drawbacks.

I don't think you should worry about the +1 copy for the UMA devices as the security considerations will require you to examine the contents being written/read anyway, so might as well do that during copy.

litherum · 2020-02-24T06:24:45Z

I'm going to retract this proposal because the required extra internal copy is distasteful to the WebGPU CG. Instead, we are debating the merits of the proposals here.

kvark mentioned this issue Feb 14, 2018

[meta] Big questions #48

Closed

magcius mentioned this issue Aug 26, 2019

Make it easier to upload data into buffers correctly #418

Closed

magcius mentioned this issue Nov 7, 2019

Proposal: queued data transfers #491

Closed

litherum closed this as completed Feb 24, 2020

magcius mentioned this issue Sep 23, 2020

Add experimental support for Beetle Adventure Racing magcius/noclip.website#343

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU / CPU Transfers #45

GPU / CPU Transfers #45

grorg commented Jan 31, 2018 •

edited by litherum

dmikis commented Jan 31, 2018 •

edited

devshgraphicsprogramming commented Nov 14, 2018

litherum commented Feb 24, 2020

GPU / CPU Transfers #45

GPU / CPU Transfers #45

Comments

grorg commented Jan 31, 2018 • edited by litherum

Benefits

Drawbacks

Example

dmikis commented Jan 31, 2018 • edited

devshgraphicsprogramming commented Nov 14, 2018

litherum commented Feb 24, 2020

grorg commented Jan 31, 2018 •

edited by litherum

dmikis commented Jan 31, 2018 •

edited