Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU / CPU Transfers #45

Closed
grorg opened this issue Jan 31, 2018 · 3 comments
Closed

GPU / CPU Transfers #45

grorg opened this issue Jan 31, 2018 · 3 comments

Comments

@grorg
Copy link
Contributor

grorg commented Jan 31, 2018

This is Apple's proposal for GPU to CPU transfers, and visa versa.

We believe that for a first version (MVP), we can stick to an extremely simple model. If we later discover we need something more complicated for efficiency, we can add to the API.

partial interface HostAccessPass {
    Promise<ArrayBuffer> downloadData(GPUBuffer buffer, UnsignedLong offset, UnsignedLong length);
    void uploadData(GPUBuffer buffer, ArrayBuffer input, UnsignedLong offset);
}

Benefits

  • Asynchronous: It is impossible to synchronously read from a buffer, and therefore cause a GPU flush.
  • Portable: There is no ambiguity when the site's Javascript can request data to be downloaded or uploaded. (And it's not stateful.)
  • Well-defined: It is impossible to use this API to cause a data race between the CPU and GPU. Transfers will only ever occur when both the CPU and GPU are ready for them to occur.
  • Secure: ArrayBuffer automatically handles the situation of reading out of bounds.
  • Simple: Downloading and uploading are each a single easily-understandable call.
  • Implementable: Implementations which don't support mapping work naturally.
  • Optimizable: Web content doesn't need to have a special path for UMA vs discrete GPU scenarios, or have to know about how some buffers are CPU accessible but slow on the GPU but others are not CPU accessible but fast on the GPU. The implementation is more likely than the web app to handle all the cases in the most optimized way possible. (Write once, run anywhere.)
  • Easy to use: It's likely that any website code using this API will be correct. It's difficult (impossible?) to use this API wrong.
  • Style: The rest of the Web platform uses Promises and ArrayBuffers, and this API is no exception.

Drawbacks

All transfers require at least one copy.

Example

function performAsynchronousMath(gpuQueue, gpuBuffer, inputBuffer) {
    let uploadPass = queue.createHostAccessPass();
    uploadPass.uploadData(gpuBuffer, inputBuffer, 0);

    let computePass = queue.createComputePass();
    computePass.setState(...);
    computePass.setBuffer(buffer, ...);
    computePass.dispatch(...);

    let downloadPass = queue.createHostAccessPass();
    downloadPass.downloadData(buffer, 0, buffer.getLength()).then(function(arrayBuffer) {
        let typedArray = new Float32Array(arrayBuffer);
        for (let i = 0; i < buffer.getLength() / Float32Array.BYTES_PER_ELEMENT; ++i) {
            console.log(String(arrayBuffer[i]));
        }
    });

    queue.enqueue(uploadPass);
    queue.enqueue(computePass);
    queue.enqueue(downloadPass);
}
@dmikis
Copy link

dmikis commented Jan 31, 2018

In the meeting discussion went a bit into organisational matters, so I'll put my question here.

It may be a bit stupid, but here it goes: how with this API we'll protect buffer we're trying to download from write-after-read (IIUC) hazards from commands further in the queue that may write into it? Is there such a problem at all?

For example:

queue.enqueue(computePass);
queue.enqueue(downloadPass);
queue.enqueue(anotherComputePassOverSameBuffer);

My understanding is that under the hood of the GPUWeb something like this will be happening:

enqueueStuffFromTheComputePass();
insertFence();
enqueueStuffFromTheOtherComputePass();

// concurrently:
waitForFence();
mapBufferAndCopyFromIt();

There's a way out of it: between compute passes insert a copy command that will copy contents of the buffer into some staging area, that will then be safely read from. But that's +1 copy, which may be undesirable on UMA GPUs (on dGPUs, AFAIK, there's need for "staging" anyway).

UPD. I think I've got a possible answer:) (moral is: don't ask questions at 2 a.m.) It seems that at least in 2 target APIs there's a way to make device wait for an event or fence being signalled from CPU. In Vulkan it's vkCmdWaitEvents, in D3D12 its Wait on ID3D12CommandQueue. IDK however about Metal.

@devshgraphicsprogramming

The way we do it in our engine is that we have an address (virtual memory) allocator sitting over default up/down stream (Staging) buffers that are persistently mapped (yes you can have that in all APIs)

To these buffers the first writing of data occurs, then a copy to the actual device-native immutable buffer.
If data is to large to fit in streaming buffer, it gets uploaded in parts.

This has many performance benefit, you do not want your actual GPU-side buffer to be mappable, sit in some special DMA memory, or be updateable... all of this causes serious performance drawbacks.

I don't think you should worry about the +1 copy for the UMA devices as the security considerations will require you to examine the contents being written/read anyway, so might as well do that during copy.

@litherum
Copy link
Contributor

I'm going to retract this proposal because the required extra internal copy is distasteful to the WebGPU CG. Instead, we are debating the merits of the proposals here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants