-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tegra: Do not cache command buffer BO's #13
Conversation
Just an idea: maybe we can prevent the need for this on a higher level? Something like always double-buffering command lists, and using fences to make sure they completed? |
As a matter of fact, isn't it a bug to release an in-use BO anyway? This could also happen with any other BO, no? |
The more I think about this, the more I feel like this is just hiding bugs. Buffers needs to be kept around until hardware is done working with them. This for instance means that surfaces that are being used as blit-sources from needs to stay around until the command-buffer containing the blit has been executed. This does nothing to help this. I think we should instead do something like creating a garbage-list of BOs for each drm_tegra_job that gets freed once we know the hardware is done. |
Yeah, that's something to consider about :) Your garbage-list idea sounds interesting, maybe we could spawn a new thread (or have a persistent garbage-queue thread-handler spawned on drm_tegra_new() invocation) on a successful job submission and wait for a fence completion there, so we have to bump the BO's refcount on submission and unref them on the fence completion. However, that means that we would waste quite a lot of memory in case of enabled Host1x firewall, we need to know somehow about the firewall state to avoid that waste. |
And yes, we'd probably need to clone and attach job's fence to the BO's. That could be done via libdrm-tegra API extension with something like drm_tegra_bo_attach_fence(fence). A quick draft:
|
We could just make some periodic calls from the client-API to see if we can free up some memory. For instance, we might want to do this when we flush, and if we fail to allocate (and then retry). |
@digetx: I suspect that it would be desirable to be allowed to replace the fence with a newer one. That way, we can keep a running-track of the latest-use fence for a resource, and wait for that when we want to either modify or delete it. Hmm, but maybe that becomes a bit tricky with usage from multiple queues (GR2D / GR3D)? What about imported BOs? |
@kusma Sounds fun already :P Maybe that would be useful for us https://www.phoronix.com/scan.php?page=news_item&px=DRM-Sync-Objects-4.13, we'd have to add a fence<->syncobj wrappers, I guess. That also means we'd be incompatible with older kernel, but probably that's fine. |
Actually, we should implement drm_tegra_bo_cpu_prep() of find_in_bucket() in tegra_bo_cache.c |
So, I have looked at it today a bit more. The dmabuf provides all necessary synchronization facilities (fences, reservation), we just need to support them in the kernel driver I think. https://01.org/linuxgraphics/gfx-docs/drm/driver-api/dma-buf.html |
Would this be helpful?
https://github.com/cyndis/linux/tree/host1x-fence-1
Back when it was posted we lacked a user in userspace and hence couldn't
merge, but it sounds like you have a need.
Thierry
|
@thierryreding thank you very much for pointing at it! I'll take a detailed look at it very soon, but at quick glance it looks like it would be useful at least for some degree. @cyndis implemented dmabuf-fence for the Host1x job, so that jobs fence could be shared with another process. But right now, for BO caching-issue, we need to implement dmabuf-reservation that is attached to a BO. @cyndis, do you have anything else useful in your stash? :) |
The @cyndis's implementation looks interesting, however there is a nit. We can't use a raw syncpoint threshold value for fence signaling because syncpoint can wraparound, syncpoint should be backed by an uint64_t seqno that is incrementing on a jobs completion. I'd prefer |
Sorry, I don't have any other series ready :p Good point about the wraparound, though looking at struct dma_fence, seqno is only an unsigned int so it doesn't have any more range than host1x's hardware syncpoints - so I don't think we can do anything here unless that field is changed to an uint64_t. Not that it probably matters in practice. If it is changed, we should change it so that the seqno just tracks the syncpt value but ignoring wraparound, because we can have syncpts that are incremented mid-submit, or potentially syncpts that are not related to any host1x job (e.g. used by nouveau, if support is implemented). I can continue developing the series - if you have some kind of userspace that would satisfy the maintainers that would be great so we could get this in :) The other option I have been thinking of is adding prefence support to nouveau in mesa. |
@cyndis Yes, you are right, seqno is unsigned int. It should be okay because we can detect the wraparound like freedreno / etnaviv are doing Right now, I don't think that we (grate-driver) have a real use for a sync file, i.e. explicit syncing. But we need the implicit syncing for the reservations. The implicit / explicit syncing is almost the same thing, for explicit we are waiting for a whatever fence that userspace decides should be completed before job should start execution, for implicit we are waiting for the fences attached to BOs (all reads and writes must complete before a write and all writes must complete before a read). So explicit / implicit fences would use the same 'host1x fence' implementation, at least your "Add support for DMA fences" patch looks very useful, I'm going to try it today-tomorrow. I'm thinking about one potential use-case for explicit syncing, like awaiting for a VBLANK before writing to a displayed surface. I don't think that there is a real need for a core DRM maintainers approval for the explicit syncing if it would go in with the implicit, we need the implicit now and potentially explicit later. BTW, https://www.kernel.org/doc/Documentation/sync_file.txt operates with the terms |
However, I think we actually need the explicit fencing now. It is needed to serialize jobs (3d / 2d), implicit fencing is concurrent, i.e. jobs scheduling order would be arbitrary without explicit fencing. @cyndis, explicit fencing seems isn't that straightforward like you did it, the problem is that if you would submit a job to HW with an in-fence (or prefence, like you called it) then further submitted jobs to the same channel that do not require explicit syncing would be blocked by that explicitly synced job. Probably explicit fencing should be always CPU-waited? I'm wondering what would happen if different channels would simultaneously request a host1x-unit to wait for a syncpoint, would be one channel blocked by a Host1x bus until other channel would release the host1x-unit? Or there is a host1x-unit per channel? We need to fix the Tegra DRM IOCTLs locking because right now whole submission would be blocked by a CPU-waiting fence, that's not good. |
Well, implicit fencing is affected by the same blocking issues. Probably fencing in HW isn't useful at all without a software channels arbitration, like we should track the used channels, use the free-one and units should be MLOCKed. But that's a way too much work to implement. Also we don't need to wait for a fence if it has the same context as the jobs client because it is already serialized. |
Both explicit and implicit fencing are now staged in https://github.com/grate-driver/linux/. I'm now wondering whether we should do anything about older kernels that do not support required features, maybe we should try to add some sort of lame-and-slow compatibility mode to libdrm? Or it doesn't worth the effort... |
In case of a disabled Host1x firewall, command buffers won't be cloned by
kernel driver, leaving possibility of recycling these buffers while they are
executed by HW, as result userspace may overwrite their data and that would
end up in tears. Actually this is easily reproducible with x11perf.
Unfortunately this results in a very noticeable performance degradation,
we'd have to do something about it.
Signed-off-by: Dmitry Osipenko digetx@gmail.com