tegra: Do not cache command buffer BO's #13

digetx · 2017-06-26T12:29:50Z

In case of a disabled Host1x firewall, command buffers won't be cloned by
kernel driver, leaving possibility of recycling these buffers while they are
executed by HW, as result userspace may overwrite their data and that would
end up in tears. Actually this is easily reproducible with x11perf.
Unfortunately this results in a very noticeable performance degradation,
we'd have to do something about it.

Signed-off-by: Dmitry Osipenko digetx@gmail.com

kusma · 2017-06-26T12:41:58Z

Just an idea: maybe we can prevent the need for this on a higher level? Something like always double-buffering command lists, and using fences to make sure they completed?

kusma · 2017-06-26T12:42:47Z

As a matter of fact, isn't it a bug to release an in-use BO anyway? This could also happen with any other BO, no?

kusma · 2017-06-26T12:51:28Z

The more I think about this, the more I feel like this is just hiding bugs. Buffers needs to be kept around until hardware is done working with them. This for instance means that surfaces that are being used as blit-sources from needs to stay around until the command-buffer containing the blit has been executed. This does nothing to help this.

I think we should instead do something like creating a garbage-list of BOs for each drm_tegra_job that gets freed once we know the hardware is done.

digetx · 2017-06-26T13:06:56Z

Yeah, that's something to consider about :) Your garbage-list idea sounds interesting, maybe we could spawn a new thread (or have a persistent garbage-queue thread-handler spawned on drm_tegra_new() invocation) on a successful job submission and wait for a fence completion there, so we have to bump the BO's refcount on submission and unref them on the fence completion. However, that means that we would waste quite a lot of memory in case of enabled Host1x firewall, we need to know somehow about the firewall state to avoid that waste.

digetx · 2017-06-26T13:18:34Z

And yes, we'd probably need to clone and attach job's fence to the BO's. That could be done via libdrm-tegra API extension with something like drm_tegra_bo_attach_fence(fence).

A quick draft:

int drm_tegra_bo_attach_fence(struct drm_tegra_bo *bo,
			      struct struct drm_tegra_fence *fence)
{
	int err = 0;

	if (!bo)
		return -EINVAL;

	pthread_mutex_lock(&bo->lock);

	if (bo->fence) {
		err = -EBUSY;
		goto unlock;
	}

	bo->fence = fence;
	
unlock:
	pthread_mutex_unlock(&bo->lock);

	return 0;
}

int drm_tegra_bo_fence_wait(struct drm_tegra_bo *bo,
			    unsigned long timeout)
{
	int err = 0;
	
	if (!bo)
		return -EINVAL;
	
	pthread_mutex_lock(&bo->lock);

	if (!bo->fence)
		goto unlock;

	err = drm_tegra_fence_wait_timeout(bo->fence, timeout);
	if (err < 0)
		goto unlock;

	drm_tegra_fence_free(bo->fence);
	bo->fence = NULL;
	
unlock:
	pthread_mutex_unlock(&bo->lock);

	return err;
}

kusma · 2017-06-26T13:18:44Z

We could just make some periodic calls from the client-API to see if we can free up some memory. For instance, we might want to do this when we flush, and if we fail to allocate (and then retry).

kusma · 2017-06-26T13:24:05Z

@digetx: I suspect that it would be desirable to be allowed to replace the fence with a newer one. That way, we can keep a running-track of the latest-use fence for a resource, and wait for that when we want to either modify or delete it.

Hmm, but maybe that becomes a bit tricky with usage from multiple queues (GR2D / GR3D)? What about imported BOs?

digetx · 2017-06-26T13:36:31Z

@kusma Sounds fun already :P Maybe that would be useful for us https://www.phoronix.com/scan.php?page=news_item&px=DRM-Sync-Objects-4.13, we'd have to add a fence<->syncobj wrappers, I guess. That also means we'd be incompatible with older kernel, but probably that's fine.

kusma · 2017-06-26T13:45:35Z

digetx · 2017-06-26T21:00:04Z

Actually, we should implement drm_tegra_bo_cpu_prep() of find_in_bucket() in tegra_bo_cache.c

digetx · 2017-06-27T21:44:06Z

So, I have looked at it today a bit more. The dmabuf provides all necessary synchronization facilities (fences, reservation), we just need to support them in the kernel driver I think. https://01.org/linuxgraphics/gfx-docs/drm/driver-api/dma-buf.html

thierryreding · 2017-06-28T01:48:36Z

Would this be helpful? https://github.com/cyndis/linux/tree/host1x-fence-1 Back when it was posted we lacked a user in userspace and hence couldn't merge, but it sounds like you have a need. Thierry

digetx · 2017-06-28T14:00:43Z

@thierryreding thank you very much for pointing at it! I'll take a detailed look at it very soon, but at quick glance it looks like it would be useful at least for some degree. @cyndis implemented dmabuf-fence for the Host1x job, so that jobs fence could be shared with another process. But right now, for BO caching-issue, we need to implement dmabuf-reservation that is attached to a BO.

@cyndis, do you have anything else useful in your stash? :)

digetx · 2017-06-30T23:41:58Z

The @cyndis's implementation looks interesting, however there is a nit. We can't use a raw syncpoint threshold value for fence signaling because syncpoint can wraparound, syncpoint should be backed by an uint64_t seqno that is incrementing on a jobs completion. I'd prefer in_fence / out_fence namings rather than prefence / postfence, following freedreno and etnaviv. @cyndis are you going to continue working on the host1x-fence patches or I may take them over?

cyndis · 2017-07-01T07:36:55Z

Sorry, I don't have any other series ready :p

Good point about the wraparound, though looking at struct dma_fence, seqno is only an unsigned int so it doesn't have any more range than host1x's hardware syncpoints - so I don't think we can do anything here unless that field is changed to an uint64_t. Not that it probably matters in practice. If it is changed, we should change it so that the seqno just tracks the syncpt value but ignoring wraparound, because we can have syncpts that are incremented mid-submit, or potentially syncpts that are not related to any host1x job (e.g. used by nouveau, if support is implemented).

I can continue developing the series - if you have some kind of userspace that would satisfy the maintainers that would be great so we could get this in :) The other option I have been thinking of is adding prefence support to nouveau in mesa.

digetx · 2017-07-01T12:06:16Z

@cyndis Yes, you are right, seqno is unsigned int. It should be okay because we can detect the wraparound like freedreno / etnaviv are doing (s32)(current_ctx_seqno - fence_seqno) >= 0, assuming that we won't have a fence stalled for months / years.

Right now, I don't think that we (grate-driver) have a real use for a sync file, i.e. explicit syncing. But we need the implicit syncing for the reservations. The implicit / explicit syncing is almost the same thing, for explicit we are waiting for a whatever fence that userspace decides should be completed before job should start execution, for implicit we are waiting for the fences attached to BOs (all reads and writes must complete before a write and all writes must complete before a read). So explicit / implicit fences would use the same 'host1x fence' implementation, at least your "Add support for DMA fences" patch looks very useful, I'm going to try it today-tomorrow.

I'm thinking about one potential use-case for explicit syncing, like awaiting for a VBLANK before writing to a displayed surface. I don't think that there is a real need for a core DRM maintainers approval for the explicit syncing if it would go in with the implicit, we need the implicit now and potentially explicit later.

BTW, https://www.kernel.org/doc/Documentation/sync_file.txt operates with the terms in-fence / out-fence, I think we should stick with these names.

digetx · 2017-07-01T21:03:54Z

However, I think we actually need the explicit fencing now. It is needed to serialize jobs (3d / 2d), implicit fencing is concurrent, i.e. jobs scheduling order would be arbitrary without explicit fencing.

@cyndis, explicit fencing seems isn't that straightforward like you did it, the problem is that if you would submit a job to HW with an in-fence (or prefence, like you called it) then further submitted jobs to the same channel that do not require explicit syncing would be blocked by that explicitly synced job. Probably explicit fencing should be always CPU-waited?

I'm wondering what would happen if different channels would simultaneously request a host1x-unit to wait for a syncpoint, would be one channel blocked by a Host1x bus until other channel would release the host1x-unit? Or there is a host1x-unit per channel?

We need to fix the Tegra DRM IOCTLs locking because right now whole submission would be blocked by a CPU-waiting fence, that's not good.

digetx · 2017-07-01T21:45:20Z

Well, implicit fencing is affected by the same blocking issues. Probably fencing in HW isn't useful at all without a software channels arbitration, like we should track the used channels, use the free-one and units should be MLOCKed. But that's a way too much work to implement.

Also we don't need to wait for a fence if it has the same context as the jobs client because it is already serialized.

digetx · 2017-07-04T23:03:50Z

Both explicit and implicit fencing are now staged in https://github.com/grate-driver/linux/. I'm now wondering whether we should do anything about older kernels that do not support required features, maybe we should try to add some sort of lame-and-slow compatibility mode to libdrm? Or it doesn't worth the effort...

digetx mentioned this pull request Jun 26, 2017

exa: implement TegraEXAMarkSync / TegraEXAWaitMarker grate-driver/xf86-video-opentegra#21

Closed

digetx closed this Jul 24, 2017

digetx force-pushed the master branch from 75346dd to 770ad8e Compare July 24, 2017 10:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tegra: Do not cache command buffer BO's #13

tegra: Do not cache command buffer BO's #13

digetx commented Jun 26, 2017

kusma commented Jun 26, 2017

kusma commented Jun 26, 2017

kusma commented Jun 26, 2017

digetx commented Jun 26, 2017

digetx commented Jun 26, 2017 •

edited

Loading

kusma commented Jun 26, 2017

kusma commented Jun 26, 2017

digetx commented Jun 26, 2017

kusma commented Jun 26, 2017

digetx commented Jun 26, 2017

digetx commented Jun 27, 2017

thierryreding commented Jun 28, 2017 via email •

edited by digetx

Loading

digetx commented Jun 28, 2017

digetx commented Jun 30, 2017

cyndis commented Jul 1, 2017

digetx commented Jul 1, 2017

digetx commented Jul 1, 2017

digetx commented Jul 1, 2017

digetx commented Jul 4, 2017

tegra: Do not cache command buffer BO's #13

tegra: Do not cache command buffer BO's #13

Conversation

digetx commented Jun 26, 2017

kusma commented Jun 26, 2017

kusma commented Jun 26, 2017

kusma commented Jun 26, 2017

digetx commented Jun 26, 2017

digetx commented Jun 26, 2017 • edited Loading

kusma commented Jun 26, 2017

kusma commented Jun 26, 2017

digetx commented Jun 26, 2017

kusma commented Jun 26, 2017

digetx commented Jun 26, 2017

digetx commented Jun 27, 2017

thierryreding commented Jun 28, 2017 via email • edited by digetx Loading

digetx commented Jun 28, 2017

digetx commented Jun 30, 2017

cyndis commented Jul 1, 2017

digetx commented Jul 1, 2017

digetx commented Jul 1, 2017

digetx commented Jul 1, 2017

digetx commented Jul 4, 2017

digetx commented Jun 26, 2017 •

edited

Loading

thierryreding commented Jun 28, 2017 via email •

edited by digetx

Loading