Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tegra: Do not cache command buffer BO's #13

Closed
wants to merge 0 commits into from

Conversation

digetx
Copy link
Member

@digetx digetx commented Jun 26, 2017

In case of a disabled Host1x firewall, command buffers won't be cloned by
kernel driver, leaving possibility of recycling these buffers while they are
executed by HW, as result userspace may overwrite their data and that would
end up in tears. Actually this is easily reproducible with x11perf.
Unfortunately this results in a very noticeable performance degradation,
we'd have to do something about it.

Signed-off-by: Dmitry Osipenko digetx@gmail.com

@kusma
Copy link
Member

kusma commented Jun 26, 2017

Just an idea: maybe we can prevent the need for this on a higher level? Something like always double-buffering command lists, and using fences to make sure they completed?

@kusma
Copy link
Member

kusma commented Jun 26, 2017

As a matter of fact, isn't it a bug to release an in-use BO anyway? This could also happen with any other BO, no?

@kusma
Copy link
Member

kusma commented Jun 26, 2017

The more I think about this, the more I feel like this is just hiding bugs. Buffers needs to be kept around until hardware is done working with them. This for instance means that surfaces that are being used as blit-sources from needs to stay around until the command-buffer containing the blit has been executed. This does nothing to help this.

I think we should instead do something like creating a garbage-list of BOs for each drm_tegra_job that gets freed once we know the hardware is done.

@digetx
Copy link
Member Author

digetx commented Jun 26, 2017

Yeah, that's something to consider about :) Your garbage-list idea sounds interesting, maybe we could spawn a new thread (or have a persistent garbage-queue thread-handler spawned on drm_tegra_new() invocation) on a successful job submission and wait for a fence completion there, so we have to bump the BO's refcount on submission and unref them on the fence completion. However, that means that we would waste quite a lot of memory in case of enabled Host1x firewall, we need to know somehow about the firewall state to avoid that waste.

@digetx
Copy link
Member Author

digetx commented Jun 26, 2017

And yes, we'd probably need to clone and attach job's fence to the BO's. That could be done via libdrm-tegra API extension with something like drm_tegra_bo_attach_fence(fence).

A quick draft:

int drm_tegra_bo_attach_fence(struct drm_tegra_bo *bo,
			      struct struct drm_tegra_fence *fence)
{
	int err = 0;

	if (!bo)
		return -EINVAL;

	pthread_mutex_lock(&bo->lock);

	if (bo->fence) {
		err = -EBUSY;
		goto unlock;
	}

	bo->fence = fence;
	
unlock:
	pthread_mutex_unlock(&bo->lock);

	return 0;
}

int drm_tegra_bo_fence_wait(struct drm_tegra_bo *bo,
			    unsigned long timeout)
{
	int err = 0;
	
	if (!bo)
		return -EINVAL;
	
	pthread_mutex_lock(&bo->lock);

	if (!bo->fence)
		goto unlock;

	err = drm_tegra_fence_wait_timeout(bo->fence, timeout);
	if (err < 0)
		goto unlock;

	drm_tegra_fence_free(bo->fence);
	bo->fence = NULL;
	
unlock:
	pthread_mutex_unlock(&bo->lock);

	return err;
}

@kusma
Copy link
Member

kusma commented Jun 26, 2017

We could just make some periodic calls from the client-API to see if we can free up some memory. For instance, we might want to do this when we flush, and if we fail to allocate (and then retry).

@kusma
Copy link
Member

kusma commented Jun 26, 2017

@digetx: I suspect that it would be desirable to be allowed to replace the fence with a newer one. That way, we can keep a running-track of the latest-use fence for a resource, and wait for that when we want to either modify or delete it.

Hmm, but maybe that becomes a bit tricky with usage from multiple queues (GR2D / GR3D)? What about imported BOs?

@digetx
Copy link
Member Author

digetx commented Jun 26, 2017

@kusma Sounds fun already :P Maybe that would be useful for us https://www.phoronix.com/scan.php?page=news_item&px=DRM-Sync-Objects-4.13, we'd have to add a fence<->syncobj wrappers, I guess. That also means we'd be incompatible with older kernel, but probably that's fine.

@kusma
Copy link
Member

kusma commented Jun 26, 2017

image

@digetx
Copy link
Member Author

digetx commented Jun 26, 2017

Actually, we should implement drm_tegra_bo_cpu_prep() of find_in_bucket() in tegra_bo_cache.c

@digetx
Copy link
Member Author

digetx commented Jun 27, 2017

So, I have looked at it today a bit more. The dmabuf provides all necessary synchronization facilities (fences, reservation), we just need to support them in the kernel driver I think. https://01.org/linuxgraphics/gfx-docs/drm/driver-api/dma-buf.html

@thierryreding
Copy link
Member

thierryreding commented Jun 28, 2017 via email

@digetx
Copy link
Member Author

digetx commented Jun 28, 2017

@thierryreding thank you very much for pointing at it! I'll take a detailed look at it very soon, but at quick glance it looks like it would be useful at least for some degree. @cyndis implemented dmabuf-fence for the Host1x job, so that jobs fence could be shared with another process. But right now, for BO caching-issue, we need to implement dmabuf-reservation that is attached to a BO.

@cyndis, do you have anything else useful in your stash? :)

@digetx
Copy link
Member Author

digetx commented Jun 30, 2017

The @cyndis's implementation looks interesting, however there is a nit. We can't use a raw syncpoint threshold value for fence signaling because syncpoint can wraparound, syncpoint should be backed by an uint64_t seqno that is incrementing on a jobs completion. I'd prefer in_fence / out_fence namings rather than prefence / postfence, following freedreno and etnaviv. @cyndis are you going to continue working on the host1x-fence patches or I may take them over?

@cyndis
Copy link

cyndis commented Jul 1, 2017

Sorry, I don't have any other series ready :p

Good point about the wraparound, though looking at struct dma_fence, seqno is only an unsigned int so it doesn't have any more range than host1x's hardware syncpoints - so I don't think we can do anything here unless that field is changed to an uint64_t. Not that it probably matters in practice. If it is changed, we should change it so that the seqno just tracks the syncpt value but ignoring wraparound, because we can have syncpts that are incremented mid-submit, or potentially syncpts that are not related to any host1x job (e.g. used by nouveau, if support is implemented).

I can continue developing the series - if you have some kind of userspace that would satisfy the maintainers that would be great so we could get this in :) The other option I have been thinking of is adding prefence support to nouveau in mesa.

@digetx
Copy link
Member Author

digetx commented Jul 1, 2017

@cyndis Yes, you are right, seqno is unsigned int. It should be okay because we can detect the wraparound like freedreno / etnaviv are doing (s32)(current_ctx_seqno - fence_seqno) >= 0, assuming that we won't have a fence stalled for months / years.

Right now, I don't think that we (grate-driver) have a real use for a sync file, i.e. explicit syncing. But we need the implicit syncing for the reservations. The implicit / explicit syncing is almost the same thing, for explicit we are waiting for a whatever fence that userspace decides should be completed before job should start execution, for implicit we are waiting for the fences attached to BOs (all reads and writes must complete before a write and all writes must complete before a read). So explicit / implicit fences would use the same 'host1x fence' implementation, at least your "Add support for DMA fences" patch looks very useful, I'm going to try it today-tomorrow.

I'm thinking about one potential use-case for explicit syncing, like awaiting for a VBLANK before writing to a displayed surface. I don't think that there is a real need for a core DRM maintainers approval for the explicit syncing if it would go in with the implicit, we need the implicit now and potentially explicit later.

BTW, https://www.kernel.org/doc/Documentation/sync_file.txt operates with the terms in-fence / out-fence, I think we should stick with these names.

@digetx
Copy link
Member Author

digetx commented Jul 1, 2017

However, I think we actually need the explicit fencing now. It is needed to serialize jobs (3d / 2d), implicit fencing is concurrent, i.e. jobs scheduling order would be arbitrary without explicit fencing.

@cyndis, explicit fencing seems isn't that straightforward like you did it, the problem is that if you would submit a job to HW with an in-fence (or prefence, like you called it) then further submitted jobs to the same channel that do not require explicit syncing would be blocked by that explicitly synced job. Probably explicit fencing should be always CPU-waited?

I'm wondering what would happen if different channels would simultaneously request a host1x-unit to wait for a syncpoint, would be one channel blocked by a Host1x bus until other channel would release the host1x-unit? Or there is a host1x-unit per channel?

We need to fix the Tegra DRM IOCTLs locking because right now whole submission would be blocked by a CPU-waiting fence, that's not good.

@digetx
Copy link
Member Author

digetx commented Jul 1, 2017

Well, implicit fencing is affected by the same blocking issues. Probably fencing in HW isn't useful at all without a software channels arbitration, like we should track the used channels, use the free-one and units should be MLOCKed. But that's a way too much work to implement.

Also we don't need to wait for a fence if it has the same context as the jobs client because it is already serialized.

@digetx
Copy link
Member Author

digetx commented Jul 4, 2017

Both explicit and implicit fencing are now staged in https://github.com/grate-driver/linux/. I'm now wondering whether we should do anything about older kernels that do not support required features, maybe we should try to add some sort of lame-and-slow compatibility mode to libdrm? Or it doesn't worth the effort...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants