drm:drm_vblank_event doesn't represent actual HW vblank #30

hfink-daqri · 2019-05-14T10:18:08Z

By default, GPUVis uses drm:drm_vblank_event trace points to identify vblank timings. While working on an Intel i915 trace, I learned that for this driver, the drm_vblank_event trace point does not represent the actual HW vblank timestamp, but only the CPU time the trace-point was hit during some irq handler (briefly discussed at #intel-gfx). In my tests, this timestamp was 200 - 400 us before the actual vblank. Supposedly, for a system under heavy load, this divergence becomes larger.

This can be a problem when analyzing tight timings with a GPUVis trace. For instance, we are scheduling a CPU wake-up time relative to the last vblank time we get through the user-space KMS API (which is HW corrected and accurate). During that wake-up, we commit display plane state as late as possible for it to make it for the next vblank. In GPUVis, this wake-up appears to be inaccurately scheduled, since the visualized vblank lines are based on inaccurate timings themselves. That lead to a few false conclusions on my side.

Unfortunately I don't have a good fix for this, and neither do I know if this problem also applies to AMD drivers. I just wanted to write this up here in case anyone else runs into this issue, and maybe to discuss alternative solutions.

The text was updated successfully, but these errors were encountered:

danvet · 2019-05-14T10:23:55Z

Correction: This holds for all drivers, not just i915. The difference between when we handle the vblank interrupt and the actual vblank timestamp is entirely driver dependent (but in no case actually matches).

Please also note that the time you can last successfully submit a page flip is also driver dependent. The only cross-driver guarantee is that after you get the vblank event, a subsequent flip will schedule on the next frame. Even if the vblank hasn't actually happened yet. But it is entirely undefined how much ahead of that event you need to submit a page flip to hit the current vblank.

lostgoat · 2019-05-14T16:50:24Z

Can I trouble you for a link to a log of the #intel-gfx discussion. Don't want to make you have to repeat yourself here.

I believe that we currently don't have any information source that offers higher precision than drm_vblank_event. So we are going to need a new event to collect this information. E.g. we'd have a set of:

drm_vblank_event cputs=%lld crtc=%d seq=%d
- Represent the point in time when the libdrm event is fired (could still be useful)
drm_vblank_hw cputs=%lld hwts=%lld crtc=%d seq=%d
- Represent the point in time when the HW vblank happened
- Timestamp needs to come from HW, since on the CPU side we'll always have some sort of delay
  - I haven't worked with VRR much before, but I'm guessing very precise TS will be fairly useful here.
- Needs a better event name :)
- Gpuvis can optionally use this event if available

Could an event like the above be generate by the i915 driver? If not, do we have any alternative strategies for mitigating the delay? (we could try to find some standard deviation and show vblank as a range for example)

lostgoat · 2019-05-14T17:33:15Z

Alternatively, we could:

Capture the timestamp at the top of the IRQ handler routine, while still in the IRQ context. Then pass that as a parameter to the IRQ handler routine.
Or, adjust the scheduling policy so that the IRQ context and the IRQ handler. This is probably not a good approach cause we might end up where we started.

Terminology note. By IRQ handler I mean the routine that executes outside once interrupts are re-enabled. I can't recall the proper name at the moment.

Plagman · 2019-05-14T22:50:49Z

For a compositor to be able to achieve maximum quality of service (eg. lowest possible latency while not missing frames), it'd be nice for drivers to provide a reasonable estimate on what that value might be on the current hardware that's driving the head, though. At the bare minimum feedback on what vblank a flip actually landed on would let us tune the compositor frame timer guard window timing at runtime, so hopefully the driver is able to provide that, but a hint to get us started would be ideal.

hfink-daqri · 2019-05-15T08:04:15Z

Can I trouble you for a link to a log of the #intel-gfx discussion. Don't want to make you have to repeat yourself here.

I didn't keep a log of that discussion, sorry. As far as I remember, there wasn't any additional information discussed than already described here.

drm_vblank_hw cputs=%lld hwts=%lld crtc=%d seq=%d
Represent the point in time when the HW vblank happened

I can't really comment on the trace points and i915 architecture, but would it be possible for a user-space process to insert events into the trace_marker stream, where the visualized CPU time is independent of the time it was inserted into the stream, i.e. we set the CPU time to a specific value? That way, as a workaround for now, we could insert the actual vblank time into a custom event, and configure GPUVis to use that as the frame boundaries.

At the bare minimum feedback on what vblank a flip actually landed on would let us tune the compositor frame timer guard window timing at runtime

If I'm not mistaken, this information you can already track through the page_flip_handler2 callback of a
drmEventContext, if you manage swap-chains and scan-out buffers manually. See kms-quads for an example of such a handler.

Could it also be that kernel static trace points are simply not well suited for collecting this data? Maybe it would be worth thinking to add an alternative data collection technique (i.e. instead of ftrace) on Linux to GPUVis (such as Window ETL support was apparently recently) in the long run?

danvet · 2019-05-15T09:10:47Z

@lostgoat Trying to answer your questions:

in most drivers drm_handle_vblank is called from the irq handler, so we're already catching the timestamp you have from the tracepoint at pretty much the right time.
the trouble is that the irq handler fires whenever the hw team felt like (most of them fire at the start of vblank, but not all, and most not exactly at the start of vblank)
when we hit the tracepoint we already have the corrected timestamp (corrected to start-of-next-frame that is, and only for drivers which support high precision vblank timestamps).
I think the simplest solution would be to extend the existing tracepoint and add the correct hw timestamp in there, maybe only optionally for drivers which have high precision timestamps (otherwise it's kinda pointless).

@Plagman agreed, but given scheduling heuristics and all that stuff it's tricky. Best rule of thumb for the deadline right now is "a bit before you get the drm_event/hit that tracepoint". The really annoying thing is that for some hw you can squeeze in a frame update even after the vblank has passed already (those suppporting vrr usually), and we're still shuffling around implementation details to make sure we have consistent timestamps and all that for these cases. They way things are progressing we'll do the following:

delay the drm_event until point of no return, so that we can make sure the vblank timestamp and the page_flip timestamp for a given frame always match
this means a lot more funny interrupt handling code, since vblank irq handling is now two stage
but end result is that for userspace the same heuristics should still be good enough, i.e. a few hundred us before you get the event is the deadline.

It might be good to put that into some formal vblank/page_flip timestamp uapi documentation.

lostgoat · 2019-05-15T15:53:54Z

I think the simplest solution would be to extend the existing tracepoint and add the correct hw timestamp in there, maybe only optionally for drivers which have high precision timestamps (otherwise it's kinda pointless).

That sounds perfect.

We can use the adjusted timestamp if the field is available, otherwise we'll use the ftrace timestamp.

hfink-daqri · 2019-05-16T07:39:52Z

sounds good, I didn't know that we can extend the tracepoint that easily. I would be more than happy to test this on our side and share the results. @danvet what do you think is the best way to move forward, and how can I help?

danvet · 2019-05-16T10:09:33Z

tbh I just think we can extend tracepoints like that, I'm not sure.

next step would be to type the kernel patch and maybe adjust gpuvis and see whether it all works. should be a tiny patch, at least on the kernel side.

mikesart · 2019-05-25T03:35:56Z

I think @hfink-daqri is already building his own kernel to enable the Intel tracepoints, so if anyone has a rough patch I'm sure either he or I could get it working and give it a try on the user side. Thanks everyone.

subdiff · 2019-05-25T21:42:38Z

I have some comprehension questions.

@danvet

delay the drm_event until point of no return, so that we can make sure the vblank timestamp and the page_flip timestamp for a given frame always match

Do you mean with "point of no return" the point in time when the actual hw vblank starts?
According to drm docs the drm_vblank_event timestamp must agree with the one returned from page flip events already now. Do you mean some other vblank timestamp?

The really annoying thing is that for some hw you can squeeze in a frame update even after the vblank has passed already (those suppporting vrr usually)

According to drm docs the vertical front porch is extended in this case until a time out (VRR minimum) or a page flip / atomic commit comes in. The next vblank just begins later. Or did in this case the drm_vblank_event already trigger and you meant that?

I created the following diagram to visualize what's happening around a vblank:

                                    P-3                  P-4
                                     |                    |
-------------- frame x ---------------------------------->|<---- frame x+1 ----
                                     |<----- vblank ----->|
                                     |                    |
                 irq_handler +       |....driver flips....|
                drm_vblank_event     |...scanout-buffer...|
       P-1              |            |                    |
--------|---------------|------------|--------------------|--------------------
        |               |            |                    |
    last point         P-2        start of             end of v-blank,
    in time when                  "actual"             end of frame x,
    flip submit                   hw-vblank            start of frame x+1
    possible for                                       being scanned out
    frame x+1

Please tell me if that's not correct. Some explanations why intervals are defined and points of interest P-1 to P-4 are placed like above:

P-1 is first because @danvet said about P-2:

The only cross-driver guarantee is that after you get the vblank event, a subsequent flip will schedule on the next frame.

That means if there is another point in time when the property of being able to submit another page flip to hit next frame x+1 changes from true to false, this point in time must be before P-2 (because from P-2 on it is guaranteed to be false).
At P-2 we have both the irq_handler and the drm_vblank_event, because @danvet said:

in most drivers drm_handle_vblank is called from the irq handler

and drm docs says about drm_handle_vblank:

Drivers should call this routine in their vblank interrupt handlers to update the vblank counter and send any signals that may be pending.

I assume the drm_vblank_event is such a pending signal. But this contradicts with drm docs:

An application can request to be notified when the page flip has completed. The drm core will supply a struct drm_event in the event parameter in this case. This can be handled by the drm_crtc_send_vblank_event() function, which the driver should call on the provided event upon completion of the flip.

That should mean the drm_vblank_event may only be sent after the page flip has completed, i.e. at some point between P-3 and P-4.
P-2 is before P-3 because @hfink-daqri measured it to be 200-400 us earlier.
Frame x extends till P-4 because until P-4 we have not yet changed the content on the screen. The scanout of frame x+1 only begins at P-4. Also in kms-quads the timestamp received through the page_flip_handler2 hook is described as:

This time is usually close to the start of the vblank period of the previous frame [...]

In the end userland receives in the page_flip_handler2 callback some timestamp value. Which point P1 till P-4 is this timestamp then? Or is it the point in time the flip completed somewhere between P-3 and P-4?

danvet · 2019-06-04T15:16:50Z

@subdiff a few thoughts:

I assume that P4 is what you get from in the drm_event for vblank events/page_flip completion. I guess that answers your question at the very end. Note this only applies for drivers with so called high precision timestamp support. These are nouveau, i915, radeon, amdgpu, vc4 (and maybe some more, not sure). For drivers without high-precision timestamps there's no way for you to measure P4, since the timestamp you get from the flip event is usually P2. Emphasis on "usually". Sadly you can't check from userspace whether you do have high precision timestamp support (but would be easy to add)
your scenario is correct for some drivers. It's not correct for others. A sequennce like P3, P1, P2, P4 is possible, and actually the one you observe when VRR is enabled.
the P2-P3 that @hfink-daqri measure holds for current implementations on current intel hw. Can't generalize that. Also I thought @hfink-daqri measured two different samplings of P2 (once through tracepoint, the other through the dma-fence), but not sure.
you cannot measure P3, the driver doesn't tell you that. Only thing you can do is take P4 and add the time the buffer will take to scan out (since that part isn't changed by VRR) to get P3 of the next frame, but only if you have a high-precision timestamp driver.
vblank irq handler is an implementation detail. The "should" in the kernel doc will work for the 90% cases of drivers with simple needs/simple hw. VRR doesn't work like that though, and amdgpu.ko has some pretty interesting code to make it all fit (there's essentially start and end vblank irq and some code to handle all that). We should probably update the docs a bit to make this clearer. Even on simple hw the vblank interrupt can essentially happen any time around the vblank intervall. I think even a sequence of P3, P1, P4, P2 is possible.
P1 is a lie for backwards compatibility reasons. If you request a specific frame then for some drivers you can still schedule and update for that frame after P2. Again, only applies to some drivers.
There's drivers without real vblank support (emulated/virtual hw). Those usually complete flips right away, and the timestamp you get back for page_flip events is just something sampled when the modeset code ran. You need to rate limit yourself, otherwise you just burn down all the cpu flipping as fast as you can draw new frames.

tldr; all bets are off :-)

Also sorry for the late reply, somehow missed the notification.

danvet · 2019-06-04T15:17:52Z

Update: There's a few more drivers with high-precision timestamp support: stm, and msm (but only for mdp5, not for mdp4).

hfink-daqri · 2019-06-05T14:55:51Z

the P2-P3 that @hfink-daqri measure holds for current implementations on current intel hw. Can't generalize that. Also I thought @hfink-daqri measured two different samplings of P2 (once through tracepoint, the other through the dma-fence), but not sure.

@danvet I measured the diff between ~CPU time for irq_handler and timestamp provided by page_flip_handler2, so I guess that's P2 - P4.

re. possible patches, I'd be happy to assist in testing eventual patches, even if it's just work-in-progress prototypes.

danvet · 2019-06-05T15:57:17Z

Ok, I've checked my math and on my 1980x1200 screen the vblank (so P3-P4) is a bit less than 500 usec. I thought it was longer, but I guess that was just back in the old days of actual CRT screens. So P2-P4 of 200-400 usec is somewhat plausible I think.

hfink-daqri · 2019-08-08T16:10:54Z

I was playing around to extend the trace-point as discussed above. The kernel part seems straight-forward, I got that part working, but I am struggling to align the passed timestamp (which is monotonic clock based) to the timestamps used in gpuvis. Is there maybe a similar case in gpuvis where a monotonic-clock timestamp is aligned to the trace-event timeline that I could follow? Or maybe documentation on how trace-event timestamps are being parsed?

hfink-daqri · 2019-08-09T08:33:51Z

answering my question above: ftrace needs to be configured to use monotonic clock for time-stamping. It'll be slower than the default "local" CPU clock, but reasonable to do for our use-case, I guess: echo mono > /sys/kernel/tracing/trace_clock

hfink-daqri · 2019-08-09T16:37:21Z

I got high-precision vblank timestamps working so far, see linked PR above. It works nicely and you can toggle back and forth between using high-precision and IRQ-based vblank timings in GPUVis.

@danvet I sent a draft of the kernel patch to the mailing list. It's the first time I am sending a kernel patch, so it's likely there's something wrong/missing in that patch and/or me messing up my git send-email config :)

Feedback on either kernel/GPUVis side is very welcome. I am off for 10 days now, so I might be late responding to your comments and won't be able to submit new revision before being back.

Store the timestamp of the current vblank in the new field 'time' of the vblank trace event. If the timestamp is calculated by a driver that supports high-precision vblank timing, set the field 'high-prec' to 'true'. User space can now access actual hardware vblank times via the tracing infrastructure. Tracing applications (such as GPUVis, see [0] for related discussion), can use the newly added information to conduct a more accurate analysis of display timing. [0] mikesart/gpuvis#30 Signed-off-by: Heinrich <heinrich.fink@daqri.com>

hfink-daqri · 2019-09-25T09:52:37Z

kernel patch has been applied to drm-misc-next and GPUVis MR was merged. Closing this issue. Thanks everyone for this interesting discussion and your input!

Store the timestamp of the current vblank in the new field 'time' of the vblank trace event. If the timestamp is calculated by a driver that supports high-precision vblank timing, set the field 'high-prec' to 'true'. User space can now access actual hardware vblank times via the tracing infrastructure. Tracing applications (such as GPUVis, see [0] for related discussion), can use the newly added information to conduct a more accurate analysis of display timing. v2 Fix author name (missing last name) [0] mikesart/gpuvis#30 Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Heinrich Fink <heinrich.fink@daqri.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20190902142412.27846-2-heinrich.fink@daqri.com

hfink-daqri mentioned this issue Aug 9, 2019

Use high-precision vblank timings, if available #33

Merged

hfink-daqri closed this as completed Sep 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drm:drm_vblank_event doesn't represent actual HW vblank #30

drm:drm_vblank_event doesn't represent actual HW vblank #30

hfink-daqri commented May 14, 2019

danvet commented May 14, 2019

lostgoat commented May 14, 2019

lostgoat commented May 14, 2019

Plagman commented May 14, 2019

hfink-daqri commented May 15, 2019

danvet commented May 15, 2019

lostgoat commented May 15, 2019

hfink-daqri commented May 16, 2019

danvet commented May 16, 2019

mikesart commented May 25, 2019

subdiff commented May 25, 2019 •

edited

Loading

danvet commented Jun 4, 2019

danvet commented Jun 4, 2019

hfink-daqri commented Jun 5, 2019

danvet commented Jun 5, 2019

hfink-daqri commented Aug 8, 2019

hfink-daqri commented Aug 9, 2019

hfink-daqri commented Aug 9, 2019

hfink-daqri commented Sep 25, 2019

drm:drm_vblank_event doesn't represent actual HW vblank #30

drm:drm_vblank_event doesn't represent actual HW vblank #30

Comments

hfink-daqri commented May 14, 2019

danvet commented May 14, 2019

lostgoat commented May 14, 2019

lostgoat commented May 14, 2019

Plagman commented May 14, 2019

hfink-daqri commented May 15, 2019

danvet commented May 15, 2019

lostgoat commented May 15, 2019

hfink-daqri commented May 16, 2019

danvet commented May 16, 2019

mikesart commented May 25, 2019

subdiff commented May 25, 2019 • edited Loading

danvet commented Jun 4, 2019

danvet commented Jun 4, 2019

hfink-daqri commented Jun 5, 2019

danvet commented Jun 5, 2019

hfink-daqri commented Aug 8, 2019

hfink-daqri commented Aug 9, 2019

hfink-daqri commented Aug 9, 2019

hfink-daqri commented Sep 25, 2019

subdiff commented May 25, 2019 •

edited

Loading