Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Beam Racing/Scanline Sync to RetroArch (aka Lagless VSYNC) #6984

Open
blurbusters opened this issue Jul 13, 2018 · 60 comments
Open

Add Beam Racing/Scanline Sync to RetroArch (aka Lagless VSYNC) #6984

blurbusters opened this issue Jul 13, 2018 · 60 comments
Labels
bounty feature request New enhancement to RetroArch.

Comments

@blurbusters
Copy link

blurbusters commented Jul 13, 2018

Bountysource

Feature Request Description

A new lagless VSYNC technique has been developed that is already implemented in some emulators. This should be added to RetroArch too.

Bounty available

There is currently a BountySource of about $500 to add the beam racing API to RetroArch plus support at least 2 emulator modules (scroll below for bounty trigger conditions). RetroArch is a C / C++ project.

Synchronize emu raster with real world raster to reduce input lag

It is achieved via synchronizing the emulator's raster to the real world's raster. It is successfully implemented in some emulators, and uses less processing power than RunAhead, and is more forgiving than expected thanks to a "jitter margin" technique that has been invented by a group of us (myself and a few emulator authors).

For lurkers/readers: Don't know what a "raster" or "beam racing" is? Read WIRED Magazine's Racing the beam article. Many 8-bit and 16-bit computers, consoles and arcade machines utilized similar techniques for many tricks, and emulators typically implement them

Already Proven, Already Working

There is currently discussion between other willing emulator authors behind the scenes for adding lagless VSYNC (real-world beam racing support).

Preservationist Friendly. Preserves original input lag accurately.

Beam racing preserves all original latencies including mid-screen input reads.

Less horsepower needed than RunAhead.

RunAhead is amazing! That said, there are other lag-reducing tools that we should also make available too.

Android and Pi GPUs (too slow for RunAhead in many emulators) even work with this lag-reducing technique.

Beam racing works on PI/Android, allows slower cycle exact emulators to have dramatic lag reductions,
We have found it scales in both direction. Including Android and PI. Powerful computers can gain ultra-tight beam racing margins (sync between emuraster and realraster can be sub-millisecond on GTX 1080 Ti). Slower computers can gain very forgiving beam racing margins. The beam racing margin is adjustable -- can be up to 1 refresh cycle in size.

In other words, graphics are essentially raster-streamed to the display practically real-time (through a creative tearingless VSYNC OFF trick that works with standard Direct3D/OpenGL/Metal/etc), while the emulator is merrily executing at 1:1 original speed.

Diagrammatic Concept

Lagless VSYNC

Lagless VSYNC jitter margin

Just like duplicate refresh cycles never have tearlines even in VSYNC OFF, duplicate frameslices never have tearlines either. We're simply subdividing frames into subframes, and then using VSYNC OFF instead.

We don't even need a raster register (it can help, but we've come up with a different method), since rasters can be a time-based offset from VSYNC, and that can still be accurate enough for flawless sub-millisecond latency difference between emulator and original machine.

Emulators can merrily run at original machine speed. Essentially streaming pixels darn-near-raster-realtime (submillisecond difference). What many people don't realize is 1080p and 4K signals still top-to-bottom scan like an old 60Hz CRT in default monitor orientation -- we're simply synchronizing to cable scanout, the scanout method of serializing 2D images to a 1D cable is fundamnetally unchanged. Achieving real raster sync between the emulator raster and real raster!

Many emulators already render 1 scanline at a time to an offscreen framebuffer. So 99% of the beam racing work is already done.

Simple Pre-Requisites

Distilling down to minimum requirements makes rasters cross-platform:

  • Platform supports a VSYNC OFF mode
  • Platforms is able to provide VSYNC timestamps
  • Platform supports high-precision counters (sub-millisecond-accuracy counters)
    Such as RTDSC or QueryPerformanceCounter or std::chrono::high_resolution_clock
  • PC, Mac, Android, Pi, Radeon, GeForce, Intel, all supports beamraced frame slice technique

We use beam racing to hide tearlines in the jitter margin, creating a tearingless VSYNC OFF (lagless VSYNC ON) with a very tight (but forgiving) synchronization between emulator raster and real raster.

The simplified retro_set_raster_poll API Proposal

Proposing to add an API -- retro_set_raster_poll -- to allow this data to be relayed to an optional centralized beamracing module for RetroArch to implement realworld sync between emuraster and realraster via whatever means possible (including frameslice beam racing & front buffer beam racing, and/or other future beam racing sync techniques).

The goal of this API simply allows the centralized beamracing module to do an early peak at the incomplete emulator refresh cycle framebuffer every time a new emulator scan line has been plotted to it.

This minimizes modifications to emulators, allowing centralization of beam racing code.

The central code handle its own refresh cycle scanout synchronization (busylooping to pace correctly to real world's raster scan line number which can be extrapolated in a cross-platform manner as seen below!) without the emulator worrying about any other beam racing specifics.

Further Detail

Basically it's a beam-raced VSYNC OFF mode that looks exactly like VSYNC ON (perfect tearingless VSYNC OFF). The emulator can merrily render at 1:1 speed while realtime streaming graphics to the display, without surge-execution needed. This requires far less horsepower on the CPU, works with "cycle-exact" emulators (unlike RunAhead) and allows ultra low lag on Raspberry PI and Android processors. Frame-slice beam racing is already used for Android Virtual Reality too, but works successfully for emulators.

Which emulators does this benefit?

This lag reduction technique will benefit any emulator that already does internal beam racing (e.g. to support original raster interrupts). Nearly all retro platforms -- most 8-bit and 16-bit platforms -- can benefit.

This lag-reduction technique does not benefit high level emulation.

Related Raster Work on GPUs

Doing actual "raster interrupts" style work on Radeon/GeForces/Intels is actually surprisingly easy: tearlines are just rasters -- see YouTube video.

This provide the groundwork for lagless VSYNC operation, synchronization of realraster and emuraster. With the emulator method, the tearlines are hidden via the jittermargin approach.

Common Developer Misconceptions

First, to clear up common developer misconceptions of assumed "showstoppers"...

  • Yes, it can work with 60Hz, 120Hz, 180Hz, 240Hz (simply beam racing cherrypicked refresh cycles -- requires surge execution for beam racing "fast" refresh cycles), works in WinUAE
  • Yes, it's more forgiving than expected of computer performance fluctuations (jitter margin technique)
  • Yes, it can work simultaneously with RunAhead (if need be, though not necessary). Simply beam race the final/visible frame.
  • Yes, it works simultaneously with variable refresh rate (see this post), works in WinUAE
  • Yes, you can easly enter/exit beamracing mode on the fly (e.g. screen rotation to incompatible scan direction, switch to windowed operation)
  • Yes, it works with scaled and HLSL/shaders/fuzzylines, as it always works in WinUAE. It does slow things down, and requires optimizations to speed up again (but this can be solved as a separate optimization). Any distortions (e.g. curves, or line fuzz) can be hidden in the jitter margin height technique, to be 100% artifactless
  • Yes, it can be used in conjunction with black frame insertion (including for the 31KHz 240p compatibility mode for MAME arcade machines; though that will require 2x surge-execute during a fast 1/120sec scanout of the visible refresh cycle).

Proposal

Recommended Hook

  1. Add the per-raster callback function called "retro_set_raster_poll"
  2. The arguments are identical to "retro_set_video_refresh"
  3. Do it to one emulator module at a time (begin with the easiest one).

It calls the raster poll every emulator scan line plotted. The incomplete contents of the emulator framebuffer (complete up to the most recently plotted emulator scanline) is provided. This allows centralization of frameslice beamracing in the quickest and simplest way.

Cross-Platform Method: Getting VSYNC timestamps

You don't need a raster register if you can do this! You can extrapolate approximate scan line numbers simply as a time offset from a VSYNC timestamp. You don't need line-exact accuracy for flawless emulator frameslice beamracing.

For the cross-platform route -- the register-less method -- you need to listen for VSYNC timestamps while in VSYNC OFF mode.

These ideally should become your only #ifdefs -- everything else about GPU beam racing is cross platform.

PC Version

  1. Get your primary display adaptor URL such as \.\\DISPLAY1 .... For me in C#, I use Screen.PrimaryScreen.DeviceName to get this, but in C/C++ you can use EnumDisplayDevices() ...
  2. Next, call D3DKMTOpenAdapterFromHdc() with this info to open the hAdaptor handle
  3. For listening to VSYNC timestamps, run a thread with D3DKMTWaitForVerticalBlankEvent() on this hAdaptor handle. Then immediately record the timestamp. This timestamp represents the end of a refresh cycle and beginning of VBI.

Mac Version

Other platforms have various methods of getting a VSYNC event hook (e.g. Mac CVDisplayLinkOutputCallback) which roughly corresponds to the Mac's blanking interval. If you are using the registerless method and generic precision clocks (e.g. RTDSC wrappers) these can potentially be your only #ifdefs in your cross platform beam racing -- just simply the various methods of getting VSYNC timestamps. The rest have no platform-specificness.

Linux Version

See GPU Driver Documentation. There is a get_vblank_timestamp() available, and sometimes a get_scanout_position() (raster register equivalent). Personally I'd only focus on the obtaining VSYNC timestamping -- much simpler and more guaranteed on all platforms.

Getting the current raster scan line number

For raster calculation you can do one of the two:

(A) Raster-register-less-method: Use RTDSC or QueryPerformanceCounter or std::chrono::high_resolution_clock to profile the times between refresh cycle. On Windows, you can use known fractional refresh rate (from QueryDisplayConfig) to bootstrap this "best-estimate" refresh rate calculation, and refine this in realtime. Calculating raster position is simply a relative time between two VSYNC timestamps, allowing 5% for VBI (meaning 95% of 1/60sec for 60Hz would be a display scanning out). NOTE: Optionally, to improve accuracy, you can dejitter. Use a trailing 1-second interval average to dejitter any inaccuracies (they calm to 1-scanline-or-less raster jitter), ignore all outliers (e.g. missed VSYNC timestamps caused by computer freezes). Alternatively, just use jittermargin technique to hide VSYNC timestamp inaccuracies.

(B) Raster-register-method: Use D3DKMTGetScanLine to get your GPU's current scanline on the graphics output. Wait at least 1 scanline between polls (e.g. sleep 10 microseconds between polls), since this is an expensive API call that can stress a GPU if busylooping on this register.

NOTE: If you need to retrieve the "hAdaptor" parameter for D3DKMTGetScanLine -- then get your adaptor URL such as \.\\DISPLAY1 via EnumDisplayDevices() ... Then call D3DKMTOpenAdapterFromHdc() with this adaptor URL in order to open the hAdaptor handle which you can then finally pass to D3DKMTGetScanLine that works with Vulkan/OpenGL/D3D/9/10/11/12+ .... D3DKMT is simply a hook into the hAdaptor that is being used for your Windows desktop, which exists as a D3D surface regardless of what API your game is using, and all you need is to know the scanline number. So who gives a hoot about the "D3DKMT" prefix, it works fine with beamracing with OpenGL or Vulkan API calls. (KMT stands for Kernel Mode Thunk, but you don't need Admin priveleges to do this specific API call from userspace.)

Improved VBI size monitoring

You don't need raster-exact precision for basic frameslice beamracing, but knowing VBI size makes it more accurate to do frameslice beamracing since VBI size varies so much from platform to platform, resolution to resolution. Often it just varies a few percent, and most sub-millisecond inaccuracies is easily hidden within jittermargin technique.

But, if you've programmed with retro platforms, you are probably familiar with the VBI (blanking interval) -- essentially the overscan space between refresh cycles. This can vary from 1% to 5% of a refresh cycle, though extreme timings tweaking can make VBI more than 300% the size of the active image (e.g. Quick Frame Transport tricks -- fast scan refresh cycles with long VBIs in between). For cross platform frameslice beamracing it's OK to assume ~5% being the VBI, but there are many tricks to know the VBI size.

  1. QueryDisplayConfig() on Windows will tell you the Vertical Total. (easiest)
  2. Or monitoring the ratio of .INVBlank = true versus .INVBlank = false ... (via D3DKMTGetScanLine) by monitoring the flag changes (wait a few microseconds between polls, or 1 scanline delay -- D3DKMTGetScanLine is an 'expensive' API call)

Turning The Above Data into Real Frameslice Beamracing

For simplicity, begin with emu Hz = real Hz (e.g. 60Hz)

  1. Have a configuration parameter of number of frameslices (e.g. 10 frameslices per refresh cycle)
  2. Let's assume 10 frameslices for this exercise.
  3. Actual screen 1080p means 108 real pixel rows per frameslice.
  4. Emulator screen 240p means 24 emulator pixel rows per frameslice.
  5. Your emulator module calls the centralized raster poll (retro_set_raster_poll) right after every emulator scan line. The centrallized code (retro_set_raster_poll) counts the number of emulator pixel rows completed to fill a frameslice. The central code will do either (5a) or (5b):
    (5a) Returns immediately to emulator module if not yet a full new framesliceful have been appended to the existing offscreen emulator framebuffer (don't do anything to the partially completed framebuffer). Update a counter, do nothing else, return immediately.
    (5b) However once you've got a full frameslice worth built up since the last frameslice presented, it's now time to frameslice the next frameslice. Don't return right away. Instead, immediately do an intentional CPU busyloop until the realraster reaches roughly 2 frameslice-heights above your emulator raster (relative screen-height wise). So if your emulator framebuffer is filled up to bottom edge of where frameslice 4 is, then do a busyloop until realraster hits the top edge* of frameslice 3. Then immediately Present() or glutSwapBuffers() upon completing busyloop. Then Flush() right away.
    NOTE: The tearline (invisible if unchanged graphics at raster are) will sometimes be a few pixels below the scan line number (the amount of time for a memory blit - memory bandwidth dependant - you can compensate for it, or you can just hide any inaccuracy in jittermargin)
    NOTE2: This is simply the recommended beamrace margin to begin experimenting with: A 2 frameslice beamracing margin is very jitter-margin friendly.

Example

Note: 120Hz scanout diagram from a different post of mine. Replace with emu refresh rate.matching real refresh rate, i.e. monitor set to 60 Hz instead. This diagram is simply to help raster veterans conceptualize how modern-day tearlines relates to raster position as a time-based offset from VBI

Lagless VSYNC jitter margin

Bottom line: As long as you keep repeatedly Present()-ing your incompletely-rasterplotted (but progressively more complete) emulator framebuffer ahead of the realraster, the incompleteness of the emulator framebuffer never shows glitches or tearlines. The display never has a chance to display the incompleteness of your emulator framebuffer, because the display's realraster is showing only the latest completed portions of your emulator's framebuffer. You're simply appending new emulator scanlines to the existing emulator framebuffer, and presenting that incomplete emulator framebuffer always ahead of real raster. No tearlines show up because the already-refreshed-part is duplicate (unchanged) where the realraster is. It thusly looks identical to VSYNC ON.

Precision Assumptions:

  • Scaling doesn't have to be exact.
  • The two frameslice offset gives you a one-frameslice-ahead jitter margin
  • You can vary the height of consecutive frameslices if you want, slightly, or lots, or for rounding errors.
  • No artifacts show because the frameslice seams are well into the jitter margin.

Special Note On HLSL-Style Filters: You can use HLSL/fuzzyline style shaders with frameslices. WinUAE just does a full-screen redo on the incomplete emu framebuffer, but one could do it selectively (from just above the realraster all the way to just below the emuraster) as a GPU performance-efficiency optimization.

Adverse Conditions To Detect To Automatically disable beamracing

Optional, but for user-friendly ease of use, you can automatically enter/exit beamracing on the fly if desired. You can verify common conditions such as making sure all is me:

  • Rotation matches (scan direction same) = true
  • Supported refresh rate = true
  • Module has a supported raster hook = true
  • Emulator performance is sufficient = true

Exiting beamracing can be simply switching to "racing the VBI" (doing a Present() between refresh cycles), so you're just simulating traditional VSYNC ON via VSYNC OFF via that manual VSYNC'ing. This is like 1-frameslice beamracing (next frame response). This provides a quick way to enter/exit beamracing on the fly when conditions change dynamically. A Surface Tablet gets rotated, a module gets switched, refresh rate gets changed mid-game, etc.

Questions?

I'd be happy to answer questions.

@blurbusters
Copy link
Author

blurbusters commented Jul 13, 2018

Additional timesaver notes:

General Best Practices

Debugging raster problems can be frustrating, so here's knowledge by myself/Calamity/Toni Wilen/Unwinder/etc. These are big timesaver tips:

  1. Raster error manifests itself as tearline jitter.
  2. If jitter is within raster jittermargin technique, no tearing or artifacts shows up.
  3. It's an amazing performance profiling tool; tearline jitter makes your performance fluctuations very visible. In debug mode, use color-coded tints for your frameslices, to help make normally-hidden raster jitter more visible (WinUAE uses this technique).
  4. Raster error is more severe at top edge than bottom edge. This is because GPU is more busy during this region (e.g. scheduled Windows compositing thread, stuff that runs every VSYNC event in the Windows Kernel, etc). It's minor, but it means you need to make sure your beam racing margin accomodate sthis.
  5. GPU power management. If your emulator is very light on a powerful GPU, your GPU fluctuating power management will amplify raster error. Which may mean having too frameslices will have amplified tearline jitter. Fixes include (A) configure more frameslices (B) simply detect when GPU is too lightly loaded and make it busy one way or another (e.g. automatically use more frameslices). The rule of thumb is don't let GPU idle for more than a millisecond if you want scanline-exact rasters. Or you can just merely simply use a bigger jittermargin to hide raster jitter.
  6. If you're using D3DKMTGetScanLine... do not busyloop on it because it stresses the GPU. Do a CPU busyloop of a few microseconds before polling the raster register again.
  7. Do a Flush() before your busyloop before your precision-timed Present(). This massively increases accuracy of frameslice beamracing. But it can decrease performance.
  8. Thread-switching on some older CPUs can cause RTDSC or QueryPerformanceCounter backwards clock ticking unexpectedly. So keep QueryPerformanceCounter polls to the same CPU thread with a BeginThreadAffinity. You probably already know this from elsewhere in the emulator, but this is mentioned here as being relevant to beamracing.
  9. Instead of rasterplotting emulator scanlines into a blank framebuffer, rasterplot on top of a copy of the the emulator previous refresh cycle's framebuffer. That way, there's no blank/black area underneath the emulator raster. This will greatly reduce visibility of glitches during beamrace fails (falling outside of jitter margin -- too far behind / too far ahead) -- no tearing will appear unless within 1 frameslice of realraster, or 1 refresh cycle behind. A humongous jitter margin of almost one full refresh cycle. And this plot-on-old-refresh technique makes coarser frameslices practical -- e.g. 2-frameslice beamracing practical (e.g. bottom-half screen Present() while still scanning out top half, and top-half screen Present() while scanning out bottom half). When out-of-bounds happens, the artifact is simply brief instantaneous tearing only for that specific refresh cycle. Typically, on most systems, the emulator can run artifactless identical looking to VSYNC ON for many minutes before you might see brief instantaneous tearline from a momentary computer freeze, and instantly disappear when beamrace gets back in sync.
  10. Some platforms supports microsecond-accurate sleeping, which you can use instead of busylooping. Some platforms can also set the granularity of the sleep (there's an undocumented Windows API call for this). As a compromise, some of us just do a normal thread sleep until a millisecond prior, then doing a busyloop to align to the raster.
  11. Don't worry about mid-scanline splits (e.g. HSYNC timings). We don't have to worry about such sheer accuracy. The GPU transceiver reads full pixel rows at a time. Being late for a HSYNC simply means the tearline moves down by 1 pixel. Still within your raster jitter margin. We can jitter quite badly when using a forgiving jitter margin -- (e.g. 100 pixels amplitude raster jitter will never look different from VSYNC ON). Precision requirement is horizontal scanrate (e.g. 67KHz means 1/67000sec precision needed for scanline-exact tearlines -- which is way overkill for 10-frameslice beamracing which only needs 1/600sec precision at 60Hz).
  12. Use multimonitor. Debugging is way easier with 2 monitors. Use your primary is exclusive full screen mode, with the IDE on a 2nd monitor. (Not all 3D frameworks behave well with that, but if you're already debugging emulators, you've probably made this debugging workflow compatible already anyway). You can do things like write debug data to a console window (e.g. raster scanline numbers) when debugging pesky raster issues.
  13. Some digital display outputs exhibit micropacketization behavior (DisplayPort at lower resolutions especially, where multiple rows of pixels seem to squeeze into the same packet -- my suspicion). So your raster jitter might vibrate in 2 or 4 scan line multiples rather than single-scanline multiples. This may or may not happen more often with interleaved data (DisplayPort cable handling 2 displays or other PCI-X data) but they are still pretty raster-accurate otherwise, the raster inaccuracies are sub-millisecond, and fall far within jitter margin. Advanced algorithms such as DSC (Display Stream Compression of new DisplayPort implementations) can amplify raster jitter a bit. But don't worry; all known micro-packetization inaccuracies, fall far well within jittermargin technique, so no problem. I only mention this is you find raster-jitter differences between different video outputs.
  14. Become more familiar with how the jitter-margin technique saves your ass. If you do Best-Practice High cpu usage with pulse #9, you gain a full wraparound jittermargin (you see, step High cpu usage with pulse #9 allows you to Present() the previous refresh cycle on bottom half of screen, while still rendering the top half...). If you use 10 frameslices at 1080p, your jitter safety margin becomes (1080 - 108) = 972 scanlines before any tearing artifacts show up! No matter where the real raster is, you're jitter margin is full wraparound to previous refresh cycle. The earliest bound is pageflip too late (more than 1 refresh cycle ago) or pageflip too soon (into the same frameslice still not completed scanning-out onto display). Between these two bounds is one full refresh cycle minus one frameslice! So don't worry about even a 25 or 50 scanline jitter inaccuracy (erratic beamracing where margin between realraster and emuraster can randomly vary) in this case... It still looks like VSYNC ON perfectly until it goes out of that 972-scanline full-wraparound jitter margin. For minimum lag, you do want to keep beam racing margin tight (you could make beamrace margin adjustable as a config value, if desired -- though I just recommend "aim the Present() at 2 frameslice margin" for simplicity), but you can fortunately surge ahead slightly or fall behind lots, and still recover with zero artifacts. The clever jittermargin technique that permanently hides tearlines into jittermargin makes frameslice beam-racing very forgiving of transitient background activity._
  15. Get familiar with how it scales up/down well to powerful and underpowered platforms. Yes, it works on Raspberry PI. Yes, it works on Android. While high-frameslice-rate beamracing requires a powerful GPU, especially with HLSL filters, low-frameslice beamracing makes it easier to run cycle-exact emulation at a very low latency on less powerful hardware - the emulator can merrily emulate at 1:1 speed (no surge execution needed) spending more time on power-consuming cycle-exactness or ability to run on slower mobile GPUs. You're simply early-presenting your existing incomplete offscreen emulator framebuffer (as it gets progressively-more-complete). Just adjust your frameslice count to an equilibrium for your specific platform. 4 is super easy on the latest Androids and Raspberry PI (Basically 4 frameslice beam racing for 1/4th frame subrefresh input lag -- still damn impressive for a PI or Android) while only adding about 10% overhead to the emulator.
  16. If you are on a platform with front buffer rendering (single buffer rendering), count yourself lucky. You can simply rasterplot new pixel rows directly into the front buffer instead of keeping the buffer offscreen (As you already are)! And plot on top of existing graphics (overwrite previous refresh cycle) for a jitter margin of a full refresh cycle minus 1-2 pixel rows! Just provide config parameter of of beamrace margin (vertical screen height percentage difference between emuraster + realraster), to adjust tightness of beamracing. You can support frameslicing VSYNC OFF technique & frontbuffer technique with the same suggested API, retro_set_raster_poll suggestion -- it makes it futureproof to future beamracing workflows.
  17. Yes, it works with curved scanlines in HLSL/filter type algorithms. Simply adjust your beamracing margin to prevent the horizontally straight realraster from touching the top parts of curved emurasters. Usually a few pixel rows will do the job. You can add a scanlines-offset-adjustment parameter or a frameslice-count-offset adjustment parameter.
  18. You may have to sometimes temporarily turn off debug output when programming/debugging real world beam racing. Some environments have too many raster glitches when a console window is running -- the IDE's console is surprisingly slow/inefficient. So, when running in debug mode, it may be better to create your own built-in graphics console overlay instead of a separate console window -- don't use debug console-writing to IDE or separate shell window during beam racing. It can glitch massively if you generate lots of debug output to a console window. Instead, display debug text directly in the 3D framebuffer instead and try to buffer your debug-text-writing till your blanking interval, and then display it as a block of text at top of screen (like a graphics console overlay). Even doing the 3D API calls to draw a thousand letters of text on screen, will cause far less glitches than trying to run a 2nd separate window of text (IDE debug overheads & shell window overheads) can cause massive beam-racing glitches if you try to output debug text -- some Debug output commands can cause >16ms stall -- I suspect that some IDE's are programmed in garbage-collected language and sometimes the act of writing console output causes a garbage-collect event to occur. Or some other really nasty operating-system / IDE environment overheads. So if you're running in debug mode while debugging raster glitches, then temporarily turn off the usual debug output mechanism, and output instead as a graphics-text overlay on your existing 3D framebuffer. Even if it means redundant re-drawing of a line of debugging text at the top edge of the screen every frame.
    NOTE: Debug mode seems okay (good test of amplified raster jitter sometimes) on fast machines if debug output is temporarily disabled (or only used vary sparingly - rasters severely glitches when using Visual Studio debug console unless you've got a massively multithreaded CPU -- If you're only using a 2-core or 4-core CPU and need to debug raster-exactness problems, it is preferable to redraw onscreen characters (e.g. SpriteFonts) every frameslice instead as your onscreen graphical debug console -- that actually is less disruptive. Hopefully you don't need to do this, but be prepared to.

Hopefully these best practices reduce the amount of hairpulling during frameslice beamracing.

Special Notes

  • Special Note about Rotation Emulator devices already should report their screen orientation (portrait, landscape) which generally also defines scan direction. QueryDisplayConfig() will tell you real screen orientation. Default orientation is always top-to-bottom scan on all PC/Mac GPUs. 90 degree counterclockwise display rotation changes scan direction into left-to-right. If emulating Galaxian, this is quite fine if you're rotating your monitor (left-right scan) and emulating Galaxian (left-right scan) -- then beamracing works._

  • Special Note about Unsupported Refresh Rates Begin KISS and worry about 50Hz/60Hz only first. Start easy. Then iterate in adding support to other refresh rates like multiples. 120Hz is simply cherrypicking every other refresh cycle to beam race. For the in-between refresh cycles, just leave up the existing frame up (the already completed frame) until the refresh cycle that you want to beamrace is about to begin. In reality, there's very few unbeamraceable refresh rates -- even beamracing 60fps onto 75Hz is simply beamracing cherrypicked refresh cycles (it'll still stutter like 60fps@75Hz VSYNC ON though)._

  • Advanced Note about VRR Beam Racing Before beam racing variable refresh rate modes (e.g. enabling GSYNC or FreeSync and then beamracing that) -- wait until you've mastered all the above before you begin to add VRR compatibility to your beamracing. So for now, disable VRR when implementing frameslice beamracing for the first time. Add this as a last step once you've gotten everything else working reasonably well. It's easy to do once you understand it, but the conceptual thought of VRR beamracing is a bit tricky to grasp at first. VRR+VSYNC OFF supports beamracing on VRR refresh cycles. The main considerations are, the first Present() begins the manually-triggered refresh cycle (.INVBlank becomes false and ScanLine starts incrementing), and you can then frameslice beamrace that normally like an individual fixed-Hz refresh cycle. Now, one additional very special, unusual consideration is the uncontrolled VRR repeat-refresh. One will need to do emergency catchup beamraces on VRR displays if a display decides to do an uncommanded refresh cycle (e.g. when a display+GPU decides to do a repeat-refresh cycle -- this often happens when a display's framerates go below VRR range). These uncommanded refresh cycles also automatically occur below VRR range (e.g. under 30fps on a 30Hz-144Hz VRR display). Most VRR displays will repeat-refresh automatically until it's fully displayed an untorn refresh cycle. If this happens and you've already begun emulating a new emulator refresh cycle, you have to immediately start your beamrace early (rather than at the wanted precise time). So if you do a frameslice beamrace of a VRR refresh cycle, the GPU will send a repeat-refresh to the display automatically immediately. There might be an API call to suppress this behavior, but we haven't found one, so this behavior is unwanted so this kind of makes beamraced 60fps onto a 75Hz FreeSync display difficult to do stutter-free. But it works fine for 144Hz VRR displays - we find it's easy to be stutterfree when the VRR max is at least twice the emulator Hz, since we don't care about those automatic-repeat-refresh cycles that aren't colliding with timing of the next beamrace.

@blurbusters blurbusters changed the title Lagless VSYNC Support: Add Beam Racing to RetroArch [$100 BountySource] Lagless VSYNC Support: Add Beam Racing to RetroArch Jul 13, 2018
@blurbusters blurbusters changed the title Lagless VSYNC Support: Add Beam Racing to RetroArch Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $120 BountySource] Jul 13, 2018
@blurbusters
Copy link
Author

blurbusters commented Jul 13, 2018

$120 Funds Now Added to BountySource

Added $120 BountySource -- permanent funds -- no expiry.

https://www.bountysource.com/issues/60853960-lagless-vsync-support-add-beam-racing-to-retroarch

Trigger for BountySource completion:

  1. Add optional retro_set_raster_poll API and centralized beam racing code
    (or mutually agreed easier compromise)
  2. Make any two emulator modules successfully work with it
    (on 3 platforms: on PC, on Mac, on Linux. See above for a list of API calls available on all 3 platforms)

Minimum refresh rate required: Native refresh rate.

Emulator Support Preferances: Preferably including either NES and/or VICE Commodore 64, but you can choose any two emulators that is easiest to add beam-racing to).

Notes: GSYNC/FreeSync compatible beam racing is nice (works in WinUAE) but not required for this BountySource award; can be a stretch goal later. Must support at least native refresh rate (e.g. 50Hz, 60Hz) but would be a bonus to also support multiple thereof (e.g. 100Hz or 120Hz) -- as explained, this is done via automatically cherrypicking which refresh cycles to beamrace (WinUAE style algorithm or another mutually agreed algorithm).

Effort Assessment

Assessment is that Item 1 will probably require about a thousand-ish lines of code, while item 3 (modification to individual emulator modules) can be as little as 10 lines or thereabouts. 99% of the beam racing is already implemented by most 8-bit and 16-bit emulators and emulator modules, it's simply the missing 1% (sync between emuraster and realraster) that is somewhat 'complicated' to grasp.

The goal is to simplify and centallize as much complexity of the beam racing centrally as possible, and minimize emulator-module work as much as possible -- and achieve original-machine latencies (e.g. software emulator with virtually identical latencies as an original machine) which has already been successfully achieved with this technique.

Most of the complexity is probably testing/debugging the multiple platforms.

It's Easier Than Expected. Learning/Debugging Is The Hard Part

Tony of WinUAE said it was easier than expected. It's simply learning that's hard. 90% of your work will be learning how to realtime-beamrace a modern GPU. 10% of your time coding. Few people
(Except Blur Busters) understand the "black box" between Present() and photons hitting eyes. But follow the Best Practices and you'll realize it is as easy as an E=mc^2 Eureka Moment if you've ever programmed an Amiga Copper or Commodore 64 raster interrupt, that modern GPUs are surprisingly crossplatform-beamraceable now via "VSYNC OFF tearlines are simply rasters. All tearlines created in humankind are simply rasters" technical understanding.

BountySource Donation Dollar Match Thru the $360 Level

DOLLAR MATCH CHALLENGE -- Until End of September 2018 I will match dollar-for-dollar all additional donations by other users up to another $120. Growing my original donation to $240 in addition to $120 other people's donations = $360 BountySource!

EDIT: Dollar match maxed out 2018/07/17 -- I've donated $360

@ghost
Copy link

ghost commented Jul 14, 2018

How could this possibly be done reliably on desktop OSes (non-hard-realtime) where scheduling latency is random?

@blurbusters
Copy link
Author

blurbusters commented Jul 14, 2018

How could this possibly be done reliably on desktop OSes (non-hard-realtime) where scheduling latency is random?

See above. It's already in an emulator. It's already successfully achieved.

That's not a problem thanks to the jittermargin technique.

Lagless VSYNC jitter margin

Look closely at the labels in Frame 3.
image

As long as the Present() occurs with a tearline inside that region, there is NO TEARING, because it's a duplicate frameslice at the screen region that's currently scanning-out onto the video cable. (As people already know, a static image never has a tearline -- tearlines only occurs with images in motion). The jitter margin technique works, is proven, is already implemented, and is already in a downloadable emulator, if you wish to see for yourself. In addition, I also have video proof below:

Remember, I am the inventor of TestUFO.com and founder of BlurBusters.com

If you've seen the UFO or similar tests in any website (RTings, TFTCentral, PCMonitors, etc, they are likely using one of my display-testing inventions that I've got a peer-reviewed conference paper with NIST.gov, NOKIA, Keltek. So my reputation precedes me, and now that out of the way:

As a result, I know what I am talking about.

You can adjust the jittermargin to give as much as 16.7ms of error margin (Item 9 of Best Practices above). Error margin with zero artifacts is achieved via jittermargin (duplicate frameslice = no tearline).
Testing shows we can go ~1/4 refresh cycle errormargin on PI/Android and sub-millisecond errormargin on GTX 1080Ti + i7 systems.

Some videos I've created of my Tearline Jedi Demo --

Here's YouTube video proof of stable rasters on GeForces/Radeons:
THREAD: https://www.pouet.net/topic.php?which=11422&page=1

Video

Video

Video

And the world's first real-raster cross platform Kefrens Bars demo

Video

(8000 frameslices per second -- 8000 tearlines per second -- way overkill for an emulator -- 100 tearlines per refresh cycle with 1-pixel-row framebuffers stretched vertically between tearlines. I also intentionally glitch it at the end by moving around a window; demonstrating GPU-processing scheduling interference).

Now, its much more forgiving for emulators because the tearlines (That you see in this) is all hidden in the jittermargin technique. Duplicate refresh cycles (and duplicate frameslices / scanline areas) have no tearlines. You just make sure that the emulator raster stays ahead of real raster, and frameslice new slices onto the screen in between the emuraster & realraster.

As long as you keep adding frameslices ahead of realraster -- no artifacts or tearing shows up. Common beam racing margins with WinUAE successfully is approximately 20-25 scanlines during 10 frameslice operation in WinUAE emulator. So the margin can safely jitter (computer performance problems) without artifacts.

Lagless VSYNC jitter margin

If you use 10 frameslices (1/10th screen height) -- at 60Hz for 240p, that's approximately a 1.67ms jitter margin -- most newer computers can handle that just fine. You can easily increase jitter margin to almost a full refresh cycle by adding distance between realraster & emuraster -- to give you more time to add new frameslices in between.

And even if there was a 1-frame mis-performance, (e.g. computer freeze), the only artifact is a brief sudden reappearance of tearing before it disappears.

Also, Check the 360-degree jittermargin technique as part of Step 9 and 14 of Best Practices, that can massively expand the jitter margin to a full wraparound refresh cycle's worth:

  1. Instead of rasterplotting emulator scanlines into a blank framebuffer, rasterplot on top of a copy of the the emulator previous refresh cycle's framebuffer. That way, there's no blank/black area underneath the emulator raster. This will greatly reduce visibility of glitches during beamrace fails (falling outside of jitter margin -- too far behind / too far ahead) -- no tearing will appear unless within 1 frameslice of realraster, or 1 refresh cycle behind. A humongous jitter margin of almost one full refresh cycle. And this plot-on-old-refresh technique makes coarser frameslices practical -- e.g. 2-frameslice beamracing practical (e.g. bottom-half screen Present() while still scanning out top half, and top-half screen Present() while scanning out bottom half). When out-of-bounds happens, the artifact is simply brief instantaneous tearing only for that specific refresh cycle. Typically, on most systems, the emulator can run artifactless identical looking to VSYNC ON for many minutes before you might see brief instantaneous tearline from a momentary computer freeze, and instantly disappear when beamrace gets back in sync.

AND

  1. Become more familiar with how the jitter-margin technique saves your ass. If you do Best-Practice 9, you gain a full wraparound jittermargin (you see, step 9 allows you to Present() the previous refresh cycle on bottom half of screen, while still rendering the top half...). If you use 10 frameslices at 1080p, your jitter safety margin becomes (1080 - 108) = 972 scanlines before any tearing artifacts show up! No matter where the real raster is, you're jitter margin is full wraparound to previous refresh cycle. The earliest bound is pageflip too late (more than 1 refresh cycle ago) or pageflip too soon (into the same frameslice still not completed scanning-out onto display). Between these two bounds is one full refresh cycle minus one frameslice! So don't worry about even a 25 or 50 scanline jitter inaccuracy (erratic beamracing where margin between realraster and emuraster can randomly vary) in this case... It still looks like VSYNC ON perfectly until it goes out of that 972-scanline full-wraparound jitter margin. For minimum lag, you do want to keep beam racing margin tight (you could make beamrace margin adjustable as a config value, if desired -- though I just recommend "aim the Present() at 2 frameslice margin" for simplicity), but you can fortunately surge ahead slightly or fall behind lots, and still recover with zero artifacts. The clever jittermargin technique that permanently hides tearlines into jittermargin makes frameslice beam-racing very forgiving of transitient background activity._

And single-refresh-cycle beam racing mis-sync artifacts are not really objectionable (an instantaneous one-refresh-cycle reappearance of a tearline that disappears when the beam racing "catches up" and goes back to the jitter margin tolerances.)

240p scaled onto 1080p is roughly 4.5 real scanlines per 1 emulator scanline. Obviously, the real raster "Register" will increment scan line number roughly 4.5 times faster. But as you have seen, Tearline Jedi successfully beam-races a Radeon/GeForce on both PC/Mac without a raster register simply by using existing precision counter offsets. Sure, there's 1-scanline jittering as seen in YouTube video. But tearing never shows in emulators because that's 100% fully hidden in the jittermargin technique making it 100% artifactless even if it is 1ms ahead or 1ms behind (If you've configured those beam racing tolerances for example -- can be made an adjustable slider -- tighter for super fast more-realtime systems -- a looser for slower/older systems).

But we're only worried about screen-height distance between the two. We need to merely simply make sure the emuraster is at least 1 frameslice (or more) below the realraster, relative-screen-height-wise -- and we can continue adding VSYNC OFF frameslices in between emu raster and real raster -- creating a tearingless VSYNC OFF mode, because the framebuffer swap (Present() or glutSwapBuffers()) is a duplicate screen area, no pixels changed, so no tearline is visible. It's conceptually easy to understand once you have the "Eureka" moment.

Lagless VSYNC jitter margin

image

There's already high speed video proof of sub-frame latencies (same-frame-response) achieved with this technique. e.g. mid-screen input reads for bottom-of-screen reactions are possible, replicating original's machine latency (to an error margin of one frameslice).

As you can see, the (intentionally-visible) rasters in the earlier videos are so stable and falls within common jittermargin sizes (for intentionally-invisible tearlines). With this, you create a (16.7ms - 1.67ms = 15ms jitter margin). That means with 10 frameslices with the refresh-cycle-wraparound jitter margin technique -- your beamracing can go too fast or too slow in a much wider and much safer 15ms range. Today, Windows scheduling is sub-1ms and PI schecduling is sub-4ms, so it's not a problem.

The necessary accuracy to do realworld beamracing happened 8-to-10 years ago already.

Yes, nobody really did it for emulators because it took someone to apply all the techniques together (1) Understanding how to beamrace a GPU, (2) Initially understanding the low level black box of Present()-to-Photons at least to the video output port signal level. (3) Understanding the techniques to make it very forgiving, and (4) Experience with 8-bit era raster interrupts.

In tests, WinUAE beam racing actually worked on a year-2010 desktop with an older GPU, at lower frameslice granularities -- someone also posted screenshots of an older Intel 4000-series GPU laptop in the WinUAE beamracing thread. Zero artifacts, looked perfectly like VSYNC ON but virtually lagless (well -- one frameslice's worth of lag).

Your question is understandable, but the fantastic new knowledge we all now have, now compensates totally for it -- a desktop with a GeForce GTX Titan about ~100x the accuracy margin needed for sub-refresh-latency frameslice beam racing.

So as a reminder, the accuracy requirements necessary to pull off this technical feat, already occured 8-to-10 years ago and the WinUAE emulator successfully is beamracing on an 8-year-old computer today in tests. I implore you to reread our research (especially the 18-point Best Practices), watch the videos, and view the links, to understand that it is actually quite forgiving thanks to the jittermargin technique.

(Bet you are surprised to learn that we are already so far past the rubicon necessary for this reliable accuracy, as long as the Best Practices are followed.)

@blurbusters
Copy link
Author

blurbusters commented Jul 14, 2018

BountySource now $140

Someone added $10, so I also added $10.

NOTE: I am currently dollar-matching donations (thru the $360 level) until end of September. Contribute to the pot: https://www.bountysource.com/issues/60853960-lagless-vsync-support-add-beam-racing-to-retroarch

@blurbusters blurbusters changed the title Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $120 BountySource] Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $140 BountySource] Jul 14, 2018
@blurbusters blurbusters changed the title Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $140 BountySource] Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $200 BountySource] Jul 16, 2018
@blurbusters
Copy link
Author

BountySource now $200

Twinphalex added $30, so I also added $30.

@blurbusters blurbusters changed the title Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $200 BountySource] Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $850 BountySource] Jul 17, 2018
@blurbusters
Copy link
Author

blurbusters commented Jul 17, 2018

$850 on BountySource

Wow! bparker06 just generously donated $650 to turn this into an $850 bounty

(bparker06, if you're reading this, reach out to me, will you? -- mark@blurbusters.com -- And to reconfirm you were previously aware that I'm currently dollar-matching only up to the BountySource $360 commitment -- Thanks!)

@blurbusters blurbusters changed the title Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $850 BountySource] Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $1050 BountySource] Jul 17, 2018
@blurbusters
Copy link
Author

blurbusters commented Jul 17, 2018

Now $1050 BountySource

I've topped up; and have donated $360 totalled
-- the dollar-for-dollar matching limit promise I said earlier.

This is now number 32 biggest pot on BountySource.com at the moment!

@blurbusters
Copy link
Author

blurbusters commented Jul 17, 2018

So..... since this is getting to be serious territory, I might as well post multiple references that may be of interest, to help jumpstart any developers who may want to begin working on this:

Useful Links

Videos of GroovyMAME lagless VSYNC experiment by Calamity:
https://forums.blurbusters.com/viewtopic.php?f=22&t=3972&start=10#p31851
(You can see the color filters added in debug mode, to highlight separate frameslices)

Screenshots of WinUAE lagless VSYNC running on a laptop with Intel GPU:
http://eab.abime.net/showthread.php?p=1231359#post1231359
(OK: approx 1/6th frame lag, due to coarse 6 frameslice granularity.)

Corresponding (older) Blur Busters Forums thread:
https://forums.blurbusters.com/viewtopic.php?f=22&t=3972

Corresponding LibRetro lag investigation thread (Beginning at post #628 onwards):
https://forums.libretro.com/t/an-input-lag-investigation/4407/628

The color filtered frame slice debug mode (found in WinUAE, plus the GroovyMAME patch) is a good validation method of realtimeness -- visually seeing how close your realraster is to emuraster -- I recommend adding this debugging technique to the RetroArch beam racing module to assist in debugging beam racing.

Minimum Pre-Requisites for Cross-Platform Beam Racing

As a reminder, our research has successfully simplified the minimum system requirements for cross-platform beam racing to just simply the following three items:

  1. Platform supports VSYNC OFF (aka "vblank disabled" mode)
  2. Platform supports getting VSYNC timestamps
  3. Platform supports high-precision counters (e.g. RTDSC or std::chrono::high_resolution_clock or QueryPerformanceCounter etc)

If you can meet (1) and (2) and (3) then no raster register is required. VSYNC OFF tearlines are just rasters, and can be "reliably-enough" controlled (when following 18-point Best Practices list above) simply as precision timed Present() or glutSwapBuffers() as precision-time-offsets from a VSYNC timestamp, corresponding to predicted scanout position.

Quick Reference Of Available VSYNC timestamping APIs

While mentioned earlier, I'll resummarize compactly: These "VSYNC timestamp" APIs have suitable accuracies for the "raster-register-less" cross platform beam racing technique. Make sure to filter any timestamp errors and freezes (missed vsyncs) -- see Best Practices above.

  • Windows: D3DKMTWaitForVerticalBlankEvent() (Works with OpenGL/Metal too)
  • MacOS: CVDisplayLinkOutputCallback()
  • Linux GPU driver API: get_vblank_timestamp()

If you 100% strictly focus on the VSYNC timestamp technique, these may be among the only #ifdefs that you need.

Other workarounds for getting VSYNC timestamps in VSYNC OFF mode

As tearlines are just rasters, it's good to know all the relevant APIs if need be. These are optional, but may serve as useful fallbacks, if need be (be sure to read Best Practices, e.g. expensiveness of API calls that we've discovered, and some mitigation techniques that we've discovered).

Be noted, it is necessary to use VSYNC OFF to use beam raced frameslicing. All known platforms (PC, Mac, Linux, Android) have methods that can access VSYNC OFF. On some platforms, this may interfere with your ability to get VSYNC timestamps. As a workaround you may have to instead poll the "In VBlank" flag (or busyloop in a separate thread for the bit-state change, and timestamp immediately after) -- in order to get VSYNC timestamps while in VSYNC OFF mode. Here are alternative APIs that helps you work around this, if absolutely necessary.

  • Windows
    D3DKMTGetScanLine() -- Windows equivalent of raster register. Also can be used to poll the "In VBLANK" status too. However, we found it unnecessary to do this, due to the existence of D3DKMTWaitForVBlank() which still works in VSYNC OFF mode. On the other hand, it may reduce need for tieing up a CPU core in precision-busylooping.

  • Linux
    get_scanout_position() -- Linux equivalent of raster register. Also can be used to poll the "In VBLANK" status too.
    drm_calc_vbltimestamp_from_scanoutpos() -- Linux Direct Rendering Manager (DRM) calculating VSYNC timestamps from scanout position. This can be quite handy to also do on Windows, as a low-CPU-method (no busylooping needed) method of generating VSYNC timestamps.

Currently, it seems implementations of get_vblank_timestamp() tend to call drm_calc_vbltimestamp_from_scanoutpos() so you may not need to do this. However, this additional information is provided to help speed up your research when developing for this bounty.

@blurbusters blurbusters changed the title Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $1050 BountySource] Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $1065 BountySource] Jul 18, 2018
@blurbusters blurbusters changed the title Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $1065 BountySource] Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $1082 BountySource] Jul 19, 2018
@blurbusters
Copy link
Author

blurbusters commented Jul 20, 2018

  • Bounty now $1132

  • Any platform or module with already-implemented beam racing technique -- is allowed to be rolled into the bounty as long as it helps meets the bounty conditions (e.g. ported to retro_set_raster_poll technique)

  • Bounty may be split between multiple programmers (if all stakeholders mutually agree). I understand not everyone can program all platforms.

As you remember, retro_set_raster_poll is supposed to be called every time after an emulator module plots a scanline to its internal framebuffer.

retro_set_raster_poll API proposal

As written earlier, basically retro_set_raster_poll (if added) simply allows the central RetroArch screen rendering code to optionally an "early peek" at the incompletely-rendered offscreen emulator buffer, every time the emulator modules plots a new scanline.

That allows the central code to beam-race scanlines (whether tightly or loosely, coarsely or ultra-zero-latency realtimeness, etc) onto the screen. It is not limited to frameslice beamracing.

By centralizing it into a generic API, the central code (future implementations) can decide how its wants to realtime-stream scanlines onto the screens (bypassing preframebuffering). This maximizes future flexibility.

  • VSYNC OFF frameslicing (add frameslices on the fly in the margin between emuraster and realraster)
  • Front buffer rendering (write emurasters directly to onscreen buffer, ahead of realraster)
  • Generous-jittermargins (see this post) versus single-scanline tightness (for virtually perfect original lag).
  • Single-scanline streaming (advanced RTOS techniques)
  • Mobile half-screen rendering (e.g. display bottom while rendering top, display top while rendering bottom) with almost zero extra CPU/GPU via the 2-frameslice tearingless VSYNC OFF technique.
  • Etc! Any other beam racing / beam chasing workflows -- including those not dreamed yet.

The bounty doesn't even ask you to implement all of this. Just 1 technique per 3 platforms (one for PC, one for Mac, one for Linux). The API simply provides flexibility to add other beamracing workflows later. VSYNC OFF frameslicing (essentially tearingless VSYNC OFF / lagless VSYNC ON) is the easiest way to achieve.

Each approach has their pros/cons. Some are very forgiving, some are very platform specific, some are ultra-low-lag, and some work on really old machines. I simply suggest VSYNC OFF frameslice beamracing because that can be implemented in exactly the same way on Windows+Mac+Linux, so is the easiest. But one realizes there's a lot of flexibility.

The proposed retro_set_raster_poll API call would be called at roughly the horizontal scanrate (excluding VBI scanlines). Which means for 480p, that API call would be called almost ~31,500 times per second. Or 240p that API would be called almost ~15000 times per second.

While high -- the good news is that this isn't a problem because most API calls would be an immediate return for coarse frameslicing. For example, WinUAE defaults at 10 frameslices per refresh cycle, 600 frameslices per second. So retro_set_raster_poll would simply do nothing (return immediately) until 1/10th of a screen height's worth of emulator scanlines are built up. And then will execute.

So out of all those tens of thousands of retro_set_raster_poll calls, only 600 would be 'expensive' if RetroArch is globally configured to be limited to 10-frameslice-per-refresh beam racing (1/10th screen lag due to beam chase distance between emuraster + realraster). The rest of the calls would simply be immediate returns (e.g. not a framesliceful built up yet).

Some emulator modules only need roughly 10 lines of modification

The complexity is centralized.

The emulator module is simply modified (hopefully as little as 10 line modification for the easiest emulator modules, such as NES) to call retro_set_raster_poll on all platforms. The beam racing complexity is all hidden centrally.

Nearly all 8-bit and 16-bit emulator modules already beamrace into their own internal framebuffers. Those are the 'easy' ones to add the retro_set_raster_poll API. So those would be easy. The bounty only needs 2 emulators to be implemented.

The central would decide how to beam race obviously (but frameslice beam racing would be the most crossplatform method, but it doesn't have to be the only method). Platform doesn't support it yet? Automatically disable beamracing (return immediately from retro_set_raster_poll). Screen rotation doesn't match emulator scan direction? Ditto, return immediately too. Whatever code a platform has implemented for beam racing synchronization (emuraster to realraster), it can be hidden centrally.

That's what part of bounty also pays for: Add the generic crossplatform API call so the rest of us can have fun adding various kinds of beam-racing possibilities that are appropriate for specific platforms. Obviously, the initial 3 platforms need to be supported (One for Windows, one for Mac, and one for Linux) but the fact that an API gets added, means additional platforms can be later supported.

The emulators aren't responsible for handling that complexity at all -- from a quick glance, it is only a ~10 line change to NES, for example. No #ifdefs needed in emulator modules! Instead, most of the beam racing sync complexity is centrallized.

@blurbusters blurbusters changed the title Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $1082 BountySource] Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $1132 BountySource] Jul 20, 2018
@asimonf
Copy link
Contributor

asimonf commented Jul 28, 2018

Would the behavior need to be adjusted for emulators that output interlaced content momentarily?

The SNES can switch from interlaced output to progressive during a vblank. Both NTSC and PAL are actually interlaced signals and the console is usually just rendering even lines (or is it odd lines? I don't recall now) using a technique commonly referred to double-strike.

@aliaspider
Copy link
Contributor

I don't see why that would matter, the only requirement here is that the core can be run on a per scanline basis, and that the vertical refresh rate is constant and close to the monitor rate.

@asimonf
Copy link
Contributor

asimonf commented Jul 28, 2018

I'm still wrapping my head around it, but yeah, now I see it. Interlaced content would be handled internally by the emulator as it already does.

@blurbusters
Copy link
Author

blurbusters commented Jul 28, 2018

About Interlaced Modes

No, behaviour doesn't need to be adjusted for interlaced.

Interlaced is still 60 temporal images per second, basically half-fields spaced 1/60 sec apart.

Conceptually, it's like frames that contains only odd scanlines, then a frame containing only even scanlines

Conceptually, you can think of interlaced 480i as the following:

T+0/60sec = the 240 odd scanlines
T+1/60sec = the 240 even scanlines
T+2/60sec = the 240 odd scanlines
T+3/60sec = the 240 even scanlines
T+4/60sec = the 240 odd scanlines
T+5/60sec = the 240 even scanlines

Etc.

Since interlaced was designed in the analog era where scanlines can be arbitrarily vertically positioned anywhere on a CRT tube -- 8-bit-era computer/console makers found a creative way to simply overlap the even/odd scanlines instead of offset them (between each other) -- via a minor TV signal timing modification -- creating a 240p mode out of 480i. But 240p and 480i still contains exactly 60 temporal images of 240 scanlines apiece, regardless.

Note: With VBI, it is sometimes called "525i" instead of "480i"

Terminologically, 480i was often called "30 frames per second" but NTSC/PAL temporal resolution was always permanently 60 fullscreen's worth of scanouts per second, regardless of interlaced or progressive. "Frame" terminology is when one cycle of full (static-image) resolution is built up. However, motion resolution was always 60, since you can display a completely different image in the second field of 480i -- and Sports/Soap operas always did that (60 temporal images per second since ~1930s).

Deinterlacers may use historical information (the past few fields) to "enhance" the current field (i.e. converting 480i into 480p). Often, "bob" deinterlace are beam racing friendly. For advanced deinterlacing algorithms, what may be displayed is an input-lagged result (e.g. lookforward deinterlacer that displays the intermediate middle combined result of a 3-frame or 5-frame history -- adding 1 frame or 2 frames lag). Beam racing this will still have a lagged result like any good deinterlacer may have, albiet with slightly less lag (up to 1 frame less lag).

Now, if there's no deinterlacing done (e.g. original interlacing preserved to output) then deinterlacing lag (for lookforward+lookbackward deinterlacers) isn't applicable here.

Emulators typically generally handle 480i as 60 framebuffers per second. That's the proper way to do it, anyway -- whether you do simple bob deinterlace, or any advanced deinterlace algorithms.

I used to work in the home theater industry, being the moderator of the AVSFORUM Home Theater Computers forums, and have worked with vendors (including working for RUNCO as a consultant) on their video processor & scaler products. So I understand my "i" and "p" stuff...

If all these concepts this is too complicated, just add it as an additional condition to automatically disable beam racing ("If in interlaced mode instead of progressive mode, disable the laggy deinterlacer or disable beam racing").

Most retro consoles used 240p instead of 480i. Even NTSC 480i (real interlacing) is often handled as 60 framebuffers per second in an emulator, even if some sources used to call it "480i/30" (two temporal fields per frame, offset 1/60sec apart).

Note: One can simply visually seamlessly enter/exit beamracing on the fly (in real time) -- there might be one tiny microstutter during the enter/exit (1/60sec lag increase-decrease) but that's an acceptable penalty during, say, a screen rotation or a video mode change (most screens take time to catch up in mode changes anyway). This is accomplished by using one VBI-synchronized full buffer Present()s per refresh (software-based VBI synchronization) instead of mid-frame Present()s (true beam racing). e.g. during screen rotation when scanout directions diverge (realworld vs emu scanout) but could include the entering/exiting interlaced mode in the SNES module, if SNES module is chosen to be the two first modules to support beam racing as part of the bounty requirements. Remember, you only need to support two emulator modules to claim the bounty. If you choose an SNES module as part of the bounty, then the SNES module would still count towards the bounty even if beamracing was automatically disabled during interlaced mode (if too complex to wrap your head around it).

For simplicity, supporting beam racing during interlaced modes is not a mandatory requirement for claiming this bounty -- however it is easy to support or to add later (by a programmer who understands interlacing & deinterlacing).

@blurbusters
Copy link
Author

blurbusters commented Oct 3, 2018

Formerly someone (Burnsedia) started working on this BountySource issue until they realized this was a C/C++ project. I'm updating the original post to be clear that this is a C/C++ skilled project.

@m4xw
Copy link
Contributor

m4xw commented Oct 6, 2018

@Burnsedia Your past track record on bountysource came to my attention, you marked 5 bounties as "solving", yet all of them are still open.
Since I expect you to have a solid understanding of C and the required knowledged of how graphics API's work internally, could you please elaborate on how you would implement this feature?
If you can't answer this, I will need you to refrain yourself from taking our bounties on, as I fear you could lock up high value bounties for no reason - effectively stalling progress on this or other bounties.

@TheDeadFish
Copy link

Has anyone tried this on Nvidia 700 or 900 series cards. I have had major issues with these cards and inconstant timing of the frame-buffer. The time at which the frame-buffer is actually sampled can vary by as much as half a frame making racing the beam completely impossible.

The problem stems from an over-sized video output buffer and also memory compression of some kind. As soon as the active scan starts the output buffer is filled at an unlimited rate (really fast), this causes the read position in the frame-buffer to pull way ahead of the real beam position. The output buffer seems to store compressed pixels, for a screen of mostly solid color about half a frame can fit in the output buffer, for a screen of incompressible noise only a small number of lines can fit and therefor has much more normal timing.

This issue has plagued my mind for several years (I gave my 960 away because it bothered me so much), but I have yet to see any other mentions of this issue. I only post this here now because its relevant.

@blurbusters
Copy link
Author

Bountysource increased to $1142.

@blurbusters blurbusters changed the title Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $1132 BountySource] Lagless VSYNC Support: Add Beam Racing to RetroArch [Now $1142 BountySource] Oct 20, 2018
@ghost
Copy link

ghost commented Aug 13, 2019

Someone should close this issue and apologize to bakers.

@mdrejhon
Copy link

mdrejhon commented Jun 1, 2020

Here you go:

#10757 (BFIv3) Emulate a CRT Electron Gun Via Rolling-Scan BFI

Theoretically, both this GitHub item (#6984) and the BFIv3 (#10757) can essentially become an identical task.

This may be helpful for people who find BFIv3 conceptually easier to program than this GitHub item, though it would need a rolling full-persistence option too.

If programming from that angle, and later adding this GitHub item as a subset of #10757, perhaps using a @TomHarte-derived flywheel sync algorithm that triggers only when emuHz-realHz is close enough.

@mdrejhon
Copy link

mdrejhon commented Jun 1, 2020

Task Breakdown Simplification:

GitHub item #10758 is a pre-requisite for this.

I broke out the retro_set_raster_poll pre-requisite separately, because it's a universal requirement for all possible beamraceable output techniques.

@jayare5
Copy link

jayare5 commented May 15, 2021

Hey, I'd love to contribute some money to the bounty! But I see that it hasnt had anything added since 2018 and Im feeling hesitant, Is it worth doing it? Also it would be cool to promote in some way, I'm surprised I don't hear more people talking about it!

@mdrejhon
Copy link

mdrejhon commented May 15, 2021

Hey, I'd love to contribute some money to the bounty! But I see that it hasnt had anything added since 2018 and Im feeling hesitant, Is it worth doing it? Also it would be cool to promote in some way, I'm surprised I don't hear more people talking about it!

It's still a valid bounty. Most of the funds are mine -- and this bounty will be honored.

There was a bit of talk about it in 2018, but currently quiet on these fronts at the moment.

The buzz can be restarted at pretty much any time, if a small group of us of similar interests can start a buzz campaign about this. Some of us have jobs though, or got affected by pandemic affecting work, and have to work harder to compensate, etc. But I'm pretty much 100% behind seeing this happen.

BTW, the new "240 Hz IPS" monitors are spectacular for RetroArch (even for 60Hz operation).

@johnnovak
Copy link

johnnovak commented Nov 3, 2022

I find it so weird that there aren't dozens of devs jumping at the opportunity to implement this... More than 4 years have passed since this ticket was created and still no working implementation?! Huh?!

Input lag is one of THE most pressing issues that needs addressing in emulators, and WinUAE has proven that this technique works extremely well in practice. With the "lagless vsync" feature enabled in WinUAE with a frame-slice of 4, I really see zero reason to bother with real hardware anymore. The best of all — it works flawlessly with complex shaders! It's a huge game-changer, and I'm quite disappointed that developers of other emulators are so incredibly slow at adapting this brilliant technique.

For the record, I don't care about RetroArch at all, otherwise I'd be doing this. But I started nagging the VICE devs about it; their C64 emulator badly needs it (some C64 games are virtually unplayable with the current 60-100ms lag). Might follow my own advice and will implement it myself, eventually...

@LibretroAdmin
Copy link
Contributor

This bounty is solely for a RetroArch implementation.

We also regret that nobody has picked this up yet. We have tried funding it with money, clearly that is not enough. It has to come from the heart from someone passionate enough and capable to do it.

@mdrejhon
Copy link

mdrejhon commented Nov 4, 2022

Yes. WinUAE has led the way, having already implemented this.

Someone needs to add retro_set_raster_poll placeholders (see #10758).
Then this task becomes much simpler.

As a reminder to all -- this techinique is really truly the only way to organically get universal organic original machine latency in an emulator (universal native-machine / FPGA-league latency originality). VSYNC OFF frameslice beam racing is the closest you can get to raster-plotting directly to front-buffer, one row at a time, in real time, in sync with the real world raster.

Same latency as original machine, to the error margin of 1-2 frameslices (subrefresh segments). Some of the faster GPUs can now exceed 10,000 frameslices per second.

We are rapidly approaching an era where we may be able to do full fine granularity NTSC scanrate too! (1-pixel tall VSYNC OFF frameslices -- e.g. each pixel row is its separate tearline)

@johnnovak
Copy link

Yes. WinUAE has led the way, having already implemented this.

Someone needs to add retro_set_raster_poll placeholders (see #10758). Then this task becomes much simpler.

Talked to the VICE people today about it. They're considering it, but some large scale refactorings will come first, which might take years.

@LibretroAdmin
Copy link
Contributor

I'd like to at least start implementing some of the auxiliary things which would be needed to get the whole thing going.

Thankfully blurbusters provided a lot of documentation and I feel like it should be possible to maybe break up all that has to be done into chunks. If we get some of these chunks done, even without a working implementation the entire thing might not seem so daunting to do.

@hizzlekizzle
Copy link
Contributor

As I've mentioned elsewhere, I believe one of the major hurdles for RetroArch/libretro in implementing this is that we typically work in full-frame chunks. That is, the core runs long enough to generate a frame's worth of audio and video, then passes it to the frontend. For this, we'll need to pass along much smaller chunks and sync them more often.

I suspect the cores that already use libco to hop between the core and libretro threads are probably going to be the lowest-hanging fruit. IIRC someone (maybe RealNC?) tinkered with this awhile back unsuccessfully, but I don't recall what exactly fell short.

@mdrejhon
Copy link

mdrejhon commented Nov 4, 2022

As I've mentioned elsewhere, I believe one of the major hurdles for RetroArch/libretro in implementing this is that we typically work in full-frame chunks. That is, the core runs long enough to generate a frame's worth of audio and video, then passes it to the frontend. For this, we'll need to pass along much smaller chunks and sync them more often.

I suspect the cores that already use libco to hop between the core and libretro threads are probably going to be the lowest-hanging fruit. IIRC someone (maybe RealNC?) tinkered with this awhile back unsuccessfully, but I don't recall what exactly fell short.

That's exactly what the retro_set_raster_poll is designed to do. Please look at #10758. I've already addressed this.

Several emulators (e.g. NES) already render line-based.

We simply need to add callbacks there, and it will be highly configurable for the future, with any one or more of the following:

  • Optional frameslice beam racing (very easy, already done by WinUAE)
  • Optional line based beam racing
  • Optional long term future CRT simulators (e.g. shaders that simulate a CRT electron beam in real time (e.g. using 8 refresh cycles on a 500Hz monitor to simulate one 60Hz CRT refresh cycle in a rolling-scan style BFI with simulated phosphor fadebehind)- [Feature Request] (BFIv3) Emulate a CRT Electron Beam Via Rolling-Scan BFI #10757

I actually already spent dozens of hours researching RetroArch's source code. It's simpler thank you think. The first step is adding the raster scan line callback to the existing RetroArch callback APIs -- header it out to template it in, even if no module is "activated" yet.

Then it is a simple matter of activation one module at a time (on modules that already render line-based)

The flow is

  1. Add the line-based callback placeholders, according to instructions at [Feature Request] Add retro_set_raster_poll API Proposal (To Incubate Beam Raced Outputs) #10758 which does precisely what you just described.
    It's just merely simply a modified header file, and all the modules need to have dummy empty functions added. That's it.

  2. Add VSYNC OFF frameslice beamracing (any graphics API capable of tearlines can do it).

  3. Then implement it on ONE module (one that already renders line-based, like the NES module).

Step 1 is easier than you think, if you ANY raster interrupt experience at all. Step 2 simply needs to gain some

@johnnovak
Copy link

johnnovak commented Nov 5, 2022

☝🏻 I'm 99% sure the answer is similarly simple with VICE. The problem there is more the infrastructure side of things; now it's tightly coupled with GTK and it uses some vsync mechanism provided by GTK (well, the GTK3 version, at least; the SDL one would be easier to hack I assume).

Raster interrupts are common on the C64, so it's either already rendering by rasterline internally, or it would be trivial to add (haven't read the source yet).

People are vastly overestimating the difficulty of implementing this technique, I think... Okay, maybe in a very generic framework like RA it could be a little bit trickier.

@mdrejhon
Copy link

mdrejhon commented Nov 6, 2022

Here's a TL;DR resummarization of the technique and why it's easier than expected to implement in RetroArch, if done in a staged manner.

Any graphics API capable of VSYNC OFF tearlines, should technically be beamraceable. The key important discovery is:

VSYNC OFF frameslice beam racing discovered to be relatively platform-independent

Since framebuffer tearing (a raster artifact) is a platform-independent artifact, we piggyback on tearing as our beamracing method, and make tearing invisible with a trick, while achieving subrefresh lag.

At Pressent() followed by a Flush() as well as the GL equivalent (glFlush()) -- upon return is the raster-exact time a VSYNC OFF tearline occurs.

My tests have shown that frameslice beam racing is pretty much graphics API independent, provided you:

The Four Pre-Requisites

(A) Can get tearing to appear (aka VSYNC OFF)
Most graphics APIs can, OpenGL and DirectX can.

(B) Can flush all queued graphics call (aka Flush() in DX or glFlush() in OpenGL)
DirectX and OpenGL both have a flush API. Return from flush = raster exact tearline

(C) Has a high precision tick API call such as RTDSC instruction, micros() call, or std::chrono::high_resolution_clock(), or equivalent such as QueryPerformanceCounter()
PC, Mac, Linux, Android does. Even Raspberry Pi does

(D) Has any crude method of estimating raster as an approximate offset between blanking intervals (e.g. as a time offset between blocking VSYNC ON calls)

Exact Scan Line Number Not Necessary (Zero Artifacts!)

You don't even need to know the exact scanline number -- just estimate it as an offset between refresh cycles. You use (C) to create timestamps from (D), and estimate a raster scanline number from that.

The exact moment the return from Flush() is the exact raster-moment the tearline appears in realtime in the display's scanout (approximately as a relative time offset between VSYNC's), as per this diagram of first post:

Precisely Raster-Timing a VSYNC OFF Tearline

image

(D) can be achieved via several methods. Some platforms have an .InVBlank flag, others via release of a blocking VSYNC ON call, or by monitoring the WIndows compositor). One method is to run two concurrent framebuffers -- one visible VSYNC OFF frame buffer (for frameslice beam racing), and one invisible VSYNC ON frame buffer (for monitoring blanking intervals to estimate raster offsets from)

Item (C) allows you to precisely time your Present() + Flush() call (or OpenGL equivalent), one can use ultra-precise busywait to flush at exact a sub-millisecond moment you can (sometimes achieves accuracy of less than 10 microseconds).

You can use a timer right up to ~0.5ms-1ms prior, then busywait the rest of the way, to save a bit of resources. Timers aren't accurate enough, but busywaits are super-accurate even on a Raspberry Pi.

Exact-Scanline-Number APIs are Optional

Some platforms even lets you query an exact current-scanline number (e.g. D3DKMTGetScanLine() ...) which you can use, but you should design cross-platform genericness as a fallback for platforms that don't have a raster scan line number poll.

For computing estimated raster scan line -- Windows, Mac, Linux, Android already have some indirect some methods of doing (D), so that's the only platform-specific part, but tearlines are platform-independent, making possible cross-platform frameslice beam racing via OpenGL.

Estimating methods are perfectly fine, thanks to the jitter-hiding technique described.

Jitter Safety Margin That Keeps Artifacts 100% Invisible

Obviously, computer performance will jitter. This is solved by frameslicing ahead of the real world raster, within the error margin to hide raster-jitter in the VSYNC OFF tearline.

image

As long as the performance jittering stays within the temporal time of frameslice (e.g. 4 frameslices per 60Hz refresh cycle = 4ms safety margin before artifacts appear).

  • High-bandwidth discrete PC CPU/GPUs easily can achieve can sub-0.5ms error margin (2000 frameslices/sec and up) which is virtually FPGA-league and original-machine latency identicalness. The RTX 4090 can exceed 10,000 frameslices/sec and may be able to match NTSC scanrate of ~15,000 frameslices/sec (imagine....scanline latency is within reach, if sufficiently RTOS-league priority to RetroArch)
  • Integrated Intel can do about 600 frameslices/sec in my experience. 2016-era Intel Iris-league laptop GPU, give or take. That gives you as little as 1/600sec latency.
  • Raspberry Pi could probably achieve at least 240 frameslices/sec (~4ms-8ms latency, still sub-refresh latency), creating the world's lowest-lag Arduino emulator without needing the performance overhead of RunAhead (great tech too, but still performance demanding).

Essentially at 4 frame slices, you're simply Present()+Flush() (or OpenGL equivalent, glFlush() ...) four times per 60Hz refresh cycle.

Display Scaling Independent

Scaling-wise, a 60Hz digital refresh cycle (even 4K HDMI/DisplayPort) scans top to bottom at the GPU output level (left to right, top to bottom digital pixel delivery) at roughly 1:1 physical surface-area sync relative to a VGA output (one 60Hz refresh) / NTSC signal (one 60Hz field), ever since the 1940s NTSC through 2020s DisplayPort, within less than a ~1ms error margin. Same for 50Hz PAL, with monitor set to 50Hz.

So this is scaling-independent for a non-QFT ordinary 50Hz or 60Hz HDMI / DisplayPort.

You can output 1080p, and pretend the 540th scanline (middle of screen) roughly maps to the raster of the 120th scanline of a 240p signal, or 240th scanline of a 480i signal. So you can still emuraster-realraster sync, within the error margin quite easily.

Different-Hz Compensation / QFT / VRR Signal Compensation

Yes, it works: WinUAE successfully does this already.

(E.g. beam racing every other refresh cycle of a 120Hz output signal)

It becomes slightly tricker with QFT or VRR signals, or different-Hz signals, but the key is you can fast-beamrace random refresh cycles, and idle the emulator module until the targeted refresh cycle, and you beamrace that specific refresh cycle. So you can do 120Hz, 180Hz, 240Hz.

You simply fast-beamrace every 2nd refresh cycle (surge-execute 1/60sec emulator in 1/120sec bursts, by knowing the refresh rate, and knowing the time interval between refresh cycles is 1/120sec, and you have to fast-sync emuraster with realraster in 2x realtime). To keep emulator running in "realtime" you only emuraster-realraster beamrace specific output refresh cycles.

Many Specific RetroArch Modules Already Rendering One Scanline At A Time

Several RetroArch modules already are rendering line-based (Most of the pre-GPU-era modules do -- including many MAME modules, the NES module, the SNES module, etc)

For GPU-era modules (e.g. Nintendo 64), just keep its own retro_set_raster_poll blank intentionally. You only need to worry about implementing it in the modules that already (repeat: I use the word "ALREADY") is rendering line-based. Those are the easiest to beam race to a real raster.

Questions?

I'm happy to share hundreds of hours of due diligence of my helping the WinUAE author (as well as Calamity's GroovyMAME prototype, as well as Tom Harte's CLK, which also implements variants of this algorithm).

The WinUAE implementation is the most mature, being almost any-Hz, any-VRR, any-QFT compatible, as long as display scanout direction is same as emulator scanout direction.

@mdrejhon
Copy link

mdrejhon commented Nov 6, 2022

Even Shorter TL;DR Education For Anybody Remotely Familiar With Raster Interrupts

VSYNC OFF tearlines are just simply rasters, no matter what platform. This is the KEY to cross-platform beam racing.

The key "TL;DR" educational image is that VSYNC OFF tearlines are almost raster exact at the exit of the flush API call (RTDSC timestamped).

image

You're simply presenting your framebuffer repeatedly (at precise subrefresh time intervals) while the EXISTING emulator module is already rasterplotting lines to its own framebuffer.

As long as you Present()+Flush() AHEAD of the guesstimated real-world raster of the GPU output jack, the tearline stays INVISIBLE! No artifacts. Even with performance jitter. And, don't worry about the display type LCD/CRT/OLED, that's irrelevant to implementing this algorithm.

You want 1/180sec input lag relative to original machine?
And 1/180sec jitter safety margin (zero artifacts)?
...Then use 3 frameslices per NTSC Hz (180 frameslices per second).

You want 1/1000sec input lag relative to original machine?
And 1/1000sec jitter safety margin (zero artifacts)?
...Then use 16 frameslices per NTSC Hz (1000 frameslices per second)

Etc.

  • Current Raspberry Pi can present about 200-400+ frameslices per second.
  • Inte GPU from 2018 can present about 600-1000 frameslices per second.
  • GTX 1080 can Present() between 6000-8000+ frameslices per second.
  • RTX 4090 can Present() over 15,000+ frameslices per second.

Tested in a C# loop, a slower language than C++

What is the Input Lag?

Input lag is always subrefresh (less than 1/60sec lagged relative to original machine!!) as long as you have at least 2 frameslices per refresh cycle (120 Present()+Flush() per second).

Lag was measured to be between 1/frameslice to 2/frameslice relative to original machine, when tested on a VGA output jack (e.g. GTX 760). Measurements with digital outputs are similar, though the digital transceivers imparts a slight tapedelay lag (of a few scanlines, easily hidden in the raster-jitter safety margin).

Remember, a 2020s DisplayPort still has a horizontal scan rate. It raster-outputs one pixel row at a time, at metered intervals, just like a 1920s Baird/Farnsworth TV signal!. Raster workflow has been unchanged for a century, as 2D serialization to a 1D signal.

That's why tearlines still exist even on DisplayPort -- tearlines are simply raster interruptions at the GPU port.

Most platforms in the last 10 years have been found to beam race accurately if not in battery saver mode (turn that off, btw). Frameslice count can be either dynamic or preset (e.g. configurable option, like WinUAE).

Note: Performance is indeed higher (more frameslices per second) without Flush() but raster jitter increases due to asynchronous GPU pipelining behaviours. You can use Flush() by default, but add a hidden toggle to enable/disable the Flush(), or as a multiple-choice selection "High Accuracy Lagless VSYNC" versus "Fast Lagless VSYNC".

Now you're an Einstein.

Scroll back up and re-read my bigger walls of text with this newfound knowledge.

@johnnovak
Copy link

johnnovak commented Nov 6, 2022

💯 🥇 👍🏻 🚀 Thanks for this @mdrejhon! Yeah I always thought it's not such a difficult to grasp technique, kind of baffled why people think it's some voodoo black magic or hard to implement... I'd like to add that in WinUAE there's a shortcut key (forgot what) that lets you visualise the jitter/error; pretty cool and illustrative. I use a frameslice of 4 in general, that's pretty reliable, and the jitter varies a lot, but it doesn't matter, as explained above!

Yes, it works with scaled and HLSL/shaders/fuzzylines, as it always works in WinUAE. It does slow things down, and requires optimizations to speed up again (but this can be solved as a separate optimization). Any distortions (e.g. curves, or line fuzz) can be hidden in the jitter margin height technique, to be 100% artifactless

For the record, I'm using a frameslice of 4 with quite heavy CRT shaders oversampled to 4x vertical resolution at 1080p; works like a charm! (vertical oversampling helps a lot with scanline uniformity & getting rid of vertical moire-patterns)

@johnnovak
Copy link

Btw, at this point might be easier to implement it yourself and show people how it's done 😎

@mdrejhon
Copy link

mdrejhon commented Nov 6, 2022

in WinUAE there's a shortcut key (forgot what) that lets you visualise the jitter/error; pretty cool and illustrative.

Yes, it's a rather neat feature.
The GroovyMAME lagless VSYNC experiment does it slightly differently, but the concept of debugging is similar.

One method RetroArch can use for debugging is changing the color-tint of the previous already-rendered emulator framebuffer, but not the new framesliceful of scanlines being rasterplotted by the emulator module.

(even separate individual color tints for each frameslice)

This makes tearing visible again (as a tint-difference tearing artifact), to watch the realtime raster jitter of VSYNC OFF tearlines for debugging purposes.

It's fun to uncloak the raster jitter (vibrating VSYNC OFF tearlines) during debugging purpose to see how the safety margin is performing.

Btw, at this point might be easier to implement it yourself and show people how it's done 😎

In theory I could do it, but it might not be till 2025 that I might be convinced to do so. I have too many paid projects to focus on to put food on the table -- and some are even superficially related to beam racing techniques (but for far more specialized purposes than emulators)

Monitor processing electronics are often beamraced internally -- where you use a rolling window between the incoming pixel rows from the video cable, before scanning it out rapidly (in a subrefresh-latency manner). LCDs and OLEDs already scan like a CRT as seen in high speed videos (www.blurbusters.com/scanout) and high-Hz esports LCDs already do internally beamraced processing in the display scaler nowadays.

Raster has always been a very been convenient 2D serialization into 1D for frame delivery purposes, and it persists to this date even in a high-Hz VRR 10-bit HDR DSC digital signal. That still uses raster scanout too!

The thing that would push me (if nobody else) is a 480Hz OLED capable of accurate #10757 -- and this is not yet milked by anyone.

With 480 Hz+, the CRT electron beam can be much more accurately simulated in software as a shader (adds a temporal equivalent of a spatial CRT filter). Instead of VSYNC OFF frameslice beamracing Hz=Hz, you could instead literally use 8 digital refresh cycle during VSYNC ON to simulate 1 CRT Hz via rolling-scan (software based rolling BFI with a phosphor fadebehind). Then beam race that by relaying raster data from retro_set_raster_poll to the electron beam simulator in real time (whether line at a time, or chunks at a time).

There'd be only 1/480sec latency between emulator raster and real-world raster (aka full frame refresh cycle containing a rolling bar segment of 1/480sec worth of CRT electron beam simulation), the emulator continuing to run in real time for original pixel-for-pixel latency (same scanout latency, same scanout velocity).

BTW, I'd guesstimate 2025-ish for a 480Hz OLED. Hoping!

Anyway, first things first, keep it simple. If I suddenly have free time, I might do #10758 next year because it's dirt-easy, without even touching my BountySource donation. #10758 is just boring internal prep work as header-file templating with no visible/feature change to software. But I beg someone else to beat me to the punch, so I can keep doing more industry-impacting work first.

I'd rather someone else trailblaze much sooner, at least pave the groundwork.

@mdrejhon
Copy link

mdrejhon commented Jul 13, 2023

New Apache 2.0 Source Code for a VSYNC estimator

  • Suitable for better crossplatform sync between emuHz-vs-realHz (non-beamraced)
  • Suitable for beam racing too (raster estimates as time offset between VSYNCs)

We just released an open source cross platfrom VSYNC estimator accurate enough for cross platform synchronization between emulator Hz and real display Hz in high level languages (even JavaScript). Accurate enough for beam racing applications!

Or simply slowly flywheeling an CPU-calculated emulator refresh rate (via RTDSC) towards more accurately aligning with the real-world refresh cycles, to prevent latency. Useful for input delay algorithms too (locking a VSYNC phase offsetting)

https://github.com/blurbusters/RefreshRateCalculator

It's the refresh rate estimator engine used by both www,vsynctester.com and www.testufo.com/refreshrate

Here's the README.md:

RefreshRateCalculator CLASS

PURPOSE: Accurate cross-platform display refresh rate estimator / dejittered VSYNC timestamp estimator.

  • Input: Series of frame timestamps during framerate=Hz (Jittery/lossy)
  • Output: Accurate filtered and dejittered floating-point Hz estimate & refresh cycle timestamps.
  • Algorithm: Combination of frame counting, jitter filtering, ignoring missed frames, and averaging.
  1. This is also a way to measure a GPU clock source indirectly, since the GPU generates the refresh rate during fixed Hz.
  2. IMPORTANT VRR NOTE: This algorithm does not generate a GPU clock source when running this on a variable refresh rate display
    (e.g. GSYNC/FreeSync), but can still measure the foreground software application's fixed-framerate operation during
    windowed-VRR-enabled operation, such as desktop compositor (e.g. DWM). This can allow a background application
    to match the frame rate of the desktop compositor or foreground application (e.g. 60fps capped app on VRR display).
    This algorithm currently degrades severely during varying-framerate operation on a VRR display.

LICENSE - Apache-2.0

Copyright 2014-2023 by Jerry Jongerius of DuckWare (https://www.duckware.com) - original code and algorithm
Copyright 2017-2023 by Mark Rejhon of Blur Busters / TestUFO (https://www.testufo.com) - refactoring and improvements

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

*** First publicly released July 2023 under mutual agreement
*** between Rejhon Technologies Inc. (Blur Busters) and Jongerius LLC (DuckWare)
*** PLEASE DO NOT DELETE THIS COPYRIGHT NOTICE

JAVASCRIPT VSYNC API / REFRESH CYCLE TIME STAMPS

CODE PORTING

  • This algorithm is very portable to most languages, on most platforms, via high level and low level graphics frameworks.
  • Generic VSYNC timestamps is usually immediately after exit of almost any frame presentation API during VSYNC ON framerate=Hz
  • APIs for timestamps include RTDSC / QueryPerformanceCounter() / std::chrono::high_resolution_clock::now()
  • APIs for low level frame presentation include DirectX Present(), OpenGL glFinish(), Vulkan vkQueuePresentKHR()
  • APIs for high level frame presentation include XBox/MonoGame Draw(), Unity3D Update(), etc.
  • APIs for zero-graphics timestamps (e.g. independent/separate thread) include Windows D3DKMTWaitForVerticalBlankEvent()
  • While not normally used for beam racing, this algorithm is sufficiently accurate enough for cross-platform raster estimates for beam racing applications, based on a time offset between refresh cycle timestamps! (~1% error vs vertical resolution is possible on modern AMD/NVIDIA GPUs).

SIMPLE CODE EXAMPLE

var hertz = new RefreshRateCalculator();

[...]

  // Call this inside your full frame rate VSYNC ON frame presentation or your VSYNC listener.
  // It will automatically filter-out the jitter and dropped frames.
  // For JavaScript, most accurate timestamp occurs if called at very top of your requestAnimationFrame() callback.

hertz.countCycle(performance.now());

[...]

  // This data becomes accurate after a few seconds

var accurateRefreshRate = hertz.getCurrentFrequency();
var accurateRefreshCycleTimestamp = hertz.getFilteredCycleTimestamp();

  // See code for more good helper functions

@mdrejhon
Copy link

mdrejhon commented Dec 5, 2023

BountySource Requirements Reduction Announcement

December 2023

Did you know Retrotink 4K is a Blur Busters Approved product? I worked with them to add fully adjustable 240Hz BFI to a composite/S-Video/component signal, for output to any 240Hz LCD or OLED. I recommend the new 240Hz OLEDs, since you can reduce 60Hz motion blur by 75% with the 240:60 ratio combined with GtG=nigh near 0. Perfect Sonic Hedgehog with BFI and CRT filters simultaneously...

Retrotink 4K can do everything TestUFO can, including TestUFO Variable Persistence Demo For 240Hz Monitors and even brighten using a HDR nits booster, brighter than LG firmware TV BFI! So I'm way ahead of RetroArch, in an already released BFI product.

So RetroArch, please catch up!

I want to see open source versions.

Also, crosspost:

I recommend #10758 -- "retro_set_raster_poll" API should be added.

By encouraging emulator modules to relay one rendered scanline at a time to RetroArch, this will improve reliability of RetroArch even without frameslice beamracing, because it will keep the GPU's from sleeping, by metering out GPU rendering in tiny time slices (e.g. every 1ms), prevening the GPU from going to sleep between emulator refresh cycles. So this refactoring by pre-requisite #10758 has some MAJOR spinoff benefits, even without frameslice beam racing.

Even emulators that don't beamrace, but execute their frame rendering in realtime (e.g. run in 1ms increments and render directly to GPU framebuffer without letting GPU power-management itself to sleep) also tends to framepace better on 240Hz monitors without needing VRR help, making other things like monolithic BFI more practical.

Obviously, RunAhead is different (not realtime), but it can still benefit; it just runs a big bunch of frames continually (fast raster scanout to many frames), so it will still be a benefit. And you can disable processing in retro_set_raster_poll (dummy hooks) for most modules, and just iterate enhancements later into it. And when 1000Hz OLEDs come, we can forget frameslice beam racing, and simply display 16 incrementally-scanned-out full screen digital refresh cycles per 1 real refresh cycle (much easier to implement low-lag beam racing this way, since full screen refresh cycles are only 1ms). I'm going to see the 480Hz OLED in the invite-only demo room at CES 2024, so ultra high Hz displays are among us, one can't buy a desktop 27" OLED with a refresh rate less than 240Hz; it merely starts there -- even for productivity.

I'd like to see retro_set_raster_poll (#10758) implemented.

Bounty Requirements Reduction! ($500)

Special offer: I'll even water-down my existing bounty (#6984) to be paid out only on the mere completion of #10758 if it helps reduce workload. It's such an important germane improvement that enables a lot of spinoff benefits:

Benefits Of Just Merely Programming Only #10758

  1. Even without beam racing the destination display, it allows slow continuous metering out to GPU for preventing power management stutters
  2. Frameslice beam racing (Add Beam Racing/Scanline Sync to RetroArch (aka Lagless VSYNC) #6984)
  3. Ultra high Hz beam racing (480Hz OLEDs allow fully easily displaying 8 partially scanned out emulator frames per 60Hz emulator refresh cycle, and upcoming 1000Hz OLEDs allow fully easily displaying 16 partially scanned out emulator frames per 60Hz emulator refresh cycles)
  4. Etc.

cc: @hizzlekizzle

@QmwJlHuSg9pa
Copy link

BountySource Requirements Reduction Announcement

December 2023

Uh, you may want to check this out: bountysource/core#1586

@mdrejhon
Copy link

mdrejhon commented Dec 6, 2023

Oh wow. Appreciate it. I missed that memo. I haven't been paying attention to that. My bounty was long before the PayPal refund window, so I'll just have to swallow the loss.

OK, I declare offer a $500 code bounty directly from Blur Busters, staked on my reputation.

$500 Bounty -- directly from Blur Busters / TestUFO

BountySource bypass

I just found out BountySource is insolvent, so ignore that. I'll put up the money anyway.

$500 Bounty -- directly from Blur Busters / TestUFO

Inquire within, send to squad [at] blurbusters with subject line "Code Bounty: Retroarch", directly referring to this.
[Important: No spam/phisphers. No file attachments from unverified individuals.]

Bounty Conditions

The rule is very simple: Payout will occur via Wise, Interac or PayPal when

  1. RetroArch github leads (@hizzlekizzle et al) closes this issue [Feature Request] Add retro_set_raster_poll API Proposal (To Incubate Beam Raced Outputs) #10758
  2. At least one emulator module is now calling raster API every raster, even if dummy call
  3. I authenticate github UserID who made the pull request for this issue
  4. Expiry date set December 31st, 2024
  5. Questions are welcome here or at above email

@Cwpute
Copy link

Cwpute commented Dec 21, 2023

@blurbusters or some maintainer might want to edit your original post to remove the bountysource link 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bounty feature request New enhancement to RetroArch.
Projects
None yet
Development

No branches or pull requests