Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libobs/graphics, libobs-opengl, libobs-d3d11, docs/sphinx: Introduce effect results #4351

Closed
wants to merge 1 commit into from

Conversation

HoneyHazard
Copy link

@HoneyHazard HoneyHazard commented Mar 16, 2021

Description

  • Introduce results into OBS effect system; they are tied to respective params of the same name
  • Result variables are passed in like other params before draw, but they are also fetched back as results after draw
  • Only one type of result is introduced, which is atomic_uint. This type will be used with atomicCounterIncrement(...) effect statements to allow atomic increments of a counter, which enabled us to build the Pixel Match Switcher plugin.
  • D3D11: change from ps_4_0 to ps_5_0 shader model to support UAV variables
  • OpenGL: change #version 330 to #version 460 to support atomic counters

In-depth Look at the Changes

Click to Expand

Outline of the changes

  1. Introducing the atomic counter results
  2. Effects
  3. Parsing the intermediate shaders
  4. Graphics API
  5. libobs-opengl
  6. libobs-d3d11

Introducing the atomic counter results

  • Atomic counters are special variables that allow safe increments (or decrements) of a counter in the parallel shader environment. The Pixel Match Switcher plugin for OBS uses this feature of the graphics systems to count matching pixel as video data passes though the plugin's filter. Because the counting is done in the shader, the atomic counters work with great performance on any reasonably modern video card (must support ps_4_0 for D3D11 or version 460 for OpenGL)
  • Atomic counters require special treatment by D3D11 and special treatment by OpenGL. They need to reside in specially configured blocks of graphics memory and require highly specialized API to be initialized and used.
  • We have selected atomic_uint keyword to represent atomic counters being introduced into the OBS effect system. This keyword is also how the counters are represented in GLSL. Like other global variables of the effect, it must be preceded by uniform.
  • We have selected atomicCounterIncrement(...) function, also borrowed from GLSL, to perform an atomic increment on a counter.

Effects

Effects provide a cross platform wrappers for GLSL and HLSL shader use by OBS. In the OBS effect system you specify both vertex and pixel shader behavior in the same effect file.

Roughly speaking, the effect initialization is as follows:

  1. Effect file is parsed by the effect-parser module, generating data structures for all building blocks of the shaders it needs to generate.
  2. These building blocks are then used to write strings that represent intermediate versions of the vertex and pixel shaders for the effect.
  3. The intermediate versions of shaders are then parsed by shader-parser module to generate data structures representing the shader functionality.
  4. Finally, the data structures generated by the shader-parser are passed to down to either gl-shaderparser or d3d11-shaderprocessor modules to write actual shaders that will do the effect's work.
Additions to the effect language

We introduce two new syntax elements to support use of the atomic counters in the OBS effect language. The two are borrowed from GLSL, which, arguably, has a more user-friendly syntax for implementing atomic counters than HLSL.

  • atomic_uint is the variable type that will be used for the atomic counters. Like other global variables of the effect, it must be preceded by uniform keyword.
    • Example: uniform atomic_uint myCounter;
  • atomicCounterIncrement(...) statement increments an atomic counter.
    • Example: atomicCounterIncrement(myCounter);
  • Note: unlike actual GLSL, will not require and will not support the layout(...) qualifier in the effect code, but the OpenGL graphics subsystem will have to generate the qualifier in the event it is used.
Effect data structures

The data structures built for each effect are used to dynamically generate intermediate shader code, and provide mapping of high-level abstractions for effect variables to lower-level abstractions of shader variables. While modifying the effect data structures, we mirror existing abstractions for effect parameters and their pathways to the Graphics API, as we introduce new abstractions and new pathways for interfacing with program and shader results.

gs_effect [new members]

This struct provides a high level interface for interfacing an effect parameter.
It has an array of parameter data structures called params of type gs_effect_param. So we also add a new member results, which is an array of gs_effect_result.

gs_effect_result [new struct]

High level interface for working with a result. Members are:

  • name result name
  • type variable type; currently only GS_SHADER_PARAM_ATOMIC_UINT is supported
  • cur_val value that was last retrieved from result.

Analogous to gs_effect_param.

pass_shaderresult [new struct]

Serves as mapping of higher level effect result interface gs_effect_result to a lower-level shader result interface handle gs_sresult_t.

Analogous to pass_shaderparam.

gs_effect_pass [added members]

Has handle pointers for vertex and pixel shaders, and an array for mapping effect params to shader params. These are used later to make passing in of parameter values possible.

So we add program_results, which is an array of pass_shaderresult, to also do the mapping of effect results to shader results, and make fetching of the results possible.

Parsing the effect code

effect-parser module converts the effect code into data structures and uses them to generate intermediate shader code. Input effect string is tokenized by whitespace, and is then processed, token by token, to build data structures representing individual behavior units of the effect. These are then used to generate intermediate versions of vertex and pixel shaders, so they can be passed down to shader-parser.

Here the goal was to introduce support for results and atomic_uint type.

ep_param [new members]

This data structure maintains information about individual effect parser parameters as they are being parsed from the effect code. Since some parameters are now also results, we add a new field to be used with those results:

  • is_result gets set to true for any parameter that is also result. This will later lead to gs_effect_result being instantiated for every param marked with this flag.
ep_param_init(...) [modified function]

Assigns fields to ep_param. Modified to receive and assign the new field is_result.

ep_parse_param(...) [modified function]

Calls ep_param_init(...) with values received from ep_parse_other(...) and makes sure no erroneous symbols follow the param declaration.

Our modifications here are limited to propagating is_result value to ep_param_init(...).

ep_parse_other(...) [modified function]

This function is for parsing anything in the effect code that is not whitespace, struct, technique, or sampler_state. The function has variables for reacting to property, const, and type keywords, and activates functions for parsing functionss and params. Our modifications are for special handling of the atomic_uint results.

  • is_result variable is added, and is set to true whenever atomic_uint type token is encountered.
  • call to ep_parse_param(...) also passes is_result variable to the function
Writing intermediate shaders

After the effect code is parsed into data structures, these data structures are used to generate the intermediate shader code (so it can be processed later by the shader-parser). During this building of the intermediate shader code for a vertex/pixel shader, a mapping is constructed so that higher level effect params can be linked with corresponding lower-level shader params.

The additions here are for constructing analogous mapping between effect results and shader results.

ep_write_param(...) [modified function]

This function receives a pointer to ep_param as input and writes the param's declarations in the intermediate shader code. It also appends a name of every used param to the array used_params.

We teach the function to also append a param name, that is also a result, to the list of results. It will now receive a pointer used_results, which is an array of strings representing names of the results used in the effect. If the ep_param, passed into the function, is marked with the is_result flag, it is appended to used_results.

ep_write_func_param_deps(...), ep_write_func_func_deps(...), ep_write_func(...), ep_makeshaderstring(...) [modified functions]

These functions call each other and other lower level functions as the intermediate shader is being built from the data structures representing it. They maintain used_params - an array of strings representing params that were encountered, to be eventually passed down to ep_write_param(...), where a name is appended to the array for every used param.

They are all modified so that used_results, an array of strings representing names of the results used in the effect, can also make its way to ep_write_param(...) function, where it can be updated for every used result.

ep_compile_result(...) [new function]
  • Takes as input a name for a result and a pointer to an allocated gs_effect_result so its fields can be initialized.
  • Finds a corresponding gs_effect_param of the same name in the effect's array of params, and retains the pointer in gs_effect_result being initialized. Because this is all called (and has to be called) after the params are "compiled", the array of params is stable and a pointer into that array is safe.
  • A string of param/result name is also duplicated into gs_effect_result.
ep_compilepass_shaderresults(...) [new function]

This function is responsible for building mapping between higher level handles of effect results and lower level handles of shader results, and also invokes mapping between effect params and effect results.

For every result name in the input array of strings used_results:

  • Fetches a pointer to gs_effect_result from the effect_parser that corresponds to the result name.
  • Calls ep_compile_result(...) so gs_effect_result receives a pointer to corresponding gs_effect_param.
  • Finds a pointer/handle of type gs_sresult_t from gs_shader_t that correspots to the result name.
  • The mapped pair of result handles, represented by pass_shaderresult, is appended to the pass_results array, which is passed in by pointer.

Analogous to ep_compile_pass_shaderparams(...).

ep_compile_pass_shader [modified function]

This function invokes generation of shader strings for either vertex or pixel shader. While doing so, it also generates mapping of effect parameters to shader parameters, which is used later to make setting of params possible.

We need to provide a mapping of effect results to shader results, very similar to how it is done for params, so the result retrieval works. Mapping between effect results and corresponding effect params will also be indirectly invoked.

  • Add variable used_results, which is an array of strings representing names of results used in the shader. used_results is updated during the calls to ep_makeshaderstring(...).
    • This is analogous to how used_params is worked with.
  • Obtain a pointer variable pass_results of type pass_shaderresult. It is pointing to the data member program_results of ep_pass pointer, and will be modified in-place to retain the results mapping in ep_pass data structure
    • Similar to vertshader_params and pixelshader_params members of ep_pass.
  • Add a call to ep_compilepass_shaderresults(...). used_results and pass_results variables are passed in. pass_results will be updated.
    • Analogous to how ep_compile_pass_shaderparams(...) is called.
    • Also will result in effect results receiping pointers to corresponding effect params.

Interacting with the effects

In a typical scenario the effect user obtains handles to an effect's parameters, and uses those handles to pass the parameters to the effect, so the values can be passed down to the lower level layers and eventually end up in the actual graphics system and shader machinery.

We have to expand the usage to include the results. Similarly to params, the user will be able to lookup result handles by name. The user will then be able to retrieve result values from the result handles.

The result variables are also connected to param variables of the same name. atomic_uint will require a new type of param that is not int, and also requires some special care by either graphics system when using it as a parameter.

gs_effect_get_result_by_name(...) [new function]

Obtains a gs_eresult_t pointer by name. This handle can then be used by gs_effect_get_atomic_uint_result(...) to obtain an effect result.

Analogous to gs_effect_get_param_by_name(...).

gs_effect_set_atomic_uint(...) [new function]

This allows you pass a value to an unsigned integer atomic counter in the shader like you would set any other param. Nothing new here, except the new parameter type that will be used with the atomic_uint variable in the effect code. All lower-level details are in other functions.

Analogous to gs_effect_set_int(...).

gs_effect_get_atomic_uint_result(...) [new function]

Given a gs_eresult_t pointer/handle, retrieves the value of the atomic_uint result after drawing with the effect has finished.

Again, no super low-level stuff here; just some memory copies. Very similar to gs_effect_set_xyz(...) functions (where xyz is a data type) except we get instead of set since the results are coming back after the draw, instead of being passed in before the draw.

effect_setval_inline(...) [modified function]

This is a wrapper for updating effect parameters with new values. In addition to some error-checking, the functions prevents updating uniform data in the shaders when the values given are no different from the previously assigned value. (so, no update should be necessary)

We customize the behavior to always force updates anytime the parameter type is GS_SHADER_PARAM_ATOMIC_UINT. Our atomic counters are both a param and a result, and are treated a little different from other uniforms by the graphics systems. So, we must ensure the value is always passed in to the shaders before every draw, even if we keep sending the same value.

effect_pass_free(...) [modified function]

This is a cleanup function for gs_effect_pass. We modify it to also cleanup the newly introduced program_results array.

effect_free(...) [modified function]

This is a cleanup function for gs_effect. We modify it to also cleanup the newly introduced results array.

Parsing the intermediate shaders

shader-parser processes the intermediate vertex and pixel shader code generated by the effect-parser. Data structures are generated to represent the shader behavior. This allows reformatting the intermediate shader into GLSL or HLSL code, depending on which graphics subsystem is used.

Our changes here are for supporting results, and for assigning a unique index for each atomic counter variable, so either graphics system will be ready to allocate resources for its specialized handling of the counters.

shader_parser [new members]

This structure holds an instance of a cf_parser which is a c-style parser, and has arrays of data structures for params, structs, samplers, and funcs.

We need an incrementing counter to assign unique, increasing index for each atomic counter we encounter in the intermediate shader code. This struct seems to be a fine place to keep the next index to be assigned, so we add atomic_counter_next_index integer field.

shader_var [new members]

This struct represents a variable in a shader code. We want results to be special kind of variables, so we add is_result flag to the fields. We also add atomic_counter_index so that each atomic_uint variable can be uniquely identified in preparation to be handled by either graphics system. Increasing values will be assigned to indices of counters in the order of the counter's declaration in the intermediate shader.

shader_parser_init(...) [modified function]

This initializes members of shader_var. We modify it to also initialize atomic_counter_next_index to 0, so the counter enumeration index will begin with 0.

shader_var_init_param(...) [modified function]

This function initializes a shader_var that was previous allocated, assigning its fields based on function parameters.

New function parameters are added to support results and atomic counters; specifically:

  • is_result set to true when the variable is also a result
  • atomic_counter_next_index is an integer that is passed by pointer. Whenever the variable is of type atomic_uint this integer is copied to the field atomic_counter_index in shader_var, and then the next index to be assigned is incremented.
sp_parse_param(...) [modified function]

Very similarly to ep_parse_param(...), this calls shader_var_init_param(...) to initialize a shader_var and do some error-checking.

We modify the call to shader_var_init_param(...) so the flag is_result is passed down to it, and also pass the pointer to atomic_counter_next_index of gs_shader as the other new argument.

sp_parse_other(...) [modified function]

Very similar to ep_parse_other(...) and is responsible for parsing anything in the intermediate shader code that is not whitespace, struct or sampler_state. We add special handling for variables of type atomic_uint.

  • Local boolean is_result is added, and is set to true when a variable of atomic_uint type is encountered.
  • is_result is also passed down to sp_parse_param(...).

Graphics API

graphics is "an API-independent graphics subsystem wrapper". Many data types are defined. Among other things, it has function pointers that need be assigned to functions specific to OpenGL vs Direct3D11 operation.

Here we integrate some new functionality needed to make results work.

New data types
gs_shader_result [new struct]

This is actually defined by either D3D11 or OpenGL subsystems. It will contain data that either system will need to interact with an actual shader variable associated with the result.
However, the graphics code will pass this data around using the [gs_eresult_t](#gs_eresult_t-new typedef) typedef wrapper.

See OpenGL and D3D11 implementations of gs_shader_result.

gs_eresult_t [new typedef]

This is a typedef for gs_effect_result so a pointer handle to an effect result can be used by modules that don't need knowledge of the graphics internals.

Analogous to gs_eparam_t.

gs_sresult_t [new typedef]

This is a typedef for gs_shader_result so a pointer handle to a shader result can be used by modules that don't need knowledge of the shader internals.

Analogous to gs_sparam_t.

gs_shader_param_type [extended enum]

This enum for shader param data types is extended to include GS_SHADER_PARAM_ATOMIC_UINT.

New function signatures

The following new function signatures will be declared so they can be defined by either OpenGL or D3D11 subsystems. This will allow platform-agnostic effect abstractions to interact with the actual shader machinery of either subsystem.

gs_shader_get_result_by_name(...) [new function signature]
gs_sresult_t *(*gs_shader_get_result_by_name)(gs_shader_t *program, const char *name);

Fetches a pointer/handle gs_sresult_t by name from a shader program.

See OpenGL and D3D11 implementations.

Analogous to gs_shader_get_param_by_name(...).

gs_shader_set_atomic_uint(...) [new function signature]
void (*gs_shader_set_atomic_uint)(gs_sparam_t *param, unsigned int val);

Passes an unsigned integer value to the atomic counter variable in the shader represented by the gs_sresult_t pointer/handle.

Expands on the existing gs_shader_set_xyz(...) function declarations, where xyz is a data type.

See OpenGL and D3D11 implementations.

gs_shader_get_result(...) [new function signature]
void (*gs_shader_get_result)(gs_sresult_t *result, struct darray *dst);

Copies new result data from the shader into the dst, provided a gs_sresult_t pointer/handle.

See OpenGL and D3D11 implementations.

Naming is analogous to gs_shader_set_val(...) declaration.

New and changed functions
download_results(...) [new function]

This uses the array of mapping pairs pass_shaderresult
to transfer data from shader result handles gs_sresult_t (where new data is available after a technique draw is finished) into the associated gs_effect_result, which makes the data available to effect users. It calls gs_shader_get_result(...).

Naming is analogous to upload_parameters(...).

gs_technique_end_pass(...) [modified function]

This function is doing some cleanup after a technique draw has finished, and it was a convenient place for us to insert a call to download_results(...).

libobs-opengl

This module provides implementation of the Graphics API for the OpenGL graphics system.

Roughly speaking, most of the changes fall into the categories of writing low-level code to implement the atomic counters
with features available in OpenGL, and adding the structure and logic necessary to make the results work.

gl-subsystem

gl-subsystem.h/.cpp does many GL-specific implementations of the Graphics API, including GL-specific implementations of the data structure for shader and program params.

We will modify and add new data structures here to add support for results, with some additions being specific to support atomic_uint variables.

gs_shader_param [new members in libobs-opengl implementation]

This is the GL-specific implementation for the shader_param, that has some low-level details for interacting with the params. In platform-agnostic sections of the code the pointers to gs_shader_param are passed around using gs_sparam_t pointer handles.

Our concept of a result implies being linked with a param of the same name, and using atomic_uint variable as a shader param requires some special handling too. So, we choose this struct to contain low-level data necessary for interacting with the atomic counter params AND results, as well as flags that were similarly added to other structures to support results.

New members are:

  • is_result when true indicates this param also has an associated result
  • buffer_id is the ID of the OpenGL Buffer Object that will be used to interact with the atomic counter. This will be received by calling init_atomic_buffer(...)
  • layout_binding in the GL subsystem this represents the index into the indexed target GL_ATOMIC_COUNTER_BUFFER​ that represents graphics data for our counters.
  • layout_offset in the GL subsystem we can have multiple atomic counters share the same layout binding but have different offsets.
gs_shader_result [libobs-opengl implementation]

This will be our implementation of the shader result for the GL subsystem.

Members are:

  • name: name of the result
  • param: pointer to gs_shader_param, which has fields with low-level details for interacting with atomic counter params/results.
  • cur_value data array where retrieved result data will be stored
gs_shader [new members in libobs-opengl implementation]

This is the GL-specific definition for representing an active shader. One of the members is params, which is an array of gs_shader_param.

We add results, which is an array of gs_shader_result.

program_result [new struct]

This just has a pointer gs_shader_result, so a program result can be associated with a shader result.

Analogous to program_param.

gs_program [new members]

This represents an OpenGL shader program. It has gs_shader pointers to a vertex and pixel shader, as well as an array of program_params.

We add results, which is an array of program_results.

gl-shaderparser

After shader-parser has parsed the intermediate shader code into data structures, gl-shaderparser code will generate the final GLSL shader code from these data structures.

Our changes here are for generating code that utilizes the atomic counters feature of OpenGL/GLSL. atomic_uint type is native to GLSL, but we need to generate the layout(...) qualifier block that is required for atomic_uint variable declarations, which we have omitted from the effect language.

Notably, atomicCounterIncrement(...) (or decrement) effect statements, that we aim to support, are borrowed from GLSL, and require no special translation when regenerated from intermediate shader code into GLSL.

gl_write_var [modified function]

This function writes variable declarations into GLSL, taking into account various qualifiers.

To support atomic_uints in GLSL, the declaration must be preceded by a layout(...) qualifier block. So, our declarations for atomic counters will need to look like this:

(layout binding = 1, offset = 0) uniform atomic_uint myCounter;

We modify the function to insert the layout qualifiers for variable declarations that are of type atomic_uint. We will use atomic_counter_index of shader_var as the source for the binding index.

gl_shader_buildstring [modified function]

Dispatches lower-level functions for putting together the final GLSL code.

We change #version 330 preprocessor declaration to #version 460 to support atomic counters.

gl-shader

Defines many functions for activating and interacting with a GLSL shader and an OpenGL shader program.

Our changes are for integrating the flow of data for results, and introducing the low-level code for activation and interaction specific to atomic counters (atomic_uint).

gl_add_result(...) [new function]

Instantiates and stores a new result in the results array of gs_shader. New instance of gs_shader_result gets a copy of the param/result name, so the corresponding param and the result can be linked once the intermediate shader parsing has finished and the array of params is stable.

gl_add_param(...) [modified function]

Initializes a gs_shader_param given a shader_var as input. Textures get some special treatment here. Once initialized, new instance of gs_shader_param is pushed back to the params array of gs_shader.
We modify the function as follows:

  • When param is of type GS_SHADER_PARAM_ATOMIC_UINT its layout_binding gets set to the value of atomic_counter_index of shader_var, and its layout_offset is 0.
  • is_result field of shader_var propagates to the new gs_shader_param.
  • in the event is_result is true, gl_add_result(...) is called to instantiate and store a gs_shader_result corresponding to the param's name.
init_atomic_buffer(...) [new function]

Generates a new OpenGL Buffer Object of type GL_ATOMIC_COUNTER_BUFFER, which will be used for writing to and reading from an atomic counter variable. Buffer ID is retained to be stored in gs_shader_param.

We use glGenBuffers(...), glBindBuffer(...), glBufferData(...), glBindBuffer(...), and glBindBufferBase(...) to get an atomic counter buffer initialized.

gs_shader_set_atomic_uint(...) [libobs-opengl implementation]

Implements gs_shader_set_atomic_uint(...) function signature for the OpenGL subsystem.

Very similar to most of the other gs_shader_set_xyz(...) implementations (where xyz is a data type) that just copy data into cur_value data array of gs_shader_param.

gs_shader_get_result_by_name(...) [libobs-opengl implementation]

Implements gl_shader_get_result_by_name(...) function signature for the OpenGL subsystem.

Analogous to gs_shader_get_param_by_name(...) and just searches for the right result with name field that matches.

gs_shader_get_result(...) [libobs-opengl implementation]

Implements gs_shader_get_result(...) function signature for the OpenGL subsystem.

Just copies data into destination pointer from the cur_value data array of gs_shader_result.

gs_shader_destroy(...) [modified function]

This is a cleanup function for gs_shader and we modify it to also cleanup the results array of gs_shader_result.

program_set_param_data(...) [modified function]

This function works on getting param values/data into uniform variables of an active GLSL shader. For most supported data types this means calling glUniformXyz(...) function, with some specialized work needed for textures.

The atomic counter variables are also special and we cannot use glUniformXyz(...) style functions to set values before draw. When param type is GS_SHADER_PARAM_ATOMIC_UINT we add another specialization to call glBindBuffer(...), glBindBufferBase(...) and glBufferSubData(...) functions so the value can make its way to atomic_uint variable of interest in the shader, before the draw.

assign_program_param(...) [modified function]

This function finds the uniform locations of a given gs_shader_param by calling glGetUniformLocation(...) with the param's name, and then pushes the given shader param to the params array of gs_program.

Because the mechanisms used for initializing and using the atomic counters are different from regular uniforms, we modify the function to skip finding and assigning a uniform location any time a param is of type is GS_SHADER_PARAM_ATOMIC_UINT.

assign_program_shader_results(...) [new function]

This function works on connecting each constructed gs_shader_result with the respective gs_shader_param of the same name. It also constructs a new program_result linked with the shader result, and pushes it to results array of gs_program.

Naming is analogous to assign_program_shader_params(...).

assign_program_results(...) [new function]

This function calls assign_program_shader_results(...) for both vertex and pixel shaders of gs_program.

Analogous to assign_program_params(...).

gs_program_create(...) [modified function]

After GLSL shaders were compiled this function creates a new shader program, attaches the shaders to the program, and links it. After linking it calls assign_program_params(...) and assign_program_attribs(...), so we also add a call to assign_program_results(...).

gs_program_destroy(...) [modified function]

This is a cleanup function for gs_program and we modify it to also destroy the results array containing program_results.

libobs-d3d11

This module provides implementation of the Graphics API for the Direct3D11 graphics system.

Roughly speaking, most of the changes fall into the categories of implementing atomic counters using Direct3D's UAV variables system, and adding the structure and logic necessary to make the results work.

d3d11-subsystem

d3d11-subsystem.h/.cpp does many D3D11-specific implementations implementations of the Graphics API, including D3D11-specific implementations of the data structure for shader and program params.

We will modify and add new data structures here to add support for results, with some additions being specific to implement atomic_uint variables using the UAV system.

gs_shader_param [new members in libobs-d3d11 implementation]

This is the D3D11-specific implementation for the shader_param, that has some low-level details for interacting with the params. In platform-agnostic sections of the code the pointers to gs_shader_param are passed around using gs_sparam_t handle.

Our concept of a result implies being linked with a param of the same name, and using a UAV variable as a shader param requires some special handling too. So, we choose this struct to contain low-level data necessary for interacting with the atomic counter params AND results, as well as flags that were similarly added to other structures to support results.

New members are:

  • is_result when true indicates this param also has an associated result
  • atomicCounterIndex in the D3D11 subsystem will represent index into the buffer of UAV memory of unsigned integers where variable resides. The value will be copied from atomic_counter_index of shader_var.

Situational repurposing of an existing member:

  • The pos member is being used to store the byte address of the variable in the memory chunk used to set const/uniform variables. For UAV counter variables we will reuse the same variable as the byte address into the UAV memory chunks that we send and receive from the shader. Since each counter variable will be 4 bytes, anything that is a counter will have its pos assigned to atomicCounterIndex * 4.
gs_shader_result [libobs-d3d11 implementation]

This will be our implementation of the shader result for the D3D11 subsystem.

Members are:

  • name: name of the result
  • param: pointer to gs_shader_param, which has fields with low-level details for interacting with UAV counter params/results.
  • curValue data array where retrieved result data will be stored
gs_shader [new members in libobs-d3d11 implementation]

This is the D3D11-specific definition for representing an active shader. Among other things, it has params member, which is a vector of gs_shader_param, which partakes in passing the param values to the shader before the draw. So, we add results, which is a vector of gs_shader_result, and will partake in fetching the results after the draw.

The class also holds const data size and descriptor used for initializing the const/uniform buffer. So, we introduce UAV data size and several descriptor variables used to initialize the UAV buffer and its use. The descriptors are initialized in the gs_shader::BuildUavBuffer(...) (called from the vertex and pixel shader constructors) and are reused in gs_vertex/pixel_shader::Rebuild(...).

  • uavBd is a D3D11_BUFFER_DESC structure for describing the UAV buffer. Gets passed down to ID3D11Device::CreateBuffer(...)] to create the UAV buffer in graphics memory.
  • uavTxfrBd is a D3D11_BUFFER_DESC structure for describing the UAV transfer buffer. Gets passed down to ID3D11Device::CreateBuffer(...)] to create a transfer buffer for transferring data back and forth between the system memory and the UAV graphics memory.
    • analogous to bd which is a D3D11_BUFFER_DESC used in initializing the constants buffer that transfers uniforms' data to the graphics memory.
  • uavViewDesc is a D3D11_UNORDERED_ACCESS_VIEW_DESC structure for describing the UAV view. Gets passed down to ID3D11Device::CreateUnorderedAccessView(...) to create the UAV view for sending the counter data into the shader.

Finally, the class holds pointer to a D3D11 interface for the buffer used in setting const/uniform data. So, we add pointers to the two buffers and a UAV view used in setting and receiving UAV data. These are created in the gs_shader::BuildUavBuffer(...) (called from the vertex and pixel shader constructors) and have to be reinitialized in the event of gs_vertex/pixel_shader::Rebuild(...).

  • uavBuffer is a pointer to ID3D11Buffer representing the UAV buffer in graphics memory. Obtained by calling D3D11Device::CreateBuffer(...)] with uavBd as one of parameters.
  • uavTxfrBuffer is a pointer to ID3D11Buffer for sending and receiving the UAV data. Obtained by calling D3D11Device::CreateBuffer(...)] with uavTxfrBd as one of parameters.
    • analogous to constants buffer pointer for setting uniforms
  • uavView is a pointer to ID3D11UnorderedAccessView and is used to sending data to the UAV region in graphics memory. Obtained by calling ID3D11Device::CreateUnorderedAccessView(...) with uavBuffer and uavViewDesc as arguments.
device_draw(...) [modified function in libobs-d3d11 implementation]

This function encapsulates much of a typical draw activity, including loading vertex buffer, updating blend, raster, Z-stencil states, view+proj matrix, invoking UploadParams(...) on shader parameters to vertex and pixel shaders, and finally drawing primitives. It is called from gs_draw(...).

As we are introducing the concept of results that become available after the draw, we add a call to gs_shader::DownloadResults(...) on the vertex and pixel shaders, after the primitive draw has finished.

d3d11-shaderprocessor

After shader-parser has parsed the intermediate shader code into data structures, d3d11-shaderprocessor code will generate the final HLSL shader code from these data structures.

Our changes here are for generating code that parses the added effect syntax for atomic counters, and utilizes the UAV variables of D3D to implement atomic counters. Unfortunately, in HLSL there is no direct analogue to atomic_uint type and no atomicCounterIncrement(...) function, so we translate our intermediate shader code to become other things in HLSL that needs to be generated:

  • RWStructuredBuffer<uint> __uavBuffer : register(u1); is used to declare a UAV memory chunk in the shader that we will use for storing the atomic counter variables. Such a buffer will be added when UAV counters were encountered in the effect code, and the uavBuffer handle of gs_shader will be connected to this buffer when [gs_shader::BuildUavBuffer(...)](gs_shaderBuildUavBuffer-new-function) or gs_vertex/pixel_shader::Rebuild(...)` is called.
  • We cannot assign and access variables inside our __uavBuffer by name, but we can index into it like an array of unsigned 32-bit integers. We will use the atomicCounterIndex member of gs_shader_param as an index into the buffer (which was inherited from atomic_counter_index of shader_var).
  • InterlockedAdd(...) will be used to replace atomicCounterIncrement(...) added to the effect language.
  • So, an effect statement atomicCounterIncrement(varName) will need to be translated into InterlockedAdd(__uavBuffer[0], 1), where 0 happened to be the atomic counter index for varName, and 1 is because "increment" is equivalent to "add 1 and assign";
ShaderProcessor::SeekUntil(...) [new function]

This protected utility function is added for iterating through cf_tokens of ShaderParser until a token matching a string is found. Used by ShaderProcessor::PeekAndSkipAtomicUint(...) and ShaderProcessor::ReplaceAtomicIncrement(...) functions.

ShaderProcessor::SeekWhile(...) [new function]

This protected utility function is added for iterating through cf_tokens of ShaderParser for as long as tokens match a string. Used by ShaderProcessor::PeekAndSkipAtomicUint(...) and ShaderProcessor::ReplaceAtomicIncrement(...) functions.

ShaderProcessor::PeekAndSkipAtomicUint(...) [new function]

This function as added to eat up all tokens that are part of uniform atomic_uint myVar; declarations. As mentioned before, we will be using numeric index into the __uavBuffer of the shader code instead of counter variable names, so all tokens that are part of declarations for atomic_uints will be completely ignored - no output will be produced. Returns true when one such declaration was encountered and swallowed up.

ShaderProcessor::ReplaceAtomicIncrement(...) [new function]

This works on translating the atomicCounterIncrement(...) statement added to the effect language into InterlockedAdd(...) statements of HLSL. In order to translate variable name of intermediate shader code into a numeric index that can be used with __uavBuffer - the params array of shader_parser is scanned for shader_var with the matching name and the atomic_counter_index of that shader variable is used.

ShaderProcessor::BuildString(...) [modified function]

This one obtains tokens from the parser of the intermediate shader code, and replaces keywords of the effect language into things that actually exist in HLSL, so the final HLSL string is constructed. Our additions of the atomic counter syntax are no exception to these needs, and even require some additional handling. Changes are:
-ShaderProcessor::ReplaceAtomicIncrement(...) is called whenever atomicCounterIncrement keyword is encountered, and will navigate tokens to replace the entire increment statement.

  • We also add a call to ShaderProcessor::PeekAndSkipAtomicUint(...), which is called after all other keywords are checked for a need of conversion. If the function returns true - this means atomic_uint declaration was encountered, and all tokens of the declaration statement will be be skipped in the HLSL.
  • As mentioned before, we need to add RWStructuredBuffer<uint> __uavBuffer : register(u1); into output to declare the UAV memory block for the counters, but only for the shaders that have atomic counter variables. So, instead of stringstream output function variable there are now string streams tempOutput and finalOutput. tempOutput is being written to as tokens are being processed, just like output was before. During processing of the tokens, we will learn if the UAV block will be needed or not. And if it is needed, in the finalOutput we will insert RWStructuredBuffer<uint> __uavBuffer : register(u1); statement after static const bool obs_glsl_compile = false but before the rest of the code, which has been constructed in tempOutput. Now finalOutput stream has the final HLSL code to be copied into outputString, which is the function argument that is passed by reference. Phew.
d3d11-shader

Defines much of the functionality for initializing and interacting with an active HLSL shader.

Changes here will be for adding the needed structure and logic for results, initializing all the D3D11 device and context handles needed for interacting with the UAV buffer containing atomic counter variables, and using them.

gs_shader_get_result_by_name(...) [libobs-d3d11 implementation]

Implements gl_shader_get_result_by_name(...) function signature for the GL subsystem.

Analogous to gs_shader_get_param_by_name(...) and just searches for the right result with name field that matches.

gs_shader_set_atomic_uint(...) [libobs-d3d11 implementation]

Implements gs_shader_set_atomic_uint(...) function signature for the D3D11 subsystem.

Copies data from a const void* data pointer into curValue vector of gs_shader_param, resizing the vector when necessary.

gs_shader_get_result(...) [libobs-d3d11 implementation]

Implements gs_shader_get_result(...) function signature for the D3D11 subsystem.

Just copies data into destination pointer from the curValue data array of gs_shader_result.

gs_shader::BuildUavBuffer(...) [new function]

Iterates through the vector of gs_shader_param. For each param that is also a result, uses pos member to determine the largest value of the mapping index, so the required size of the UAV block is known. Once this size, uavSize, is known, and is not zero, initializes the uavBd, uavTxfrBd, and uavViewDesc descriptors, and uses them to create the uavBuffer and uavTxfrBuffer, and uavView of the gs_shader, which are used for sending data to and from the UAV graphics memory.

Analogous to the behavior of gs_shader::BuildConstantBuffer(...), except there is some additional complexity due to UAV use and support for bi-directional data flow.

gs_vertex/pixel_shader::Rebuild(...) [modified functions]

These functions reconstruct a vertex or pixel shader, maintaining the descriptors for const/uniform buffer but reinstantiating the const buffer and making sure all params work in the new instance of the shader.

Similarly to how const data is handled, we maintain the uavBd, uavTxfrBd, and uavViewDesc descriptors, but we create new instances of the uavBuffer and uavTxfrBuffer, and uavView of gs_shader, which are required for sending data to and from the UAV graphics memory.

gs_shader::UpdateParam(...) [modified function]

This function adjusts constData data array, passed by reference, in response to an input gs_shader_param. The pos member of each param dictates where in the const data (uniform) memory chunk the param's data will go. When new or modified param data is due to be inserted into the chunk of const data, a pass-by-reference boolean uploadConst is set to true, to flag that the const data will need to be uploaded to refresh the uniform variable(s).

We mirror this arrangement as we add support for sending UAV memory chunks to the shaders, triggered by the counter params+results that need it:

  • We add uavData data array that is passed into the function by reference. It will be adjusted in response to the input param when it is of type GS_SHADER_PARAM_ATOMIC_UINT. Once again, pos member of each param is used as a mapping index, except this time it's indexing into the UAV data instead of the const/uniform data.
  • We add uploadUav, a boolean passed by reference, that will need to be set to true whenever an input gs_shader_param is of type GS_SHADER_PARAM_ATOMIC_UINT. We always force UAV data refresh anytime a UAV variable is encountered - even if the UAV memory chunk appears unchanged. We want UAV counters to be set to predictable values as we begin drawing with the effect.
gs_shader::UploadParams(...) [modified function]

This function initializes a local variable constData, which is data vector, and then calls gs_shader::UpdateParam(...) on every param of gs_shader until constData contains all the values to be assigned to the uniforms. If new or modified param data was encountered during the updates - the uploadConst flag gets set, and the function uses ID3D11DeviceContext::Map(...) to map the constants buffer, so the constructed data block in constData is copied and ends up in this const/uniform buffer.

We mirror this structure as we introduce UAV counter results. We add uavData data array and uploadUav boolean, and we also process these by gs_shader::UpdateParam(...). If any of the params were atomic counter results, the uploadUav flag is set and:

gs_shader::DownloadResults(...) [new function]

After the draw we need to download the UAV data back:

  • Add a local veriable, data vector resultsData to receive the UAV memory chunk.
  • As we work in the opposite direction, we call ID3D11DeviceContext::CopyResource(...) first to copy data from uavBuffer in the graphics memory to the uavTxfrBuffer in the system memory.
  • Then use ID3D11DeviceContext::Map(...) to copy data from uavTxfrBuffer to the resultsData vector.
  • Finally, copy data from the resultsData vector to individual gs_shader_result, using the pos member of each result as a mapping index.
gs_pixel_shader::gs_pixel_shader(...) [modified constructor]

This a constructor for a class that represents an active pixel shader, and is derived from gs_shader base class. It instantiates and makes calls to d3d11-shaderprocessor, so the previously generated intermediate shader code can be parsed again into useful data structures for interacting with the shader. The final HLSL code is assembled and ID3D11Device::CreatePixelShader(...) is called to create the shader and keep and handle to it inside gs_pixel_shader.

Here, the only modification was changing shader model from ps_4_0 to ps_5_0 in order to support UAV buffer.

Motivation and Context

Pixel Match Switcher is being developed to provide users with ability to switch scenes (or toggle sources and filters) when a template image is being matched in a video source. In order for this to be fast we needed to introduce results into OBS effect system.

How Has This Been Tested?

Tested on Windows 10 and Debian 10.

Verified that existing functionality of OBS is not broken, and that results work when using the Pixel Match Switcher plugin.

Types of changes

  • New feature (non-breaking change which adds functionality)
  • Documentation (a change to documentation pages)

Checklist:

  • My code has been run through clang-format.
  • I have read the contributing document.
  • My code is not on the master branch.
  • The code has been tested.
  • All commit messages are properly formatted and commits squashed where appropriate.
  • I have included updates to all appropriate documentation.

@WizardCM WizardCM added Documentation Enhancement Improvement to existing functionality Request for Comments More feedback & discussion is requested labels Mar 16, 2021
@HoneyHazard
Copy link
Author

I understand there may be a lot going here, and apologize in advance for any heavy-handedness! We really hope to make image template matching easily accessible to users, and I'm hoping to get the ball rolling with this PR. Looking forward to your feedback.

@Fenrirthviti
Copy link
Member

For future reference, something this substantial should likely have been submitted and discussed as an RFC to avoid unnecessary work on your end.

As an additional FYI, we're gearing up for v27 release very soon, so please be patient with us until we have time to do a full review. Thanks!

@HoneyHazard
Copy link
Author

For future reference, something this substantial should likely have been submitted and discussed as an RFC to avoid unnecessary work on your end.

As an additional FYI, we're gearing up for v27 release very soon, so please be patient with us until we have time to do a full review. Thanks!

I was not aware of the RFCs. I understand that the cumulative sum of the changes is substantial and see that an RFC could be more appropriate.

There is a standalone page with the in-depth overview of the changes, which may be easier to read than the expandable section I've pasted into the PR description:
https://github.com/HoneyHazard/obs-studio-atomic-effects/wiki/Overview-of-the-changes-introduced-in-the-atomic-effects-fork-of-obs-studio

Please, do let me know if there is anything I can do to help people review, test and consider these changes.

libobs-d3d11/d3d11-shaderprocessor.cpp Outdated Show resolved Hide resolved
libobs-opengl/gl-shader.c Outdated Show resolved Hide resolved
@@ -741,7 +752,7 @@ static bool gl_shader_buildstring(struct gl_shader_parser *glsp)
return false;
}

dstr_copy(&glsp->gl_string, "#version 330\n\n");
dstr_copy(&glsp->gl_string, "#version 460\n\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No associated changes with our required openGL version? I presume it takes an OpenGL 4.6 context to compile glsl 4.6?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No associated changes with our required openGL version? I presume it takes an OpenGL 4.6 context to compile glsl 4.6?

I know that it works fine when testing on Windows and Linux, and I see no warnings about shader version and context mismatch. Looking more into what is appropriate...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have exact documentation confirming this precisely, but it seems possible for an older GL context to compile a shader with a newer version: https://stackoverflow.com/questions/35354496/why-does-my-version-330-shader-run-on-older-opengl

It could be more appropriate to force the required GL version to be 4.6. It appears a newer version of the GLAD library would be needed, because the one currently in the build tree seems to only have awareness of 4.3.

I think ideally a 4.6 version will be requested for GLSL programs that need results, and an older 3.3 could be used on everything else. This would be nice because users with older hardware can continue using OBS, and will only run into problems when newer features, such as results, are requested by plugins.

I have tried, in the past, to implement atomic counters using an ARB extension to 3.30 instead of upping the version to 4.6. This, unfortunately, has somehow ended up breaking shaders for the existing effects, and so has created more problems than it has fixed.

I really hope that upping the GL version will not be a deal breaker. It could be good to observe the behavior on older hardware that only supports 3.30, but I do not exactly have access to that. The closest I have is a linux machine that runs mesa drivers and they only support version 4.3. I will have to confirm this to be sure, but what I remember happens is this: When running OBS (with our changes) everything that does not require 4.6 works (even when shaders request version 460). But then when features needed for the atomic counters are actually used in the shader the effect will break, and only that effect. Other effects and the rest of OBS would continue working fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macOS deprecated OpenGL long ago and supports 4.10 max. So without a Metal renderer this would've been a non-starter for us.

…effect results.

Introduce results; they are tied to respective params. Result variables
are passed in a lot like other params before draw; they are also fetched
back as results after draw. Only one type of result is introduced, which
is atomic_uint. This type will be used with atomicCounterIncrement()
effect statements to allow atomic increments of a counter.

Change HLSL shader model from ps_4_0 to ps_5_0 to support UAV variables.
Change GLSL shader version from #version 330 to #version 460 to support
atomic counters.
@jpark37
Copy link
Collaborator

jpark37 commented Mar 16, 2021

I suspect the shader version bumps are problematic. The last time this was discussed, we didn't want to lose GPU support, or increase the maintenance burden of more rendering paths.

The way to do this without shader atomics is to render results to a render target and downsample against another render target, ping-ponging until you get down to 1x1. Then copy to a circular buffer of 1x1 staging textures to avoid having to block on the CPU waiting for the GPU to complete, and map for read. It's not going to be as efficient, but it'll be compatible.

@jpark37
Copy link
Collaborator

jpark37 commented Mar 16, 2021

I skimmed the code, and it looks like you're reading data off of the GPU immediately after making the draw call, which is going to stall the graphics thread like I mentioned in the previous comment. It's probably better to do this sort of management external to the graphics abstraction layer rather than introducing a "result" type.

@jpark37
Copy link
Collaborator

jpark37 commented Mar 16, 2021

Actually, the ping-pong version could be faster despite the extra memory traffic and render target switches. Shader atomics are supposed to be slow.

In an ideal world, you'd do the parallel reduction with compute shaders and groupshared memory rather than atomics, but we don't support compute shaders either since we only support Shader Model 4.

Also, don't allow more CPU access on a resource than you need because that tends to have an adverse affect on GPU performance. Maybe not so important if the buffer is tiny, but it's a good rule to follow in general.

@HoneyHazard
Copy link
Author

I skimmed the code, and it looks like you're reading data off of the GPU immediately after making the draw call, which is going to stall the graphics thread like I mentioned in the previous comment.

Allow me to try to defend our implementation of results. Here are some things to consider:

  • No readback ever occurs for effects that do not have results. Only effects that have results will do readback after a draw. Thus, users that do not require atomic counters should never experience any slowdowns due to existence of the results mechanism
  • For individual atomic counter variables, the amount of data to transfer is very small. To match images in a video source with Pixel Match Switcher we only need to read back two counter integers.
  • I have not explicitly profiled our image matching performance, but on a system with 5+ year old hardware I am noticing no observable slowdown, even when running in debug mode.
  • There is no observable slowdown even given the fact that the plugin is not very efficient right now. Even as it is doing a separate draw pass for every image to be matched in the scene, and even with 10 or so images matches in every frame I see no observable slowdown. The draws seems fast, shader counters seem fast, and whatever slowdown is being caused by blocking due to readbacks cannot be observed easily.

It's probably better to do this sort of management external to the graphics abstraction layer rather than introducing a "result" type.

My understanding is that doing it externally would imply dragging a lot of platform-specific stuff outside of the graphics layer. Either that, or it will be slow. We tasked ourselves with image matching that would be fast :)

Also, atomic counter is currently the only result type, but the result system can, in theory, also be extended to do useful work with other types of data that can be read back.

The way to do this without shader atomics is to render results to a render target and downsample against another render target, ping-ponging until you get down to 1x1. Then copy to a circular buffer of 1x1 staging textures to avoid having to block on the CPU waiting for the GPU to complete, and map for read. It's not going to be as efficient, but it'll be compatible.

Actually, the ping-pong version could be faster despite the extra memory traffic and render target switches. Shader atomics are supposed to be slow.

I have looked into the progressive upsampling and matching before, and see great potential in it, especially since it will work well for matching an image anywhere in video and not just specific location.

It is not, however, clear to me about how it can be faster. As you've mentioned, the shader pipeline in OBS is kept simple. There are a lot of details to consider, but it seems that multiple rounds of draw and reading back large chunks of texture would be slower than just reading back a few counter variables. So the only way that progressive upsampling will be faster is in the event that atomics are really really slow, and I am not sure about how much that is the case.

Again, an effort has been made to introduce the results and the atomic counters in ways that do not disrupt the existing project structure too much, and should have no performance effect whatsoever for users that do not invoke the use of results or counters. It is a feature that is available, and we have a proof of concept plugin that shows that results, atomic counters, and a pretty simple matching algorithm can all work in the framework of OBS to do useful work of switching scenes and toggling scene items when template images are matched in video sources.

In the context of image matching, a (slightly lazy) argument against progressive upsampling is that is significantly more complicated that using atomic counters. Whether you can do multiple resolution rounds within a single draw or not, it is complicated. This means more work to implement it and more work to maintain it and keep it from breaking. We already have a proof of concept for image matching that can already so useful work, should be plenty fast on a system with a dedicated GPU, and we want to share it!

And of course, progressive upsampling could one day be combined with atomic counters, to provide the fastest image matching ever, anywhere in the video :)

Our question to the community is whether the concept of results, and atomic counters specifically, can live in the OBS upstream, and thus grant users an easy access to an image matching system we have created. (Even if the plugin is not super polished or super optimized, yet)

Yes, I see that upping the shader version will be a big concern. If it will be a dealbraker, I can experiment more with ways of how everyone but users that need counters will continue using older shaders. So, newer shader versions will be requested only by effects that need counters. And this way, users of older hardware will stray away from our plugin and would continue using OBS unaffected.

@HoneyHazard
Copy link
Author

I suspect the shader version bumps are problematic. The last time this was discussed, we didn't want to lose GPU support, or increase the maintenance burden of more rendering paths.

This change below will make it so newer versions of shaders are only requested when atomic counters were encountered in the effect code. When they were not, #version 330 for GLSL and ps_4_0 for HLSL will continue to be used, so users of older hardware will not suffer. (Except suffering from envy of not having fast image matching :)

HoneyHazard@6868587

I will merge it into this PR, unless somebody objects

@jp9000
Copy link
Member

jp9000 commented Mar 17, 2021

I appreciate the submission, and please don't kill me for this, but this really should have started as an RFC, or at the very least discussion of some sort with us somewhere. I feel like the last time somebody brought up pixel matching was at least two years ago, and even then it wasn't a deep or serious discussion. I want to help everyone, but at the same time we have our own things we have to deal with. This adds a significant amount of code to our graphics backend that we'd have to be responsible for. And make no mistake: when we merge code, we are responsible for that code. We are responsible for making sure it works, we are responsible for making sure there are no bugs, and in this case, we are responsible for adding it to any other graphics API implementations we might add in the future (Metal/Vulkan).

There is no doubt that this PR helps you, but when I look at a something like this, I have to ask myself: why would I want to merge it? We're going to have to deal with Metal/Vulkan support in the future, and that means we'd have to add it for those too. This is harsh for me to say, but I can't find a reasonable benefit for merging this. Users aside, this code exists solely for your plugin. This would be code we would be reviewing, testing, maintaining, and adding to other graphics implementations solely for the sake of your plugin. That's not to say that I don't care about you or your plugin, but what I am saying is that it's asking a lot. When you're in the maintainer position and you get these sort of PRs all the time, it's absolutely asking a lot.

Compute shaders and SM5 sound more like something I'd be interested in, but even in that case I'd want to measure what sort of compatibility hit it would cause to devices. I think compute/SM5 is worthy of discussion though, because it would have a much wider variety of uses.

I understand that a lot of people want to try to make OBS a platform, but I feel like as the maintainer, I have a responsibility to say no to some things in order to keep the project healthy and focused. I understand that you want to implement this for your plugin, and it's possible you might have a lot riding on this pull request, but this is something where you need to work with us beforehand to make sure that we can fit it into our scope and priorities. I understand the desire to be defensive. I know that it's frustrating considering you spent a lot of time writing this. I'm sorry. Please try to see it from our perspective.

Besides, even if I didn't close it, introducing something that changes the backend this much would be unlikely to be reviewed within any reasonable timeframe. You have to understand that it's very difficult for us to be motivated to review and test code that only affects a very specific use case like this (arguably niche). When we have so many things to do, it is so incredibly easy to just keep putting lower priority things on the backburner. We have our own priorities and our own goals, and that's why it's so important to get involved with the community beforehand and make sure that we're receptive to it before you start pouring your heart into it.

For now, I just have to say no to this particular implementation. I would rather have a compute/SM5 pipeline or something. But even then, that's something that would need discussion, and I'm not sure if I'd be interested in it any time soon. If you can, I'd recommend using the 1x1 downsample (ping-pong) method for the time being, I know it's annoying but it definitely should be something you can do for the time being.

@jp9000 jp9000 closed this Mar 17, 2021
@HoneyHazard
Copy link
Author

I understand that a lot of people want to try to make OBS a platform, but I feel like as the maintainer, I have a responsibility to say no to some things in order to keep the project healthy and focused. I understand that you want to implement this for your plugin, and it's possible you might have a lot riding on this pull request, but this is something where you need to work with us beforehand to make sure that we can fit it into our scope and priorities. I understand the desire to be defensive. I know that it's frustrating considering you spent a lot of time writing this. I'm sorry. Please try to see it from our perspective.

I completely understand where you are coming from and I apologize for not being proper in approaching developers with something like this. No hard feelings are taken. This PR was made under impression that it was the right thing to do to get folks thinking about the functionality and considering it, and not under any expectation of immediate acceptance. I did not know about RFCs and I can see that it would be more appropriate. I understand the demands of maintaining a complex project, with so many aspects of hardware and software involved.

The fork and the plugin were started in response to what many folks have been asking for, for some time. We know that there is a strong demand for this functionality, and several folks who had the opportunity of playing with the plugin are excited. I sincerely hope we find the right path to sharing our work with the community.

@jpark37
Copy link
Collaborator

jpark37 commented Mar 18, 2021

I measured performance of matching 1920x1080 on an RTX 2080 Ti with this atomic accesses and CPU-GPU sync point, and the results are not good.

GPU draw: about 10 ms, measured with Intel GPA.
CPU Map read stall: about 10 ms, measured with QueryPerformanceCounter.

For comparison, I tricked the shader to skip over the atomic accesses without stripping the UAVs by adding impossible branches:
if (per_pixel_err_thresh > 2000000.0).

GPU draw: 0.046 ms
CPU Map read stall: about 3 ms (this should be 0 if buffered and using previous frame's results)

I haven't used shader atomics before, so I wasn't sure what to expect, but it makes sense that performance would tank if you essentially mutex a counter, and send 2 million GPU threads at it at the same time. I can almost guarantee that render target ping-ponging will not perform worse than this.

It also didn't look like the code uses actual UAV counters, which may perform better than regular UAV buffer storage, but I would be surprised if that would be good enough.

@jeske
Copy link

jeske commented Mar 18, 2021

I want to briefly introduce myself. I'm the catalyst for this plugin. I designed the functionality to meet a need I have, I co-architected it with @HoneyHazard, and I funded him to work on the development.

I just want to clarify the purpose of this PR, which perhaps was mis-communciated by @HoneyHazard.

This code was developed to meet a streaming need I had, and so I paid-for it's development for myself and others streamers like me. That need is to automatically and reliable "hide" or coverup sensitive data from streams. At the time development started, the target was hiding data from EVE Online and Albion Online, both open-world PVP games that are notoriously stream-ghosted. Our pixel-match fork works well to address this use case, and will continue to do so with no action by any OBS devs. We very much appreciate that OBS is open source, and we could build on it. Thank you!

In addition, because I very much believe in open source, our code is also open source. I asked @HoneyHazard to open this Pull Request simply to make OBS Devs aware of the code's existence, should it be something the wider OBS community was interested in now or in the future. And so you are now aware of it, mission accomplished!

If you are interested in a future version of OBS where a plugin like this can be built without modifying OBS itself, whether that be via atomics, or ping-pong, or compute shaders, or whatnot, both @HoneyHazard and myself are more than happy to participate in an RFC discussion, and I'm willing to fund that development as well. If not, that's great too, and we'll keep doing what we're doing, and scratching our own itch for our users. Thanks again for OBS!

@jeske
Copy link

jeske commented Mar 18, 2021

CPU Map read stall: about 3 ms (this should be 0 if buffered and using previous frame's results)

We considered using the previous frame's result, but it did not meet our design goal.. Which is to automatically and reliably make sure that sensitive information does not appear in the stream. (aka to minimize the benefit of stream-ghosting)

While casual viewers might not notice a single frame of sensitive information, savvy users would still have access to that frame of information and could extract it. (and in fact, they could use pixel-matcher to easily do so in real time) Certainly this could be a toggle to use previous frame (less secure, faster) or current frame (more secure, slower) for matching.

I'm interested to know if you have ideas about how to minimize the sync stall while still using the current frame to act on the current frame.

The bigger question for us, however, is what the sync stall affects. If it merely delays when OBS receives a frame, that's not a big concern. If it hangs up the game 3d render target, that is a bigger issue.

@jpark37
Copy link
Collaborator

jpark37 commented Mar 18, 2021

You can use the 1x1 texture as an SRV input to a draw that can screen out the information to cover the one or two frames it will take to cover the gap.

return passes(result.Load(int3(0, 0, 0)) ? black : texture.Load(int3(x, y, 0));

To cull the draw completely, issuing an indirect draw and filling out the instance count as 0 or 1 is an option, but you'd usually write to the indirect argument buffer from a compute shader. It might also be possible to use predication to cull the draw, but I don't have experience with that API.

@HoneyHazard
Copy link
Author

I measured performance of matching 1920x1080 on an RTX 2080 Ti with this atomic accesses and CPU-GPU sync point, and the results are not good.

GPU draw: about 10 ms, measured with Intel GPA.
CPU Map read stall: about 10 ms, measured with QueryPerformanceCounter.

For comparison, I tricked the shader to skip over the atomic accesses without stripping the UAVs by adding impossible branches:
if (per_pixel_err_thresh > 2000000.0).

GPU draw: 0.046 ms
CPU Map read stall: about 3 ms (this should be 0 if buffered and using previous frame's results)

I haven't used shader atomics before, so I wasn't sure what to expect, but it makes sense that performance would tank if you essentially mutex a counter, and send 2 million GPU threads at it at the same time. I can almost guarantee that render target ping-ponging will not perform worse than this.

Were you matching an image that is literally 1920x1080, or what was the size of your image? A typical use case will involve small images, and there will be a lot less blocking when an image mask is used.

@jpark37
Copy link
Collaborator

jpark37 commented Mar 18, 2021

Yes, I was matching the whole image. The time will be shorter at smaller sizes, but I would be surprised if this approach wasn't still magnitudes slower than non-atomic approaches. OBS often needs to live in the leftover scraps of another PC game, so GPU time is at a premium.

As for the CPU, the readback stall can take much longer if a significant chunk of work is still processing before the GPU can get around to the pixel match draw, and not just work from OBS. Three slots (2 frame delay) is generally a safe number for a ring buffer; that's what OBS uses for reading back image data for CPU encoding. You can cover up the sensitive frames using the GPU if the GPU result says it crossed the threshold.

I'm with NVIDIA regarding CPU-GPU sync points, "Not even one."
https://www.slideshare.net/basisspace/avoiding-catastrophic-performance-loss

@HoneyHazard
Copy link
Author

HoneyHazard commented Mar 19, 2021

I'm with NVIDIA regarding CPU-GPU sync points, "Not even one."
https://www.slideshare.net/basisspace/avoiding-catastrophic-performance-loss

Again, not to assert that we absolutely expect OBS to have this feature, but it has been framed as a feature that users could activate when they need it, and never worry about when they don't. There is a plugin that uses the feature. While things are not super optimized yet, the plugin already works. And performance impact with properly designed, practical examples and gaming hardware is nowhere near catastrophic.

People are already requesting extending the functionality of what happens when images are matched. To make a scene switching or scene item toggle decision, or any other custom action in OBS, you have to read information back after a draw to know what happened. You could read back the whole frame, then scan it in CPU, which would be slower and less inefficient, but could be doable without changing OBS. Or, let shaders do the work that is better suited for GPU. Yes, some blocking will occur in the shader due to atomics, but should be nowhere near catastropic for smaller images and is reduced further by masking. (There could be strategies to reduce impact of this blocking, too) And after the draw, you are reading back as little data as possible during the sync point.

I also did not want folks to have an impression that this would be something that is dropped on folks and completely abandoned. Given the opportunity I would be glad to help with whatever maintenance woes may arise due to having results. And again, there were no expectations of this being instantly approved. We wanted to get it on your radar. And it would have been fantastic if (with some work, I'm sure) this could be part of OBS, because this would make it much easier for users to find and use this feature. But we still have ways of sharing it with them.

@Fenrirthviti
Copy link
Member

Hey, just wanted to jump in to avoid any more wheel spinning. The proposal hasn't been rejected outright, we are basically requesting that it be submitted as an RFC so that all the discussion can take place there, and when it's in a state that we are prepared to accept, it can be then submitted as a PR. However, please keep in mind that I am also not guaranteeing that we will accept it. There is enough disagreement on how it should be properly implemented to warrant a proper RFC (which, honestly, you basically already have in the body of this PR).

Next steps to keeping this discussion going would be to submit the RFC.

@ItsCubeTime
Copy link

Yes, I was matching the whole image. The time will be shorter at smaller sizes, but I would be surprised if this approach wasn't still magnitudes slower than non-atomic approaches. OBS often needs to live in the leftover scraps of another PC game, so GPU time is at a premium.

As for the CPU, the readback stall can take much longer if a significant chunk of work is still processing before the GPU can get around to the pixel match draw, and not just work from OBS. Three slots (2 frame delay) is generally a safe number for a ring buffer; that's what OBS uses for reading back image data for CPU encoding. You can cover up the sensitive frames using the GPU if the GPU result says it crossed the threshold.

I'm with NVIDIA regarding CPU-GPU sync points, "Not even one." https://www.slideshare.net/basisspace/avoiding-catastrophic-performance-loss

Another question worth asking here is what GPU you were testing on. Nummber of ms taken will depend entirely on the hardware. "Old" hardware today would be something like a GTX 1060, a 750 Ti or something alike is getting ancient. I also agree with jeske here that its important to differentiate if it stalls other apps that might need to make use of the GPU.

I think being able to automatically mask out sensitive data from a live stream is a necessary feature. I see no reason why you wouldnt throw your hats up in the sky for such a generous PR 🙂Ive had a lot of complaints from people about there Discord messages ending up in live streams, currently OBS doesnt offer any functionality to mask out specific applications for that matter.

@dodgepong
Copy link
Member

Another question worth asking here is what GPU you were testing on.

As he stated earlier, this was tested on a 2080 Ti.

I think being able to automatically mask out sensitive data from a live stream is a necessary feature.

I don't know how feasible it is to accomplish this using image recognition, practically speaking, and discussion of such a feature would be best suited for somewhere else (such as ideas.obsproject.com). However, at least on Windows 10 2004 and later, processes can mark themselves as hidden from display capture if they want. If you want to be able to hide Discord from capture, I recommend to send that feature request to them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Enhancement Improvement to existing functionality Request for Comments More feedback & discussion is requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet