Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New element processing pipeline #119

Closed
raphlinus opened this issue Nov 1, 2021 · 8 comments
Closed

New element processing pipeline #119

raphlinus opened this issue Nov 1, 2021 · 8 comments
Labels
architecture Changes to the way it's put together

Comments

@raphlinus
Copy link
Contributor

raphlinus commented Nov 1, 2021

It's time to redo the element processing pipeline. The proposal in this issue fixes the following problems (of which some but not all clip related issues are addressed in #88):

  • Accounting of clip bbox needs to be done by CPU, and is in global coordinates. This seriously impacts composability of encoded scene subgraphs.
  • Clip bbox is accounted as the union of enclosed drawing commands, rather than intersection of clips in the stack.
  • Clip mask is computed at BeginClip but is not needed for alpha composition until EndClip, requiring more memory bandwidth in fine rasterization.
  • Bounding boxes are not tight in the presence of non axis aligned transforms.
  • Element processing is slower than it might be.
  • Adding blends makes many of the above issues worse.

The main theme of this proposal is to divide the scene buffer into streams, rather than having a single flattened encoding of the scene tree. Later streams can reference earlier streams, and use a prefix sum of the count to track the reference. This idea is a continuation of the transform work done by Elias (#62, #71), but takes it farther.

In this proposal, there are 3 streams encoded by the CPU: transform, path, and draw object. From the draw object stream is derived a clip stream, so there is a pipeline of 4 stream processing shaders, each implemented as a single compute dispatch implementing a monoid scan. It should be emphasized, the size of all these streams is easily determined during encoding and is used to allocate buffer space and determine the number of workgroups in each such dispatch. We're already doing a lot of that.

Thus, an encoded scene (or subgraph of a scene) is a tuple of the 3 streams, as well as some counts of number of transform and clip elements. The idea is that it's still straightforward and efficient to stitch together encoded subgraphs, which is done by concatenating each of the streams and adding the counts.

Transform stream

The transform stream is very simple, each element is just an affine transform, and the monoid is affine matrix multiplication.

As a potential later refinement, we might encode push and pop elements and use a stack monoid, rather than the current approach of representing a pop by pushing the inverse transform. That would allow degenerate transforms and reduce roundoff error (because floating point, we might not end up with the identical transform), but it's not clear it's worth the effort at this point.

Path stream

The path stream has the following elements:

  • Transform()
  • SetLineWidth(line width)
  • LineSegment(p0, p1)
  • QuadSegment(p0, p1, p2)
  • CubicSegment(p0, p1, p2, p3)
  • EndPath()

(Alternatively, we might SetLineWidth as is done now. I don't have a strong feeling about this, but encoding this in the End element seems like it might be slightly more efficient)

There is one transform element in this stream for each transform in the transform stream. Thus, a prefix sum of the count is a reference to the corresponding transform. In this stream the transform element has no payload.

The shader for this stage does two things: apply the transform to the points, and accumulate a bounding box for each path. The output is into two vectors, one for path segments, one for paths. Overall this shader is very similar to the current element processing stage, the main difference is that it does apply the transform (as used to be the case, then that moved to coarse path rendering in #71) but relies on the multiplication of transforms having been done in the previous stage.

Draw object stream

The draw object stream has the following elements:

  • BeginClip()
  • EndClip()
  • FillColor(color)
  • FillGradient(gradient params)
  • FillImage(image params)

The draw object stream processing shader has two outputs. One is a vector with one element per draw object, containing a clip count and a path count (these counts are straightforward prefix sums). The other is a compacted vector containing an element only for BeginClip and EndClip draw objects, the output of which is a tag indicating begin/end and a path index.

(An alternative to counting paths is to assume that each draw object is one path, and have empty elements in the path stream when that's not the case - currently just EndClip - but that's less great if and when we allow rectangles or other geometries in addition to Bézier paths)

Clip stream

The clip stream (the second vector produced by the draw object shader above) has the following elements:

  • BeginClip(path_ix)
  • EndClip()

The shader implements the stack monoid computing the intersection of the bboxes. The output contains both a bbox and a reference (index) to the clip path (this is the same as the one on input for BeginClip, and is derived for EndClip). Note that begin and end both have the same bbox and reference the same path.

Changes to rasterization

A major benefit of this proposal is that it makes more and better information about clips available to subsequent stages. Here's how those stages are affected:

Binning

Binning uses the bounding box of an element to decide whether to place it into a bin for coarse rasterization. Currently for clip operations it uses the bounding box encoded by the CPU. In the new design, instead use the bounding box computed the clip stream processing shader.

Coarse rasterization

For every element in the draw object stream, look up the clip index (output of draw object stream shader). With that, look up the clip bbox (output of clip shader). That bbox is intersected with the bbox of the draw object (which, in the case of paths, is now resolved from the output of the path stream processing shader.

To process BeginClip, resolve the clip path, then look up the tile. That is either all-0, all-1, or mixed. If all-1, increment clip_zero_count (similar to clip_zero_depth in the current logic; all drawing ops are quick-rejected when this is nonzero). If all-1, basically a no-op. If mixed, encode a BeginClip to the per-tile command list.

To process EndClip, resolve the clip path, then look up the tile. Importantly, this will always match the corresponding BeginClip. If all-0, decrement clip_zero_count. If mixed, encode the clip path tile, then encode EndClip. Note that the clip_one_mask and clip_depth variables and their associated logic can go away; their function is subsumed by the stack monoid in the clip stream processing.

Fine rasterization

Changes to fine rasterization are the same as described in #88. Basically, the encoding of the clip path is moved from BeginClip to EndClip, which is where the derived alpha mask values are actually used for composition. This eliminates the need to store the alpha values in the interim. Another small change is that it eliminates the possibility that a fill will be encoded with zero path segments (this could happen before on overflow of the 32 bit clip_one_mask stack window).

Blending

Blending is similar in many ways to clipping, but with some important differences.

A major design decision regarding blending is whether to require CPU encoding of a blend bounding box. Higher level vector representations such as SVG and the COLRv1 proposal do not include them, so they must either be inferred from the contents of the blend group, or assumed to be the drawing area. For font rendering, the latter may be reasonable (the drawing area being considered the glyph bounds, even if many glyphs are being drawn to an atlas).

Many drawing APIs do require bounding boxes for blend groups so the size of any needed temporary buffer can be known before drawing into that buffer. For example, in Direct2D there is a contentBounds field in the parameters of the Direct2D PushLayer method.

If blends are relatively rare, then it makes sense to require some bounding box on encoding (not necessarily accurate), and optimize them on a per-tile basis in coarse rasterization. If, on pop of a blend group, no drawing operations have been encoded since the corresponding push, delete the corresponding push. This requires a bit of state in the coarse rasterization process, but not dramatically more than we already have.

Another possibility is to infer blend bounding boxes from the encoded scene. This is a sketch of how that might be done.

Run a monoid scan over the draw object stream that implements this pseudocode:

raw_blends = []
bbox = empty
for obj in draw_objects:
    match obj:
        case BeginBlend:
            raw_blends.push(('begin', bbox))
            bbox = empty
        case EndBlend:
            raw_blends.push(('end', bbox))
            bbox = empty
        other:
            look up clip bbox for object
            look up bbox for object (for example, path bbox if it's a path)
            bbox = union(bbox, intersect(clip_bbox, obj_bbox))

Run a second stack monoid scan over the result of this first one:

accum_bbox = empty
stack = []
blends = [undefined] * len(raw_blends)
for i in 0..len(raw_blends):
    match raw_blends[i]:
        case ('begin', bbox):
            stack.push((i, union(accum_bbox, bbox)))
            accum_bbox = empty
        case ('end', bbox):
            bbox = union(accum_bbox, bbox)
            (begin_ix, accum_bbox) = stack.pop()
            blends[begin_ix] = bbox
            blends[i] = bbox
            accum_bbox = union(accum_bbox, bbox)

The result is to compute the union of (clipped) bounding boxes of all draw objects that occur between each begin/end pair, and to store that bounding box at both the beginning and end.

Note, there are opportunities to fuse these scans, but it's unclear whether they are actually more efficient. Also note that these scans can be skipped altogether when there are no blends in the scene.

SOA vs AOS

Right now, piet-gpu uses an "array of structures" approach for most of its processing. I think in some cases a "structure of arrays" might be more efficient, as some operations might only use some fields.

raphlinus added a commit that referenced this issue Nov 24, 2021
There's a bit of reorganizing as well. Shader stages are made available
from piet-gpu to the test rig, config is now a proper structure
(marshaled with bytemuck).

This commit just has the transform stage, which is a simple monoid scan
of affine transforms.

Progress toward #119
@raphlinus
Copy link
Contributor Author

So I'm digging into the implementation now. One update is that I will be doing the monoid scans using tree reduction rather than decoupled look-back, for compatibility. A consequence is that there are potentially quite a few more dispatches, which in part leads me to consider ways that can be reduced.

While it is conceptually easy to do the path stream processing in monoids, following the existing elements shader, it promises to be pretty tedious, and likely slow as well. Thus, I'm considering the following design.

The path stream is not represented as Rust-style enums (as is the current scene buffer, and strongly encouraged by piet-gpu-derive). Rather, there is one stream of tag bytes, and one or more streams that contain the payloads. This has a few advantages, one of which is that the scan of just the tags can be much faster. Another is that the payloads can be quite a bit more packed.

The path stream processing is split into two substages. The first is a scan over the tag bytes, computing a prefix sum of counts. I think that's 5 counts: transforms, line widths, path segments, paths, and offsets into the path segment payload stream (the latter would be redundant if path segments had a fixed size encoding, but we want them more packed, so I think it's worth it). The transform stream has been previously computed, and the line width stream can be encoded CPU-side.

The second substage is not a full monoid scan, but works with a partition worth of data, so it can be a single dispatch. Each thread loops over a small, fixed number (n_seq) of path segments, performing the following operations:

  • Load the tag. If it's not a path segment, skip
  • Unpack the path segment from the path segment payload stream
  • Apply the transform (and process line width, if applicable)
  • Write a PathSeg into the path segment output buffer
  • Compute a bounding box

Next, compute the bounding box + reset monoid scan (this is very similar to the existing elements.comp), but only locally within the partition. For paths that live entirely within the partition, the resulting bbox is the correct bbox for the path. Write that bbox into the path output buffer.

For the partial path at the beginning and end of the partition, the monoid computes the partial bbox. Use atomic min/max to apply the bbox to the path output buffer (it has previously been cleared with the bbox monoid identity).

A bit of further optimization. As written above, the first substage outputs 20 bytes (5 counts, 4 bytes each) for each byte of input. That's not necessarily a showstopper, but it's possible to do better. It could sample the stream, outputting a count tuple only every 4 bytes (so for every u32 of input). It's then quite straightforward to reconstruct the full scan by adding a local scan derived from the tag stream word. This is especially fast (and convenient) if n_seq is 4, but can be made to work with other combinations.

Another potential optimization is an optional 16 bit packing for path segments; this would be especially useful for fonts.

@raphlinus
Copy link
Contributor Author

I didn't address fill mode in the above discussion. This can be done one of several ways; I'm not sure yet which is fastest and cleanest (which might not be the same).

Way 1 is to have special values of line width. For example, we could ensure on CPU encoding that the value is always positive. Then, perhaps, -1.0 could represent non-zero winding fill, and -2.0 could represent even/odd. This has the advantage of requiring no special processing in the tag scan. Other things are kinda nice too, for example, max(linewidth, 0.0) represents the delta to the bounding box to accommodate the stroke width; no if is needed.

Way 2 is to have additional tag bytes for setting the fill mode, and the fill mode is tracked in the monoid in the tag scan. This has the advantage of being a more compact encoding, especially if there is a lot of switching between stroke and fill.

There are various other hybrids, but these seem to be the main approaches. I'm leaning toward 1.

raphlinus added a commit that referenced this issue Dec 1, 2021
This patch contains the core of the path stream processing, though some
integration bits are missing. The core logic is tested, though
combinations of path types, transforms, and line widths are not (yet).

Progress towards #119
raphlinus added a commit that referenced this issue Dec 1, 2021
This patch contains the core of the path stream processing, though some
integration bits are missing. The core logic is tested, though
combinations of path types, transforms, and line widths are not (yet).

Progress towards #119
This was referenced Dec 1, 2021
raphlinus added a commit that referenced this issue Dec 3, 2021
This successfully renders the tiger; fills and strokes are supported.
Other parts of the imaging model, not yet.

Progress toward #119
raphlinus added a commit that referenced this issue Dec 3, 2021
This successfully renders the tiger; fills and strokes are supported.
Other parts of the imaging model, not yet.

Progress toward #119
@raphlinus
Copy link
Contributor Author

The above commits are good progress towards this goal. I should say a little something about how I expect this to land.

In the current state, only filling and stroking works. Re-adding gradients, images, and transforms is fairly straightforward, and I'll probably do that.

Clips are a different matter. It would certainly be possible to get them working again with the current architecture (bbox in absolute units, computed by CPU), but I'm not feeling the motivation, as I plan to do the clip stack, so a bunch of that would need to be redone. However, clipping is not the top priority - getting glyph rendering to a texture atlas working on mac is the main goal, and mac compatibility was a motivation for pushing this through.

I'm imagining some other changes before this can be considered fully landed. I'd like to move elements to a variable size encoding (much like path segments). That's basically the end of the Rust side of piet-gpu-derive/types, and I'd like to delete that as well. Much as I'd like to work at a higher level of abstraction, we need flexibility, and the most productive way forward is to just write GLSL.

There's some scope for discussion here, but basically this is a notice that the architecture will continue to be fluid for a while.

@ec1oud
Copy link

ec1oud commented Jan 9, 2022

It's interesting that you put SetLineWidth into the path stream. I'd like to see support for interpolating the width along a stroke, as in http://schepers.cc/differentstrokes.html / https://www.w3.org/Graphics/SVG/WG/wiki/Proposals/Variable_width_stroke (plain linear interpolation along the length would be a good start, although the w3c proposal has other complications). I think it would be useful for several things: 1) combining multiple paths into one stream as I think you are doing; 2) ink rendering of course (translating tablet pressure into width); and 3) stroke-based fonts as in http://cajun.cs.nott.ac.uk/compsci/epo/papers/volume6/issue3/klassen.pdf / http://ronaldperry.org/SaffronTechDocs/SSF_2006_SIGRRAPH.pdf

@raphlinus
Copy link
Contributor Author

SetLineWidth is in the path stream because the line width affects the computed bounding box. There's actually quite a bit more complexity here even to fully achieve the PostScript 1 stroke imaging model - notably, the bounding box is based on a round or bevel join; a miter join can extend quite far from the control polygon, depending on the setting of miter limit. Currently only round joins and caps are supported.

I haven't figured out the whole story yet, but I strongly suspect that fancier strokes will be handled by a separate compute dispatch upstream of the main path processing, as essentially a conversion of the stroke to a fill. Bounding boxes wouldn't be a special problem because they'd be computed simply from the fill. This model is actually good for your proposal of variable stroke width, as the relatively simple PostScript model could be swapped out for a fancier shader.

I should have a separate issue for stroking, but for reference I'll put here that I'd like it to include the wisdom of the polar stroking and Nehab stroking papers.

For the time being, the canonical way to get fancier stroking will be to compute it on the CPU and upload the outline. I think that's reasonable while still iterating on defining what the semantics should be.

@ec1oud
Copy link

ec1oud commented Jan 9, 2022

Thanks for the links. Nehab's method looks to be more readily available, whereas polar stroking is aka patent 11164372 so hmm, I guess we can use it a couple of decades from now.

I wrote a naive wgpu compute shader to render variable-width stroked line segments into a full-frame storage texture, but didn't do anything about the joins, so that's not terribly useful yet. (I was kindof surprised that overlapping strokes are not causing any visual problem or crash; but also not surprised that shader-local memory would be much faster than what I'm doing, especially if I'm triggering some sort of mutex; tiling makes a lot of sense.) But maybe my time is better spent getting acquainted with this project of yours.

I agree with your overall vision: let's put the GPU in charge of rendering as much as possible, put the whole SG into graphics memory, and have the freedom to render pixels in any way you like at the same time as you traverse it.

raphlinus added a commit that referenced this issue Feb 18, 2022
This PR reworks the clip implementation. The highlight is that clip bounding box accounting is now done on GPU rather than CPU. The clip mask is also rasterized on EndClip rather than BeginClip, which decreases memory traffic needed for the clip stack.

This is a pretty good working state, but not all cleanup has been applied. An important next step is to remove the CPU clip accounting (it is computed and encoded, but that result is not used). Another step is to remove the Annotated structure entirely.

Fixes #88. Also relevant to #119
This was referenced Feb 18, 2022
@raphlinus raphlinus added the architecture Changes to the way it's put together label Feb 21, 2022
@raphlinus
Copy link
Contributor Author

I'm going to close this as done, as basically all of the work outlined in this issue has been done. We continue to refine it, but that should be captured in new issues.

@DJMcNab
Copy link
Collaborator

DJMcNab commented Feb 11, 2023

Closing as mentioned in the above comment.

@DJMcNab DJMcNab closed this as completed Feb 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architecture Changes to the way it's put together
Projects
None yet
Development

No branches or pull requests

3 participants