Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stroke rework #303

Closed
raphlinus opened this issue Apr 6, 2023 · 8 comments
Closed

Stroke rework #303

raphlinus opened this issue Apr 6, 2023 · 8 comments

Comments

@raphlinus
Copy link
Contributor

We have quite a bit of work planned on redoing strokes to make them richer and more correct, while retaining performance. There's a considerable amount of discussion in the Zulip thread, but I also wanted to snapshot the current state of the plans here.

At a very high level, the plan is to compute offset curve + flattening in the compute shader, expanding the stroke into a fill, then using the same mechanisms downstream to render the fills. This will replace the current distance-field approach. The latter has some nice properties, but also limitations including not handling stroke styles other than round-cap/round-join, and also not handling general (non angle preserving) affine transformations.

The following capabilities are in scope for the compute shader:

  • Handling butt/square/round line caps
  • Handling miter/bevel/round line joins
  • Handling arbitrary affine transformations
  • Being correct and robust even for challenging curves (cusps etc)

The following capabilities are, for the moment, out of scope:

  • Dashing in the compute shader
  • Evolutes as described in Nehab 2020

Dashing will be computed during encoding, and will be handled by the inv_arclen method.

Evolutes are part of an extremely rigorous approach to spec compliance, where the marked area is defined as a line swept along the source curve, normal to the curve. Very few renderers correctly implement them, and they come at a complexity and performance cost. A serious attempt will be made to ensure results equivalent to a Minkowski sum of the path and the disc (or ellipse, under affine transform) of the stroking pen, when the stroke style is round-cap, round-join, but not an absolute guarantee for all adversarial inputs.

The main flattening method will be a variant of the fractional subdivision method currently used, but with Euler spirals as the base curve rather than quadratic Beziers. Offset curve will be done Euler spiral to Euler spiral, using the algorithms in Cleaner parallel curves with Euler spirals. The main motivation for Euler spiral is to make the cusp detection and subdivision simple, as this is a tricky recursive subdivision with Bezier-based representations. Another advantage of Euler spiral is that the round caps and joins have an exact representation. Cubic-to-cubic offsetting may also bring challenging issues with numerical stability of the quartic solver; it's only been tested with f64 arithmetic.

The proposed pipeline is:

  • pathtag scan, as now. output is TagMonoid (scan of 4x reduction).
  • mega stage which decodes path segments, performs offset, flattening, generation of caps and joins. Output is: (a) line soup, (b) writing bounding boxes for intra-workgroup bbox, (c) wg-reduced bbox monoid
  • bbox fixup, scans reduced bbox monoid and fixes up path bboxes
  • draw, clip, binning stages unchanged from now (require path bbox as input)
  • path segment count. Takes line soup as input, does atomic bump of segment count per tile and also backdrop deltas
  • backdrop prefix sum (same as now, but needs improvement)
  • coarse. Allocates space inline in a segment buffer, based on per-tile segment counts. Writes offset into Tile structs
  • path segment tiling. Atomically increments offset into segment buffer, write path segments into that buffer
  • fine. Similar to now but reads segment buffer contiguously rather than pointer-chasing the linked lists

See #259 for the monoid-based bounding box computation, which might land separately before this. "Line soup" refers to an unsorted buffer of path id and line segment (f32, render target coordinate system). The line segments will most likely be f16, with coordinates relative to the tile. Additionally, numerical robustness work in multisampled path rendering will likely lead to a replacement of y_edge (currently an additional f32) with a test that x0 or x1 = 0. All this (including removal of the linked list reference) should reduce the bandwidth and memory consumption of the line segments from 24 to 8 bytes per segment.

Generating the path segments contiguously rather than in a linked list also helps with multisampled path rendering (#270). Depending on the details, that work may involve other dispatches, especially computing partition-wide prefix sums of the pixel counts for each line segment. That's the reason the above pipeline has been revised slightly to output line segments into their own buffer rather than inline in the ptcl. (Having subgroups would probably also solve the performance issue, but again details depend on empirical measurement and tuning).

A potentially major improvement to the backdrop computation is doing partition-wide prefix sum (oblivious to which path the backdrops belong to), then having a bit of logic in coarse to fix those up when crossing partition boundaries. In the non-crossing case, the resolved backdrop is the backdrop of the tile minus the backdrop at the start of the row. The crossing case is similar but adding the backdrop at the end of the partition.

A possibility to explore is that the path segment count phase also outputs partition-wide prefix sums of the counts, for consumption by the path segment tiling phase.

There's a bit more computational geometry to do, notably detecting when the affine transform is non angle preserving (doing Singular Value Decomposition is probably the gold-plated version of this, though I think a quick check is possible) and computing an error metric on transformations.

Perspective transforms may be in scope, as they are required by CSS. Additionally, rational Beziers may also be in scope, as they're essentially what you get when you apply perspective transforms to Beziers. Care must be taken not to significantly slow the common case though.

The work will probably involve building CPU-side implementations of the main primitives. That will be quite useful in figuring out numerical robustness, and probably for other reasons.

@raphlinus
Copy link
Contributor Author

I have a potential refinement to the design after having written this issue. The goal is to reuse some of the work from path segment count so it doesn't need to be redone in path segment tiling. That work is atomic counts (with a scattered memory access pattern), plus prefix sums of counts followed by binary search based load balancing, which is potentially expensive.

In this refinement, there is one more buffer, sized exactly the same as the segment buffer, written by path segment count and read by path segment tiling. Each element has three fields:

  • A reference to a line in line soup (u32)
  • An index of the segment within the line (0..number of tiles touched by line)
  • The result of the atomic increment counting the number of segments in a tile

The latter two counts can reasonably be packed into u16. (Note: an adversarial input might overflow 64k path segments per tile; we should probably be robust in the sense of not crashing, but probably accept incorrect rendering for that tile)

Then, path segment tiling is an indirect dispatch, with a thread count equal to the number of segments. Each thread processes exactly one tile. That processing is:

  • Read the line from line soup indexed by that reference.
  • Compute the line segment (clipped to tile boundaries) based on the index of the segment.
  • Use the path index (read from line soup) and the (x, y) tile coordinates (from previous step) to find the slice in the segment buffer (which was allocated by coarse).
  • Write the segment into that slot (add the atomic count result to the base of the slice).

Thus, the memory access patterns are: contiguous reads of the new count struct, gathered reads of line soup and the tile allocation data, completed by a scattered write. There's no shared memory (or inter-thread communication of any kind) or concern about utilization. On the other hand, there's the additional read/write traffic to the count struct compared with the original plan. I expect the refinement to be considerably faster, but of course it's hard to say for sure without experimentation.

@raphlinus
Copy link
Contributor Author

Another potential optimization (as I'm implementing this): the "segment count" data structure might represent a small number (2 or possibly 4) tiles. The indices of the segments within the line would be sequential so wouldn't need to be stored (though there would need to be accounting for partially filled slices). The atomic increments would need to be stored, at a cost of one u16 per.

Whether this is a win or lose depends greatly on the distribution of the number of tiles per line segment. It's a lose if most of the lines are one tile, and I suspect that is likely in many cases (especially font rendering). But potentially worth considering.

@raphlinus
Copy link
Contributor Author

I'm obviously overthinking the optimization thing. A bit of brain dump.

You know the count (number of tiles covered) during flattening, almost. The "almost" part is that you do know when it is clipped to the viewport, but not when the bounding box is intersected by clipping. That's unlikely to be a significant factor in real life. Also keep in mind that retaining paths to the left of the viewport (ie when the y range overlaps but the xmax of the segment is less than xmin of the viewport) is important for coarse winding computation.

When 0, the line should be discarded on flatten, so it has zero cost downstream. This will be important when culling large parts of the drawing area (ie zoomed in SVGs) but not when the application is optimizing while encoding.

The following discussion pretty much only makes sense when n is large (when n is small, the savings will probably be dominated by the extra dispatches).

It's expected that a significant fraction of line segments will touch only one tile. Segregate the "line soup" structure so one is for single-tile and the other is for the rest. Flattening will scatter using two atomic bump allocators.

Then you have two copies of the path count and path tiling stages, specialized for count=1. The single-tile path count stage is considerably simpler, as there's no iteration over the number of tiles, thus no tradeoff between the simple iteration and load-balanced strategies; you don't get any occupancy problems doing the former, and you don't have to pay for the partition-wide prefix sum or binary search of the latter. In addition, you don't have to setup the math for conservative line rasterization (it's at least one divide). The single-tile path count stage also doesn't bump an atomic for coarse winding (backdrop), nor is there any clipping to tile boundaries.

The path tiling stage is also specialized. The single-tile version can avoid some of the line rasterization logic (basically, z is always 0) and also doesn't need to clip to tile boundaries (note that this requires a bit of care, we can't handle lines that are clipped by the viewport but have one tile inside the clip region). It's basically just the scatter.

However, the real benefit is to enable packing more than one segment into the segment count structure, as described in the previous comment. Then you don't pay any cost for reduced utilization in the 1-tile case.

It's completely unclear how much of a win the above is, or even whether the path tiling will be enough of the total runtime to be worth optimizing so carefully. As usual, the only real thing to be done is implement and measure.

raphlinus added a commit that referenced this issue Oct 4, 2023
Don't use GPU-side signed distance field rendering for strokes, but rather expand them on the CPU (using kurbo) during encoding, then render using fill.

This is a significant performance regression, but is on the critical path path for multisampled path rendering and stroke rework (#303).
@raphlinus
Copy link
Contributor Author

One step in moving stroke expansion to the GPU is encoding the stroke style and providing enough context to do joins and caps. The old signed distance field based renderer did not need context for joins as it always modeled each line stroke as a stadium. There are two parts: encoding the style (a fragment of what is now the Stroke struct in kurbo), and encoding paths for stroking.

For style encoding, we plan to continue to do dashing CPU-side. One reason is that dashing is invariant to transforms, while efficient stroke expansion is sensitive to transform to set the tolerance correctly. Thus, what needs to be encoded is a flag to indicate stroke vs fill, additional flags for the join and cap types, and the miter limit, and the line width (which is currently included in the encoding). Additionally, it makes sense for even/odd vs nonzero to be encoded as a flag (these are currently "magic" negative values of line width). I propose:

struct Stroke {
    flags_and_miter: u32,
    line_width: f32,
}

And that the miter limit be encoded as f16 within the first word, to be extracted using unpack2x16float. I believe not a great deal of precision is needed for it.

Encoding path for strokes is nontrivial, as rendering of a stroke segment requires knowledge of whether it is the first and/or last segment to conditionally render caps, and also one adjacent segment to render joins. Fortunately, I think a relatively simple extension to the existing encoding can handle these cases, at the modest cost of one additional encoded path segment per subpath - conceptually a repetition of the first segment in the subpath, but to save space it is sufficient to encode a lineto with the same tangent vector, as only the tangent is needed to render caps and joins.

This final segment has the SUBPATH_END_BIT set. There is an additional bit (bit 7 is free in the existing encoding) indicating whether the subpath is open or closed. Processing is as follows:

Fills are unchanged (other than minor changes in detecting stroke vs fill and the fill rule). In particular, the planned optimization of eliding the subpath end bit when a subsequent subpath has an implicit moveto is still viable. This optimization is not available for strokes, however.

For strokes, the two parallel curves are rendered when subpath end bit is false. In addition, the path tag of the following path segment is decoded. If it has the subpath end bit set and is open, then an end cap is drawn for the end of the path segment. Otherwise, the subsequent path segment is decoded, its start tangent calculated, and a join is drawn based on the end tangent of this segment and the start tangent of the next.

When the subpath end bit is true and the subpath is open, then a start cap is drawn. If the subpath is closed, nothing is drawn (the segment must still be encoded to provide a join tangent for the previous segment).

I believe this handles the needed cases and can be evaluated fully in parallel. As such, I think it represents a middle point between the concepts of 'local' and 'global' as in Nehab 2020. By the classification of that paper, it is technically a 'global' technique, yet each segment can be processed independently and there is no need to maintain ordering of the output segments (as they are LineSoup after subsequent flattening).

It might be worth testing this with a placeholder algorithm, leveraging the existing flattening logic, to test it working end-to-end, as the anticipated Euler spiral based stroke expansion algorithm depends on some fancy computational geometry.

raphlinus added a commit that referenced this issue Oct 6, 2023
This is part of the larger multisampled path rendering work, under stroke rework (#303). It refactors the GPU pipeline so that the path segments available to fine rasterization are stored as a contiguous slice rather than a linked list as before.

Numerous parts of the pipeline are refactored. In the old pipeline, path segment decoding generated cubic line segments and also estimated a bounding box (somewhat imprecise), and the combination of flattening those cubics and tiling was in a separate stage (path_coarse) quite a bit later in the pipeline. In the new pipeline, path decoding is fused with flattening, generating a `LineSoup` structure (line segments associated with paths, otherwise unordered) (with bbox as a side effect), and tiling is spread over multiple stages, later in the pipeline.

The first tiling stage (path_count) counts the number of tiles that will be generated. Then coarse rasterization allocates contiguous slices based on those counts. The second stage does a scattered write of the resulting tiles. Both of these stages rely on indirect dispatch, as the number of lines and the number of segments (respectively) are not known at encode time.

These changes only make sense for filled paths, thus they relied on stroke expansion being done earlier, currently on the CPU.
armansito added a commit that referenced this issue Oct 27, 2023
* Pipelines downstream of flatten (draw_leaf, coarse) now extract the
  fill rule using a "draw flags" field. Flatten still writes this to the
  path bounding-box structure, which gets propagated to the draw info
  list, and eventually translates to the fill rule written to the PTCL.

* draw_leaf no longer deals with transforming the linewidth for strokes.
  This was a leftover from the previous architecture and the logic is no
  longer needed.

* bbox_clear used to contain a duplicate PathBbox data structure. It now
  uses the one from shader/shared/bbox.wgsl

This continues the work outlined in #303
armansito added a commit that referenced this issue Oct 28, 2023
* Pipelines downstream of flatten (draw_leaf, coarse) now extract the
  fill rule using a "draw flags" field. Flatten still writes this to the
  path bounding-box structure, which gets propagated to the draw info
  list, and eventually translates to the fill rule written to the PTCL.

* draw_leaf no longer deals with transforming the linewidth for strokes.
  This was a leftover from the previous architecture and the logic is no
  longer needed.

* bbox_clear used to contain a duplicate PathBbox data structure. It now
  uses the one from shader/shared/bbox.wgsl

This continues the work outlined in #303
@caxieyou
Copy link

caxieyou commented Nov 7, 2023

A small question, so you moved part of the work from GPU to CPU, is this fast enough?

I mean sorry I did not look into kurbo carefully, but I think if there are countless dash lines, with some style, does this process did the clipping before hand? or has to caculate all of them even if they are not inside the camera or window?

Also I met a problme is about the line with issue, we have a very stupid requirement which is "keeping the line width static to screen", which means in the old way, I just need to change the line width according to the zoom-in/out scale once, but now I have to update all the strokes in the CPU. This makes the process a very hot path. -_-!!!!

The first thought to me is just split this work into GPU and split them parallelly.

My concern maybe the cpu side will slow down the whole process.

Any ideas?

Thanks a lot

armansito added a commit that referenced this issue Nov 8, 2023
- Implemented the stroke cap marker segment encoding scheme outlined in
  #303. The proposed scheme has been slightly modified such that instead
  of an open/closed bit, the segment is encoded as a lineto or quadto
  to distinguish between the different states.

  Encoding the marker segment as a quad to for open segments also
  handles a limitation with only using linetos: for an open path, the
  marker must encode both the start tangent and the coordinates of the
  first control point to correctly position a start cap. Encoding these
  as 4 f32s requires an implicit moveto before the lineto which must
  terminate the preceding subpath.

  Encoding this as a quadto preserves the invariant that every stroked
  subpath only contains one segment with the SUBPATH_END_BIT set to 1,
  which now serves as the stroke cap marker. For a closed path, a lineto
  is sufficient since the preceding coordinate (from the lineto encoded
  for a "closepath" verb) must already equal the subpath's first control
  point.

- Shitty stroking continues to work by detecting and ignoring the
  stroke cap marker.
armansito added a commit that referenced this issue Nov 8, 2023
- Implemented the stroke cap marker segment encoding scheme outlined in
  #303. The proposed scheme has been slightly modified such that instead
  of an open/closed bit, the segment is encoded as a lineto or quadto
  to distinguish between the different states.

  Encoding the marker segment as a quad to for open segments also
  handles a limitation with only using linetos: for an open path, the
  marker must encode both the start tangent and the coordinates of the
  first control point to correctly position a start cap. Encoding these
  as 4 f32s requires an implicit moveto before the lineto which must
  terminate the preceding subpath.

  Encoding this as a quadto preserves the invariant that every stroked
  subpath only contains one segment with the SUBPATH_END_BIT set to 1,
  which now serves as the stroke cap marker. For a closed path, a lineto
  is sufficient since the preceding coordinate (from the lineto encoded
  for a "closepath" verb) must already equal the subpath's first control
  point.

- Shitty stroking continues to work by detecting and ignoring the
  stroke cap marker.
armansito added a commit that referenced this issue Nov 8, 2023
- Implemented the stroke cap marker segment encoding scheme outlined in
  #303. The proposed scheme has been slightly modified such that instead
  of an open/closed bit, the segment is encoded as a lineto or quadto
  to distinguish between the different states.

  Encoding the marker segment as a quad to for open segments also
  handles a limitation with only using linetos: for an open path, the
  marker must encode both the start tangent and the coordinates of the
  first control point to correctly position a start cap. Encoding these
  as 4 f32s requires an implicit moveto before the lineto which must
  terminate the preceding subpath.

  Encoding this as a quadto preserves the invariant that every stroked
  subpath only contains one segment with the SUBPATH_END_BIT set to 1,
  which now serves as the stroke cap marker. For a closed path, a lineto
  is sufficient since the preceding coordinate (from the lineto encoded
  for a "closepath" verb) must already equal the subpath's first control
  point.

- Shitty stroking continues to work by detecting and ignoring the
  stroke cap marker.
@armansito
Copy link
Collaborator

armansito commented Nov 9, 2023

@caxieyou

A small question, so you moved part of the work from GPU to CPU, is this fast enough?

Following the recent pipeline refactor, all stroke expansion is currently done on the CPU. This is of course very slow if the scene contains a lot of stroked path segments. This is temporary, as the end goal is to do all stroke expansion on the GPU.

There has been steady progress towards building out the GPU stroker (see #408 which is the next step towards this). This is already showing promising performance numbers on the heaviest stroke test scenes (mmark and longpathdash in particular). The short term plan is to keep dashing on the CPU, convert a dashed stroke into a sequence of stroked path segments, and expand those on the GPU. It's not perfect but it's not the end state.

In my local prototype, the longpathdash scene takes about 21ms to render on my M1 Pro (with CPU dashing). With CPU-side stroke expansion this takes about 165ms, so it's already a big speed up.

I mean sorry I did not look into kurbo carefully, but I think if there are countless dash lines, with some style, does this process did the clipping before hand? or has to caculate all of them even if they are not inside the camera or window?

That's an interesting point about clipping. I don't believe the kurbo dash iterator is aware of the viewport or does any type of culling. This could be a good avenue to explore if we end up living in the CPU dashing world long enough.

Also I met a problme is about the line with issue, we have a very stupid requirement which is "keeping the line width static to screen", which means in the old way, I just need to change the line width according to the zoom-in/out scale once, but now I have to update all the strokes in the CPU. This makes the process a very hot path. -_-!!!!

Others can correct me but I believe this is generally the case with kurbo shapes and Vello currently doesn't support setting a device-space line width. If you want the stroke width to stay fixed relative to the viewport scale, you need to re-encode the stroke styles.

The first thought to me is just split this work into GPU and split them parallelly.

My concern maybe the cpu side will slow down the whole process.

Any ideas?

Thanks a lot

Would you clarify what you're asking? Are you looking for a way to process your stroke encoding in parallel to be consumed by Vello or you asking generally?

@caxieyou
Copy link

caxieyou commented Nov 9, 2023

Thanks a huge for your answering, this helps a lot.

By saying "The first thought to me is just split this work into GPU and split them parallelly."

I just mean there is some kind of work such as dashing or "filling a rect with some pattern", which I think doing in CPU is too time consuming,
I think this kind of work should automatically "fill" or "generate" or "split" in GPU parallelly.

Thanks

@raphlinus
Copy link
Contributor Author

I'm going to close this issue, as all of the pieces have now landed in main. No doubt there's more to do, but I think we can consider strokes to be reworked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants