Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] DrawingAreas should be threadsafe. #56

Closed
ralfbiedert opened this issue Oct 15, 2019 · 12 comments
Closed

[Feature Request] DrawingAreas should be threadsafe. #56

ralfbiedert opened this issue Oct 15, 2019 · 12 comments

Comments

@ralfbiedert
Copy link
Contributor

Background

Similar to #55. In addition, it would be nice to make use of multiple cores when rendering real time sub-graphs.

  • What is the feature ?

When doing:

let root = BitMapBackend::with_buffer(&mut texture.data, texture.dimension).into_drawing_area();
let (upper, lower) = root.split_vertically(300);

It would be nice if upper and lower could be used from separate threads to render both sub-graphs in parallel, since they should not share any overlapping area.

  • Why do you need the feature ?
  • What is the use case ?

See above, to improve multi-core utilization when doing software rendering of real time graphs.

Addition Information

Plotters should not do the multithreading itself and should remain lightweight. Instead, I as the API user want to be in charge which threading library I use, e.g., when rendering sub-graphs.

@ralfbiedert
Copy link
Contributor Author

ralfbiedert commented Oct 15, 2019

Ideally this feature should work with #54.

Now that I think about it, this might be really hard to implement efficiently, since it would probably require either:

  • splitting a mutable borrow of the backing array between multiple threads (which might work for vertical stacking but probably not for horizontal ones as the arrays are not continuous anymore)
  • synchronizing access to the backing array, which costs performance.

...

Alright, with both issues in mind, it might be worthwhile to do a performance analysis if having multiple owned backing arrays (and copying / blitting when merging) is "on average better", or avoiding multi-threaded access altogether (but reducing allocations & copying).

@38
Copy link
Member

38 commented Oct 15, 2019

Hi,

That's a really nice suggestion and I think it's really good to have it.

Particularly, I think splitting the buffer based on drawing region may have the following issue:

  1. As you mentioned. The memory region may not be consecutive, for example we have a region that handles from (100, 100) to (200, 200), Since the image buffer is stored as a 2D array, it's not easy to have a reference like that.
  2. The drawing area may overlaps, so there's no thread safety guarantee.

Also if we synchronizing the access per pixel, that will be very expensive.

So that I am thinking is, for sure we need synchronizing threads to manipulate the image buffer. What we can improve it is to use a buffer that caches the modification in drawing area and later we acquire the lock once and put all pixels to the same one.

In fact, I am thinking about could we have an bitmap element.
Basically each thread just render a different in-memory image. And after that one thread convert all the in-memory image into bitmap elements and paste it on the main buffer ?

As long as it reduce the synchronized operations, it might have OK performance.

Ideally, we should have a pixel granularity lock, but it seems hard to have one based on the image crate.

So any thoughts?

@serzhiio
Copy link
Contributor

serzhiio commented Oct 15, 2019 via email

@38
Copy link
Member

38 commented Oct 15, 2019

I believe Rayon create has some of needed functionality.

It really depends on what we want to do. If we just do the bitmap element, which allows background rendering and then paste to foreground, then nothing much needs to be changed.

However, if we want to make a real thread safety guarantee, we may need many things to think about.

@ralfbiedert
Copy link
Contributor Author

I think how to implement this depends a bit on the overarching values to address. In this scope mine would be, in order of priority:

  • control (I can precisely draw what I want, when and how I want it)
  • lightweight (few dependencies, low compile time, does not interfere with my own dependencies)
  • single-threaded speed (more common use case than multi-threaded)
  • multi-threaded speed

From that angle I would want:

  • generally speed, but not at the expense of being lightweight (e.g., I'd prefer not to have a default Rayon integration as that might interfere with being lightweight, and even control if there is a version e.g., with rayon-core)
  • multi-threaded speed, but not at the expense of making single-threaded speed significantly lower. I'd be willing to trade a bit, but not very much.

Basically each thread just render a different in-memory image. And after that one thread convert all the in-memory image into bitmap elements and paste it on the main buffer ?

As long as it reduce the synchronized operations, it might have OK performance.

This sounds like an interesting idea to hack with different image sizes / graphs, to see how much time plotting usually takes today, and how long image assembly would be, to get a ballpark figure.

@ralfbiedert
Copy link
Contributor Author

Alright, I did a small experiment:

  • Rendering the x^2 curve in 640x480 to a BitMapBackend
  • Splitting the canvas into two horizontal subgraphs and rendering the curves individually
  • Simulating blitting a horizontal half-buffer line-by-line into a larger buffer.

With grid

test sine_640_480                                  ... bench:   4,505,700 ns/iter (+/- 200,020)
test sine_640_480_2_horizontal_subgraphs_draw_both ... bench:   5,087,750 ns/iter (+/- 184,736)
test sine_640_480_2_horizontal_subgraphs_draw_one  ... bench:   2,454,890 ns/iter (+/- 85,392)
test sine_640_480_2_horizontal_subgraphs_fake_blit ... bench:     331,725 ns/iter (+/- 6,994)

Thoughts:

  • These are ballpark numbers, esp. the blitting has allocation cost mixed in
  • Drawing two subgraphs at the given resolution takes 5ms on one thread. Parallel drawing and blitting would rather be in the 2.4 + 0.3 ms = 2.7ms area
  • For a graph as complex as these ones (I'd guess drawing the background grid is the most expensive part), there would be a clear benefit of parallelizing and then copying.
  • Also, I just realize the name sine_ is wrong, should be called x_square_ :)

Same experiment without the grid

test sine_640_480                                  ... bench:   2,973,837 ns/iter (+/- 134,352)
test sine_640_480_2_horizontal_subgraphs_draw_both ... bench:   2,998,700 ns/iter (+/- 475,909)
test sine_640_480_2_horizontal_subgraphs_draw_one  ... bench:   1,436,110 ns/iter (+/- 69,206)
test sine_640_480_2_horizontal_subgraphs_fake_blit ... bench:     333,900 ns/iter (+/- 29,314)

Without the grid, 320x240

test sine_320_240                                  ... bench:     925,335 ns/iter (+/- 8,680)
test sine_320_240_2_horizontal_subgraphs_draw_both ... bench:   1,063,720 ns/iter (+/- 85,561)
test sine_320_240_2_horizontal_subgraphs_draw_one  ... bench:     904,630 ns/iter (+/- 17,451)
test sine_320_240_2_horizontal_subgraphs_fake_blit ... bench:      36,906 ns/iter (+/- 1,035)

Even then, this drawing into separate buffers and blitting once done seems totally worth it. However, it would be nice not paying the extra cost of blitting + allocation, unless I actually want to pay it, using the threading library of my choice.

@38
Copy link
Member

38 commented Oct 17, 2019

Open a new issue tracking that. But your benchmark seems to be very interesting. #58

Do you know any details about this ?

==

And yes, we can actually do something for parallel rendering just for BitMapBackend.

We probably would have a API for BitMapBackend allowing split a backend obect to a few different one, each one can draw independently. (But only vertically split?)

Thoughts?

@38
Copy link
Member

38 commented Oct 17, 2019

Hi @ralfbiedert ,

Just implemented the BitMapElement and try your benchmark quickly, sees a speedup. But lower than estimated. My guess is the thread synchronization overhead.

test sine_640_480                                      ... bench:   4,069,858 ns/iter (+/- 176,661)
test sine_640_480_2_horizontal_subgraphs_draw_and_blit ... bench:   3,227,221 ns/iter (+/- 536,819)
test sine_640_480_2_horizontal_subgraphs_draw_both     ... bench:   4,561,579 ns/iter (+/- 276,693)
test sine_640_480_2_horizontal_subgraphs_draw_one      ... bench:   2,291,869 ns/iter (+/- 94,406)
test sine_640_480_2_horizontal_subgraphs_fake_blit     ... bench:     260,334 ns/iter (+/- 22,001)
test sine_640_480_2_vertical_subgraphs_draw_and_blit   ... bench:   3,217,974 ns/iter (+/- 924,244)

Check the dev branch if you want to try this.

And this is one of the new benchmark I added

#[bench]
fn sine_640_480_2_horizontal_subgraphs_draw_and_blit(b: &mut Bencher) {
    let dim = (640 as u32, 360 as u32);
    let mut buffer = vec![0; (dim.0 * dim.1 * 3) as usize];

    let mut root = BitMapBackend::with_buffer(&mut buffer, dim).into_drawing_area();
    let (mut left, mut right) = root.split_horizentally(320);

    b.iter(|| {
        let plots:Vec<_> = [left.dim_in_pixel(), right.dim_in_pixel()]
            .par_iter()
            .map(|(w,h)| {
                let mut element = plotters::element::BitMapElement::new((0,0), (*w,*h));
                {
                    let left = element.as_bitmap_backend().into_drawing_area();
                    left.fill(&WHITE).unwrap();
                    let mut chart = ChartBuilder::on(&left)
                        .caption("y=x^2", ("Arial", 50).into_font())
                        .margin(5)
                        .x_label_area_size(30)
                        .y_label_area_size(30)
                        .build_ranged(-1f32..1f32, -0.1f32..1f32)
                        .unwrap();

                    chart.configure_mesh().draw().unwrap();

                    chart
                        .draw_series(LineSeries::new(
                            (-50..=50).map(|x| x as f32 / 50.0).map(|x| (x, x * x)),
                            &RED,
                        ))
                        .unwrap()
                        .label("y = x^2")
                        .legend(|(x, y)| Path::new(vec![(x, y), (x + 20, y)], &RED));

                    chart
                        .configure_series_labels()
                        .background_style(&WHITE.mix(0.8))
                        .border_style(&BLACK)
                        .draw()
                        .unwrap();

                }
                element
            })
        .collect();

        left.draw(&plots[0]).unwrap();
        right.draw(&plots[1]).unwrap();

        root.get_base_pixel()
    }); // `black_box` prevents `f` from being optimized away.
}

@38
Copy link
Member

38 commented Oct 18, 2019

Hello @ralfbiedert ,

After working for a while on this, I just want to let you know some updates:

  1. As I previously mentioned, the draw and blit is currently supported.
  2. Worked on the benchmark for a while and finally makes the bitmap backend 8 times faster.
  3. After think for a while, I don't think I am going to add inplace parallel rendering to the crate. But this can be done in this way: https://github.com/38/plotters/blob/3a17f8b7a3e6d1c2ae1d2c40f584313f491251c6/benches/benches/parallel.rs#L106

For the last update, I think the major problem is partially mutable borrow a vector seems to be unsafe anyway. So as a crate maintainer, I think it's important to not leaking the unsafe behavior to the safe rust code.
Since image crate requires the container to be anything that impls DerefMut. So we have to do make it mutably partial borrowed, but the problem is what if the thread code later moved one part of the buffer?
Plus the limit of trait bound, borrow a mutable reference is the only way we can feed it into image crate.

So what I am thinking is like the code in my benchmark:

            let mut buffer = vec![0u8; (W * H * 3) as usize];
            let (upper, lower) = unsafe {
                let upper_addr = &mut buffer[0] as *mut u8;
                let lower_addr = &mut buffer[(W * H * 3 / 2) as usize] as *mut u8;
                (
                    std::slice::from_raw_parts_mut(upper_addr, (W * H * 3 / 2) as usize),
                    std::slice::from_raw_parts_mut(lower_addr, (W * H * 3 / 2) as usize),
                )
            };

            [upper, lower]
                .par_iter_mut()
                .for_each(|b| draw_plot(&BitMapBackend::with_buffer(*b, (W, H / 2)).into_drawing_area(), 2.0));

So would you mind share you idea on this?

Thanks!

@ralfbiedert
Copy link
Contributor Author

ralfbiedert commented Oct 20, 2019

Worked on the benchmark for a while and finally makes the bitmap backend 8 times faster.

I just measured with my own benchmarks again, and can confirm drawing the graphs went from:

test sine_640_480                                  ... bench:   4,505,700 ns/iter (+/- 200,020)
test sine_640_480_2_horizontal_subgraphs_draw_both ... bench:   5,087,750 ns/iter (+/- 184,736)
test sine_640_480_2_horizontal_subgraphs_draw_one  ... bench:   2,454,890 ns/iter (+/- 85,392)

to

test sine_640_480                                  ... bench:     887,485 ns/iter (+/- 17,703)
test sine_640_480_2_horizontal_subgraphs_draw_both ... bench:   1,513,180 ns/iter (+/- 85,506)
test sine_640_480_2_horizontal_subgraphs_draw_one  ... bench:     773,900 ns/iter (+/- 22,090)

My actual app went from 3% to 1% CPU usage! This is awesome!

After think for a while, I don't think I am going to add inplace parallel rendering to the crate.

Although I think a "any layout" parallel story would be nice, I can see the overhead this might bring. Thanks for investigating!

So what I am thinking is like the code in my benchmark:

I submitted a PR, I think this can be simplified and made safe as:

let (upper, lower) = buffer.split_at_mut((W * H * 3 / 2) as usize);

@38
Copy link
Member

38 commented Oct 20, 2019

Glad you confirmed the performance improvement.

Although I think a "any layout" parallel story would be nice, I can see the overhead this might bring. Thanks for investigating!

Thanks for the suggestion. You PR just remainders me. I wasn't realize there's split_at_mut function which is safe. I just realized I was previously wrong that a mutable reference can be dropped by the worker thread. As long as the guest thread is holding just a mutable reference, the worst thing it can do is just swap the content but cannot simply move the content and invalidate the buffer. So it should be still safe.

So I think my argument on introducing any unsafe behavior into safe code isn't the case.

If this is the case, I am really happy to have the feature.


Update: I worked on this. And if you would like to try, just pull the latest dev branch. Code looks like this:

            let mut back = BitMapBackend::with_buffer(&mut buffer, (W, H));
            back.split(&[H / 2])
                .into_par_iter()
                .for_each(|b| draw_plot(&b.into_drawing_area(), 2.0));

@38
Copy link
Member

38 commented Oct 24, 2019

Seems we have investigave and impled things suggested under this one. Closing the issue for now. Feel free to reopen it if there's anything else. Thanks

@38 38 closed this as completed Oct 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants