New example: Mandelbrot #58

wimrijnders · 2018-07-18T08:55:19Z

This adds the example program Mandelbrot, at least an initial version.

Structure taken from Rot3d
Scalar kernel only
The scalar kernel is structured in such a way that I hope makes sense for QPU kernels
Outputs a PGM bitmap with the result

Running times:

# Intel:
> obj-debug/bin/Mandelbrot 
0.869301s

# Pi 2:
> obj/bin/Mandelbrot 
13.470295s

The output PGM bitmap looks like this:

This adds the example program Mandelbrot, at least an initial version. - Structure taken from `Rot3d` - Scalar kernel only - The scalar kernel is structured in such a way that I hope makes sense for QPU kernels - Outputs a PGM bitmap with the result

wimrijnders · 2018-07-18T08:55:51Z

I really really couldn't resist. Never mind my actual work or other important tasks!

wimrijnders · 2018-07-18T09:00:55Z

I'm actually quite impressed with the running time of the Pi 2: it's only about 14 times slower than my 3GHz i7. I find that amazing for such a dinky processor.

wimrijnders · 2018-07-18T09:02:12Z

The goal as far as I'm concerned, is to have the VideoCore beat this i7 value. I want to see a $40 computer make mincemeat of my intel laptop.

wimrijnders · 2018-07-18T12:11:12Z

Added first version of a QPU kernel. This works with the emulator, not tested yet with hardware.

I must honestly say that the conversion from scalar to QPU was straightforward, congrats on that. Sincere feedback on my first kernel attempt is appreciated.

Also some code cleanup.

wimrijnders · 2018-07-18T12:26:34Z

I used the following construct on a hunch:

   BoolExpr condition = (radius < 4 && count < numiterations);

    While (any(condition))
      Where (condition)
   ...

I'm a bit suprised actually that it worked, it appears to be recalculated on every iteration as well. So BoolExpr works as a kind of lambda apparently.

Is this correct? Would you expect it to work propely like this? Or am I stretching the definitiion here? Keep in mind that I've only run it on an emulator. Perhaps there are some devious differences with QPU.

wimrijnders · 2018-07-18T14:25:41Z

Working on QPU! Execution with 0.215066s with 1 QPU, 192x192 points.

Pardon my language, but I'm fucking impressed. WORKING! This DSL thing you crafted actually delivers.

drinks are on me if ever we meet.

I had to reduce the resolution to 192x192 (was 1024x1024), because otherwise you get a heap alloc error.

Time comparison:

Platform	Kernel	Time (s)
i7	scalar	0.032004
i7 emulator	1	1.303475
p2	scalar	0.470066
p2 emulator	1	26.990710
p2	1	0.215066

i7 scalar is still 7 times faster than p2 kernel 1. However, this is 1 QPU and completely unoptimized code.

mn416 · 2018-07-18T17:43:03Z

Woohoo! Really nice!!! I haven't looked at the code yet, but will do and offer suggestions if any come to mind. We should put this as one of the introductory examples in the README :)

BoolExpr condition = (radius < 4 && count < numiterations);

Yes, BoolExpr is an expression not a value -- it isn't actually evaluated at this point.

I'm looking forward to the 12 QPU version :)

mn416 · 2018-07-18T17:46:48Z

First suggestion: instead of

result[i] = count;

try

store(count, result+i);

The latter is non-blocking -- it doesn't wait until the store is complete before continuing execution.

mn416 · 2018-07-18T17:52:49Z

Second suggestion: try using gather and receive instead of array lookups for the loads. However, I'd probably try multiple QPUs first.

By the way, does Mandlebrot require the loads? Can the fractal be produced without reading any input arrays?

Despite these suggestions, I'm glad you implemented the non-optimised version first. It looks very neat.

wimrijnders · 2018-07-18T17:58:45Z

Second suggestion: try using gather and receive instead of array lookups for the loads. However, I'd probably try multiple QPUs first.

Okay, but that's for kernel 2. Right now, I'm more interested in getting the current code optimized. Deluge me with suggestions!

By the way, does Mandlebrot require the loads? Can the fractal be produced without reading any input arrays?

Well, no to first and yes to second question. It's entirely possible to initialize everything with the given parameters. But I haven't figured out how to that yet. All I've got is what I understand from GCD and Rot3D. Suggestions are welcome, otherwise have patience.

EDIT: Well, ....

wimrijnders · 2018-07-19T06:09:02Z

Here's the second iteration of the Mandelbrot kernel. It does away with the input arrays. The trick here was to understand the usage of index().
Also some code 'optimization' - between quotes because it didn't make one bit of difference.

Calculation with 192x192 points; 1.0 previous kernel, 1.1 new kernel

Platform	Kernel	Time (s)
i7	scalar	0.032004
Pi2	scalar	0.470066
Pi2	1.0	0.215066
Pi2	1.1	0.209823

Well, maybe a tiny bit. I think it's fair to say that this kernel is computation-bound, because removing the data transport does not make one bit of difference (do you agree?).

The nice thing about not having the input arrays is that the points can be scaled up again to 1024x1024:

Calculation with 1024x1024 points

Platform	Kernel	Time (s)
i7	scalar	0.869301
Pi2	scalar	13.712409
Pi2	1.0	couldn't run
Pi2	1.1	5.005170

Insight:

Expressions are code generators

Yes, BoolExpr is an expression not a value -- it isn't actually evaluated at this point.

I understand now. The generated code is inlined and therefore it's as if it's called as a lambda. I actually really like this serendipitous capability, you should use it as a selling point and formalize it in the documentation.

I used this construct for a further optimization. This double use of condition bothered me:

   BoolExpr condition = (radius < 4 && count < numiterations);

    While (any(condition))
      Where (condition)
   ...

... because with my new insight it's obvious that the condition got executed twice - overhead. So I used the same principle to tweak the condition to something that can be stored in a variable so that only that variable needs to be checked:

     FloatExpr condition = (4.0f - (reSquare + imSquare))*toFloat(numiterations - count);
     Float checkvar = condition;

     While (any(checkvar > 0))
       Where (checkvar > 0)

And there is a slight improvement:

Calculation with 1024x1024 points

Platform	Kernel	Time (s)
i7	scalar	0.869301
Pi2	scalar	13.712409
Pi2	1.1	5.005170
Pi2	1.1 tweaked	4.370737

wimrijnders · 2018-07-19T06:29:11Z

Observations and feature requests

Given:

Int a;
Float b;
Float c;

... the following don't work:

c = a+b;       // No operator for Int, Float combination
c = a*b;       // idem
c += b;        // operator doesn't exist
a = (b < 0);   // Can't assign result BoolExpr to Int

There are alternatives to the first three of course:

c = toFloat(a)+b;
c = toFloat(a)*b;
c = c + b;

But I personally would truly appreciate it if the initial versions worked. I can sort of understand if you want to have explicit casts, but still.

I hereby place a feature request for the given operators. Also the conversion of a BoolExpr result to Int.

wimrijnders · 2018-07-19T06:43:00Z

Also, a minor point, following does not work:

  store(count, result[index]);

I had to do it like this instead:

  store(count, result + index);

But TBH this is a small thing I can live with.

wimrijnders · 2018-07-19T06:44:54Z

I hope it goes without saying that any optimizations you can think of are appreciated. I want to embarrass the i7, but we're not close yet!

wimrijnders · 2018-07-19T06:46:57Z

And I'll repeat, I'm impressed with your efforts at making this work. I just starred your project, great work! Hope I can help to make it even better.

mn416 · 2018-07-19T07:18:14Z

I hope it goes without saying that any optimizations you can think of are appreciated.

I see you got rid of the loads, excellent. I agree, the kernel is now compute bound so should scale up to 12 QPUs without much hassle. I'm not saying we're getting optimal performance from a single QPU, but that is surely more the compiler's fault than the program's, in this case.

wimrijnders · 2018-07-19T08:16:35Z

Update, initial 12 QPU version:

Calculation with 1024x1024 points, 2 is multi-QPU kernel

Platform	Kernel	Time (s)
i7	scalar	0.869301
Pi2	scalar	13.712409
Pi2	1.1	4.370737
Pi2	2	2.187408

:-( I'm just so intensely disappointed right now. I'll see if I can tweak it further, then I'll commit for your insights.

wimrijnders · 2018-07-19T08:39:07Z

Calculation with 1024x1024 points, Pi2, kernel 2 - multi-QPU

Num QPU's	Time (s)
1	4.427383
2	3.272469
3	2.755362
6	2.492874
12	2.192445

Not linear with num QPU's as I was expecting.....

mn416 · 2018-07-19T08:44:20Z

I can only imagine that there is a bottleneck created by the store function. What if you remove all calls to store, how well does the scaling work then?

I'm starting to think that the VideoCore doesn't like the way I am using the DMA unit -- lots of single-vector DMA requests. If so, this is a good thing to learn because it is probably also the bottleneck in other QPULib examples, and it is fixable.

wimrijnders · 2018-07-19T08:54:16Z

Questions/Requests

Is there any difference between If() and Where()? They appear to do the same thing.
Would a Continue and/or Break statement be possible? E.g.

While (condition)                // Also For(); any loops
  If (condition2) Continue; End  // Do next iteration loop
  If (condition3) Break;  End    // Exit the loop
End

Same question for Return. Also, to break of a sub-generator, eg:

void func_1() {
...
  Return;  // Get out of current generator
...
}

void func_2() {
...
  func_1();
...
}

An Exit would also be nice, but I think that's what kernelFinish() does.

wimrijnders · 2018-07-19T08:56:06Z

What if you remove all calls to store, how well does the scaling work then?

No difference.

mn416 · 2018-07-19T08:59:39Z

Is there any difference between If() and Where()? They appear to do the same thing

Where allows assignment to a subset of elements of a vector, where that subset is determined by the condition.

If executes different code depending on the condition.

wimrijnders · 2018-07-19T09:00:45Z

I commited the last changes:

Added a multi-QPU core
DRY'd common code for the kernels in mandelbrotCore()
raised the heap size, otherwise issues with running the emulator

Right now I'm hoping for a duh-moment where you point out some obvious error to me.

wimrijnders · 2018-07-19T09:03:43Z

Where allows assignment to a subset of elements of a vector, where that subset is determined by the condition.

Sorry, don't get it. Example to point out difference?

I'm stopping now, wasted[1] too much time on this already. I should be working right now!

[1] 'wasted' being a relative term. I'm having loads of fun doing this.

mn416 · 2018-07-19T09:15:01Z

Sorry, don't get it. Example to point out difference?

Where (x > 10) x++; End

Above increments the elements of vector x that are larger than 10.

If (any(x > 10)) x++; End

Above increments every element of x only when at least one of its elements is larger than 10.

wimrijnders · 2018-07-19T09:23:32Z

OK so far. What would this do?

  If (x > 10) x++; End

mn416 · 2018-07-19T11:05:23Z

Ahh, that makes more sense. Cool. I am guessing that store is the bottleneck now, can you verify?

wimrijnders · 2018-07-19T11:11:17Z

12 QPU's.

store(count, result + resultIndex: 0.218935s
result[resultIndex] = count: 0.220662s
No write to result at all: 0.199739s

Not really much difference

mn416 · 2018-07-19T11:12:28Z

The results you are seeing are linear if you plot num QPUs versus speedup factor.

wimrijnders · 2018-07-19T11:18:39Z

The results you are seeing are linear if you plot num QPUs versus speedup factor.

And this is how that looks like:

Something to be satisfied about I think.

mn416 · 2018-07-19T11:51:50Z

Something to be satisfied about I think.

Definitely. The first QPULib example to show strong scaling :)

mn416 · 2018-07-19T11:59:34Z

void mandelbrotCore(
  Float reC, Float imC,
  Int resultIndex,
  Int numiterations,
  Ptr<Int> &result)

You might try taking resultIndex and numIterations as references, to avoid unnecessary copying. As I've said before, QPULib doesn't do many optimisations.

wimrijnders · 2018-07-19T15:58:15Z

That was the last commit, some minor cleanup. Right now, I don't have any more bright ideas on how to make it better. Please final review?

mn416 · 2018-07-19T16:43:27Z

Ideally mandlebrot_2 would look as close as possible to the scalar version, and then the two could be placed side-by-side in the tutorial. One way to do this would be to inline mandlebrotCore and try to get rid of the strange looking dummy loop.

These are only suggestions, happy to accept the PR as it is too.

wimrijnders · 2018-07-19T17:02:50Z

The mandbrodCore is a DRY on the code....would hesitate to remove it again. A case can be made to put it as such in the tutorial, you can describe it separately.

As for the If (dummy..., this is the best I could think of. Open for suggestions. Can this be done better?

wimrijnders · 2018-07-19T17:32:23Z

I do agree that the dummy if is stupid. Ideally, a When should be there but semantics forbid it.

wimrijnders · 2018-07-20T04:53:44Z

I found a solution for If(dummy.... Since a line in the mandelbrot is done completely by a single QPU, it's sufficient to test yIndex only. See the code diff.

Further Changes:

code cleanup
init value for count in output_pgm()
aligned kernel 1 and kernel 2 code - kernel 1 showed differences in output, probably due to incremental differences in calculation

I tested this both kernel 1 and 2 and with different numbers of QPU's (especially odd numbers), the output bitmap is now always the same.

wimrijnders · 2018-07-20T05:54:01Z

@mn416 Heh. There is a competitor.

I wonder how our implementation compares to that one. I'll check in a spare moment.

Tested on Pi 2, kernel 2, 12 QPU's, same parameters for mandelbrot generation as link above.

Competitor's time: 33.781s

Run	Time(s)	comment
1	9.997932	Message: 'Failed to invoke kernel on QPUs'
2	31.794590	this is good!
3	0.000137	This can't be right at all. Something went wrong with the scheduling?
4	48.802330	:-(
5	0.000137
6	51.746103
7	0.000139
8	36.522121

There's a pattern here. The first call fails in some way. The second call succeeds, but the times are highly variable.

Do you have any idea what can cause this?

Also, see the output bitmap:

The output is 1920x1080, 5MB. I couldn't load it into GIMP so I made a screenshot and scaled it down.

Not sure where this comes from. I'm not really expecting the calculation itself to be in error (however, see error message above). It's probably more likely to do with the pgm generation.

mn416 · 2018-07-20T06:44:25Z

Yeah the first error is probably a timeout. Grep for TIMEOUT in the repo. It is 10s by default. And the error message should be improved to include the cause of the failure.

Not sure how the 2nd run finished though. Might be best to restart the pi and avoid the timeout to make sure of a correct result.

mn416 · 2018-07-20T06:47:06Z

BTW, I would be surprised if QPULib were to outperform hand written assembler.

wimrijnders · 2018-07-20T06:51:08Z

@mn416 ping. Edited the previous comment with more tests and insights.

wimrijnders · 2018-07-20T06:53:26Z

Yeah the first error is probably a timeout.

That can't be it, it fails right away. There's no waiting for ten seconds.
Ah, wait, you're responding to the first value. OK. But the rest of the 'quick' iterations are different.

the error message should be improved to include the cause of the failure.

Yes, please.

BTW, I would be surprised if QPULib were to outperform hand written assembler.

Me too. I was actually quite happy with the first value. The subsequent runs are at least in the same ball-park.

wimrijnders · 2018-07-20T06:57:13Z

I fixed a bug in the pgm output. There are limits to be considered.

Thinking about what other problems can occur.

mn416 · 2018-07-20T06:58:49Z

If the timeout occurs the QPUs could be in a bad state. I’d fix the timeout and do a hard reset.

wimrijnders · 2018-07-20T07:35:04Z

The Pi 2 hung during testing, restarted and now I'm consistently getting 'Failed to invoke kernel on QPUs'. Why did this not happen previously?

Will see if I can disable the timeout.

wimrijnders · 2018-07-20T08:14:40Z

Insight. OK, I got it now.

QPU_TIMEOUT is used qpu_execute() to start a kernel. The call waits till the kernel has executed, it times out with the given value if not executed within that time.

Which begs the question: Is it possible to schedule kernel executions? Also without a timeout running?

The scenario I'm thinking of is that the CPU program starts the jobs, does something else and regularly polls if the QPU work is done.

wimrijnders · 2018-07-20T08:20:50Z

Raised the timeout to 100s, working perfectly now. I think all the previous problems had to do with some QPU's timing out and others not.

The run values are up by the way, several runs:

62.494617
62.494638
62.494692

Again, comparing to competitor: 33.781s

That the run times are consistent is a good sign I think. Also, the output bitmap is perfect.

wimrijnders · 2018-07-20T08:23:06Z

Updated final fixes. Please review.

mn416 · 2018-07-20T08:33:42Z

Nice. I think the code looks much neater now without that dummy loop, thanks for getting rid of it.

Slightly uneasy about increasing the timeout to 100s. A lowish timeout is actually very useful when debugging, which is the common case. Ideally, the timeout would be taken at run-time, e.g. kernel.setTimeout(t).

Anyway, I'll merge this PR now as the timeout issue is somewhat separate.

Great work BTW, I'm very pleased to have another example of QPULib in there.

wimrijnders · 2018-07-20T08:38:33Z

I think the code looks much neater now without that dummy loop, thanks for getting rid of it.

I agree, it was a desperation move to get it to work properly. I didn't want it in either.

Slightly uneasy about increasing the timeout to 100s.

OK. Actually, I just updated it to get the competitor sample running. I'll put it back again!

Ideally, the timeout would be taken at run-time, e.g. kernel.setTimeout(t).

That's a good idea.

Great work BTW

My pleasure. I am truly enjoying this work. Wish I had more time for it - actually more time for the other things I need to do in my life. I'm bingeing right now, won't be able to keep it up.

New example: Mandelbrot

e4a74e5

This adds the example program Mandelbrot, at least an initial version. - Structure taken from `Rot3d` - Scalar kernel only - The scalar kernel is structured in such a way that I hope makes sense for QPU kernels - Outputs a PGM bitmap with the result

Added initial version of QPU kernel, works with emulator; ode cleanup

8e9c340

Mandelbrot: Remove the need for the input arrays in the kernel

a6590eb

Add multi-QPU kernel; DRY'd common kernel code

5d3e4fa

Some essential fixes

7228cf8

Final fixes

1e8e52b

Resolution of , code cleanup and small fixes

189a6bc

raised qpu execute timeout, enhanced pgm output

fc4fc09

mn416 merged commit 5d4ad88 into mn416:development Jul 20, 2018

wimrijnders deleted the mandelbrot branch July 21, 2018 04:27

New example: Mandelbrot #58

New example: Mandelbrot #58

Conversation

wimrijnders commented Jul 18, 2018

wimrijnders commented Jul 18, 2018

wimrijnders commented Jul 18, 2018

wimrijnders commented Jul 18, 2018

wimrijnders commented Jul 18, 2018

wimrijnders commented Jul 18, 2018 • edited

wimrijnders commented Jul 18, 2018 • edited

mn416 commented Jul 18, 2018 • edited

mn416 commented Jul 18, 2018 • edited

mn416 commented Jul 18, 2018 • edited

wimrijnders commented Jul 18, 2018 • edited

wimrijnders commented Jul 19, 2018 • edited

Insight:

Expressions are code generators

wimrijnders commented Jul 19, 2018

Observations and feature requests

wimrijnders commented Jul 19, 2018

wimrijnders commented Jul 19, 2018

wimrijnders commented Jul 19, 2018

mn416 commented Jul 19, 2018 • edited

wimrijnders commented Jul 19, 2018

wimrijnders commented Jul 19, 2018 • edited

mn416 commented Jul 19, 2018

wimrijnders commented Jul 19, 2018

Questions/Requests

wimrijnders commented Jul 19, 2018

mn416 commented Jul 19, 2018

wimrijnders commented Jul 19, 2018

wimrijnders commented Jul 19, 2018 • edited

mn416 commented Jul 19, 2018

wimrijnders commented Jul 19, 2018

mn416 commented Jul 19, 2018

wimrijnders commented Jul 19, 2018

mn416 commented Jul 19, 2018

wimrijnders commented Jul 19, 2018

mn416 commented Jul 19, 2018

mn416 commented Jul 19, 2018

wimrijnders commented Jul 19, 2018

mn416 commented Jul 19, 2018

wimrijnders commented Jul 19, 2018

wimrijnders commented Jul 19, 2018

wimrijnders commented Jul 20, 2018

wimrijnders commented Jul 20, 2018 • edited

mn416 commented Jul 20, 2018

mn416 commented Jul 20, 2018

wimrijnders commented Jul 20, 2018

wimrijnders commented Jul 20, 2018 • edited

wimrijnders commented Jul 20, 2018

mn416 commented Jul 20, 2018

wimrijnders commented Jul 20, 2018

wimrijnders commented Jul 20, 2018

wimrijnders commented Jul 20, 2018

wimrijnders commented Jul 20, 2018

mn416 commented Jul 20, 2018

wimrijnders commented Jul 20, 2018

wimrijnders commented Jul 18, 2018 •

edited

wimrijnders commented Jul 18, 2018 •

edited

mn416 commented Jul 18, 2018 •

edited

mn416 commented Jul 18, 2018 •

edited

mn416 commented Jul 18, 2018 •

edited

wimrijnders commented Jul 18, 2018 •

edited

wimrijnders commented Jul 19, 2018 •

edited

mn416 commented Jul 19, 2018 •

edited

wimrijnders commented Jul 19, 2018 •

edited

wimrijnders commented Jul 19, 2018 •

edited

wimrijnders commented Jul 20, 2018 •

edited

wimrijnders commented Jul 20, 2018 •

edited