Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New example: Mandelbrot #58

Merged
merged 8 commits into from Jul 20, 2018
Merged

New example: Mandelbrot #58

merged 8 commits into from Jul 20, 2018

Conversation

wimrijnders
Copy link
Contributor

This adds the example program Mandelbrot, at least an initial version.

  • Structure taken from Rot3d
  • Scalar kernel only
  • The scalar kernel is structured in such a way that I hope makes sense for QPU kernels
  • Outputs a PGM bitmap with the result

Running times:

# Intel:
> obj-debug/bin/Mandelbrot 
0.869301s

# Pi 2:
> obj/bin/Mandelbrot 
13.470295s

The output PGM bitmap looks like this:

mandelbrot

This adds the example program Mandelbrot, at least an initial version.

- Structure taken from `Rot3d`
- Scalar kernel only
- The scalar kernel is structured in such a way that I hope makes sense for QPU kernels
- Outputs a PGM bitmap with the result
@wimrijnders
Copy link
Contributor Author

I really really couldn't resist. Never mind my actual work or other important tasks!

@wimrijnders
Copy link
Contributor Author

I'm actually quite impressed with the running time of the Pi 2: it's only about 14 times slower than my 3GHz i7. I find that amazing for such a dinky processor.

@wimrijnders
Copy link
Contributor Author

The goal as far as I'm concerned, is to have the VideoCore beat this i7 value. I want to see a $40 computer make mincemeat of my intel laptop.

@wimrijnders
Copy link
Contributor Author

Added first version of a QPU kernel. This works with the emulator, not tested yet with hardware.

I must honestly say that the conversion from scalar to QPU was straightforward, congrats on that. Sincere feedback on my first kernel attempt is appreciated.

Also some code cleanup.

@wimrijnders
Copy link
Contributor Author

wimrijnders commented Jul 18, 2018

I used the following construct on a hunch:

   BoolExpr condition = (radius < 4 && count < numiterations);

    While (any(condition))
      Where (condition)
   ...

I'm a bit suprised actually that it worked, it appears to be recalculated on every iteration as well. So BoolExpr works as a kind of lambda apparently.

Is this correct? Would you expect it to work propely like this? Or am I stretching the definitiion here? Keep in mind that I've only run it on an emulator. Perhaps there are some devious differences with QPU.

@wimrijnders
Copy link
Contributor Author

wimrijnders commented Jul 18, 2018

Working on QPU! Execution with 0.215066s with 1 QPU, 192x192 points.

mandelbrot

Pardon my language, but I'm fucking impressed. WORKING! This DSL thing you crafted actually delivers.

drinks are on me if ever we meet.


I had to reduce the resolution to 192x192 (was 1024x1024), because otherwise you get a heap alloc error.

Time comparison:

Platform Kernel Time (s)
i7 scalar 0.032004
i7 emulator 1 1.303475
p2 scalar 0.470066
p2 emulator 1 26.990710
p2 1 0.215066

i7 scalar is still 7 times faster than p2 kernel 1. However, this is 1 QPU and completely unoptimized code.

@mn416
Copy link
Owner

mn416 commented Jul 18, 2018

Woohoo! Really nice!!! I haven't looked at the code yet, but will do and offer suggestions if any come to mind. We should put this as one of the introductory examples in the README :)

BoolExpr condition = (radius < 4 && count < numiterations);

Yes, BoolExpr is an expression not a value -- it isn't actually evaluated at this point.

I'm looking forward to the 12 QPU version :)

@mn416
Copy link
Owner

mn416 commented Jul 18, 2018

First suggestion: instead of

result[i] = count;

try

store(count, result+i);

The latter is non-blocking -- it doesn't wait until the store is complete before continuing execution.

@mn416
Copy link
Owner

mn416 commented Jul 18, 2018

Second suggestion: try using gather and receive instead of array lookups for the loads. However, I'd probably try multiple QPUs first.

By the way, does Mandlebrot require the loads? Can the fractal be produced without reading any input arrays?

Despite these suggestions, I'm glad you implemented the non-optimised version first. It looks very neat.

@wimrijnders
Copy link
Contributor Author

wimrijnders commented Jul 18, 2018

Second suggestion: try using gather and receive instead of array lookups for the loads. However, I'd probably try multiple QPUs first.

Okay, but that's for kernel 2. Right now, I'm more interested in getting the current code optimized. Deluge me with suggestions!

By the way, does Mandlebrot require the loads? Can the fractal be produced without reading any input arrays?

Well, no to first and yes to second question. It's entirely possible to initialize everything with the given parameters. But I haven't figured out how to that yet. All I've got is what I understand from GCD and Rot3D. Suggestions are welcome, otherwise have patience.


EDIT: Well, ....

@wimrijnders
Copy link
Contributor Author

wimrijnders commented Jul 19, 2018

Here's the second iteration of the Mandelbrot kernel. It does away with the input arrays. The trick here was to understand the usage of index().
Also some code 'optimization' - between quotes because it didn't make one bit of difference.

Calculation with 192x192 points; 1.0 previous kernel, 1.1 new kernel

Platform Kernel Time (s)
i7 scalar 0.032004
Pi2 scalar 0.470066
Pi2 1.0 0.215066
Pi2 1.1 0.209823

Well, maybe a tiny bit. I think it's fair to say that this kernel is computation-bound, because removing the data transport does not make one bit of difference (do you agree?).

The nice thing about not having the input arrays is that the points can be scaled up again to 1024x1024:

Calculation with 1024x1024 points

Platform Kernel Time (s)
i7 scalar 0.869301
Pi2 scalar 13.712409
Pi2 1.0 couldn't run
Pi2 1.1 5.005170

Insight:

Expressions are code generators

Yes, BoolExpr is an expression not a value -- it isn't actually evaluated at this point.

I understand now. The generated code is inlined and therefore it's as if it's called as a lambda. I actually really like this serendipitous capability, you should use it as a selling point and formalize it in the documentation.

I used this construct for a further optimization. This double use of condition bothered me:

   BoolExpr condition = (radius < 4 && count < numiterations);

    While (any(condition))
      Where (condition)
   ...

... because with my new insight it's obvious that the condition got executed twice - overhead. So I used the same principle to tweak the condition to something that can be stored in a variable so that only that variable needs to be checked:

     FloatExpr condition = (4.0f - (reSquare + imSquare))*toFloat(numiterations - count);
     Float checkvar = condition;

     While (any(checkvar > 0))
       Where (checkvar > 0)

And there is a slight improvement:

Calculation with 1024x1024 points

Platform Kernel Time (s)
i7 scalar 0.869301
Pi2 scalar 13.712409
Pi2 1.1 5.005170
Pi2 1.1 tweaked 4.370737

@wimrijnders
Copy link
Contributor Author

Observations and feature requests

Given:

Int a;
Float b;
Float c;

... the following don't work:

c = a+b;       // No operator for Int, Float combination
c = a*b;       // idem
c += b;        // operator doesn't exist
a = (b < 0);   // Can't assign result BoolExpr to Int

There are alternatives to the first three of course:

c = toFloat(a)+b;
c = toFloat(a)*b;
c = c + b;

But I personally would truly appreciate it if the initial versions worked. I can sort of understand if you want to have explicit casts, but still.

I hereby place a feature request for the given operators. Also the conversion of a BoolExpr result to Int.

@wimrijnders
Copy link
Contributor Author

Also, a minor point, following does not work:

  store(count, result[index]);

I had to do it like this instead:

  store(count, result + index);

But TBH this is a small thing I can live with.

@wimrijnders
Copy link
Contributor Author

I hope it goes without saying that any optimizations you can think of are appreciated. I want to embarrass the i7, but we're not close yet!

@wimrijnders
Copy link
Contributor Author

And I'll repeat, I'm impressed with your efforts at making this work. I just starred your project, great work! Hope I can help to make it even better.

@mn416
Copy link
Owner

mn416 commented Jul 19, 2018

I hope it goes without saying that any optimizations you can think of are appreciated.

I see you got rid of the loads, excellent. I agree, the kernel is now compute bound so should scale up to 12 QPUs without much hassle. I'm not saying we're getting optimal performance from a single QPU, but that is surely more the compiler's fault than the program's, in this case.

@wimrijnders
Copy link
Contributor Author

Update, initial 12 QPU version:

Calculation with 1024x1024 points, 2 is multi-QPU kernel

Platform Kernel Time (s)
i7 scalar 0.869301
Pi2 scalar 13.712409
Pi2 1.1 4.370737
Pi2 2 2.187408

:-( I'm just so intensely disappointed right now. I'll see if I can tweak it further, then I'll commit for your insights.

@wimrijnders
Copy link
Contributor Author

wimrijnders commented Jul 19, 2018

Calculation with 1024x1024 points, Pi2, kernel 2 - multi-QPU

Num QPU's Time (s)
1 4.427383
2 3.272469
3 2.755362
6 2.492874
12 2.192445

screenshot from 2018-07-19 10-39-46

Not linear with num QPU's as I was expecting.....

@mn416
Copy link
Owner

mn416 commented Jul 19, 2018

I can only imagine that there is a bottleneck created by the store function. What if you remove all calls to store, how well does the scaling work then?

I'm starting to think that the VideoCore doesn't like the way I am using the DMA unit -- lots of single-vector DMA requests. If so, this is a good thing to learn because it is probably also the bottleneck in other QPULib examples, and it is fixable.

@wimrijnders
Copy link
Contributor Author

Questions/Requests

  • Is there any difference between If() and Where()? They appear to do the same thing.
  • Would a Continue and/or Break statement be possible? E.g.
While (condition)                // Also For(); any loops
  If (condition2) Continue; End  // Do next iteration loop
  If (condition3) Break;  End    // Exit the loop
End
  • Same question for Return. Also, to break of a sub-generator, eg:
void func_1() {
...
  Return;  // Get out of current generator
...
}

void func_2() {
...
  func_1();
...
}
  • An Exit would also be nice, but I think that's what kernelFinish() does.

@wimrijnders
Copy link
Contributor Author

What if you remove all calls to store, how well does the scaling work then?

No difference.

@mn416
Copy link
Owner

mn416 commented Jul 19, 2018

Is there any difference between If() and Where()? They appear to do the same thing

Where allows assignment to a subset of elements of a vector, where that subset is determined by the condition.

If executes different code depending on the condition.

@wimrijnders
Copy link
Contributor Author

I commited the last changes:

  • Added a multi-QPU core
  • DRY'd common code for the kernels in mandelbrotCore()
  • raised the heap size, otherwise issues with running the emulator

Right now I'm hoping for a duh-moment where you point out some obvious error to me.

@wimrijnders
Copy link
Contributor Author

wimrijnders commented Jul 19, 2018

Where allows assignment to a subset of elements of a vector, where that subset is determined by the condition.

Sorry, don't get it. Example to point out difference?

I'm stopping now, wasted[1] too much time on this already. I should be working right now!


[1] 'wasted' being a relative term. I'm having loads of fun doing this.

@mn416
Copy link
Owner

mn416 commented Jul 19, 2018

Sorry, don't get it. Example to point out difference?

Where (x > 10) x++; End

Above increments the elements of vector x that are larger than 10.

If (any(x > 10)) x++; End

Above increments every element of x only when at least one of its elements is larger than 10.

@wimrijnders
Copy link
Contributor Author

OK so far. What would this do?

  If (x > 10) x++; End

@mn416
Copy link
Owner

mn416 commented Jul 19, 2018

Ahh, that makes more sense. Cool. I am guessing that store is the bottleneck now, can you verify?

@wimrijnders
Copy link
Contributor Author

12 QPU's.

  • store(count, result + resultIndex: 0.218935s
  • result[resultIndex] = count: 0.220662s
  • No write to result at all: 0.199739s

Not really much difference

@mn416
Copy link
Owner

mn416 commented Jul 19, 2018

The results you are seeing are linear if you plot num QPUs versus speedup factor.

@wimrijnders
Copy link
Contributor Author

The results you are seeing are linear if you plot num QPUs versus speedup factor.

And this is how that looks like:

screenshot from 2018-07-19 13-16-16

Something to be satisfied about I think.

@mn416
Copy link
Owner

mn416 commented Jul 19, 2018

Something to be satisfied about I think.

Definitely. The first QPULib example to show strong scaling :)

@mn416
Copy link
Owner

mn416 commented Jul 19, 2018

void mandelbrotCore(
  Float reC, Float imC,
  Int resultIndex,
  Int numiterations,
  Ptr<Int> &result)

You might try taking resultIndex and numIterations as references, to avoid unnecessary copying. As I've said before, QPULib doesn't do many optimisations.

@wimrijnders
Copy link
Contributor Author

That was the last commit, some minor cleanup. Right now, I don't have any more bright ideas on how to make it better. Please final review?

@mn416
Copy link
Owner

mn416 commented Jul 19, 2018

Ideally mandlebrot_2 would look as close as possible to the scalar version, and then the two could be placed side-by-side in the tutorial. One way to do this would be to inline mandlebrotCore and try to get rid of the strange looking dummy loop.

These are only suggestions, happy to accept the PR as it is too.

@wimrijnders
Copy link
Contributor Author

The mandbrodCore is a DRY on the code....would hesitate to remove it again. A case can be made to put it as such in the tutorial, you can describe it separately.

As for the If (dummy..., this is the best I could think of. Open for suggestions. Can this be done better?

@wimrijnders
Copy link
Contributor Author

I do agree that the dummy if is stupid. Ideally, a When should be there but semantics forbid it.

@wimrijnders
Copy link
Contributor Author

I found a solution for If(dummy.... Since a line in the mandelbrot is done completely by a single QPU, it's sufficient to test yIndex only. See the code diff.

Further Changes:

  • code cleanup
  • init value for count in output_pgm()
  • aligned kernel 1 and kernel 2 code - kernel 1 showed differences in output, probably due to incremental differences in calculation

I tested this both kernel 1 and 2 and with different numbers of QPU's (especially odd numbers), the output bitmap is now always the same.

@wimrijnders
Copy link
Contributor Author

wimrijnders commented Jul 20, 2018

@mn416 Heh. There is a competitor.

I wonder how our implementation compares to that one. I'll check in a spare moment.


Tested on Pi 2, kernel 2, 12 QPU's, same parameters for mandelbrot generation as link above.

Competitor's time: 33.781s

Run Time(s) comment
1 9.997932 Message: 'Failed to invoke kernel on QPUs'
2 31.794590 this is good!
3 0.000137 This can't be right at all. Something went wrong with the scheduling?
4 48.802330 :-(
5 0.000137
6 51.746103
7 0.000139
8 36.522121

There's a pattern here. The first call fails in some way. The second call succeeds, but the times are highly variable.

Do you have any idea what can cause this?

Also, see the output bitmap:

mandelbrot_big

The output is 1920x1080, 5MB. I couldn't load it into GIMP so I made a screenshot and scaled it down.

Not sure where this comes from. I'm not really expecting the calculation itself to be in error (however, see error message above). It's probably more likely to do with the pgm generation.

@mn416
Copy link
Owner

mn416 commented Jul 20, 2018

Yeah the first error is probably a timeout. Grep for TIMEOUT in the repo. It is 10s by default. And the error message should be improved to include the cause of the failure.

Not sure how the 2nd run finished though. Might be best to restart the pi and avoid the timeout to make sure of a correct result.

@mn416
Copy link
Owner

mn416 commented Jul 20, 2018

BTW, I would be surprised if QPULib were to outperform hand written assembler.

@wimrijnders
Copy link
Contributor Author

@mn416 ping. Edited the previous comment with more tests and insights.

@wimrijnders
Copy link
Contributor Author

wimrijnders commented Jul 20, 2018

Yeah the first error is probably a timeout.

That can't be it, it fails right away. There's no waiting for ten seconds.
Ah, wait, you're responding to the first value. OK. But the rest of the 'quick' iterations are different.

the error message should be improved to include the cause of the failure.

Yes, please.

BTW, I would be surprised if QPULib were to outperform hand written assembler.

Me too. I was actually quite happy with the first value. The subsequent runs are at least in the same ball-park.

@wimrijnders
Copy link
Contributor Author

I fixed a bug in the pgm output. There are limits to be considered.

Thinking about what other problems can occur.

@mn416
Copy link
Owner

mn416 commented Jul 20, 2018

If the timeout occurs the QPUs could be in a bad state. I’d fix the timeout and do a hard reset.

@wimrijnders
Copy link
Contributor Author

The Pi 2 hung during testing, restarted and now I'm consistently getting 'Failed to invoke kernel on QPUs'. Why did this not happen previously?

Will see if I can disable the timeout.

@wimrijnders
Copy link
Contributor Author

Insight. OK, I got it now.

QPU_TIMEOUT is used qpu_execute() to start a kernel. The call waits till the kernel has executed, it times out with the given value if not executed within that time.

Which begs the question: Is it possible to schedule kernel executions? Also without a timeout running?

The scenario I'm thinking of is that the CPU program starts the jobs, does something else and regularly polls if the QPU work is done.

@wimrijnders
Copy link
Contributor Author

Raised the timeout to 100s, working perfectly now. I think all the previous problems had to do with some QPU's timing out and others not.

The run values are up by the way, several runs:

  • 62.494617
  • 62.494638
  • 62.494692

Again, comparing to competitor: 33.781s

That the run times are consistent is a good sign I think. Also, the output bitmap is perfect.

@wimrijnders
Copy link
Contributor Author

Updated final fixes. Please review.

@mn416
Copy link
Owner

mn416 commented Jul 20, 2018

Nice. I think the code looks much neater now without that dummy loop, thanks for getting rid of it.

Slightly uneasy about increasing the timeout to 100s. A lowish timeout is actually very useful when debugging, which is the common case. Ideally, the timeout would be taken at run-time, e.g. kernel.setTimeout(t).

Anyway, I'll merge this PR now as the timeout issue is somewhat separate.

Great work BTW, I'm very pleased to have another example of QPULib in there.

@mn416 mn416 merged commit 5d4ad88 into mn416:development Jul 20, 2018
@wimrijnders
Copy link
Contributor Author

I think the code looks much neater now without that dummy loop, thanks for getting rid of it.

I agree, it was a desperation move to get it to work properly. I didn't want it in either.

Slightly uneasy about increasing the timeout to 100s.

OK. Actually, I just updated it to get the competitor sample running. I'll put it back again!

Ideally, the timeout would be taken at run-time, e.g. kernel.setTimeout(t).

That's a good idea.

Great work BTW

My pleasure. I am truly enjoying this work. Wish I had more time for it - actually more time for the other things I need to do in my life. I'm bingeing right now, won't be able to keep it up.

@wimrijnders wimrijnders deleted the mandelbrot branch July 21, 2018 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants