New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New example: Mandelbrot #58
Conversation
This adds the example program Mandelbrot, at least an initial version. - Structure taken from `Rot3d` - Scalar kernel only - The scalar kernel is structured in such a way that I hope makes sense for QPU kernels - Outputs a PGM bitmap with the result
I really really couldn't resist. Never mind my actual work or other important tasks! |
I'm actually quite impressed with the running time of the Pi 2: it's only about 14 times slower than my 3GHz i7. I find that amazing for such a dinky processor. |
The goal as far as I'm concerned, is to have the VideoCore beat this i7 value. I want to see a $40 computer make mincemeat of my intel laptop. |
Added first version of a QPU kernel. This works with the emulator, not tested yet with hardware. I must honestly say that the conversion from scalar to QPU was straightforward, congrats on that. Sincere feedback on my first kernel attempt is appreciated. Also some code cleanup. |
I used the following construct on a hunch: BoolExpr condition = (radius < 4 && count < numiterations);
While (any(condition))
Where (condition)
... I'm a bit suprised actually that it worked, it appears to be recalculated on every iteration as well. So Is this correct? Would you expect it to work propely like this? Or am I stretching the definitiion here? Keep in mind that I've only run it on an emulator. Perhaps there are some devious differences with QPU. |
Working on QPU! Execution with 0.215066s with 1 QPU, 192x192 points. Pardon my language, but I'm fucking impressed. WORKING! This DSL thing you crafted actually delivers. drinks are on me if ever we meet. I had to reduce the resolution to 192x192 (was 1024x1024), because otherwise you get a heap alloc error. Time comparison:
i7 scalar is still 7 times faster than p2 kernel 1. However, this is 1 QPU and completely unoptimized code. |
Woohoo! Really nice!!! I haven't looked at the code yet, but will do and offer suggestions if any come to mind. We should put this as one of the introductory examples in the README :)
Yes, BoolExpr is an expression not a value -- it isn't actually evaluated at this point. I'm looking forward to the 12 QPU version :) |
First suggestion: instead of
try
The latter is non-blocking -- it doesn't wait until the store is complete before continuing execution. |
Second suggestion: try using By the way, does Mandlebrot require the loads? Can the fractal be produced without reading any input arrays? Despite these suggestions, I'm glad you implemented the non-optimised version first. It looks very neat. |
Okay, but that's for kernel 2. Right now, I'm more interested in getting the current code optimized. Deluge me with suggestions!
Well, no to first and yes to second question. It's entirely possible to initialize everything with the given parameters. But I haven't figured out how to that yet. All I've got is what I understand from EDIT: Well, .... |
Here's the second iteration of the Mandelbrot kernel. It does away with the input arrays. The trick here was to understand the usage of Calculation with 192x192 points; 1.0 previous kernel, 1.1 new kernel
Well, maybe a tiny bit. I think it's fair to say that this kernel is computation-bound, because removing the data transport does not make one bit of difference (do you agree?). The nice thing about not having the input arrays is that the points can be scaled up again to 1024x1024: Calculation with 1024x1024 points
Insight:Expressions are code generators
I understand now. The generated code is inlined and therefore it's as if it's called as a lambda. I actually really like this serendipitous capability, you should use it as a selling point and formalize it in the documentation. I used this construct for a further optimization. This double use of condition bothered me: BoolExpr condition = (radius < 4 && count < numiterations);
While (any(condition))
Where (condition)
... ... because with my new insight it's obvious that the condition got executed twice - overhead. So I used the same principle to tweak the condition to something that can be stored in a variable so that only that variable needs to be checked: FloatExpr condition = (4.0f - (reSquare + imSquare))*toFloat(numiterations - count);
Float checkvar = condition;
While (any(checkvar > 0))
Where (checkvar > 0) And there is a slight improvement: Calculation with 1024x1024 points
|
Observations and feature requestsGiven: Int a;
Float b;
Float c; ... the following don't work: c = a+b; // No operator for Int, Float combination
c = a*b; // idem
c += b; // operator doesn't exist
a = (b < 0); // Can't assign result BoolExpr to Int There are alternatives to the first three of course: c = toFloat(a)+b;
c = toFloat(a)*b;
c = c + b; But I personally would truly appreciate it if the initial versions worked. I can sort of understand if you want to have explicit casts, but still. I hereby place a feature request for the given operators. Also the conversion of a |
Also, a minor point, following does not work: store(count, result[index]); I had to do it like this instead: store(count, result + index); But TBH this is a small thing I can live with. |
I hope it goes without saying that any optimizations you can think of are appreciated. I want to embarrass the i7, but we're not close yet! |
And I'll repeat, I'm impressed with your efforts at making this work. I just starred your project, great work! Hope I can help to make it even better. |
I see you got rid of the loads, excellent. I agree, the kernel is now compute bound so should scale up to 12 QPUs without much hassle. I'm not saying we're getting optimal performance from a single QPU, but that is surely more the compiler's fault than the program's, in this case. |
Update, initial 12 QPU version: Calculation with 1024x1024 points, 2 is multi-QPU kernel
:-( I'm just so intensely disappointed right now. I'll see if I can tweak it further, then I'll commit for your insights. |
I can only imagine that there is a bottleneck created by the I'm starting to think that the VideoCore doesn't like the way I am using the DMA unit -- lots of single-vector DMA requests. If so, this is a good thing to learn because it is probably also the bottleneck in other QPULib examples, and it is fixable. |
Questions/Requests
void func_1() {
...
Return; // Get out of current generator
...
}
void func_2() {
...
func_1();
...
}
|
No difference. |
|
I commited the last changes:
Right now I'm hoping for a duh-moment where you point out some obvious error to me. |
Sorry, don't get it. Example to point out difference? I'm stopping now, wasted[1] too much time on this already. I should be working right now! [1] 'wasted' being a relative term. I'm having loads of fun doing this. |
Above increments the elements of vector
Above increments every element of |
OK so far. What would this do? If (x > 10) x++; End |
Ahh, that makes more sense. Cool. I am guessing that |
12 QPU's.
Not really much difference |
The results you are seeing are linear if you plot num QPUs versus speedup factor. |
Definitely. The first QPULib example to show strong scaling :) |
You might try taking resultIndex and numIterations as references, to avoid unnecessary copying. As I've said before, QPULib doesn't do many optimisations. |
That was the last commit, some minor cleanup. Right now, I don't have any more bright ideas on how to make it better. Please final review? |
Ideally These are only suggestions, happy to accept the PR as it is too. |
The As for the |
I do agree that the dummy if is stupid. Ideally, a |
I found a solution for Further Changes:
I tested this both kernel 1 and 2 and with different numbers of QPU's (especially odd numbers), the output bitmap is now always the same. |
@mn416 Heh. There is a competitor. I wonder how our implementation compares to that one. I'll check in a spare moment. Tested on Pi 2, kernel 2, 12 QPU's, same parameters for mandelbrot generation as link above. Competitor's time: 33.781s
There's a pattern here. The first call fails in some way. The second call succeeds, but the times are highly variable. Do you have any idea what can cause this? Also, see the output bitmap: The output is 1920x1080, 5MB. I couldn't load it into GIMP so I made a screenshot and scaled it down. Not sure where this comes from. I'm not really expecting the calculation itself to be in error (however, see error message above). It's probably more likely to do with the pgm generation. |
Yeah the first error is probably a timeout. Grep for TIMEOUT in the repo. It is 10s by default. And the error message should be improved to include the cause of the failure. Not sure how the 2nd run finished though. Might be best to restart the pi and avoid the timeout to make sure of a correct result. |
BTW, I would be surprised if QPULib were to outperform hand written assembler. |
@mn416 ping. Edited the previous comment with more tests and insights. |
That can't be it, it fails right away. There's no waiting for ten seconds.
Yes, please.
Me too. I was actually quite happy with the first value. The subsequent runs are at least in the same ball-park. |
I fixed a bug in the pgm output. There are limits to be considered. Thinking about what other problems can occur. |
If the timeout occurs the QPUs could be in a bad state. I’d fix the timeout and do a hard reset. |
The Pi 2 hung during testing, restarted and now I'm consistently getting 'Failed to invoke kernel on QPUs'. Why did this not happen previously? Will see if I can disable the timeout. |
Insight. OK, I got it now.
Which begs the question: Is it possible to schedule kernel executions? Also without a timeout running? The scenario I'm thinking of is that the CPU program starts the jobs, does something else and regularly polls if the QPU work is done. |
Raised the timeout to 100s, working perfectly now. I think all the previous problems had to do with some QPU's timing out and others not. The run values are up by the way, several runs:
Again, comparing to competitor: 33.781s That the run times are consistent is a good sign I think. Also, the output bitmap is perfect. |
Updated final fixes. Please review. |
Nice. I think the code looks much neater now without that Slightly uneasy about increasing the timeout to 100s. A lowish timeout is actually very useful when debugging, which is the common case. Ideally, the timeout would be taken at run-time, e.g. Anyway, I'll merge this PR now as the timeout issue is somewhat separate. Great work BTW, I'm very pleased to have another example of QPULib in there. |
I agree, it was a desperation move to get it to work properly. I didn't want it in either.
OK. Actually, I just updated it to get the competitor sample running. I'll put it back again!
That's a good idea.
My pleasure. I am truly enjoying this work. Wish I had more time for it - actually more time for the other things I need to do in my life. I'm bingeing right now, won't be able to keep it up. |
This adds the example program Mandelbrot, at least an initial version.
Rot3d
Running times:
The output PGM bitmap looks like this: