VideoCore IV Performance Considerations
Clone this wiki locally
It's much faster than I can do with SIMD on the ARM CPU.
I've tried timing some instructions just for the lols. Data is not VPU L1 cached, but it is L2 cached. I'm super guessing the clock speed is 250 MHz.
Here's my findings:
- the scalar unit in the VPU can do dual issue, ie can issue two independent instructions per clock.
- the average scalar instruction takes one cycle to execute, and dependent instructions can go back-to-back with no trouble.
- I have a feeling that a flag setting instruction plus bcc takes two cycles for the bcc to happen, otherwise the bcc will overlap with other scalars meaning the bcc takes just one cycle (this seems too good to be true...branch prediction?)
- vector instructions cannot dual issue with other vector instructions
- you can dual issue vector instructions with scalar instructions (vector+scalar, scalar+scalar, but not vector+vector)
- the average vector instruction takes one cycle, at 250 MHz
- vector multiplies take 2 cycles always
- instructions which do a >> 8 or a sat or clamp seem to take 2 cycles
- back to back vector loads to the same address takes 14 cycles, this appears to be the same regardless of the register target (and regardless of the size, although you seem to save a cycle or two going from a 64-byte load to 32 bytes or smaller)
- setting the increment flags on a non-incrementing instruction does not cost you
- using a 2x repeat is the same as issuing the same instruction twice
- oddly using a 2x repeat on a mul (2-cycle) is barely slower than a single mul. Yet using 2xmul is significantly slower!
From one of the patents:
 Each of the dual issue ALU 334 and 344 may comprise suitable logic, circuitry, code, and/or interfaces that may be operable to perform superscalar execution, to issue two integer operations, and to issue an integer operation and a floating-point operation concurrently. Integer operations may be able to execute in a single cycle and a forwarding path may be provided such that the result can be used by the following instruction without incurring any stalls. Complex integer operations may be pipelined over two cycles, for example. In such instances, a single pipeline stall may be inserted if the following instruction references the result. Floating-point operations may be able to execute over three clock cycles, for example. These operations may be pipelined such that a floating-point operation may be issued at each clock cycle. However, a pipeline stall may be inserted if either of the two following instructions references the result.