I don't see significant improvements for my machine other than for <=64 bytes.
On further testing (I increased the number of arrays from 128 to 256), the results are also sensitive to the number of distinct arrays being copied. The main factors in whether this provides a significant speedup are:
1, Your machine's block size (64 bytes for my machine)
2. The L1 cache size for each processor (256kb / 4 processors = 64kb for me)
Once the arrays being copied exceed my machine's block size, I don't see a significant improvement.
Eventually, once the total set of memory I'm copying from/to exceeds the L1 cache size, I start to see slowdowns (number of arrays * size of array > 64kb).
The slowdowns become significant once the individual arrays being copied exceed the size of the L1 cache.