Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master
Commits on Jun 14, 2011
Commits on May 27, 2011
  1. Fix ia64 speculative loads

    authored
    Speculative loads do not generate any faults, as intendet. Instead
    you should check the valid bit of your speculative load.
    
    But speculative loads are great to shedule your loop. If you have
    another loop iteration, your load of the next iteration data was
    "good", if not, who cares. No need to check any bits.
    
    The problem is: Speculative loads do not generate _any_ faults
    (Wait, we had that...). So they also do not generate any valid
    page faults to signal the OS action is required because a page
    is swapped out or a mmaped file page needs to be fetched.
    
    So fetch the buffer end of this outer loop iteration at each
    outer loop start.
    
    And maybe we even have to be more agressive about that...
Commits on May 16, 2011
  1. fix IA64 adler32

    authored
    Endianess, i haz it...
    
    Time for a brown paper bag.
    
    While at it raise MIN_WORK, the startup cost is higher.
Commits on May 12, 2011
  1. fix PPC Altivec version

    authored
    The PIM says that after a vec_lde the rest of the vector is undefined.
    On a G4 you get the old content of the register.
    On a G5 vec_lde behaives like a simple load, the complete vector is
    filled with the data surrounding the element you want to load.
    
    So add code to load the one element we really want.
Commits on May 11, 2011
  1. fix v6 DSP condition

    authored
  2. fix typo

    authored
Commits on May 9, 2011
Commits on May 4, 2011
  1. add SPARC VIS version

    authored
    This is the SPARC VIS version of adler32.
    an TI UltraSparc II
             -------- orig ------
                    a: 0x0CB4B676, 10000 * 160000 bytes     t: 9080 ms
                    a: 0x25BEB273, 10000 * 159999 bytes     t: 9590 ms
                    a: 0x733CB174, 10000 * 159998 bytes     t: 9580 ms
                    a: 0x1144AF76, 10000 * 159996 bytes     t: 10600 ms
                    a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 12090 ms
                    a: 0x1902A382, 10000 * 159984 bytes     t: 12100 ms
             -------- vec ------
                    a: 0x0CB4B676, 10000 * 160000 bytes     t: 5250 ms
                    a: 0x25BEB273, 10000 * 159999 bytes     t: 5250 ms
                    a: 0x733CB174, 10000 * 159998 bytes     t: 5250 ms
                    a: 0x1144AF76, 10000 * 159996 bytes     t: 5250 ms
                    a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 5250 ms
                    a: 0x1902A382, 10000 * 159984 bytes     t: 5260 ms
             speedup: 1.729524
    seen 1.8 on an UltraSparc IIIi
    
    The code is not automatically enabled, you have to specify HAVE_VIS,
    for several reasons:
    
    - Do not use it with Niagara or other CPUs like it. They have a
      shared FPU (T1: 1 FPU for all up to 8 cores, T2: 1 FPU for 8
      threads)
      and to make matters worse the code does not seem to work there
      (binary which creates correct results on other SPARC creates wrong
      result on T1)
    - There is no clear preprocesor define which tells us if we have VIS.
      Often the tool chain even disguises a sparcv9 as a sparcv8
      (pre-UltraSPARC) to not confuse the userspace.
    - The code has a high startup cost, so do not use with NO_DIVIDE &&
      NO_SHIFT
    - The code only handles big endian
    
    We can not easily provide a dynamic runtime switch. The CPU has make
    and model encoded in the Processor Status Word (PSW), but reading
    the PSW is a privilidged instruction (same as PowerPC...)
Commits on May 1, 2011
  1. finish ARMv6 DSP version, and disable it

    authored
    make inner loop more tight, with less stalls
    raise VNMAX
    
    still, no cookie...
    an i.MX515@800MHz (arm7l) with ARMv6 DSP
             -------- orig ------
                    a: 0x0CB4B676, 10000 * 160000 bytes     t: 4040 ms
                    a: 0x25BEB273, 10000 * 159999 bytes     t: 3880 ms
                    a: 0x733CB174, 10000 * 159998 bytes     t: 4070 ms
                    a: 0x1144AF76, 10000 * 159996 bytes     t: 4060 ms
                    a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 4050 ms
                    a: 0x1902A382, 10000 * 159984 bytes     t: 4060 ms
             -------- vec2 ------
                    a: 0x0CB4B676, 10000 * 160000 bytes     t: 4240 ms
                    a: 0x25BEB273, 10000 * 159999 bytes     t: 4250 ms
                    a: 0x733CB174, 10000 * 159998 bytes     t: 3300 ms
                    a: 0x1144AF76, 10000 * 159996 bytes     t: 4240 ms
                    a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 4240 ms
                    a: 0x1902A382, 10000 * 159984 bytes     t: 4250 ms
             speedup: 0.952830
  2. finish ARM NEON version

    authored
    restrict to little endian
    fix order constant
    make inner loop more tight
    raise VNMAX
    
    For this result on an i.MX515@800MHz with NEON (arm7l)
             -------- orig ------
                    a: 0x0CB4B676, 10000 * 160000 bytes     t: 4010 ms
                    a: 0x25BEB273, 10000 * 159999 bytes     t: 2990 ms
                    a: 0x733CB174, 10000 * 159998 bytes     t: 4060 ms
                    a: 0x1144AF76, 10000 * 159996 bytes     t: 4050 ms
                    a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 4060 ms
                    a: 0x1902A382, 10000 * 159984 bytes     t: 4060 ms
             -------- vec ------
                    a: 0x0CB4B676, 10000 * 160000 bytes     t: 1450 ms
                    a: 0x25BEB273, 10000 * 159999 bytes     t: 1450 ms
                    a: 0x733CB174, 10000 * 159998 bytes     t: 1460 ms
                    a: 0x1144AF76, 10000 * 159996 bytes     t: 1450 ms
                    a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 1460 ms
                    a: 0x1902A382, 10000 * 159984 bytes     t: 1450 ms
             speedup: 2.765517
  3. finish loongson version

    authored
    fix order constant, it was the wrong way round
    reduce VNMAX, we were overflowing vs1_r
    raise MIN_WORK, it takes some bytes to get to speed
    make end swizzle more swizzle, we want a complete swap
    
    This gives us:
             -------- orig ------
                    a: 0x0CB4B676, 10000 * 160000 bytes     t: 5180 ms
                    a: 0x25BEB273, 10000 * 159999 bytes     t: 5240 ms
                    a: 0x733CB174, 10000 * 159998 bytes     t: 5310 ms
                    a: 0x1144AF76, 10000 * 159996 bytes     t: 5300 ms
                    a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 5640 ms
                    a: 0x1902A382, 10000 * 159984 bytes     t: 5590 ms
             -------- vec ------
                    a: 0x0CB4B676, 10000 * 160000 bytes     t: 2940 ms
                    a: 0x25BEB273, 10000 * 159999 bytes     t: 2940 ms
                    a: 0x733CB174, 10000 * 159998 bytes     t: 2940 ms
                    a: 0x1144AF76, 10000 * 159996 bytes     t: 2940 ms
                    a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 2940 ms
                    a: 0x1902A382, 10000 * 159984 bytes     t: 2940 ms
             speedup: 1.761905
  4. change mips loongson loop constants

    authored
    pmaddhw is signed, so to not overflow into the sign bit, we have
    to lower the most inner loop count
Commits on Apr 30, 2011
  1. fix compilation with clang

    authored
    since clang has it's own integrated assembler, it does not support
    .subsection
    http://llvm.org/bugs/show_bug.cgi?id=8717
    
    Thanks to Edwin Török for the patch!
Commits on Apr 24, 2011
  1. use reduce_x on function exit

    authored
    make local labels single digit
    add a dynamic dispatch version which can detect memory corruption
  2. add iWMMXt version

    authored
  3. fix reduce_full

    authored
    use reduce_x in function exit
    prepare NO_SHIFT reduce for non-power-of-2 MIN_WORK
Commits on Apr 8, 2011
  1. Mike Frysinger

    ia64 implementation

    authored vapier committed
    Thanks to Mike Frysinger i could torture some real IA64 HW (and myself too...).
    
    Throw away the first post of this patch, it's broken, in several ways.
    (Note to self: Don't try to be a cool kid and save on parenthesis)
    And even if fixed, it's half as slow as the generic code.
    Counting instructions is a bad move on IA64.
    
    So here it is, new and improved, with 400% more unrolling:
    
    an IA64 (McKinley)
            -------- orig ------
                   a: 0x0CB4B676, 10000 * 160000 bytes     t: 1912 ms
                   a: 0x25BEB273, 10000 * 159999 bytes     t: 1916 ms
                   a: 0x733CB174, 10000 * 159998 bytes     t: 1912 ms
                   a: 0x1144AF76, 10000 * 159996 bytes     t: 1916 ms
                   a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 1916 ms
                   a: 0x1902A382, 10000 * 159984 bytes     t: 1912 ms
            -------- vec ------
                   a: 0x0CB4B676, 10000 * 160000 bytes     t: 760 ms
                   a: 0x25BEB273, 10000 * 159999 bytes     t: 764 ms
                   a: 0x733CB174, 10000 * 159998 bytes     t: 760 ms
                   a: 0x1144AF76, 10000 * 159996 bytes     t: 760 ms
                   a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 808 ms
                   a: 0x1902A382, 10000 * 159984 bytes     t: 760 ms
            speedup: 2.515789
    
    next stop, blackfin, then working on the ARM iWMMXt version for XScale
    (N.B.: does someone have a link handy to the instruction reference?),
    and when some time has passed a complete repost, there are little
    changes here and there.
  2. Mike Frysinger

    blackfin implementation

    authored vapier committed
    What is it with me, others read a good book, i read processor handbooks...
    
    Only compile tested, i don't know if this works (and i have my doubts,
    this processor and the asm is ... funky...), but since it only took me
    5 hours.
    
    Maybe someone has a Blackfin to test it on.
    /me whistles an innocent tune and looks in the direction of Mike Frysinger
    
    ----------------- followup -----------------
    
    Already solved, see patch 1, this freed up r4 for devious use ;)
    
    Now i freed up r4, I was thinking about scheduling it a bit more, but
    i need 1 more reg ;(
    Maybe ... wait a sek., Ahhhh.
    See second patch.
    
    ----------------- followup -----------------
    
    And again a big thanks to Mike Frysinger for testing grounds.
    
    The first shot was not that broken, mainly the trailer handling.
    Since fudging everything in place to make trailer handling happen in
    vector mode takes lots of instructions do trailer handling sequential,
    it's only 1-4 bytes.
    
    I'm a little bit unhappy with the performance, but ENOREGISTER.
    Every clever trick i can think of only makes it slower, because it
    needs more register/accumulators, and even if stealing from the
    non-DRegs, it means more moves, and there goes some cycle (move out of
    DRegs and at some point back again) and prop. every time some stall.
    
    The numbers:
    An Blackfin BF537@500MHz
             -------- orig ------
                    a: 0x0CB4B676, 10000 * 160000 bytes     t: 19200 ms
                    a: 0x25BEB273, 10000 * 159999 bytes     t: 19200 ms
                    a: 0x733CB174, 10000 * 159998 bytes     t: 19200 ms
                    a: 0x1144AF76, 10000 * 159996 bytes     t: 19190 ms
                    a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 24150 ms
                    a: 0x1902A382, 10000 * 159984 bytes     t: 24160 ms
             -------- vec ------
                    a: 0x0CB4B676, 10000 * 160000 bytes     t: 10690 ms
                    a: 0x25BEB273, 10000 * 159999 bytes     t: 10690 ms
                    a: 0x733CB174, 10000 * 159998 bytes     t: 10700 ms
                    a: 0x1144AF76, 10000 * 159996 bytes     t: 10680 ms
                    a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 10690 ms
                    a: 0x1902A382, 10000 * 159984 bytes     t: 10670 ms
             speedup: 1.796071
    
    Putting vorder_o/e into different L1 brings that to 1.84, but since
    that eats a precious resource for little gain, i left it out.
  3. Mike Frysinger

    mips implementation

    authored vapier committed
    And here a version for >= loongson2f and the DSP-ASE.
    
    Since i only had the cheat-sheet^W^W quick-reference-card for the
    DSP-ASE and the ST Micro sheet i read for the loongson isn't very
    detailed, this code really needs some testing and prop. fixes.
  4. Mike Frysinger

    alpha MAX implementation

    authored vapier committed
    Alpha also had a little SIMD extention called MAX avail. since the
    21264 aka. ev6 (and later in the 21164pc aka pca56).
    We are missing a nice packed mul, but we have a pixel error
    instruction (sum of absolute difference), and byte-halfword unpack
    instructions so we can build something.
    
    Only compile tested, and needs gcc due to inline asm/compiler builtins.
    
    ----------------- followup -----------------
    
    Thanks to Mike Frysinger, i could test the code on an Alpha.
    Result:
    I had a silly typo, one should decrement the right loop counter...
    Also the start cost is a little higher then expected, so raise it.
    Fixed with the attached patch
    
    And for those who like pretty numbers:
    
    an Alpha EV68 when no ByteWordeXtention is used
           -------- orig ------
                   a: 0x0CB4B676, 10000 * 160000 bytes     t: 5008.384 ms
                   a: 0x25BEB273, 10000 * 159999 bytes     t: 4202.496 ms
                   a: 0x733CB174, 10000 * 159998 bytes     t: 4769.792 ms
                   a: 0x1144AF76, 10000 * 159996 bytes     t: 4804.608 ms
                   a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 5287.936 ms
                   a: 0x1902A382, 10000 * 159984 bytes     t: 5074.944 ms
            -------- risc2 ------
                   a: 0x0CB4B676, 10000 * 160000 bytes     t: 3879.936 ms
                   a: 0x25BEB273, 10000 * 159999 bytes     t: 3879.936 ms
                   a: 0x733CB174, 10000 * 159998 bytes     t: 3878.912 ms
                   a: 0x1144AF76, 10000 * 159996 bytes     t: 3880.960 ms
                   a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 3876.864 ms
                   a: 0x1902A382, 10000 * 159984 bytes     t: 3879.936 ms
            speedup: 1.290842
    
    an Alpha EV68 with MAX used (which means BWX is used for the original code)
            -------- orig ------
                   a: 0x0CB4B676, 10000 * 160000 bytes     t: 3646 ms
                   a: 0x25BEB273, 10000 * 159999 bytes     t: 3294 ms
                   a: 0x733CB174, 10000 * 159998 bytes     t: 3335 ms
                   a: 0x1144AF76, 10000 * 159996 bytes     t: 3400 ms
                   a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 3351 ms
                   a: 0x1902A382, 10000 * 159984 bytes     t: 3369 ms
            -------- MAX ------
                   a: 0x0CB4B676, 10000 * 160000 bytes     t: 1953 ms
                   a: 0x25BEB273, 10000 * 159999 bytes     t: 1948 ms
                   a: 0x733CB174, 10000 * 159998 bytes     t: 1963 ms
                   a: 0x1144AF76, 10000 * 159996 bytes     t: 1962 ms
                   a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 1958 ms
                   a: 0x1902A382, 10000 * 159984 bytes     t: 1964 ms
            speedup: 1.866871
  5. Mike Frysinger

    x86 implementation

    authored vapier committed
    And finally an x86 version of Adler32.
    
    it covers:
    * Plain
    * MMX
    * SSE
    * SSE2
    * SSSE3
    * 32 & 64 Bit
    * PIC and non PIC
    
    And features a runtime cpu detection + dispatch.
    
    It heavily uses inline ASM, so it's restricted to GCC (or compatible
    compiler, like clang).
    This gives us help from the compiler for boilerplate code, all the
    fiddling with calling conventions, PIC and bit-ness of the code.
    
    Here are some numbers:
    And old AMD Athlon64 X2 which SSE unit is only 64 bit wide
           -------- orig ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 12100 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 12100 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 12400 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 12700 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 12600 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 12600 ms
           -------- MMX ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 6700 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 6800 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 6900 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 6800 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 6900 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 6900 ms
           speedup: 1.805970
           -------- SSE ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 6800 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 6800 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 6800 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 6900 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 6800 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 6900 ms
           speedup: 1.779412
           -------- SSE2 ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 5600 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 5700 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 5600 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 5700 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 5600 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 5700 ms
           speedup: 2.160714
    
    An Intel Core2
           -------- orig ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 15200 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 14900 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 14900 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 15100 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 14800 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 14900 ms
           -------- MMX ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 5500 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 5400 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 5500 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 5400 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 5400 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 5500 ms
           speedup: 2.763636
           -------- SSE ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 5400 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 5300 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 5400 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 5400 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 5400 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 5300 ms
           speedup: 2.814815
           -------- SSE2 ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 3400 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 3300 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 3400 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 3300 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 3400 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 3300 ms
           speedup: 4.470588
           -------- SSSE3 ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 2800 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 2900 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 2800 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 2800 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 2800 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 2900 ms
           speedup: 5.428571
    
    An AMD Semperon 140 (K10 Architecture)
           -------- orig ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 7500 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 7500 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 7400 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 7400 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 7800 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 7800 ms
           -------- MMX ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 4600 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 4600 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 4600 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 4600 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 4600 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 4600 ms
           speedup: 1.630435
           -------- SSE ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 4100 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 4100 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 4100 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 4200 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 4100 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 4100 ms
           speedup: 1.829268
           -------- SSE2 ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 1800 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 1800 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 1800 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 1700 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 1800 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 1800 ms
           speedup: 4.166667
    
    An Intel P4 based Xeon (Nocona, in 64 bit mode)
           -------- orig ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 21800 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 20900 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 21000 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 21000 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 21900 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 20900 ms
           -------- SSE2 ------
                  a: 0x0CB4B676, 10000 * 160000 bytes     t: 5900 ms
                  a: 0x25BEB273, 10000 * 159999 bytes     t: 5300 ms
                  a: 0x733CB174, 10000 * 159998 bytes     t: 4900 ms
                  a: 0x1144AF76, 10000 * 159996 bytes     t: 5300 ms
                  a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 5600 ms
                  a: 0x1902A382, 10000 * 159984 bytes     t: 5400 ms
           speedup: 3.694915
  6. Mike Frysinger

    arm implementation

    authored vapier committed
    This adds an NEON version and a version for ARMv6 DSP instructions of Adler32.
    
    The NEON Version is coded in intrinsic, but there are some details
    left with the endianness. If i understand it right it should work with
    any ARM compiler which understands NEON.
    The ARMv6 DSP version uses inline ASM, so i restricted it to GCC (or
    compatible compiler, like clang, can also be used. Clang should find
    the way in by itself, it disguises as GCC 4.2)
    
    Unfortunately i don't have any ARM, so the Code is untested, and there
    are some TODO left.
    Also some benchmarking would be needed.
  7. Mike Frysinger

    ppc altivec implementation

    authored vapier committed
    This adds an PPC Altivec version of Adler32 to zlib.
    
    It is coded in intrinsic as in the Altivec PIM, so it should work with
    any PPC compiler which knows altivec.
    But since i do not have tested it i restrict it for the moment to GCC.
    
    This not only interesting for PowerPC G4 & G5, but also the IBM POWER6
    and POWER7 which again feature an Altivec unit, only now it's called
    VSX.
    
    Here some numbers from an PPC G4 Powerbook:
    -------- orig ------
         a: 0x0CB4B676, 10000 * 160000 bytes   t: 22200 ms
         a: 0x25BEB273, 10000 * 159999 bytes   t: 20000 ms
         a: 0x733CB174, 10000 * 159998 bytes   t: 20800 ms
         a: 0x1144AF76, 10000 * 159996 bytes   t: 21200 ms
         a: 0x3F4ECB8A, 10000 * 159992 bytes  t: 21200 ms
         a: 0x1902A382, 10000 * 159984 bytes  t: 21100 ms
    -------- altivec ------
         a: 0x0CB4B676, 10000 * 160000 bytes  t: 3400 ms
         a: 0x25BEB273, 10000 * 159999 bytes  t: 3400 ms
         a: 0x733CB174, 10000 * 159998 bytes  t: 3300 ms
         a: 0x1144AF76, 10000 * 159996 bytes  t: 3400 ms
         a: 0x3F4ECB8A, 10000 * 159992 bytes  t: 3400 ms
         a: 0x1902A382, 10000 * 159984 bytes  t: 3300 ms
    speedup: 6.52941
    
    The raw engine speed is insane.
    Unfortunately the FSB/Memory can't keep up:
    -------- orig ------
         a: 0x01A71FA6, 100 * 16000000 bytes  t: 47600 ms
         a: 0x2DEB1BA3, 100 * 15999999 bytes  t: 46100 ms
         a: 0x12481AA4, 100 * 15999998 bytes  t: 47500 ms
         a: 0xDDF018A6, 100 * 15999996 bytes  t: 47600 ms
         a: 0xC43634BA, 100 * 15999992 bytes  t: 47500 ms
         a: 0xF7D70CB2, 100 * 15999984 bytes  t: 47500 ms
    -------- altivec ------
         a: 0x01A71FA6, 100 * 16000000 bytes  t: 36000 ms
         a: 0x2DEB1BA3, 100 * 15999999 bytes  t: 36000 ms
         a: 0x12481AA4, 100 * 15999998 bytes  t: 36000 ms
         a: 0xDDF018A6, 100 * 15999996 bytes  t: 36000 ms
         a: 0xC43634BA, 100 * 15999992 bytes  t: 35900 ms
         a: 0xF7D70CB2, 100 * 15999984 bytes  t: 36000 ms
    speedup: 1.3222
    
    Still we can squeeze some cycles from it.
  8. Mike Frysinger

    Prepare Adler32

    authored vapier committed
    This is a patch to prepare adler32.c for the things to come.
    
    * add another variant of modulus function for Archs without divide
      (or wide mul).
    * rename MOD & MOD4 to reduce_full & reduce_x
    * add a "simpler" reduce
    * split the adler32 function into sub functions, now we can hook in
      other functions for the large size adler32
    
    * add a 64-Bit pseudo SIMD version
    This code is for all the mips64, powerpc64 (without altivec), sparc64
    an other 64 Bit Processor.
    But i would like to dedicate this code to Alpha which early versions
    do not have instructions for byte wise memory access.
    Some results:
    
    Intel Core-i5-750
           -------- orig ------
                   a: 0x0CB4B676, 10000 * 160000 bytes     t: 9000 ms
                   a: 0x25BEB273, 10000 * 159999 bytes     t: 9100 ms
                   a: 0x733CB174, 10000 * 159998 bytes     t: 9400 ms
                   a: 0x1144AF76, 10000 * 159996 bytes     t: 9300 ms
                   a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 9200 ms
                   a: 0x1902A382, 10000 * 159984 bytes     t: 9800 ms
            -------- risc2 ------
                   a: 0x0CB4B676, 10000 * 160000 bytes     t: 4000 ms
                   a: 0x25BEB273, 10000 * 159999 bytes     t: 4000 ms
                   a: 0x733CB174, 10000 * 159998 bytes     t: 4300 ms
                   a: 0x1144AF76, 10000 * 159996 bytes     t: 4200 ms
                   a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 4300 ms
                   a: 0x1902A382, 10000 * 159984 bytes     t: 4200 ms
            speedup: 2.250000
    
    Intel Xeon, Nocona
            -------- orig ------
                   a: 0x0CB4B676, 10000 * 160000 bytes     t: 20000 ms
                   a: 0x25BEB273, 10000 * 159999 bytes     t: 19100 ms
                   a: 0x733CB174, 10000 * 159998 bytes     t: 21300 ms
                   a: 0x1144AF76, 10000 * 159996 bytes     t: 25100 ms
                   a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 19100 ms
                   a: 0x1902A382, 10000 * 159984 bytes     t: 20000 ms
            -------- risc2 ------
                   a: 0x0CB4B676, 10000 * 160000 bytes     t: 9900 ms
                   a: 0x25BEB273, 10000 * 159999 bytes     t: 9900 ms
                   a: 0x733CB174, 10000 * 159998 bytes     t: 10600 ms
                   a: 0x1144AF76, 10000 * 159996 bytes     t: 9700 ms
                   a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 10500 ms
                   a: 0x1902A382, 10000 * 159984 bytes     t: 11000 ms
            speedup: 2.020202
  9. Mike Frysinger

    gitignore

    vapier authored
Commits on Apr 7, 2011
  1. Mike Frysinger

    zlib-1.2.5

    vapier authored
Something went wrong with that request. Please try again.