|Failed to load latest commit information.|
|doc||Switch from Hg to Git|
|AUTHOR||Switch from Hg to Git|
|COPYING||Switch from Hg to Git|
|benchmark-partial-sums.c||Switch from Hg to Git|
|patch-AuthenticAMD.c||Switch from Hg to Git|
patch-AuthenticAMD ==================== Utility to patch binaries generated by the Intel C++ Compiler to get the maximum performance on AMD CPUs. The Intel C++ Compiler adds to generated binaries a CPUID test that looks if they are executed on a Intel CPU, so the binaries don't run with full optimizations on non-Intel CPUs. This utility patches such CPUID tests, so the binaries can run on an AMD CPU as if they were on a Intel CPU. **Tested on Linux with Intel C++ Compiler 10.x/11.x (it might work with future releases of ICC). Maybe it also works with Fortran compiler if it has the same CPUID test, but this is not confirmed.** *It seems that ICC 11.x doesn't impose anymore a penalty on the performance when running the compiled binaries on AMD. But the CPUID tests are still present on those binaries and this program can remove them.* *There are some GNU libraries that also have CPUID tests, so in case you generate a static binary with that code included, they could be affected, but in the performed tests the comparisons used a different instruction so they were left intact. Anyway, those tests are not evil like the Intel ones.* How to compile ---------------- You must have the libelf library. In Ubuntu 8.04 just install the package libelfg0-dev. With a version around 0.8.6 it should work well. Now you can compile with the command: make Benchmark ----------- In the source code tarball there is a file called benchmark-partial-sums.c (taken from *The Computer Language Shootout* http://shootout.alioth.debian.org). This code can be optimized with SSE2 by the Intel compiler. Compile this code with: icc -O3 -xW -o benchmark-partial-sums benchmark-partial-sums.c To run the benchmark use: time ./benchmark-partial-sums 100000000 These were the average results on my AMD64 CPU: - GCC compiled executable --> 45.5s (compiled with -O3 -msse2) - ICC original executable --> 31.5s (probably not taking the SSE2 optimized path in the binary) - ICC patched executable --> 25.5s How to patch a binary generated by Intel C++ Compiler ------------------------------------------------------- Just run: patch-AuthenticAMD <executable_name> How to patch the Intel C++ Compiler ------------------------------------- In the /path/to/icc/lib there are the shared libraries used by the compiler. It seems that patching all of them, the binaries generated by ICC won't have the CPUID test. So they run perfectly in AMD. Probably only one of the shared libraries is the responsible of adding such test. Anyway, I can't confirm this because I didn't try it. **But you are warned that modifying, disassembling or reverse engineering the Intel C++ Compiler goes against the Intel EULA (End User License Agreement). So do at your own risk.** If you want to try, run this command in /path/to/icc/lib: for i in *; do patch-AuthenticAMD -ev $i; done Report results ---------------- Please, this tool seems to work well, but it is not very tested. Send me an email with your results. You can also send me questions, suggestions, or anything. Feel free to send me questions about the code: firstname.lastname@example.org The content of the doc directory ------------------------------------ - libelf by Example.mht: http://people.freebsd.org/~jkoshy/download/libelf/article.html a tutorial for libelf in FreeBSD. Almost everything it says is valid for Linux. - naughty-intel.html: the person who wrote this article explains everything one need to know about the subject. How it works -------------- Here it is a binary compiled by ICC 10.1 disassembled: 0000000000402c5c <__intel_cpu_indicator_init>: ... # Get CPU vendor string (EAX = 0) 402c84: 48 33 c0 xor %rax,%rax 402c87: 0f a2 cpuid 402c89: 89 45 f8 mov %eax,-0x8(%rbp) 402c8c: 89 5d fc mov %ebx,-0x4(%rbp) 402c8f: 89 4d ec mov %ecx,-0x14(%rbp) 402c92: 89 55 f4 mov %edx,-0xc(%rbp) 402c95: 48 c7 c0 01 00 00 00 mov $0x1,%rax # Get CPU capabilities (EAX = 1) 402c9c: 0f a2 cpuid 402c9e: 89 45 f0 mov %eax,-0x10(%rbp) 402ca1: 89 5d e0 mov %ebx,-0x20(%rbp) 402ca4: 89 4d e8 mov %ecx,-0x18(%rbp) 402ca7: 89 55 e4 mov %edx,-0x1c(%rbp) ... 402cca: 8b 45 fc mov -0x4(%rbp),%eax # Compare the first four bytes of your vendor string with "Genu" 402ccd: 3d 47 65 6e 75 cmp $0x756e6547,%eax 402cd2: bb 01 00 00 00 mov $0x1,%ebx 402cd7: 75 1b jne 402cf4 <__intel_cpu_indicator_init+0x98> 402cd9: 8b 45 f4 mov -0xc(%rbp),%eax # Compare the first four bytes of your vendor string with "ineI" 402cdc: 3d 69 6e 65 49 cmp $0x49656e69,%eax 402ce1: 75 11 jne 402cf4 <__intel_cpu_indicator_init+0x98> 402ce3: 8b 45 ec mov -0x14(%rbp),%eax # Compare the first four bytes of your vendor string with "ntel" 402ce6: 3d 6e 74 65 6c cmp $0x6c65746e,%eax 402ceb: 75 07 jne 402cf4 <__intel_cpu_indicator_init+0x98> 402ced: ba 01 00 00 00 mov $0x1,%edx 402cf2: eb 02 jmp 402cf6 <__intel_cpu_indicator_init+0x9a> 402cf4: 33 d2 xor %edx,%edx # If you has "GenuineIntel" everything goes OK. Later are more test # to see the capabilities of your CPU and they are taken in account. ... # Here it loads in RAX the address of a global variable (_DYNAMIC+0x1d8) # where a value representing the the capabilities of your CPU is stored. # This value also says if your CPU is non-INTEL which means that the # true capabilities of your CPU are not full used (i.e. SSE). 402d7e: 48 8b 05 a3 56 20 00 mov 0x2056a3(%rip),%rax # 608428 <_DYNAMIC+0x1d8> # In EBX the value of this global variable is ready to be copied to # memory. An INTEL CPU with SSE and SSE2 has EBX = 0x800. An AMD CPU # with SSE and SSE2 has EBX = 0x1 which means that the SSE and SSE2 # capabilities are not recognized. 402d85: 89 18 mov %ebx,(%rax) ... The patch-AuthenticAMD utility remplaces those three CMP instructions by other three CMPs that look for the vendor string AuthenticAMD. The libelf library is used to analyze the structure of the ELF binary to be patched so we can find the executable sections and do the replacements only in that sections, so we can garantee that what we remplaces is a machine instruction and no another thing. Also it is possible to by pass libelf and make replacements in all the binary. The binaries generated with the Intel C++ Compiler usually have several execution branches, some of them are for maximum compatibily with x86 processors and others are for maximun speed with SSE optimizations. With this utility, the executable will get the fastest path your CPU supports.