Utility to patch binaries generated by the Intel C++ Compiler to get the maximum performance on AMD CPUs
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



Utility to patch binaries generated by the Intel C++ Compiler to get the maximum performance on AMD

The Intel C++ Compiler adds to generated binaries a CPUID test that looks if they are executed on a
Intel CPU, so the binaries don't run with full optimizations on non-Intel CPUs. This utility patches
such CPUID tests, so the binaries can run on an AMD CPU as if they were on a Intel CPU.

**Tested on Linux with Intel C++ Compiler 10.x/11.x (it might work with future releases of ICC).
Maybe it also works with Fortran compiler if it has the same CPUID test, but this is not

*It seems that ICC 11.x doesn't impose anymore a penalty on the performance when running the
compiled binaries on AMD. But the CPUID tests are still present on those binaries and this program
can remove them.*

*There are some GNU libraries that also have CPUID tests, so in case you generate a static binary
with that code included, they could be affected, but in the performed tests the comparisons used a
different instruction so they were left intact. Anyway, those tests are not evil like the Intel

 How to compile

You must have the libelf library. In Ubuntu 8.04 just install the package libelfg0-dev. With a
version around 0.8.6 it should work well. Now you can compile with the command:



In the source code tarball there is a file called benchmark-partial-sums.c (taken from
*The Computer Language Shootout* http://shootout.alioth.debian.org). This code can be optimized
with SSE2 by the Intel compiler.

Compile this code with:

	icc -O3 -xW -o benchmark-partial-sums benchmark-partial-sums.c

To run the benchmark use:

	time ./benchmark-partial-sums 100000000

These were the average results on my AMD64 CPU:

- GCC compiled executable --> 45.5s (compiled with -O3 -msse2)
- ICC original executable --> 31.5s (probably not taking the SSE2 optimized path in the binary)
- ICC patched executable  --> 25.5s

 How to patch a binary generated by Intel C++ Compiler

Just run:

	patch-AuthenticAMD <executable_name>

 How to patch the Intel C++ Compiler

In the /path/to/icc/lib there are the shared libraries used by the compiler. It seems that
patching all of them, the binaries generated by ICC won't have the CPUID test. So they run perfectly
in AMD. Probably only one of the shared libraries is the responsible of adding such test. Anyway, I
can't confirm this because I didn't try it.

**But you are warned that modifying, disassembling or reverse engineering the Intel C++ Compiler goes
against the Intel EULA (End User License Agreement). So do at your own risk.**

If you want to try, run this command in /path/to/icc/lib:

	for i in *; do patch-AuthenticAMD -ev $i; done

 Report results

Please, this tool seems to work well, but it is not very tested. Send me an email with your
results. You can also send me questions, suggestions, or anything. Feel free to send me questions
about the code:


 The content of the doc directory

- libelf by Example.mht: http://people.freebsd.org/~jkoshy/download/libelf/article.html
	a tutorial for libelf in FreeBSD. Almost everything it says is valid for Linux.
- naughty-intel.html: the person who wrote this article explains everything one need to know about
	the subject.

 How it works

Here it is a binary compiled by ICC 10.1 disassembled:

0000000000402c5c <__intel_cpu_indicator_init>:
						# Get CPU vendor string (EAX = 0)
  402c84:	48 33 c0             	xor    %rax,%rax
  402c87:	0f a2                	cpuid
  402c89:	89 45 f8             	mov    %eax,-0x8(%rbp)
  402c8c:	89 5d fc             	mov    %ebx,-0x4(%rbp)
  402c8f:	89 4d ec             	mov    %ecx,-0x14(%rbp)
  402c92:	89 55 f4             	mov    %edx,-0xc(%rbp)
  402c95:	48 c7 c0 01 00 00 00 	mov    $0x1,%rax
						# Get CPU capabilities (EAX = 1)
  402c9c:	0f a2                	cpuid
  402c9e:	89 45 f0             	mov    %eax,-0x10(%rbp)
  402ca1:	89 5d e0             	mov    %ebx,-0x20(%rbp)
  402ca4:	89 4d e8             	mov    %ecx,-0x18(%rbp)
  402ca7:	89 55 e4             	mov    %edx,-0x1c(%rbp)
  402cca:	8b 45 fc             	mov    -0x4(%rbp),%eax
						# Compare the first four bytes of your vendor string with "Genu"
  402ccd:	3d 47 65 6e 75       	cmp    $0x756e6547,%eax
  402cd2:	bb 01 00 00 00       	mov    $0x1,%ebx
  402cd7:	75 1b                	jne    402cf4 <__intel_cpu_indicator_init+0x98>
  402cd9:	8b 45 f4             	mov    -0xc(%rbp),%eax
						# Compare the first four bytes of your vendor string with "ineI"
  402cdc:	3d 69 6e 65 49       	cmp    $0x49656e69,%eax
  402ce1:	75 11                	jne    402cf4 <__intel_cpu_indicator_init+0x98>
  402ce3:	8b 45 ec             	mov    -0x14(%rbp),%eax
						# Compare the first four bytes of your vendor string with "ntel"
  402ce6:	3d 6e 74 65 6c       	cmp    $0x6c65746e,%eax
  402ceb:	75 07                	jne    402cf4 <__intel_cpu_indicator_init+0x98>
  402ced:	ba 01 00 00 00       	mov    $0x1,%edx
  402cf2:	eb 02                	jmp    402cf6 <__intel_cpu_indicator_init+0x9a>
  402cf4:	33 d2                	xor    %edx,%edx
						# If you has "GenuineIntel" everything goes OK. Later are more test
						# to see the capabilities of your CPU and they are taken in account.
						# Here it loads in RAX the address of a global variable (_DYNAMIC+0x1d8)
						# where a value representing the the capabilities of your CPU is stored.
						# This value also says if your CPU is non-INTEL which means that the
						# true capabilities of your CPU are not full used (i.e. SSE).
  402d7e:	48 8b 05 a3 56 20 00 	mov    0x2056a3(%rip),%rax        # 608428 <_DYNAMIC+0x1d8>
						# In EBX the value of this global variable is ready to be copied to
						# memory. An INTEL CPU with SSE and SSE2 has EBX = 0x800. An AMD CPU
						# with SSE and SSE2 has EBX = 0x1 which means that the SSE and SSE2 
						# capabilities are not recognized.
  402d85:	89 18                	mov    %ebx,(%rax)

The patch-AuthenticAMD utility remplaces those three CMP instructions by other three CMPs that look
for the vendor string AuthenticAMD. The libelf library is used to analyze the structure of the
ELF binary to be patched so we can find the executable sections and do the replacements only in that
sections, so we can garantee that what we remplaces is a machine instruction and no another thing.
Also it is possible to by pass libelf and make replacements in all the binary.

The binaries generated with the Intel C++ Compiler usually have several execution branches, some of
them are for maximum compatibily with x86 processors and others are for maximun speed with SSE
optimizations. With this utility, the executable will get the fastest path your CPU supports.