Avx2optimizations #122

rrwinterton · 2018-02-01T20:30:52Z

Added AVX2 Optimizations

Add files via upload

bjacob · 2018-02-02T15:56:30Z

This looks great overall. Some comments:

Can you remove the Makefile.rrw at top-level? If you have an interest in having that checked-in, we can discuss what sub-directory would make sense... I guess contrib/ .

Can you edit test/test_fixedpoint.cc, at the bottom, to ensure that your new fixedpoint_avx.h code is exercised? See the bottom of this file, with the paths for GEMMLOWP_SSE, GEMMLOWP_NEON etc. You want to RunTests<__m256i> , whatever types you specialized fixedpoint templates to. This is new code, by the way, which explains why you haven't seen it. Another new development here is the support for 16-bit fixed-point arithmetic, used to be 32bit-only. No need to add 16bit fixed-point in this pull request, that can wait, but FYI we are seriously thinking about relying on it for LSTM models.

rrwinterton · 2018-02-02T16:27:36Z

Yes I can remove the Makefile.rrw at the top level. Just wanted to make it easy to compare sse and avx2 in the builds using the old benchmarks. Yes will test the fixpoint code as it currently isn't being called in the drop right now but has been tested. We were thinking about promoting the output to AVX2 as week in a new pull but will clean up the fixpoint first. We will also do a 16 bit fixed-point if you like and do another pull request. Thanks for the comments.

Removing simple example of Makefile for avx2

bjacob · 2018-02-07T15:41:48Z

(The below is FYI, no action required here)

Side note: there is a difficult issue about ODR violations that we will need to resolve as soon as possible. I don't want it to block this pull request, and I'd rather handle it myself, but FYI I will have no choice but to resolve it before we can pull this into Google's production copies of gemmlowp.

The issue is that as we #ifdef code based on microarchitecture (AVX vs SSE, etc), since this is all inline code (in headers), if a user links together two object files containing the same gemmlowp symbols, one with AVX and the other without, then this is effectively a violation of the one-definition-rule (ODR).
https://en.wikipedia.org/wiki/One_Definition_Rule

This is a very hard problem that we've hit it all header-only libraries with micro-architecture-dependent paths. We already hit it with gemmlowp with scalar vs. SSE4 paths. What made it not matter in the end, so far, was that we already have entirely switched to requiring at least SSE4. Likewise on ARM, we already require NEON.

With AVX2, we won't be able to require it everywhere, so we will have AVX2 and SSE4 gemmlowp symbols coexisting --- and again, under the /same/ symbol name, whence the ODR issues.

There have been countless discussions internally about how to handle this problem (it's a lot more general than gemmlowp --- the same issue in Eigen has been a plague).

Ultimately, it's a mistake in gemmlowp's design that the same symbols may have different implementations. Ideally, different micro-architecture code-paths should correspond to different symbols. That would mean that it becomes the responsibility of the application (non-inline) code instantiating gemmlowp inline code, to choose between micro-architecture code paths. The critical point at which this needs to happen, is where a non-inline function calls gemmlowp inline functions.

It's too late to fix this for existing micro-architecture paths (SSE4, NEON) without regressing existing users, and again the issue so far has been mild because SSE4 and NEON are near-ubiquitous in their respective architectures.

But for AVX2, we need a more proper solution.

I think of allowing the user to #define GEMMLOWP_ENABLE_AVX2, which will result in a different namespace being created by gemmlowp: instead of "namespace gemmlowp" it will be "namespace gemmlowp_AVX2". User code will need to adjust accordingly, explicitly.

In the past a variant of this technique has often been employed: in order to make code automatically benefit from the fast path, people did: #define gemmlowp gemmlowp_FASTPATH (where FASTPATH might be AVX2 here). This tended to fix the actual ODR violation in practice, but is brittle: it would allow a user to inadvertently do this in their own inline code calling gemmlowp, only moving the problem down the inline call chain. That's why we currently think it better to require explicit choice of code path by the gemmlowp caller.

rrwinterton · 2018-03-15T19:32:20Z

It looks like the decision is to move the AVX2 optimizations to an auto-determine CPU capability and add a fast path instead of a specific ISA compilation and selection. If this is the case do you want me to make those changes in a branch of gemmlowp and make a new pull request?

bjacob · 2018-03-15T20:27:38Z

No, I don't want to make CPU capability auto-detection, and dispatching to a specific code path, part of gemmlowp's scope. That is best done in non-inline code, and gemmlowp is header-only, all inline code. So I see the scope of gemmlowp as strictly providing these code paths, not dispatching between them.

My above comments can be summarized by saying that unfortunately, at the moment, because of how gemmlowp is structured, having multiple code paths compiled may lead to ODR violations (in practice: crashes).

It is not very easy to fix this, and that would require some breaking change to gemmlowp's API.

So for now, I think the best route to taking your code is if you made it explicitly opt-in:

instead of having a #ifdef just detecting if AVX2 extensions are enabled,

#ifdef __AVX2__

have that #ifdef also check that the user explicitly opted in to compiling AVX2 code.

#if defined(__AVX2__) && defined (GEMMLOWP_ENABLE_AVX2)

That will help prevent a situation where users inadvertently mix AVX2 code into non-AVX2 code and get ODR violations. Please write a comment to that effect so that people understand the risks of GEMMLOWP_ENABLE_AVX2.

rrwinterton · 2018-03-29T21:38:58Z

Added a user compiler option to be used when compiling for IA SIMD optimizations. The results of these are shown below:

//compiler define for AVX2 -D GEMMLOWP_ENABLE_AVX2
//compiler define for SSE4 -D GEMMLOWP_ENABLE_SSE4
//no options for scalar

Test Run	Scalar	SSE4	AVX2
Graph Latency (ms)	32	12	9
GoogLeNet Latency (ms)	26	96	68
Multi-Thread 10x10x10 (Gflops/s)	1.64	2.961	2.334
Multi-Thread 1000x1000x1000 (Gflops/s)	29.49	75.85	107.8
Single-Thread 10x10x10 (Gflops/s)	1.63	2.975	2.337
Single-Thread 1000x1000x1000 (Gflops/s)	13.36	38.21	56.31

bjacob · 2018-03-29T21:45:57Z

Thanks! Just one nit: let's not change the behavior for current users who are enjoying SSE4 --- let's keep SSE4 auto-enabled without requiring GEMMLOWP_ENABLE_SSE4. The ODR issues are already a problem there, but it would be another problem to deal with hordes of angry users whom we took SSE4 acceleration away from.

Please just keep GEMMLOWP_ENABLE_AVX2.

rrwinterton · 2018-03-29T23:59:09Z

Makes sense. Just was trying to be consistent but you are right. Removed SSE compiler option. AVX2 only compiler option in change.

bjacob

I took a look at this pull request now, but can't see the new changes:

can't see GEMMLOWP_ENABLE_AX2
Makefile.rrw is still here

Looking at the Commits tab here,
https://github.com/google/gemmlowp/pull/122/commits
I don't see new commits since Feb. 1.

Please update this pull request by pushing your new commits to the branch that it is based on, rrwinterton:avx2optimizations.

bjacob · 2018-03-30T14:37:26Z

Makefile.rrw

@@ -0,0 +1,27 @@
+UNITTESTS_COMMON=test.cc benchmark.cc test_allocator.cc test_blocking_counter.cc test_fixedpoint.cc test_math_helpers.cc


As discussed earlier, please don't add top-level files, but this can go in contrib/ .

rrwinterton · 2018-04-04T22:12:40Z

I think the Makefile should be removed now and the -D compiler directive is checked in so I think we are good with this. I must have not have gotten the push and rm to complete last time. Let me know if you would like me to help. Thanks,

bjacob · 2018-04-05T15:07:01Z

Thanks! This looks good, just needs rebasing to the current state of the master branch --- I could 'rebase and merge' myself, but it's best if you do the rebasing yourself, so you can double check that your rebased branch looks like what you expect.

This should be roughly a matter of:

git checkout master
git pull origin master         # get the latest from the master branch
git checkout avx2optimizations
git rebase master        # rebase against master
git push....        # update this pull request.

…mmlowp into avx2optimizations

rrwinterton added 14 commits February 1, 2018 11:34

Add files via upload

c425579

Merge pull request #1 from rrwinterton/avx2-opt

ff5e0cd

Add files via upload

Delete pack.h

a4c5644

Delete pack_avx.h

2e5ff53

Delete output_avx.h

e98ee52

Delete kernel_default.h

07ae2f1

Delete kernel_avx.h

458a0d9

Delete fixedpoint_avx.h

2a08cf4

Delete fixedpoint.h

33c109a

Delete common.h

575ed94

avx2opt

0c9ef89

Update fixedpoint.h

2215f02

Delete fixedpoint_avx.h

c7bd4e5

avx2 optimizations

da4f8e7

Delete Makefile.rrw

0973d43

Removing simple example of Makefile for avx2

add user compiler options for simd

a301cac

removed compiler option for SSE left for AVX2 optimization

0cb911c

bjacob reviewed Mar 30, 2018

View reviewed changes

rrwinterton added 2 commits April 3, 2018 15:28

remove Makefile.rrw

06f0787

Added AVX2 compiler gemmlowp user option

04ae3ff

rrwinterton added 7 commits April 6, 2018 17:16

rebased to google master avx2 optimizations

cd66ee0

rebased avx2optimizations to the base of google gemmlowp master

988d35e

avx2 optimizations

560c2c5

Added AVX2 compiler gemmlowp user option

9e4772f

rebased avx2optimizations to the base of google gemmlowp master

e16ccc6

fix duplicated change in kregistersize

c03d2ae

Merge branch 'avx2optimizations' of https://github.com/rrwinterton/ge…

1b60279

…mmlowp into avx2optimizations

bjacob merged commit d74760e into google:master Apr 10, 2018

Qoboty mentioned this pull request Jul 4, 2018

Does avx2 feature of gemmlowp support gcc4.8.5? #141

Closed

SuperFluffy mentioned this pull request Dec 11, 2018

CellFormat in AVX2 kernel incorrect? Question for clarification #159

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avx2optimizations #122

Avx2optimizations #122

rrwinterton commented Feb 1, 2018

bjacob commented Feb 2, 2018

rrwinterton commented Feb 2, 2018

bjacob commented Feb 7, 2018

rrwinterton commented Mar 15, 2018

bjacob commented Mar 15, 2018 •

edited

rrwinterton commented Mar 29, 2018 •

edited

bjacob commented Mar 29, 2018

rrwinterton commented Mar 29, 2018

bjacob left a comment •

edited

bjacob Mar 30, 2018

rrwinterton commented Apr 4, 2018

bjacob commented Apr 5, 2018

		@@ -0,0 +1,27 @@
		UNITTESTS_COMMON=test.cc benchmark.cc test_allocator.cc test_blocking_counter.cc test_fixedpoint.cc test_math_helpers.cc

Avx2optimizations #122

Avx2optimizations #122

Conversation

rrwinterton commented Feb 1, 2018

bjacob commented Feb 2, 2018

rrwinterton commented Feb 2, 2018

bjacob commented Feb 7, 2018

rrwinterton commented Mar 15, 2018

bjacob commented Mar 15, 2018 • edited

rrwinterton commented Mar 29, 2018 • edited

bjacob commented Mar 29, 2018

rrwinterton commented Mar 29, 2018

bjacob left a comment • edited

Choose a reason for hiding this comment

bjacob Mar 30, 2018

Choose a reason for hiding this comment

rrwinterton commented Apr 4, 2018

bjacob commented Apr 5, 2018

bjacob commented Mar 15, 2018 •

edited

rrwinterton commented Mar 29, 2018 •

edited

bjacob left a comment •

edited