Fix a bug in multi_thread_gemm.h which could produce wrong results #105

bjacob · 2017-10-11T13:13:00Z

...under unusual circumstances triggered by using non-default cache
size settings, but actually a serious implementation bug.

This multi-threaded GEMM has a top-level
loop on RHS columns, by increment of l2_cols, distributing work
to each worker. That loop is at multi_thread_gemm.h:667.

The intent was for each worker to work with that block of column
as a single L2 block of columns, i.e. to use the same l2_cols
value. Yet, each worker was recomputing its own block params,
incorrectly (not accounting for the total number of workers, thus
possibly over-shooting the cache size), and defeating assumption
that the worker loops were making, based on the original intent
to use the same global l2_cols value.

This fixes that; in addition to fixing incorrect results with
nondefault (l1=32k, l2=128k) cache sizes, this should help performance:

correct determination of block params based on the number of
workers;
the determination of block params involves some divisions
and was performed dozens of times (one per worker task) so
it might even add up to significant overhead. Now done only
once globally.

(Issue reported by Andreas Gal).

under unusual circumstances triggered by using non-default cache size settings, but actually a serious implementation bug. This multi-threaded GEMM has a top-level loop on RHS columns, by increment of l2_cols, distributing work to each worker. That loop is at multi_thread_gemm.h:667. The intent was for each worker to work with that block of column as a single L2 block of columns, i.e. to use the same l2_cols value. Yet, each worker was recomputing its own block params, incorrectly (not accounting for the total number of workers, thus possibly over-shooting the cache size), and defeating assumption that the worker loops were making, based on the original intent to use the same global l2_cols value. This fixes that; in addition to fixing incorrect results with nondefault (l1=32k, l2=128k) cache sizes, this should help performance: - correct determination of block params based on the number of workers; - the determination of block params involves some divisions and was performed dozens of times (one per worker task) so it might even add up to significant overhead. Now done only once globally. (Issue reported by Andreas Gal).

bjacob force-pushed the cachesizebug branch from 858fb01 to 05c903e Compare October 11, 2017 13:14

bjacob force-pushed the cachesizebug branch from 05c903e to 98fca4c Compare October 11, 2017 13:29

bjacob merged commit d6799a4 into google:master Oct 11, 2017

andreasgal mentioned this pull request Oct 22, 2017

Recent fix causes 2.5x performance regression #108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bug in multi_thread_gemm.h which could produce wrong results #105

Fix a bug in multi_thread_gemm.h which could produce wrong results #105

bjacob commented Oct 11, 2017 •

edited

Loading

Fix a bug in multi_thread_gemm.h which could produce wrong results #105

Fix a bug in multi_thread_gemm.h which could produce wrong results #105

Conversation

bjacob commented Oct 11, 2017 • edited Loading

bjacob commented Oct 11, 2017 •

edited

Loading