Fix a bug in multi_thread_gemm.h which could produce wrong results #105
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
...under unusual circumstances triggered by using non-default cache
size settings, but actually a serious implementation bug.
This multi-threaded GEMM has a top-level
loop on RHS columns, by increment of l2_cols, distributing work
to each worker. That loop is at multi_thread_gemm.h:667.
The intent was for each worker to work with that block of column
as a single L2 block of columns, i.e. to use the same l2_cols
value. Yet, each worker was recomputing its own block params,
incorrectly (not accounting for the total number of workers, thus
possibly over-shooting the cache size), and defeating assumption
that the worker loops were making, based on the original intent
to use the same global l2_cols value.
This fixes that; in addition to fixing incorrect results with
nondefault (l1=32k, l2=128k) cache sizes, this should help performance:
workers;
and was performed dozens of times (one per worker task) so
it might even add up to significant overhead. Now done only
once globally.
(Issue reported by Andreas Gal).