Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug in multi_thread_gemm.h which could produce wrong results #105

Merged
merged 1 commit into from
Oct 11, 2017

Conversation

bjacob
Copy link
Contributor

@bjacob bjacob commented Oct 11, 2017

...under unusual circumstances triggered by using non-default cache
size settings, but actually a serious implementation bug.

This multi-threaded GEMM has a top-level
loop on RHS columns, by increment of l2_cols, distributing work
to each worker. That loop is at multi_thread_gemm.h:667.

The intent was for each worker to work with that block of column
as a single L2 block of columns, i.e. to use the same l2_cols
value. Yet, each worker was recomputing its own block params,
incorrectly (not accounting for the total number of workers, thus
possibly over-shooting the cache size), and defeating assumption
that the worker loops were making, based on the original intent
to use the same global l2_cols value.

This fixes that; in addition to fixing incorrect results with
nondefault (l1=32k, l2=128k) cache sizes, this should help performance:

  • correct determination of block params based on the number of
    workers;
  • the determination of block params involves some divisions
    and was performed dozens of times (one per worker task) so
    it might even add up to significant overhead. Now done only
    once globally.

(Issue reported by Andreas Gal).

under unusual circumstances triggered by using non-default cache
size settings, but actually a serious implementation bug.

This multi-threaded GEMM has a top-level
loop on RHS columns, by increment of l2_cols, distributing work
to each worker. That loop is at multi_thread_gemm.h:667.

The intent was for each worker to work with that block of column
as a single L2 block of columns, i.e. to use the same l2_cols
value. Yet, each worker was recomputing its own block params,
incorrectly (not accounting for the total number of workers, thus
possibly over-shooting the cache size), and defeating assumption
that the worker loops were making, based on the original intent
to use the same global l2_cols value.

This fixes that; in addition to fixing incorrect results with
nondefault (l1=32k, l2=128k) cache sizes, this should help performance:
 - correct determination of block params based on the number of
   workers;
 - the determination of block params involves some divisions
   and was performed dozens of times (one per worker task) so
   it might even add up to significant overhead. Now done only
   once globally.

(Issue reported by Andreas Gal).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant