Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up CuMatrix<Real>::Transpose() and transposed copy from matrix #790

Merged
merged 4 commits into from
May 19, 2016

Conversation

kangshiyin
Copy link
Contributor

A new matrix transpose kernel. 2~3x faster. Benchmark results are in the commit log.

Add barrier for correct timing.

Original performance:
LOG (TestCuMatrixTransposeCross():cu-matrix-speed-test.cc:91) For CuMatrix::TransposeCross<float>, for dim = 1024, speed was 4.26727 gigaflops.
LOG (TestCuMatrixTransposeS():cu-matrix-speed-test.cc:72) For CuMatrix::TransposeS<float>, for dim = 1024, speed was 5.97203 gigaflops.
LOG (TestCuMatrixTransposeNS():cu-matrix-speed-test.cc:56) For CuMatrix::TransposeNS<float>, for dim = 1024, speed was 3.0816 gigaflops.
LOG (TestCuMatrixTransposeCross():cu-matrix-speed-test.cc:91) For CuMatrix::TransposeCross<double>, for dim = 1024, speed was 3.95059 gigaflops.
LOG (TestCuMatrixTransposeS():cu-matrix-speed-test.cc:72) For CuMatrix::TransposeS<double>, for dim = 1024, speed was 4.36189 gigaflops.
LOG (TestCuMatrixTransposeNS():cu-matrix-speed-test.cc:56) For CuMatrix::TransposeNS<double>, for dim = 1024, speed was 2.39275 gigaflops.
LOG (TestCuMatrixTransposeCross():cu-matrix-speed-test.cc:91) For CuMatrix::TransposeCross<float>, for dim = 1024, speed was 14.0498 gigaflops.
LOG (TestCuMatrixTransposeS():cu-matrix-speed-test.cc:72) For CuMatrix::TransposeS<float>, for dim = 1024, speed was 16.845 gigaflops.
LOG (TestCuMatrixTransposeNS():cu-matrix-speed-test.cc:56) For CuMatrix::TransposeNS<float>, for dim = 1024, speed was 14.2464 gigaflops.
LOG (TestCuMatrixTransposeCross():cu-matrix-speed-test.cc:91) For CuMatrix::TransposeCross<double>, for dim = 1024, speed was 10.4523 gigaflops.
LOG (TestCuMatrixTransposeS():cu-matrix-speed-test.cc:72) For CuMatrix::TransposeS<double>, for dim = 1024, speed was 9.65529 gigaflops.
LOG (TestCuMatrixTransposeNS():cu-matrix-speed-test.cc:56) For CuMatrix::TransposeNS<double>, for dim = 1024, speed was 8.52148 gigaflops.
@danpovey
Copy link
Contributor

Thanks!
Please show us the speed difference here in the conversation so we don't have to look at the log.
I notice you have a template that can take 16 or 32 but you're only using 32. I recommend simplifying the code if you don't have immediate plans to use that functionality.
Dan

@kangshiyin
Copy link
Contributor Author

OK, will change. Here's the timing.

Original:
LOG (TestCuMatrixTransposeCross():cu-matrix-speed-test.cc:91) For CuMatrix::TransposeCross, for dim = 1024, speed was 4.26727 gigaflops.
LOG (TestCuMatrixTransposeS():cu-matrix-speed-test.cc:72) For CuMatrix::TransposeS, for dim = 1024, speed was 5.97203 gigaflops.
LOG (TestCuMatrixTransposeNS():cu-matrix-speed-test.cc:56) For CuMatrix::TransposeNS, for dim = 1024, speed was 3.0816 gigaflops.
LOG (TestCuMatrixTransposeCross():cu-matrix-speed-test.cc:91) For CuMatrix::TransposeCross, for dim = 1024, speed was 3.95059 gigaflops.
LOG (TestCuMatrixTransposeS():cu-matrix-speed-test.cc:72) For CuMatrix::TransposeS, for dim = 1024, speed was 4.36189 gigaflops.
LOG (TestCuMatrixTransposeNS():cu-matrix-speed-test.cc:56) For CuMatrix::TransposeNS, for dim = 1024, speed was 2.39275 gigaflops.
c765ba6

New:
LOG (TestCuMatrixTransposeCross():cu-matrix-speed-test.cc:91) For CuMatrix::TransposeCross, for dim = 1024, speed was 14.0498 gigaflops.
LOG (TestCuMatrixTransposeS():cu-matrix-speed-test.cc:72) For CuMatrix::TransposeS, for dim = 1024, speed was 16.845 gigaflops.
LOG (TestCuMatrixTransposeNS():cu-matrix-speed-test.cc:56) For CuMatrix::TransposeNS, for dim = 1024, speed was 14.2464 gigaflops.
LOG (TestCuMatrixTransposeCross():cu-matrix-speed-test.cc:91) For CuMatrix::TransposeCross, for dim = 1024, speed was 10.4523 gigaflops.
LOG (TestCuMatrixTransposeS():cu-matrix-speed-test.cc:72) For CuMatrix::TransposeS, for dim = 1024, speed was 9.65529 gigaflops.
LOG (TestCuMatrixTransposeNS():cu-matrix-speed-test.cc:56) For CuMatrix::TransposeNS, for dim = 1024, speed was 8.52148 gigaflops.

@danpovey danpovey merged commit 60a106e into kaldi-asr:master May 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants