Odin Implementation of General Matrix Multiply (gemm)

Ported from UT-Austins BLIS project / ULAFF course.

Results

Odin implementation suffers substantially compared to the C-Implementation. I suspect LLVM is poorly optimizing the code, but need to run profiling and direct comparisons to get a better idea as to the root cause.

Test Computer:

AMD 3950X: L1: 1mb, L2: 8mb, L3:64mb (shared), GFLOPS/Core: 225, Clock: 3.5 gHz (4.7 Turbo)
Expected Performance: ~90% Max, 200 GFLOP

Both runs executed using: odin test . -o:aggressive -disable-assert -microarch:native -no-bounds-check (LLVM17)

Naive Implementation (mmult_jpi):

Per-Matrix Size: 0.879 mb
Clocks: 409855116, n_flops: 1769472000, time(ms):117.101, GFS:15.111
clocks/flop 0.2316256

Optimized version (mmult):

Per-Matrix Size: 0.879 mb
starting mmult
A&B Cache-Packing Temp Allocs (kb): 72 72
Clocks: 399528885, n_flops: 1769472000, time(ms):114.151, GFS:15.501
clocks/flop 0.22579

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
BLISPicturePack.png		BLISPicturePack.png
build.bat		build.bat
gemm.odin		gemm.odin
matrix_dims.png		matrix_dims.png
odinfmt.json		odinfmt.json
ols.json		ols.json
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Odin Implementation of General Matrix Multiply (gemm)

Results

About

Languages

jon-lipstate/odin-gemm

Folders and files

Latest commit

History

Repository files navigation

Odin Implementation of General Matrix Multiply (gemm)

Results

About

Resources

Stars

Watchers

Forks

Languages