-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Nvidia Hopper GPUs #27
base: main
Are you sure you want to change the base?
Conversation
if Symbol(dtype) == :Float16 | ||
# matrix dimensions 8x8x4, factor 2 for nflops in A*B+C | ||
# see e.g. https://peerj.com/articles/cs-330.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I replaced this link with the DOI because the link is now broken.
266b15a
to
d567774
Compare
Trying to measure peakflops, I get
These results are quite far from the theoretical peaks, about 50% less, is there anything to tweak in the kernels for a new architecture? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly enough, all theoretical tensorcores peakflops for Float32, Float16, and Int8 are wrong by about 8%
julia> 535.3 / 494.7
1.0820699413786132
julia> 1070.5 / 989.4
1.0819688700222356
julia> (4282.1 / 2) / 1978.9
1.0819394613168933
but I have no clue of where this factor comes from.
elseif Symbol(dtype) == :Float64 | ||
max_peakflops *= 2 * 4 * 4 * 2 | ||
elseif Symbol(dtype) == :Int8 | ||
max_peakflops *= 2 * 2 * 32 * 8 * 4 # XXX: Wrong result! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there's an extra factor of 2 in this formula, but I based this on the Int8
calculation below
This is an initial attempt to support Nvidia Hopper GPUs, opening as draft because lots of thing still don't work. For example, theoretical peakflops for tensor cores is wrong, it looks like the formula used for A100 doesn't apply to Hopper. I tried to adapt based on figures 10-11 of https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper but with GH200 I get:
Values for
Float64
(with and without tensorcores) andFloat32
(without tensorcores) are good, but all other tensorcores peakflops are wrong according to column "H100 SXM5" table 2 of the document above, it should be 494.7 TFLOP/s forFloat32
, 989.4 forFloat16
, and 1978.9 forInt8
(also https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip which is specific to GH200 agrees with those numbers, but it has fewer significant digits, they rounded to integers).