Values between in GPU hierarchy images and output files #4

Smeegol · 2022-04-29T07:20:09Z

The two captured images from the document "Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs" as below:

(1) The maximum number of registers for each work item
(2) The size, cacheline size, and bandwidth of memory hierarchy including all levels of unified and texture caches, local,
constant and global memory
(3) The number of threads in a warp
(4) The number of ALUs in a shader core

But after I output the .json and .csv files, I cannot make all connections between the values of properties and the values of GPUs hierarchy properties. Can you please figure out all of them, especially (2) and (4) ?

Besides, I have found some connections and make a list as below (just take Adreno 640 GPU for example), can you please check it right or wrong?

Shader cores count: Value in ArchProbeReport.json - [Device] - SmCount?
Execute engines count: Value in where?
ALUs count: Value in where?
Warp size: Value in [ArchProbeReport.json] - [WarpSizeMethod{A|B}] - WarpThreadCount?
What's 384, value in where?
Registers count: Value (is 181?) in ArchProbeReport.json - [RegCount] - RegCount?
Register type (Pooled / Dedicated): Value in ArchProbeReport.json - [RegCount] - RegType?
Register bits: Value (4B) in where?
Texture L1 cache bandwidth: Value in ArchProbeReport.json - [ImageBandwidth] - MinBandwidth / MaxBandwidth?
Local memory bandwidth: Value in ArchProbeReport.json - [BufferBandwidth] - MinBandwidth / MaxBandwidth?
Unified L2 cache bandwidth: Value in ArchProbeReport.json - [BufferBandwidth] - MinBandwidth / MaxBandwidth?
Constant memory bandwidth: Value in ArchProbeReport.json - [BufferBandwidth] - MinBandwidth / MaxBandwidth?
Texture L1 cache size: Value in where?
Local memory size: Value in where?
Unified L2 cache size: Value in where?
Constant memory size: Value in where?
Global memory bandwidth: Value in where?

The text was updated successfully, but these errors were encountered:

PENGUINLIONG · 2022-05-08T07:07:37Z

Sorry about the late reply. I just noticed that I didn't attach the probing code for local and constant memories and #5 should have brought them back.

And for your confusion interpreting the ArchProbe results - instead of answering the entire list of entangled questions I'd like to give you some general guidelines.

For the terminologies you might have been confused of:

SmCount = Stream Multiprocessor (NVIDIA) Count = Execution Engine (Arm) Count = Shader Core (Qualcomm) Count = Number of SIMT units where several threads are executed together, i.e., number of bunches of ALUs.
AluCount = Max Warp Size (NVIDIA) = Maximal number of threads that can be executed in an SM.
RegCount = Number of registers. Also in our work we assumed that a single regsiter can hold a 32-bit floating point number for simplicity. Because people barely use double-precision computations on mobile devices; also to keep the probing code simple.

For cache sizes and memory bandwidths, you'd like to lookup records in {Image|Buffer}Bandwidth and {Image|Buffer}CacheHierarchyPChase. You should be aware that you CANNOT rely solely on the JSON report for cache level bandwidths. The algorithm might fail to identify large cache boundaries because of the noisy timing. The JSON report is merely an auto-genrerated brief of its discovery digested by, in out case, TVM. Unlike the fixed number of registers and memory size, aspects like memory bandwidths varies with respect to the range of access, and ArchProbe does utilize such properties to detect the cache sizes. So you should plot the results in the .csv files to visualize how the bandwidth changes while the size of memory accessed increases. And the techniques to interpret the plots have already been illustrated in detail in our paper.

PENGUINLIONG · 2022-05-14T08:35:32Z

I'm closing this issue but please feel free to reopen it if you have further concerns interpreting ArchProbe outputs.

Smeegol · 2022-05-15T13:50:19Z

Thanks for your detail explanation, I will try to digest it.

PENGUINLIONG mentioned this issue May 8, 2022

Constant and local memory probing #5

Merged

PENGUINLIONG added the question Further information is requested label May 8, 2022

PENGUINLIONG closed this as completed May 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Values between in GPU hierarchy images and output files #4

Values between in GPU hierarchy images and output files #4

Smeegol commented Apr 29, 2022 •

edited

Loading

PENGUINLIONG commented May 8, 2022

PENGUINLIONG commented May 14, 2022

Smeegol commented May 15, 2022

Values between in GPU hierarchy images and output files #4

Values between in GPU hierarchy images and output files #4

Comments

Smeegol commented Apr 29, 2022 • edited Loading

PENGUINLIONG commented May 8, 2022

PENGUINLIONG commented May 14, 2022

Smeegol commented May 15, 2022

Smeegol commented Apr 29, 2022 •

edited

Loading