Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Values between in GPU hierarchy images and output files #4

Closed
Smeegol opened this issue Apr 29, 2022 · 3 comments
Closed

Values between in GPU hierarchy images and output files #4

Smeegol opened this issue Apr 29, 2022 · 3 comments
Labels
question Further information is requested

Comments

@Smeegol
Copy link

Smeegol commented Apr 29, 2022

The two captured images from the document "Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs" as below:
image
image

(1) The maximum number of registers for each work item
(2) The size, cacheline size, and bandwidth of memory hierarchy including all levels of unified and texture caches, local,
constant and global memory
(3) The number of threads in a warp
(4) The number of ALUs in a shader core

But after I output the .json and .csv files, I cannot make all connections between the values of properties and the values of GPUs hierarchy properties. Can you please figure out all of them, especially (2) and (4) ?

Besides, I have found some connections and make a list as below (just take Adreno 640 GPU for example), can you please check it right or wrong?

  • Shader cores count: Value in ArchProbeReport.json - [Device] - SmCount?
  • Execute engines count: Value in where?
  • ALUs count: Value in where?
  • Warp size: Value in [ArchProbeReport.json] - [WarpSizeMethod{A|B}] - WarpThreadCount?
  • What's 384, value in where?
  • Registers count: Value (is 181?) in ArchProbeReport.json - [RegCount] - RegCount?
  • Register type (Pooled / Dedicated): Value in ArchProbeReport.json - [RegCount] - RegType?
  • Register bits: Value (4B) in where?
  • Texture L1 cache bandwidth: Value in ArchProbeReport.json - [ImageBandwidth] - MinBandwidth / MaxBandwidth?
  • Local memory bandwidth: Value in ArchProbeReport.json - [BufferBandwidth] - MinBandwidth / MaxBandwidth?
  • Unified L2 cache bandwidth: Value in ArchProbeReport.json - [BufferBandwidth] - MinBandwidth / MaxBandwidth?
  • Constant memory bandwidth: Value in ArchProbeReport.json - [BufferBandwidth] - MinBandwidth / MaxBandwidth?
  • Texture L1 cache size: Value in where?
  • Local memory size: Value in where?
  • Unified L2 cache size: Value in where?
  • Constant memory size: Value in where?
  • Global memory bandwidth: Value in where?
@PENGUINLIONG
Copy link
Collaborator

Sorry about the late reply. I just noticed that I didn't attach the probing code for local and constant memories and #5 should have brought them back.

And for your confusion interpreting the ArchProbe results - instead of answering the entire list of entangled questions I'd like to give you some general guidelines.

For the terminologies you might have been confused of:

  • SmCount = Stream Multiprocessor (NVIDIA) Count = Execution Engine (Arm) Count = Shader Core (Qualcomm) Count = Number of SIMT units where several threads are executed together, i.e., number of bunches of ALUs.
  • AluCount = Max Warp Size (NVIDIA) = Maximal number of threads that can be executed in an SM.
  • RegCount = Number of registers. Also in our work we assumed that a single regsiter can hold a 32-bit floating point number for simplicity. Because people barely use double-precision computations on mobile devices; also to keep the probing code simple.

For cache sizes and memory bandwidths, you'd like to lookup records in {Image|Buffer}Bandwidth and {Image|Buffer}CacheHierarchyPChase. You should be aware that you CANNOT rely solely on the JSON report for cache level bandwidths. The algorithm might fail to identify large cache boundaries because of the noisy timing. The JSON report is merely an auto-genrerated brief of its discovery digested by, in out case, TVM. Unlike the fixed number of registers and memory size, aspects like memory bandwidths varies with respect to the range of access, and ArchProbe does utilize such properties to detect the cache sizes. So you should plot the results in the .csv files to visualize how the bandwidth changes while the size of memory accessed increases. And the techniques to interpret the plots have already been illustrated in detail in our paper.

@PENGUINLIONG PENGUINLIONG added the question Further information is requested label May 8, 2022
@PENGUINLIONG
Copy link
Collaborator

I'm closing this issue but please feel free to reopen it if you have further concerns interpreting ArchProbe outputs.

@Smeegol
Copy link
Author

Smeegol commented May 15, 2022

Thanks for your detail explanation, I will try to digest it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants