# 使用CUDA C/C++统一内存和使用nsys管理加速程序内存
[CUDA教程(docs.nvidia)](http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#memory-optimizations)

****
## 使用nsys性能分析器优化程序

`nsys profile`生成一个qdrep报告,添加--status=true使其打印在输出中.
包括:
- 配置文件的信息
- 报告文件生成信息
- **CUDA API统计信息**
- **CUDA核函数的信息**
- **CUDA内存操作信息**
- 操作系统内核调用接口信息

In [None]:
!nvcc -o file -run
!nsys profile --stats=true file

/**

// 配置文件信息
**** collection configuration ****
	force-overwrite = false
	stop-on-exit = true
	export_sqlite = true
	stats = true
	capture-range = none
	stop-on-range-end = false
	Beta: ftrace events:
	ftrace-keep-user-config = false
	trace-GPU-context-switch = false
	delay = 0 seconds
	duration = 0 seconds
	kill = signal number 15
	inherit-environment = true
	show-output = true
	trace-fork-before-exec = false
	sample_cpu = true
	backtrace_method = LBR
	wait = all
	trace_cublas = false
	trace_cuda = true
	trace_cudnn = false
	trace_nvtx = true
	trace_mpi = false
	trace_openacc = false
	trace_vulkan = false
	trace_opengl = true
	trace_osrt = true
	osrt-threshold = 0 nanoseconds
	cudabacktrace = false
	cudabacktrace-threshold = 0 nanoseconds
	profile_processes = tree
	application command = ./single-thread-vector-add
	application arguments = 
	application working directory = /dli/task
	NVTX profiler range trigger = 
	NVTX profiler domain trigger = 
	environment variables:
	Collecting data...

//报告生成信息
Success! All values calculated correctly.
	Generating the /dli/task/report1.qdstrm file.
	Capturing raw events...
	4570 total events collected.
	Saving diagnostics...
	Saving qdstrm file to disk...
	Finished saving file.


Importing the qdstrm file using /opt/nvidia/nsight-systems/2019.5.2/host-linux-x64/QdstrmImporter.

Importing...

Importing [==================================================100%]
Saving report to file "/dli/task/report1.qdrep"
Report file saved.
Please discard the qdstrm file and use the qdrep file instead.

Removed /dli/task/report1.qdstrm as it was successfully imported.
Please use the qdrep file instead.

Exporting the qdrep file to SQLite database using /opt/nvidia/nsight-systems/2019.5.2/host-linux-x64/nsys-exporter.

Exporting 4531 events:

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

Exported successfully to
/dli/task/report1.sqlite

// CUDA API统计信息
Generating CUDA API Statistics...
CUDA API Statistics (nanoseconds)

Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   89.9      2340979588           1    2340979588.0      2340979588      2340979588  cudaDeviceSynchronize                                                           
    9.4       243712846           3      81237615.3           32322       243591834  cudaMallocManaged                                                               
    0.7        18003501           3       6001167.0         5385469         7058338  cudaFree                                                                        
    0.0           52769           1         52769.0           52769           52769  cudaLaunchKernel                                                                



//CUDA核函数统计信息
Generating CUDA Kernel Statistics...

Generating CUDA Memory Operation Statistics...
CUDA Kernel Statistics (nanoseconds)

Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
  100.0      2340966967           1    2340966967.0      2340966967      2340966967  addVectorsInto                                                                  


// CUDA内存操作信息
CUDA Memory Operation Statistics (nanoseconds)

Time(%)      Total Time  Operations         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   76.7        68592896        2304         29771.2            1856          182400  [CUDA Unified Memory memcpy HtoD]                                               
   23.3        20873792         768         27179.4            1120          162528  [CUDA Unified Memory memcpy DtoH]                                               


CUDA Memory Operation Statistics (KiB)

            Total      Operations            Average            Minimum            Maximum  Name                                                                            
-----------------  --------------  -----------------  -----------------  -----------------  --------------------------------------------------------------------------------
         393216.0            2304              170.7              4.000             1020.0  [CUDA Unified Memory memcpy HtoD]                                               
         131072.0             768              170.7              4.000             1020.0  [CUDA Unified Memory memcpy DtoH]                                               



// 操作系统内核调用接口的信息
Generating Operating System Runtime API Statistics...
Operating System Runtime API Statistics (nanoseconds)

Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   59.0      5356960711         275      19479857.1            1880       100123061  poll                                                                            
   39.7      3597806394         275      13082932.3           11515       100070833  sem_timedwait                                                                   
    1.1        96807561         591        163803.0            1059        18260461  ioctl                                                                           
    0.2        19965243          90        221836.0            1110         7001018  mmap                                                                            
    0.0          650772          77          8451.6            2165           29721  open64                                                                          
    0.0          110239           4         27559.8           22783           32333  pthread_create                                                                  
    0.0           93710          23          4074.3            1144           16278  fopen                                                                           
    0.0           81870          11          7442.7            3840           12373  write                                                                           
    0.0           79309           3         26436.3           19334           37494  fgets                                                                           
    0.0           48080          14          3434.3            1709            7583  munmap                                                                          
    0.0           28011           5          5602.2            2484            9028  open                                                                            
    0.0           26373          16          1648.3            1017            3255  fclose                                                                          
    0.0           23202          10          2320.2            1116            3582  read                                                                            
    0.0           11441           3          3813.7            3608            4173  pipe2                                                                           
    0.0           10093           2          5046.5            4604            5489  socket                                                                          
    0.0            6277           3          2092.3            1274            3694  fcntl                                                                           
    0.0            6240           4          1560.0            1251            1727  mprotect                                                                        
    0.0            6000           1          6000.0            6000            6000  connect                                                                         
    0.0            5791           2          2895.5            2299            3492  fread                                                                           
    0.0            1950           1          1950.0            1950            1950  bind                                                                            
    0.0            1266           1          1266.0            1266            1266  listen                                                                          




Generating NVTX Push-Pop Range Statistics...
NVTX Push-Pop Range Statistics (nanoseconds)

**/

## 性能分析
`nsys profile --stats=true file`
对核函数<n,m>分析后发现(其中n为线程块数量,m为各线程块线程数量):
- n和m越大,运行速率越快(参数10,10可以跑到2亿纳秒左右)
- n的优化效率更高(参数1,10跑到6亿纳秒,10,1跑到8亿纳秒)


## 流式多处理器(SM)与GPU配置设置的查询

### 查询GPU信息
```cudac++
int deviceId;
cudaGetDevice(&deviceId);
// deviceId 用于索引GPU服务

cudaDeviceProp props;
cudaGetDeviceProperties(&props,deviceId);
// props是一个自带的类(https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp),列有GPU的属性

//eg
props.multiProcessorCount
props.major
props.minor
props.warpSize
```