# CUDA编程模型--- 原子操作

#### 原子操作
原子函数对驻留在全局或共享内存中的一个 32 位或 64 位字执行读-修改-写原子操作。

1. atomicAdd()    
reads the 16-bit, 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old + val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.


2. atomicSub()   
reads the 32-bit word old located at the address address in global or shared memory, computes (old - val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.  

3. atomicExch()  
reads the 32-bit or 64-bit word old located at the address address in global or shared memory and stores val back to memory at the same address. These two operations are performed in one atomic transaction. The function returns old.

4. atomicMin()  
reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes the minimum of old and val, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.  
The 64-bit version of atomicMin() is only supported by devices of compute capability 3.5 and higher.  

5. atomicMax()  
reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes the maximum of old and val, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.  
The 64-bit version of atomicMax() is only supported by devices of compute capability 3.5 and higher.  

6. atomicInc()  
reads the 32-bit word old located at the address address in global or shared memory, computes ((old >= val) ? 0 : (old+1)), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.  

7. atomicDec()    
reads the 32-bit word old located at the address address in global or shared memory, computes (((old == 0) || (old > val)) ? val : (old-1) ), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.  

8. atomicCAS()  
reads the 16-bit, 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old == compare ? val : old) , and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old (Compare And Swap).  


9. atomicAnd()  
reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old & val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.  
The 64-bit version of atomicAnd() is only supported by devices of compute capability 3.5 and higher.

10. atomicOr()  
reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old | val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.
The 64-bit version of atomicOr() is only supported by devices of compute capability 3.5 and higher.

11. atomicXor()  
reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old ^ val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.  
The 64-bit version of atomicXor() is only supported by devices of compute capability 3.5 and higher.

![atomic+](atomic.png)

接下来我们完成下面的一个实例：  
给定一个数组A，它好办1000000个int类型的元素，求他所有的元素之和：  
输入：A[1000000]  
输出：output（A中所有元素之和）  

在[sum.cu](sum.cu)中完成上述实例，如果遇到困难，请参考[result.cu](result_sum.cu)


编译，并执行程序

In [None]:
!make

In [None]:
!./sum

利用nvprof测试程序性能

In [None]:
!sudo /usr/local/cuda/bin/nvprof ./sum

课后作业：
1. 给定数组A[1000000]找出其中最大的值和最小的值
2. 给定数组A[1000000]找出其中最大的十个值