<a href="https://colab.research.google.com/github/nile649/CUDA_Tutorials/blob/master/cuda_chp_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#3 
## Getting Started with PyCUDA


---
We will start
by learning how to use PyCUDA for some basic and fundamental operations. We will first
see how to query our GPU—that is, we will start by writing a small Python program that
will tell us what the characteristics of our GPU are, such as the core count, architecture, and
memory. We will then spend some time getting acquainted with how to transfer memory
between Python and the GPU with PyCUDA's gpuarray class and how to use this class for
basic computations. The remainder of this chapter will be spent showing how to write some
basic functions (which we will refer to as CUDA Kernels) that we can directly launch onto
the GPU.

The learning outcomes for this chapter are as follows:
1. Determining GPU characteristics, such as memory capacity or core count, using
PyCUDA
2. Understanding the difference between host (CPU) and device (GPU) memory
and how to use PyCUDA's gpuarray class to transfer data between the host and
device
3. How to do basic calculations using only gpuarray objects
4. How to perform basic element-wise operations on the GPU with the
PyCUDA ElementwiseKernel function
5. Understanding the functional programming concept of reduce/scan operations
and how to make a basic reduction or scan CUDA kernel








In [2]:
!lscpu


Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  2
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping:            0
CPU MHz:             2300.000
BogoMIPS:            4600.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs 

Check free memory : !free -g

In [3]:
!free -g

              total        used        free      shared  buff/cache   available
Mem:             12           0          10           0           1          11
Swap:             0           0           0


Check GPu card

In [2]:
!nvidia-smi

Mon Oct 12 21:06:35 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P8    11W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Querying your GPU

# Installing PyCUDA (Linux)


---
!pip install PyCUDA



In [3]:
!pip install PyCUDA

Collecting PyCUDA
[?25l  Downloading https://files.pythonhosted.org/packages/46/61/47d3235a4c13eec5a5f03594ddb268f4858734e02980afbcd806e6242fa5/pycuda-2020.1.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 8.8MB/s 
[?25hCollecting pytools>=2011.2
[?25l  Downloading https://files.pythonhosted.org/packages/73/d5/989a1d2bba90f5c085e4929a4b703bbd8cc6b4a4218f1671fadab2abe966/pytools-2020.4.tar.gz (67kB)
[K     |████████████████████████████████| 71kB 10.0MB/s 
Collecting appdirs>=1.4.0
  Downloading https://files.pythonhosted.org/packages/3b/00/2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324/appdirs-1.4.4-py2.py3-none-any.whl
Collecting mako
[?25l  Downloading https://files.pythonhosted.org/packages/a6/37/0e706200d22172eb8fa17d68a7ae22dec7631a0a92266634fb518a88a5b2/Mako-1.1.3-py2.py3-none-any.whl (75kB)
[K     |████████████████████████████████| 81kB 9.2MB/s 
Building wheels for collected packages: PyCUDA, pytools
  Building wheel for PyCUDA (setup.py) ... 

In [4]:
!sudo apt update
!sudo add-apt-repository ppa:graphics-drivers
!sudo apt-key adv --fetch-keys  http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
!sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list'
!sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda_learn.list'

[33m0% [Working][0m            Get:1 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
[33m0% [Connecting to archive.ubuntu.com (91.189.88.142)] [Connecting to security.u[0m[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (91.189.91.38)] [Co[0m                                                                               Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (91.189.91.38)] [Co[0m[33m0% [1 InRelease gpgv 15.9 kB] [Waiting for headers] [Connecting to security.ubu[0m                                                                               Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
[33m0% [1 InRelease gpgv 15.9 kB] [3 InRelease 15.6 kB/88.7 kB 18%] [Connecting to [0m                                                                               Hit:4 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu b

In [5]:
!sudo apt install cuda-10-1
!sudo apt install libcudnn7


Reading package lists... Done
Building dependency tree       
Reading state information... Done
cuda-10-1 is already the newest version (10.1.243-1).
0 upgraded, 0 newly installed, 0 to remove and 17 not upgraded.
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list.d/cuda_learn.list:1 and /etc/apt/sources.list.d/nvidia-ml.list:1
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libcudnn7-dev
The following held packages will be changed:
  libcudnn7
The following packages will be upgraded:
  libcudnn7 libcudnn7-dev
2 upgraded, 0 newly installed, 0 to remove and 15 not upgraded.
W: Target Packages (Packages) is configured multiple times in /etc/apt/sources.list.d/cuda_learn.list:1 and /etc/apt/sources.list.d/nvidia-ml.list:1
E: Held packages were changed and -y was used without --allow-change-held-packages.


code to check GPU specification

In [6]:
import pycuda
import pycuda.driver as drv
drv.init()

print('CUDA device query (PyCUDA version) \n')

print('Detected {} CUDA Capable device(s) \n'.format(drv.Device.count()))

for i in range(drv.Device.count()):
    
    gpu_device = drv.Device(i)
    print('Device {}: {}'.format( i, gpu_device.name() ))
    compute_capability = float( '%d.%d' % gpu_device.compute_capability() )
    print('\t Compute Capability: {}'.format(compute_capability))
    print('\t Total Memory: {} megabytes'.format(gpu_device.total_memory()//(1024**2)))
    
    # The following will give us all remaining device attributes as seen 
    # in the original deviceQuery.
    # We set up a dictionary as such so that we can easily index
    # the values using a string descriptor.
    
    device_attributes_tuples = gpu_device.get_attributes().items() 
    device_attributes = {}
    
    for k, v in device_attributes_tuples:
        device_attributes[str(k)] = v
    
    num_mp = device_attributes['MULTIPROCESSOR_COUNT']
    
    # Cores per multiprocessor is not reported by the GPU!  
    # We must use a lookup table based on compute capability.
    # See the following:
    # http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
    
    cuda_cores_per_mp = { 5.0 : 128, 5.1 : 128, 5.2 : 128, 6.0 : 64, 6.1 : 128, 6.2 : 128}[compute_capability]
    
    print('\t ({}) Multiprocessors, ({}) CUDA Cores / Multiprocessor: {} CUDA Cores'.format(num_mp, cuda_cores_per_mp, num_mp*cuda_cores_per_mp))
    
    device_attributes.pop('MULTIPROCESSOR_COUNT')
    
    for k in device_attributes.keys():
        print('\t {}: {}'.format(k, device_attributes[k]))

CUDA device query (PyCUDA version) 

Detected 1 CUDA Capable device(s) 

Device 0: Tesla T4
	 Compute Capability: 7.5
	 Total Memory: 15079 megabytes


KeyError: ignored

In [7]:
# Gives the number of GPU which supports CUDA
drv.Device.count()

1

# compute Capability


---


The Compute Capability describes the features supported by a CUDA hardware. First CUDA capable hardware like the GeForce 8800 GTX have a compute capability (CC) of 1.0 and recent GeForce like the GTX 480 have a CC of 2.0. Knowing the CC can be useful for understanting why a CUDA based demo can’t start on your system.

CUDA SDK 10.0 – 10.2 support for compute capability 3.0 – 7.5 (Kepler, Maxwell, Pascal, Volta, Turing). Last version with support for compute capability 3.x (Kepler). 10.2 

Source : wiki

In [8]:
# Compute Capability:
i=0
gpu_device = drv.Device(i)
print('Device {}: {}'.format( i, gpu_device.name() ))
compute_capability = float( '%d.%d' % gpu_device.compute_capability() )
print('\t Compute Capability: {}'.format(compute_capability))
print('\t Total Memory: {} megabytes'.format(gpu_device.total_memory()//(1024**2)))

Device 0: Tesla T4
	 Compute Capability: 7.5
	 Total Memory: 15079 megabytes


In [9]:
x = gpu_device.total_memory()//1024 # The memory size is generally in bytes -> KiloBytes
x = x/1024 # MB
x/1024 # GB

14.726318359375

Each Multi-process has Number of CUDA cores.

Stream Multiprocess is 54:

Each SM has 64 cores:

WHich gives 56*64 : 3584 cores

High cores don't indeicate better performance across different architecture.

Please refer to following links


---

https://www.extremetech.com/extreme/213519-asynchronous-shading-amd-nvidia-and-dx12-what-we-know-so-far

https://www.youtube.com/watch?v=JFhG9UntZs4&ab_channel=GregSalazar


---



In [None]:
56*64

3584

A GPU divides its individual cores up into larger units known as
Streaming Multiprocessors (SMs);

a GPU device will have several SMs, which will each
individually have a particular number of CUDA cores, depending on the compute
capability of the device.

*** To be clear: the number of cores per multiprocessor is not indicated
directly by the GPU—this is given to us implicitly by the compute capability. ***

cuda cores != cores
since cuda cores is depended on cc.






# Using PyCUDA's gpuarray class

Much like how NumPy's array class is the cornerstone of numerical programming within
the NumPy environment, PyCUDA's gpuarray class plays an analogously prominent role
within GPU programming in Python. This has all of the features you know and love from
NumPy—multidimensional vector/matrix/tensor shape structuring, array-slicing, array
unraveling, and overloaded operators for point-wise computations (for example, +, -, *, /,
and **).
Getting Started with PyCUDA Chapter 3
[ 45 ]
gpuarray is really an indispensable tool for any budding GPU programmer. We will spend
this section going over this particular data structure and gaining a strong grasp of it before
we move on.

## Transferring data to and from the GPU with gpuarray



---
GPU memory is called Global device memory, whereas CPU is device memory. GPU array is essentially Numpy data structure for CUDA.

For the most part, we treat (global) device memory on the GPU as we do
dynamically allocated heap memory in C (with the malloc and free functions) or C++ (as
with the new and delete operators); in CUDA C, this is complicated further with the
additional task of transferring data back and forth between the CPU to the GPU (with
commands such as cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost), all while
keeping track of multiple pointers in both the CPU and GPU space and performing proper
memory allocations (cudaMalloc) and deallocations (cudaFree).


---
Fortunately, PyCUDA covers all of the overhead of memory allocation, deallocation, and
data transfers with the gpuarray class. As stated, this class acts similarly to NumPy arrays,
using vector/ matrix/tensor shape structure information for the data. gpuarray objects
even perform automatic cleanup based on the lifetime, so we do not have to worry about
freeing any GPU memory stored in a gpuarray object when we are done with it

---
How exactly do we use this to transfer data from the host to the GPU? First, we must
contain our host data in some form of NumPy array (let's call it host_data), and then use
the ** *gpuarray.to_gpu(host_data)* ** command to transfer this over to the GPU and create
a new GPU array.



---


Example


In [11]:
import numpy as np
import pycuda.autoinit
from pycuda import gpuarray

host_data = np.array([1,2,3,4,5],dtype=np.float32)
device_data = gpuarray.to_gpu(host_data)
device_data_ = 19274*device_data
device_data_ = device_data_.get()
print(device_data_)

[19274. 38548. 57822. 77096. 96370.]


One thing to note is that we specifically denoted that the array on the host had its type
specifically set to a NumPy float32 type with the dtype option when we set up our
NumPy array; 

this corresponds directly with the float type in C/C++. Generally speaking,
it's a good idea to specifically set data types with NumPy when we are sending data to the
GPU. 

The reason for this is twofold: 

---



1. first, since we are using a GPU for increasing the
performance of our application, we don't want any unnecessary overhead of using an
unnecessary type that will possibly take up more computational time or memory, and

2. second, since we will soon be writing portions of code in inline CUDA C, we will have to be
very specific with types or our code won't work correctly, keeping in mind that C is a
statically-typed language.


---

**Remember to specifically set data types for NumPy arrays that will be
transferred to the GPU. This can be done with the dtype option in the
constructor of the numpy.array class.**



---

#Basic pointwise arithmetic operations with gpuarray

---
note that a pointwise operation is intrinsically parallelizable, and so when we use this
operation on a gpuarray object PyCUDA is able to offload each multiplication operation
onto a single thread, rather than computing each multiplication in serial, one after the other
(in fairness, some versions of NumPy can use the advanced SSE instructions found in
modern x86 chips for these computations, so in some cases the performance will be
comparable to a GPU). To be clear: these pointwise operations performed on the GPU are in
parallel since the computation of one element is not dependent on the computation of any
other element.


---
# Speed Test



In [12]:
import numpy as np
import pycuda.autoinit
from pycuda import gpuarray
from time import time
host_data = np.float32( np.random.random(50000000) )

t1 = time()
host_data_2x = host_data * np.float32(2)
t2 = time()
print('total time to compute on CPU: %f' % (t2 - t1))
device_data = gpuarray.to_gpu(host_data)

total time to compute on CPU: 0.035629


In [13]:
t1 = time()
device_data_2x = device_data * np.float32( 2 )
t2 = time()
from_device = device_data_2x.get()
print('total time to compute on GPU: %f' % (t2 - t1))
print('Is the host computation the same as the GPU computation? :\
{}'.format(np.allclose(from_device, host_data_2x) ))

total time to compute on GPU: 0.001246
Is the host computation the same as the GPU computation? :True


In [14]:
def func():
  host_data = np.float32( np.random.random(50000000) )

  t1 = time()
  host_data_2x = host_data * np.float32(2)
  t2 = time()
  print('total time to compute on CPU: %f' % (t2 - t1))
  device_data = gpuarray.to_gpu(host_data)
  t1 = time()
  device_data_2x = device_data * np.float32( 2 )
  t2 = time()
  from_device = device_data_2x.get()
  print('total time to compute on GPU: %f' % (t2 - t1))
  print('Is the host computation the same as the GPU computation? :\
  {}'.format(np.allclose(from_device, host_data_2x) ))  

In [21]:
%load_ext line_profiler


In [23]:
%lprun -f func func()


total time to compute on CPU: 0.037419
total time to compute on GPU: 0.000987
Is the host computation the same as the GPU computation? :  True


**In PyCUDA, GPU code is often compiled at runtime with the
NVIDIA nvcc compiler and then subsequently called from PyCUDA. This
can lead to an unexpected slowdown, usually the first time a program or
GPU operation is run in a given Python session.**

# Using PyCUDA's ElementWiseKernel for performing pointwise computations