<a href="https://colab.research.google.com/github/nile649/CUDA_Tutorials/blob/master/cuda_chp_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Check CPU feature : !lscpu

In [1]:
!lscpu


Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  2
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) CPU @ 2.00GHz
Stepping:            3
CPU MHz:             2000.172
BogoMIPS:            4000.34
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            39424K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_si

Check free memory : !free -g

In [2]:
!free -g

              total        used        free      shared  buff/cache   available
Mem:             12           0          10           0           1          11
Swap:             0           0           0


Check GPu card

In [4]:
!nvidia-smi

Sun Oct 11 20:45:21 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Setting up a C++ programming environment
In the case of Ubuntu Linux users, the
standard repository compilers and IDEs generally work and integrate perfectly with the
CUDA Toolkit, while Windows users might have to exercise a little more caution.



---



# Setting up GCC, Eclipse IDE, and graphical
dependencies (Linux)
Open up a Terminal from the Ubuntu desktop (Ctrl + Alt + T). We first update the
apt repository as follows:
sudo apt-get update
Now we can install everything we need for CUDA with one additional line:
sudo apt-get install build-essential binutils gdb eclipse-cdt
Here, build-essential is the package with the gcc and g++ compilers, and other utilities
such as make; binutils has some generally useful utilities, such as the LD linker, gdb is
the debugger, and Eclipse is the IDE that we will be using.
Let's also install a few additional dependencies that will allow us to run some of the
graphical (OpenGL) demos included with the CUDA Toolkit with this line:
sudo apt-get install freeglut3 freeglut3-dev libxi-dev libxmu-dev
Now you should be good to go to install the CUDA Toolkit.

# Installing PyCUDA (Linux)


---
!pip install PyCUDA



In [5]:
!pip install PyCUDA

Collecting PyCUDA
[?25l  Downloading https://files.pythonhosted.org/packages/46/61/47d3235a4c13eec5a5f03594ddb268f4858734e02980afbcd806e6242fa5/pycuda-2020.1.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 7.3MB/s 
[?25hCollecting pytools>=2011.2
[?25l  Downloading https://files.pythonhosted.org/packages/73/d5/989a1d2bba90f5c085e4929a4b703bbd8cc6b4a4218f1671fadab2abe966/pytools-2020.4.tar.gz (67kB)
[K     |████████████████████████████████| 71kB 10.0MB/s 
Collecting appdirs>=1.4.0
  Downloading https://files.pythonhosted.org/packages/3b/00/2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324/appdirs-1.4.4-py2.py3-none-any.whl
Collecting mako
[?25l  Downloading https://files.pythonhosted.org/packages/a6/37/0e706200d22172eb8fa17d68a7ae22dec7631a0a92266634fb518a88a5b2/Mako-1.1.3-py2.py3-none-any.whl (75kB)
[K     |████████████████████████████████| 81kB 12.2MB/s 
Building wheels for collected packages: PyCUDA, pytools
  Building wheel for PyCUDA (setup.py) ... 

code to check GPU specification

In [7]:
import pycuda
import pycuda.driver as drv
drv.init()

print('CUDA device query (PyCUDA version) \n')

print('Detected {} CUDA Capable device(s) \n'.format(drv.Device.count()))

for i in range(drv.Device.count()):
    
    gpu_device = drv.Device(i)
    print('Device {}: {}'.format( i, gpu_device.name() ))
    compute_capability = float( '%d.%d' % gpu_device.compute_capability() )
    print('\t Compute Capability: {}'.format(compute_capability))
    print('\t Total Memory: {} megabytes'.format(gpu_device.total_memory()//(1024**2)))
    
    # The following will give us all remaining device attributes as seen 
    # in the original deviceQuery.
    # We set up a dictionary as such so that we can easily index
    # the values using a string descriptor.
    
    device_attributes_tuples = gpu_device.get_attributes().items() 
    device_attributes = {}
    
    for k, v in device_attributes_tuples:
        device_attributes[str(k)] = v
    
    num_mp = device_attributes['MULTIPROCESSOR_COUNT']
    
    # Cores per multiprocessor is not reported by the GPU!  
    # We must use a lookup table based on compute capability.
    # See the following:
    # http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
    
    cuda_cores_per_mp = { 5.0 : 128, 5.1 : 128, 5.2 : 128, 6.0 : 64, 6.1 : 128, 6.2 : 128}[compute_capability]
    
    print('\t ({}) Multiprocessors, ({}) CUDA Cores / Multiprocessor: {} CUDA Cores'.format(num_mp, cuda_cores_per_mp, num_mp*cuda_cores_per_mp))
    
    device_attributes.pop('MULTIPROCESSOR_COUNT')
    
    for k in device_attributes.keys():
        print('\t {}: {}'.format(k, device_attributes[k]))

CUDA device query (PyCUDA version) 

Detected 1 CUDA Capable device(s) 

Device 0: Tesla P100-PCIE-16GB
	 Compute Capability: 6.0
	 Total Memory: 16280 megabytes
	 (56) Multiprocessors, (64) CUDA Cores / Multiprocessor: 3584 CUDA Cores
	 ASYNC_ENGINE_COUNT: 2
	 CAN_MAP_HOST_MEMORY: 1
	 CLOCK_RATE: 1328500
	 COMPUTE_CAPABILITY_MAJOR: 6
	 COMPUTE_CAPABILITY_MINOR: 0
	 COMPUTE_MODE: DEFAULT
	 CONCURRENT_KERNELS: 1
	 ECC_ENABLED: 1
	 GLOBAL_L1_CACHE_SUPPORTED: 1
	 GLOBAL_MEMORY_BUS_WIDTH: 4096
	 GPU_OVERLAP: 1
	 INTEGRATED: 0
	 KERNEL_EXEC_TIMEOUT: 0
	 L2_CACHE_SIZE: 4194304
	 LOCAL_L1_CACHE_SUPPORTED: 1
	 MANAGED_MEMORY: 1
	 MAXIMUM_SURFACE1D_LAYERED_LAYERS: 2048
	 MAXIMUM_SURFACE1D_LAYERED_WIDTH: 32768
	 MAXIMUM_SURFACE1D_WIDTH: 32768
	 MAXIMUM_SURFACE2D_HEIGHT: 65536
	 MAXIMUM_SURFACE2D_LAYERED_HEIGHT: 32768
	 MAXIMUM_SURFACE2D_LAYERED_LAYERS: 2048
	 MAXIMUM_SURFACE2D_LAYERED_WIDTH: 32768
	 MAXIMUM_SURFACE2D_WIDTH: 131072
	 MAXIMUM_SURFACE3D_DEPTH: 16384
	 MAXIMUM_SURFACE3D_HEIGHT: 1638

In [9]:
# Gives the number of GPU which supports CUDA
drv.Device.count()

1

# compute Capability


---


The Compute Capability describes the features supported by a CUDA hardware. First CUDA capable hardware like the GeForce 8800 GTX have a compute capability (CC) of 1.0 and recent GeForce like the GTX 480 have a CC of 2.0. Knowing the CC can be useful for understanting why a CUDA based demo can’t start on your system.

CUDA SDK 10.0 – 10.2 support for compute capability 3.0 – 7.5 (Kepler, Maxwell, Pascal, Volta, Turing). Last version with support for compute capability 3.x (Kepler). 10.2 

Source : wiki

In [12]:
# Compute Capability:
i=0
gpu_device = drv.Device(i)
print('Device {}: {}'.format( i, gpu_device.name() ))
compute_capability = float( '%d.%d' % gpu_device.compute_capability() )
print('\t Compute Capability: {}'.format(compute_capability))
print('\t Total Memory: {} megabytes'.format(gpu_device.total_memory()//(1024**2)))

Device 0: Tesla P100-PCIE-16GB
	 Compute Capability: 6.0
	 Total Memory: 16280 megabytes


In [19]:
x = gpu_device.total_memory()//1024 # The memory size is generally in bytes -> KiloBytes
x = x/1024 # MB
x/1024 # GB

15.8992919921875

Each Multi-process has Number of CUDA cores.

Stream Multiprocess is 54:

Each SM has 64 cores:

WHich gives 56*64 : 3584 cores

High cores don't indeicate better performance across different architecture.

Please refer to following links


---

https://www.extremetech.com/extreme/213519-asynchronous-shading-amd-nvidia-and-dx12-what-we-know-so-far

https://www.youtube.com/watch?v=JFhG9UntZs4&ab_channel=GregSalazar


---



In [20]:
56*64

3584