# Introduction to GPGPU computing

# Fanatics of GPU - Gamers

![gta_vc.png](attachment:gta_vc.png)


# What is a GPU?

* ## A microprocessor just like CPU

* ## Does quick math calculations to render images on computer screen

* ## Composed on memory, capacitors, power supply, etc.

### GPU has everything to be programmable. What makes it different from a CPU?

# CPU vs GPU

![cpu_vs_gpu.png](attachment:cpu_vs_gpu.png)

# GPU Types

* ## Manufacturers
  * ### Nvidia
  * ### AMD
  * ### Intel
  
* ## Integration into computer
  * ## Integrated GPU or shared graphics
  * ## Discrete GPU or dedicated graphics


# Applications

* ## Deep learning
* ## Signal processing
* ## Crytography
* ## Scientific computing
* ## etc.

# Parallel Computing Platforms

* ## CUDA (Nvidia) - Most popular and easiest to learn
* ## OpenCL - Cross platform
* ## OpenACC - Youngest



# Nvidia GPUs

* ## GeForce - Gaming
* ## Quadro - Professional graphics
* ## Nvidia Titan - Design and research
* ## Tesla - HPC Data and data centers 

# Multithreading on GPU

* ## Threads on a GPU are grouped into thread blocks
  * ### Threads within the same block can communicate easily
  * ### Usually limited to a total of 512 threads per block
  * ### CUDA toolkit 10 and >= Volta allow up to 1024 

* ## Thread blocks are organised into 1D, 2D or 3D array of threads
  * ### $x \times y \times z \leq $ max number of threads 
  * ### 1D indexing: ```i = blockIdx.x *  blockDim.x + threadIdx.x```

* ## Thread blocks are organised into grids
  * ### Threads blocks can be organised into one or two dimensional grids
  * ### The number of blocks per grid can be up to $65,535$ per dimension

# GPU Programming

* ## Code sequence
  * ## Load data from CPU RAM into GPU RAM
  * ## Perform computation
  * ## Unload data to CPU RAM
  
* ## Loading and unloading takes time, computation on the GPU must be worth it

* ## Avoiding memory allocations on the GPU. Pre-allocate if possible

# GPU Programming with Python

* ## Download and install [CUDA](https://developer.nvidia.com/cuda-downloads) 
* ## Download and install [Anaconda](https://www.anaconda.com/products/individual) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html)
* ## Install Numba: ``` conda install numba ```


# GPU vs CPU Example

In [65]:
# Lets do some imports
import numpy as np
from timeit import default_timer as timer
from numba import vectorize, jit, cuda

# N = 1000000000

# This is the CPU version.
def add_cpu(A):
    for i in range(len(A)):
        A[i] + 1
    return A

# This is the GPU version. 
@vectorize(['float32(float32)'], target ="cuda") 
def add_gpu(A):
    return A + 1 

@cuda.jit
def add_gpu2(A, N):
    id = cuda.grid(1)
    if id < N:
        A[id] += 1
        
def main(N):
  A = np.ones(N, dtype=np.float32)
  
  average_cpu_time = 0
  average_gpu_time = 0
  average_gpu2_time = 0

  for i in range(20):
      # Time the CPU function
      start = timer()
      add_cpu(A)
      cpu_time = timer() - start
      average_cpu_time += cpu_time
    
      threads_per_block = 512
      blocks_per_grid = (A.size + (threads_per_block - 1))
      start = timer()
      add_gpu2[blocks_per_grid, threads_per_block](A, N) 
      gpu_time2 = timer() - start
      average_gpu2_time += gpu_time2
    
      # Time the GPU function
      start = timer()
      add_gpu(A)
      gpu_time = timer() - start
      average_gpu_time += gpu_time

      
 
  average_cpu_time = average_cpu_time/20.
  average_gpu_time = average_gpu_time/20.
  average_gpu2_time = average_gpu2_time/20.
    
  # Report times
  print("CPU function took %f seconds." % average_cpu_time)
  print("GPU function took %f seconds." % average_gpu_time)
  print("GPU function 2 took %f seconds." % average_gpu2_time)

In [67]:
N = 100
main(N)

CPU function took 0.000285 seconds.
GPU function took 0.001125 seconds.
GPU function 2 took 0.000943 seconds.


In [69]:
N = 10000
main(N)

CPU function took 0.025211 seconds.
GPU function took 0.001086 seconds.
GPU function 2 took 0.001229 seconds.


In [72]:
N = 100000
main(N)

CPU function took 0.265259 seconds.
GPU function took 0.001941 seconds.
GPU function 2 took 0.003161 seconds.
