# Using a GPU

A quick reminder ! lets you call a terminal command in jupyter

## Why GPUs

* GPUs are fast! Esspecially when running code in parallel
* But..
    * Do have limitation, an important one is memory
    
# Some things to consider

* Tensorflow as a cpu and a gpu version
* This is also true of several other packages
* We generally take care of that for you on Talapas
    * You'll have to install the correct one when using your own hardware
    

A quick way to tell if you have the right GPU/Software installed is

## nvida-smi
This is a super useful command to see what and how many GPUs are on a system

**note** This only goes for nvidia gpus, but most code you'll encounter is based on cuda (which requires a nvidia gpu)
* There are some alternatives that we won't cover
    * Open-cl
    * ROCm


In [1]:
!nvidia-smi

Thu Sep 19 16:30:41 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:05:00.0 Off |                  Off |
| N/A   48C    P0    82W / 149W |      0MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

# Memory

All GPUs have a fixed ammount of memory if your GPU runs out of memory it will crash
* Larger Batch Sizes = More Memory

On talapas you should see
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:05:00.0 Off |                  Off |
| N/A   54C    P8    32W / 149W |      0MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
```

so far we've 0 of 12 GB of memory used on our k80 GPU, and it isn't processing anything (Volatile GPU-Util)

In [2]:
import tensorflow as tf
import numpy as np

be_nice=False

if be_nice:
    #Close the session we just made
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    session = tf.Session(config=config)
    tf.keras.backend.set_session(session)



def build_silly_model():

    big_input=tf.keras.layers.Input((1000,1000,3)) #~4Mb * 3 channel = 12mb image
    big_cnn=tf.keras.layers.Conv2D(500,(1),padding='same')(big_input) #~500 images 4Mb an Image =2 GB
    average=tf.keras.layers.GlobalAveragePooling2D()(big_cnn) # 500 Numbers ~ small in memory

    silly_model=tf.keras.models.Model([big_input],[average])
    silly_model.compile(loss='mse',optimizer='adam')
    return silly_model

silly_model=build_silly_model()
print(silly_model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 1000, 1000, 3)     0         
_________________________________________________________________
conv2d (Conv2D)              (None, 1000, 1000, 500)   2000      
_________________________________________________________________
global_average_pooling2d (Gl (None, 500)               0         
Total params: 2,000
Trainable params: 2,000
Non-trainable params: 0
_________________________________________________________________
None


In [3]:
!nvidia-smi

Thu Sep 19 16:30:43 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:05:00.0 Off |                  Off |
| N/A   48C    P0    82W / 149W |      0MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

keras hasn't put anything on the gpu yet

# Make a prediction with our silly model

In [4]:
input_image=np.ones((1,1000,1000,3))
#1000*1000*8 bytes fp64
print("Image Size in fp64",input_image.size*input_image.itemsize/1e6)

input_image=input_image.astype('float32')
#1000*1000*4 bytes fp32
print("Image Size in fp32",input_image.size*input_image.itemsize/1e6)


output_image=silly_model.predict(input_image,batch_size=1)
print(output_image.shape)
print("Output Size in fp32",output_image.size*input_image.itemsize/1e6)




Image Size in fp64 24.0
Image Size in fp32 12.0
(1, 500)
Output Size in fp32 0.002


In [5]:
!nvidia-smi

Thu Sep 19 16:30:44 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:05:00.0 Off |                  Off |
| N/A   48C    P0   149W / 149W |  11664MiB / 12206MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0  

# Memory-Usage 

One keras has to use the model, it creates a tensorflow session.

1-process (this one) ends up using all! the avaliable GPU memory

* This is a feature of tensorflow it will grab all the memory it can on all the gpus it can
    * Even if your code dosen't use more than one gpu
    

    
## Be a little bit nicer

Restart the kernal and run again with `be_nice=True`
You should see

```                                                                        
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    113328      C   ...envs/anaconda-tensorflow-gpu/bin/python  8387MiB |
+-----------------------------------------------------------------------------+
```

8 GBs? well a keras model is ready to fit your data, and creates activations that use more than just the layer outputs

In [6]:
#Print out all the tensors the keras model defined
[n.name for n in tf.get_default_graph().as_graph_def().node]

['input_1',
 'conv2d/kernel/Initializer/random_uniform/shape',
 'conv2d/kernel/Initializer/random_uniform/min',
 'conv2d/kernel/Initializer/random_uniform/max',
 'conv2d/kernel/Initializer/random_uniform/RandomUniform',
 'conv2d/kernel/Initializer/random_uniform/sub',
 'conv2d/kernel/Initializer/random_uniform/mul',
 'conv2d/kernel/Initializer/random_uniform',
 'conv2d/kernel',
 'conv2d/kernel/IsInitialized/VarIsInitializedOp',
 'conv2d/kernel/Assign',
 'conv2d/kernel/Read/ReadVariableOp',
 'conv2d/bias/Initializer/zeros',
 'conv2d/bias',
 'conv2d/bias/IsInitialized/VarIsInitializedOp',
 'conv2d/bias/Assign',
 'conv2d/bias/Read/ReadVariableOp',
 'conv2d/dilation_rate',
 'conv2d/Conv2D/ReadVariableOp',
 'conv2d/Conv2D',
 'conv2d/BiasAdd/ReadVariableOp',
 'conv2d/BiasAdd',
 'global_average_pooling2d/Mean/reduction_indices',
 'global_average_pooling2d/Mean',
 'Adam/iterations/Initializer/initial_value',
 'Adam/iterations',
 'Adam/iterations/IsInitialized/VarIsInitializedOp',
 'Adam/iterat

In [7]:
# nvidia-smi is still a great way to check total memory usage

In [8]:
!nvidia-smi

Thu Sep 19 16:30:45 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:05:00.0 Off |                  Off |
| N/A   48C    P0    93W / 149W |  11664MiB / 12206MiB |     19%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0  

#What Happens When You Run out of Memory

In [9]:


input_image=np.ones((20,1000,1000,3)).astype('float32')
#1000*1000*8 bytes fp64
print("Image Size in fp64",input_image.size*input_image.itemsize/1e6)

input_image=input_image.astype('float32')
#1000*1000*4 bytes fp32
print("Image Size in fp32",input_image.size*input_image.itemsize/1e6)


output_image=silly_model.predict(input_image,batch_size=20)
print(output_image.shape)
print("Output Size in fp32",output_image.size*input_image.itemsize/1e6)


Image Size in fp64 240.0
Image Size in fp32 240.0


ResourceExhaustedError: OOM when allocating tensor with shape[20,500,1000,1000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node conv2d/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](conv2d/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d/Conv2D/ReadVariableOp)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


### Unless you're running on a giant future GPU you'll see
```
ResourceExhaustedError: OOM when allocating tensor with shape[20,500,1000,1000]
```


In [10]:
# This works fine since the batch size is small
output_image=silly_model.predict(input_image,batch_size=1)


Out of memory use a smaller batch size
* Still out of memory? 
    * Try using fp16 (dosen't always work)
    * Change your model     
    * Try gradient checkpointing https://github.com/cybertronai/gradient-checkpointing

# Speed
Let's try a new smaller  model 



In [11]:

def build_small_model():
    big_input=tf.keras.layers.Input((100,100,3)) #~4Mb * 3 channel = 12mb image
    big_cnn=tf.keras.layers.Conv2D(500,(2,2),padding='same')(big_input) #~500 images 4Mb an Image =2 GB
    average=tf.keras.layers.GlobalAveragePooling2D()(big_cnn) # 500 Numbers ~ small in memory

    silly_model=tf.keras.models.Model([big_input],[average])
    silly_model.compile(loss='mse',optimizer='adam')
    return silly_model


gpu_model=build_small_model() #GPU is default

with tf.device("cpu"):
    cpu_model=build_small_model()

In [14]:
from time import time
input_image=np.ones((500,100,100,3)).astype('float32')

# With CPU

itime=time()
output_image=cpu_model.predict(input_image,batch_size=1)
print("Batch Size 1 CPU  time",time()-itime,"Seconds")

itime=time()
output_image=cpu_model.predict(input_image,batch_size=64)
print("Batch Size 50 CPU  time",time()-itime,"Seconds")





Batch Size 1 CPU  time 8.492326498031616 Seconds
Batch Size 50 CPU  time 12.380130767822266 Seconds


In [15]:
# With GPU
itime=time()
output_image=gpu_model.predict(input_image,batch_size=1)
print("Batch Size 1 GPU  time",time()-itime,"Seconds")

itime=time()
output_image=gpu_model.predict(input_image,batch_size=64)
print("Batch Size 50 GPU  time",time()-itime,"Seconds")

Batch Size 1 GPU  time 1.003133773803711 Seconds
Batch Size 50 GPU  time 1.099949598312378 Seconds



# In a seperate terimal try running

```watch -n 0.5 nvidia-smi```

run the code below (same gpu code again with more data) and see what the terminal output is. You should see something like
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:05:00.0 Off |                  Off |
| N/A   54C    P8   148W / 149W |  11664MiB / 12206MiB |      98%     Default |
+-------------------------------+----------------------+----------------------+

```
the Volatile GPU-Util shows 98% which tells you that the GPU is being (almost) fully utilized. 

You might notice with a batch size of 1 the Volatile GPU-Util is lower, which means it is not being fully utilized. 
In this case you'll get a bit better performance with a larger batch size (if it will fit in memory).


In [17]:
# Again with more data so you can watch

input_image=np.ones((5000,100,100,3)).astype('float32')

itime=time()
output_image=gpu_model.predict(input_image,batch_size=1)
print("Batch Size 1 GPU  time",time()-itime,"Seconds")

itime=time()
output_image=gpu_model.predict(input_image,batch_size=64)
print("Batch Size 50 GPU  time",time()-itime,"Seconds")

Batch Size 1 GPU  time 8.12438678741455 Seconds
Batch Size 50 GPU  time 5.824122905731201 Seconds


# Summary
* GPUs are very powerful for deep learning, but come with memory constraints
* nvidia-smi is you friend for understanding what's going on