https://rc-docs.northeastern.edu/en/latest/using-discovery/workingwithgpu.html#using-gpus-with-pytorch

In [1]:
!which python

/work/bootcamp/gpu_training/pytorch_env/bin/python


In [2]:
import torch

In [3]:
torch.manual_seed(42);

In [4]:
torch.__version__

'1.7.1'

### CPU

In [5]:
tensor1 = torch.randn(5000,10000)
tensor2 = torch.randn(10000,5000)
result = torch.matmul(tensor1, tensor2)
print (result.size())
print (f"Result is on GPU : {result.is_cuda}" )

torch.Size([5000, 5000])
Result is on GPU : False


In [6]:
%%timeit -r 7 -n 1
torch.matmul(tensor1, tensor2)

1.37 s ± 32.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


CPU run time ~ 5.5s

For sanity check, see the GPU activity via ```nvidia-smi```     
![image](https://drive.google.com/uc?export=view&id=1BT_VCeS3jj-Os5nYeEW86sZREH_vxT0i)


Useful alias : ```alias wsmi='watch -n0.1 nvidia-smi'```


### GPU

In [7]:
cuda = torch.device('cuda') 
tensor1_gpu = torch.randn(5000,10000,device=cuda)
tensor2_gpu = torch.randn(10000,5000,device=cuda)
result_gpu = torch.matmul(tensor1_gpu, tensor2_gpu)
print (result_gpu.size())
print (f"Result is on GPU : {result_gpu.is_cuda}" )

torch.Size([5000, 5000])
Result is on GPU : True


In [8]:
%%timeit -r 6 -n 100
torch.matmul(tensor1_gpu, tensor2_gpu)
torch.cuda.synchronize()

64 ms ± 252 µs per loop (mean ± std. dev. of 6 runs, 100 loops each)


GPU run time ~ 37ms . 150x faster than CPU !

In [9]:
!nvidia-smi #Run on terminal

Wed Jun  9 09:34:40 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA Tesla P1...  Off  | 00000000:82:00.0 Off |                    0 |
| N/A   50C    P0    33W / 250W |   1371MiB / 12198MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Matrix Multiplication being a compute bound operation shows high activity on GPU :  

![image](https://drive.google.com/uc?export=view&id=1OiNvtZBT-wp1i2yq2kSrZxCJgh4B3dIm)

