# High Performance Python: GPUs
## Henry Schreiner

10-20-2021

Survey: TBD

Useful links:
* [High Performance Python: CPUs](https://github.com/henryiii/python-performance-minicourse)
* [Compiled Python](https://github.com/henryiii/python-compiled-minicourse)
* [iscinumpy.gitlab.io](https://iscinumpy.gitlab.io)
* [CompClass](https://github.com/henryiii/compclass)
* [Level Up Your Python](https://henryiii.github.io/level-up-your-python)

## Intro to GPUs

GPUs are "graphics processing units" designed to compute pixels on a screen. The massively parallel design can be useful for general purpose computing;
GPU companies started providing ways to use GPUs as "GPGPU"s, general purpose GPUs.

## Computing platform

We will be using Conda, mostly through the conda-forge channel, which recently gained support for proper CUDA libraries. We are getting PyTorch from the torch channel, and Tensorflow 2.0 snuck in as well. Pip support for ML libraries is not too bad, either (both are rapidly improving).

We will be using Python 3.8. Numba does not have a 3.9 compatible release yet, since the bytecode changed and that affected libraries that inspect bytecode, like Numba does. A 3.8 compatible release is due this month. PyTorch and TensorFlow also do not support Python 3.9 yet. So far, of the major libraries discussed, CuPy will likely be the first to Python 3.9.

## Languages/platforms

For differences in terminology, the ROCm page is quite good: <https://rocm.github.io/languages.html>.

![Language interest](images/LanguageInterest.png)

#### CUDA

The leader in the pack is easily NVidia; they were first to the foray with the CUDA language, and they easily lead for scientific computation.


* Wildly popular
* NVidia only
* A C++-like language, single source (with JIT option)

#### OpenCL

AMD was late to the game, and tried to support an open standard, OpenCL - but poor support from other players caused it to be almost AMD exclusive.
Apple released Metal as a replacement for OpenGL & OpenCL; they have worked with Intel & AMD on it. The are dropping their (almost non-existent) support for OpenCL.
The Kronos group (which works on OpenGL/CL) has released a successor, Vulkan, but it mostly focuses on graphics (OpenGL) at the moment.
Intel is planning to drop OpenCL in 2-3 years, too.

* Works on most platforms
* Most platforms have buggy, older support execpt AMD
* Only JIT-like option
* Also supports other compute backends, like FPGAs and CPUs



#### ROCm

* AMD only
* Open, interacts with others at various levels

AMD's custom platform is ROCm, which is their custom platform.

#### SYCL

SYCL was a CUDA-like single source language built on OpenCL, and is now part of Intel's OneAPI plan.


#### OpenMP

OpenMP now has tools to target GPUs, but it can be tricky to program (especially if you expect to write the same code to run multiple places). There's also OpenACC.

Today we will focus on CUDA, since it has good Python support and is the current lingua franca for scientific computing. OpenCL is not as popular, but has some Python libraries. ROCm recently has been showing up in Numba and CuPy.

# Libraries

In the CPU class, we covered several libraries, but Numba was a clear standout in terms of high performance and ease of use. In GPU computing, the landscape is still quite varied. It's much harder to select a clear winner; each has features and drawbacks.

![Library interest](images/LibraryInterest.png)

## CuPy

This was designed for the ML framework Chainer, but it becoming quite popular on its own.

* *Very* close (often drop in replacement) for NumPy
* Custom kernel support, including element-wise and reduction kernels
     * Written in CUDA
* Fusion support (although limited)
* Rapid development
* Experimental ROCm support
* Supports Numba's GPU array interface

## Numba

This comes up again, since it has a GPU mode too!

* Powerful but limited vectorize (elementwise UFunct)
* Full kernel mode, but hand launched
    * Written in Python subset
* Device function support
* New ROCm mode, but different terms
* Developed the GPU array interface

## PyTorch

This is Facebook's ML library.

* NumPy-like
* Has tape-based gradient support
* Has fusion mode (torch-script), can support multiple languages
* Hard to make custom kernels
* Supports Numba's GPU array interface
* *Great* tutorials

## Tensorflow

This is Google's ML library.

* New API is similar to PyTorch, but might be a bit slow currently
* Fusion mode builds graph, but still slower than API 1
* API 1 was very fast, but hard to *setup* (computations easy, though)
* Hard to make custom kernels
* Lucky to support NumPy's array interface; no GPU interface (yet?)
* Multiple language backends, including Swift

# GPU basics

GPU programming has several characteristics:

### Memory

GPU memory is separate from main memory, and the transfer cost is high. You will constantly be thinking about *where* you memory is, and how to reduce the transfer of that memory between host and device.

Note that there are techniques, like pinned/universal memory that can hide this from the programmer to an extent.

Also there are several types of memory, going from local to global, along with specialized memories like constant and texture.

### Parallel computation

GPU threads often are tempting to think of as CPU threads, but they are must more like vector registers. A GPU computes a "warp" at a time (32 threads); each thread has does the *same* computation. So, for example, how many times will the following code run:

```python
if x < 0:
    x = 0
else:
    x = x
```

This will first run and compute x<0 and create a mask. It will then run `x = 0` with some threads masked, then `x = x` with the other threads masked!

### Synchronization

GPUs can operate on streams (somewhat like threads in CPU programming). You can give one stream commands to work on while the other stream is loading data. However, this means a lot of commands are asynchronous, that is, they return immediately and just schedule work to be done, rather than waiting till after the work is done to return. If you are using the results, this is fine (things wait properly), but if you are timing runs, you should have a "synchronize" step to make sure work is done.

### Caching (and other smart CPU things)

GPUs are not as smart as CPUs, and cannot do as much branch prediction and caching as a CPU can.

Other parallel concepts, like atomics, still apply for GPUs as well.

## Multiple GPUs

You may have multiple GPUs connected to a single CPU system. Most GPU libraries have a context system that lets you switch between the GPUs, but it's usually another thing you have to program for.