# <center> An introduction to parallel programming and distributed training<center>

Our computation resources have increased with multiple machines where each machine have multiple GPUs. Training a neural network on only 1 GPU does not make full use of our hardwares. In this note we will understand the mechanism of training on multiple gpus. The information across the internet is not complete, instead just copy paste the code to make things work, it is more important to understand what's happening behind the scene. In this way we can also get in touch with multiprocessing, which may also help to speedup scientific research.

We will begin with basic concepts of parallel programming, then talk about different set up to training neural networks on multiple gpus. The main topics are listed as follows:
- Threads and Processes.
- Problems might happened in parallel programming.
    - Race condition
    - Random number generators
- Some example use cases of parallel programming.


### 1. Threads and Processes

We will first get into parallel programming. What we want to achieve is to run multiple tasks at the same time, thus hopefully we can get a performance gain. This pictures shows the basic idea.

8 people buying ticket from one machine.
<p align="center">
<img src="resources/1/one_queue.png" alt="drawing" width="500" >
</p>

But if we have two machines, we can have four people queueing each machine.

<p align="center">
<img src="resources/1/two_queue.png" alt="drawing" width="500" >
</p>

Modern computers generally have multiple cores, each can run tasks individually. The basic idea of parallel programming is to distribute tasks to the cores, so that we can make full use of the computation power. This is actually an extremely complicated topic and many weird things might happen. The goal of this note is to provide basic concepts and terminologies, for a deeper understanding you may just search online or take a look at the book __**Modern Operating Systems**__.

You may have already noticed the two words **thread** and **process** somewhere, they are managed using  `threading` and `multiprocessing` module in Python. You may find many abstract definitions of them, but I prefer to explain them using a figure.

<p align="center">
<img src="resources/1/processes_thread.png" alt="drawing" width="1000" >
</p>

- Each block represents a process, each process have a unique PID (process id) assigned by the operating system. 
- Several processes can have a hierarchical structure. 
- Each process have exactly one parent process, may have zero or several child processes. 
- Several processes can be combined together to form a process group.
- Each process contains its own resources for example variables, fils, a programme within a process can access another process's resource through Inter Process Communication (IPC).
- Each process contains one or more threads. You may just consider threads as some codes running or going to run.
- Threads in a process can access the resources within that process directly.

Next let's examine some of the above points through some demonstrations.

Let's run this [file](./demos/pid_gid_ppid.py), and then take a look at the system monitor. The programme launches a Python process and have one thread inside it, can also check the corresponding ids.

This [file](demos/create_process.py) shows how we can start a new process programmatically. We have the main process which is the entry point of the programme, inside the main process we start a new process p1 which runs function f.

Run this [file](demos/mp_resources.py) to see how resources are not shared between processes. After we start a new process from main process, all the variables are copied. Thus, all the changes made in the sub processes will not affect the original process.
<p align="center">
<img src="
resources/1/sub_process_resources.png" alt="drawing" width="200" >
</p>

This [file](demos/create_threads.py) shows how we can start a new thread. In the system monitor we can confirm that we have created only one process which have two threads. [This](demos/thread_resources.py) shows threads in a process can modify the resource directly.


Next let's talk about the difference between multi-process and multi-thread. First we need to know the fact that creating threads are much faster than creating processes. [This folder](demos/speed_test_mp_thread) contains two test file to create 5000 threads or processes. Create threads takes about 0.3s, but create processes takes about 13s, this is a huge difference. For the purpose of parallel programming we can always create processes instead threads since we know a process contains threads. However, considering the efficiency of creation and memory usage, we should know what exact is the difference between them and use the right one at the right time.

The short answer: For scientific computing, as far as I can think of, we should use multi-process.

Long answer: multi-thread are used for IO bound operation, multi-process are used for CPU bound operation. (At least in Python with GIL) What does this means?

IO bound operation: operations that the speed are limited by input output. For example waiting for user's input, waiting for http response, time.sleep function, waiting for the GPU to return results.

CPU bound operation: operations needs to be computed, math operations + - * / , matrix operations.

multi-thread cannot speedup cpu bound operations.

In [11]:
import random

def f(n):
    for _ in range(n):
        random.random()

NUM = 100_000_000

f(NUM)


In [None]:
import random
import threading

def f(n):
    for _ in range(n):
        random.random()

NUM = 100_000_000


t1 = threading.Thread(target=f, args=())
t2 = threading.Thread(target=f, args=())

t1.start()
t2.start()

t1.join()
t2.join()