# Multiprocessing

- company uses Network-Attached Storage (NAS) to store all data generated daily (e.g., videos, photos). 

- we need to back up the data in the production NAS (mounted at /data/prod on the server) to the backup NAS (mounted at /data/prod_backup on the server). 

- former member of the team developed a Python script (full path /scripts/dailysync.py) that backs up data daily. But recently, there's been a lot of data generated and the script isn't catching up to the speed. As a result, the backup process now takes more than 20 hours to finish, which isn't efficient at all for a daily backup.

We would:
- Identify what limits the system performance: I/O, Network, CPU, or Memory
- Use `rsync` command instead of cp to transfer data
- Get system standard output and manipulate the output
- Find differences between threading and multiprocessing

# CPU bound
**CPU bound** means the program is bottlenecked by the CPU (Central Processing Unit). When the program is waiting for I/O (e.g., disk read/write, network read/write), the CPU is free to do other tasks, even if the program is stopped. The speed of the program will mostly depend on how fast that I/O can happen; if we want to speed it up, we'll need to speed up the I/O. If the program is running lots of program instructions and not waiting for I/O, then it's CPU bound. Speeding up the CPU will make the program run faster.

In either case, the key to speeding up the program might not be to speed up the hardware but to optimize the program to reduce the amount of I/O or CPU it needs. *Or we can have it do I/O while it also does CPU-intensive work.* CPU bound implies that upgrading the CPU or optimizing code will improve the overall computing performance.



- we can use `psutil` (process and system utilities) is a cross-platform library for retrieving information on running processes and system utilization (CPU, memory, disks, network, sensors) in Python. It's mainly useful for system monitoring, profiling, and limiting process resources and management of running processes.

```python
import psutil
psutil.cpu_percent()
>>> 2.9
```

This shows that CPU utilization is low. Here, we have a CPU with multiple cores; this means one fully loaded CPU thread/virtual core equals 2.9% of total load. So, it only uses one core of the CPU regardless of having multiple cores -> we noticed that they're not reaching the limit.

So, we check the CPU usage, and it looks like the script only uses a single core to run. **But the server has a bunch of cores, which means the task is CPU-bound.**

Now, using `psutil.disk_io_counters()` and `psutil.net_io_counters()` we'll get byte read and byte write for disk I/O and byte received and byte sent for the network I/O bandwidth. 

#### For checking disk I/O, we can use the following command:
`psutil.disk_io_counters()`

#### For checking the network I/O bandwidth:
`psutil.net_io_counters()`

### Basics rsync command
`rsync`(remote sync) is a utility for efficiently transferring and synchronizing files between a computer and an external hard drive and across networked computers by comparing the modification time and size of files. One of the important features of `rsync` is that it works on the **delta transfer algorithm**, which means *it'll only sync or copy the changes from the source to the destination instead of copying the whole file*. This ultimately reduces the amount of data sent over the network.

`rsync [Options] [Source-Files-Dir] [Destination]`

Options (common):

- -v = Verbose output

- -q = Suppress message output

- -a = Archive files and directory while synchronizing

- -r = Sync files and directories recursively

- -b = Take the backup during synchronization

- -z = Compress file data during the transfer

i.e. 
- copy or sync files locally `rsync -zvh [Source-Files-Dir] [Destination]`
- copy or sync dir locally `rsync -zavh [Source-Files-Dir] [Destination]`
- copy files and directories recursively locally - `rsync -zrvh [Source-Files-Dir] [Destination]`
```python
import subprocess
src = "data/prod" # replace <source-path> with the source directory
dest = "data/prod_backup" # replace <destination-path> with the destination directory
subprocess.call(["rsync", "-arq", src, dest]) # would return 0 if no errors
```

# Multiprocessing

Now, when we go through the hierarchy of the subfolders of `/data/prod`, data is from different projects (e.g., , beta, gamma, kappa) and they're independent of each other. So, in order to efficiently back up parallelly, use multiprocessing to take advantage of the idle CPU cores. Initially, because of CPU bound, the backup process takes **more than 20 hours to finish**, which isn't efficient for a daily backup. Now, by using multiprocessing, we can back up the data from the source to the destination parallelly by utilizing the multiple cores of the CPU.

Now, we'll get the Python script `multisync.py` for practice in order to understand how multiprocessing works. We used the Pool class of the multiprocessing Python module. Here, we define a run method to perform the tasks. Next, we create a pool object of the `Pool class` of a specific number of CPUs on our system has by passing a number of tasks we have. Start each task within the pool object by calling the `map` instance method, and pass the run function and the list of tasks as an argument.

`multisync.py `
```python

#!/usr/bin/env python3
from multiprocessing import Pool
def run(task):
  # Do something with task here
    print("Handling {}".format(task))
if __name__ == "__main__":
  tasks = ['task1', 'task2', 'task3']
  # Create a pool of specific number of CPUs
  p = Pool(len(tasks))
  # Start each task within the pool	
  p.map(run, tasks)
```

`dailysync.py`

```python
#!/usr/bin/env python
import subprocess
src = "/data/prod/"
dest = "/data/prod_backup/"
subprocess.call(["rsync", "-arq", src, dest])
```

**let's fix CPU bound so that it doesn't take more than 20 hours to finish.** We would apply multiprocessing, which takes advantage of the idle CPU cores for parallel processing. 

### The result of multiprocessing over main folders `['alpha', 'sigma', 'gamma', 'omega', 'beta', 'kappa', 'delta']`:

```python
#!/usr/bin/env python3
from multiprocessing import Pool
import os
import tqdm
import subprocess

def run(task):
    # Do something with task here
    print('current dir',os.getcwd())
    src = task #data/prod
    dest = 'data/prod_backup/'
    subprocess.call(["rsync", "-zavh", src, dest])
    print("Handling {}".format(task))
    print('Dest', dest)
if __name__ == "__main__":
    tasks = []
    folders = ['alpha', 'sigma', 'gamma', 'omega', 'beta', 'kappa', 'delta']
    for folder in folders:
        tasks.append(os.path.join('data/prod',folder))

    # Create a pool of specific number of CPUs
    print(len(tasks))
    p = Pool(len(tasks))
    # Start each task within the pool     
    p.map(run, tasks)
    for task in tasks:
        print('task',task)
    for _ in tqdm.tqdm(p.map(run, tasks), total=len(tasks)):
        pass
```