# 1. Process

There are many ways to create process in python:
- fork() (only for linux/unix os)
-

## 1.1 Use fork() to create process

Unix/Linux os provide a fork() call. When you call fork(), it will return twice. Because when we call fork() in a process, the os will **duplicate the current process**, and the os will consider the current process as **Parent process**, the duplicated process as **child process**. As a result, the parent process and child process will both return once.

Note: **the child process will always return 0. The parent process returns the ID of the child process.** That's because a parent process can fork many child process, so the returned child process id can help parent to build a list of all its child. **The child process can get the id of its parent process by calling getppid()**.

A real world example of fork is the apache web server, the main(parent) process listens the port 80. When a new http request comes to port 80, the main process will fork a child process to handle the new http request.

The os module of Python has implemented the fork(), so we can create process by using fork(). Below is an example. Note this works only for linux/unix, and MacOS(unix kernel). Because Windows does not provide fork() call.

In [1]:
from multiprocessing import Process,Pool, Queue
import os, time, random
import subprocess

In [None]:
print('Process (%s) start...' % os.getpid())
# Only works on Unix/Linux/Mac:
pid = os.fork()
if pid == 0:
    print('I am child process (%s) and my parent is %s.' % (os.getpid(), os.getppid()))
else:
    print('I (%s) just created a child process (%s).' % (os.getpid(), pid))

## 1.2 Use multiprocessing

Another way to create process is to use the **multiprocessing module**. Unlike fork(), the **multiprocessing module works for all OS (including windows)**.

The multiprocessing module provide a class called **Process** to implement the process，below is a simple example.

The multiprocessing module provides an API very similar to the threading module; it provides methods to **share data across the processes it creates**, and makes the task of managing multiple processes to run Python code (much) easier. In other words, multiprocessing lets you take advantage of multiple processes to get your tasks done faster by executing code in parallel.

In [2]:

# the function which the child process will run
def run_proc(name):
    print('Run child process %s (%s)...' % (name, os.getpid()))

In [3]:

## main(parent) process that creates child process
print('Parent process %s.' % os.getpid())
# create child process, two important parameter, target specify the function that the child process will run,
# args is the parameter of the function running in child process
p = Process(target=run_proc, args=('test',))
print('Child process will start.')
# Start the child process
p.start()
# tell main process to wait child process termination before running the rest of the code in main process.
p.join()
print('Child process end.')

Parent process 1672231.
Child process will start.
Run child process test (1673253)...
Child process end.


Try to remove p.join() and see what happens. Hint, if parent process end before child process, the child process will be terminated at the moment the parent process ends.

### 1.2.1 Communication between process

Sometimes we need the process to communicate with each other. In linux/unix, **process can use pipe and queue to communicate**.

In [15]:

# the code for write process
def write(q):
    print('Process to write: %s' % os.getpid())
    for value in ['A', 'B', 'C', 'end']:
        print('Put %s to queue...' % value)
        q.put(value)
        time.sleep(random.random())


# the code for read process
def read(q):
    print('Process to read: %s' % os.getpid())
    flag = True
    while flag:
        value = q.get(True)
        print('Get %s from queue.' % value)
        if value == "end":
            flag = False


In [16]:
# main process
# create a queue for communication
q = Queue()
# create first child process for write, we pass q as a parameter to process
pw = Process(target=write, args=(q,))
# create second child process for read
pr = Process(target=read, args=(q,))
# start the first process
pw.start()
# start the second process
pr.start()

# wait for two child process to finish
pw.join()
pr.join()

# you can also force close a process.
# pr.terminate()

Process to write: 12048
Process to read: 12051Put A to queue...

Get A from queue.
Put B to queue...
Get B from queue.
Put C to queue...
Get C from queue.
Put end to queue...
Get end from queue.


### 1.2.2 Process pool

Similar to thread pool, when you have many process (or thread), the overhead of creating and terminating process (or thread) is no longer negligible. We can use a process pool to avoid creating and terminating process.

Pool class is a  better way to deploy Multi-Processing because it distributes the tasks to available processors using the First In First Out schedule. It is almost similar to the map-reduce architecture- in essence, it maps the input to different processors and collects the output from all processors as a list. The processes in execution are stored in memory and other non-executing processes are stored out of memory.

In [17]:

# code for process running in poll
def long_time_task(name):
    print(f'Run task {name} {os.getpid()}...')
    start = time.time()
    time.sleep(random.random() * 3)
    end = time.time()
    print('Task %s runs %0.2f seconds.' % (name, (end - start)))

In [18]:
# main process contains a pool of two process that runs 5 tasks
print('Parent process %s.' % os.getpid())
# create a process pool
p = Pool(2)
for i in range(5):
    # apply task to a pool process
    p.apply_async(long_time_task, args=(i,))
print('Waiting for all subprocesses done...')
# close pool, pool will not accept new task after close is called
p.close()
# let main process wait all the pool process to be finished.
p.join()
print('All subprocesses done.')

Parent process 10878.
Run task 0 (12239)...Run task 1 (12240)...

Waiting for all subprocesses done...
Task 0 runs 0.63 seconds.
Run task 2 (12239)...
Task 1 runs 1.94 seconds.
Run task 3 (12240)...
Task 2 runs 2.35 seconds.
Run task 4 (12239)...
Task 4 runs 0.30 seconds.
Task 3 runs 1.87 seconds.
All subprocesses done.


You can notice by the process id, the two process that we have in the pool will be liberated and reused after a task is done. So instead of creating and terminating 5 process, we only need to handle two process.

## 1.3 Subprocess

In above example, we have seen the **multiprocessing module** can help you divide tasks written in python over multiple processes to help improve performance. But what happens if the task is not written in python?

Python provides us a **subprocess module** that lets you run and control other programs (python or not). Anything you can start with the command line on the computer, can be run and controlled with this module. Use this to integrate external programs into your Python code.

Below example runs a shell command **nslookup www.python.org** by using subprocess.

You can notice, the command are stored in a List[str]. And the subprocess.call() will take the command and run it as a shell command, it returns the
exit code of the shell command (e.g. 0->success, 1->error).

In [12]:

print("$ nslookup www.python.org")
process_cmd1 = ["nslookup", "www.python.org"]
r1 = subprocess.call(process_cmd1)
print(f"return code: {r1}")

$ nslookup www.python.org
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
www.python.org	canonical name = dualstack.python.map.fastly.net.
Name:	dualstack.python.map.fastly.net
Address: 151.101.120.223
Name:	dualstack.python.map.fastly.net
Address: 2a04:4e42:1d::223

return code: 0


In [13]:
print("$ ls -lah /tmp")
process_cmd2 = ["ls", "-lah", "/tmp"]
r2 = subprocess.call(process_cmd2)
print(f"return code: {r2}")

$ ls -lah /tmp
total 88K
drwxrwxrwt 20 root root 4.0K Apr 19 13:36 .
drwxr-xr-x 20 root root 4.0K Dec  5 20:10 ..
-rw-------  1 pliu pliu    0 Apr 19 08:16 config-err-2sAey0
drwxrwxrwt  2 root root 4.0K Apr 19 08:07 .font-unix
drwxr-xr-x  2 pliu pliu 4.0K Apr 19 08:18 hsperfdata_pliu
drwxrwxrwt  2 root root 4.0K Apr 19 08:16 .ICE-unix
drwx------  2 pliu pliu 4.0K Apr 19 08:16 ssh-BELEbbXWeWDq
drwx------  3 root root 4.0K Apr 19 08:09 systemd-private-71929981a8634f3f88ccfc6eac88c7cd-colord.service-qc9g7g
drwx------  3 root root 4.0K Apr 19 08:08 systemd-private-71929981a8634f3f88ccfc6eac88c7cd-ModemManager.service-7gOp6i
drwx------  3 root root 4.0K Apr 19 08:07 systemd-private-71929981a8634f3f88ccfc6eac88c7cd-switcheroo-control.service-Tvf0vh
drwx------  3 root root 4.0K Apr 19 08:07 systemd-private-71929981a8634f3f88ccfc6eac88c7cd-systemd-logind.service-4Ms7Fg
drwx------  3 root root 4.0K Apr 19 08:07 systemd-private-71929981a8634f3f88ccfc6eac88c7cd-systemd-resolved.service-jmnDbg
drw

### 1.3.1 Pass parameters into subprocess

Some shell command can open an interactive terminal. Subprocess allows us to pass parameter to the interactive terminal after the shell command is called.

For example, the command **nslookup** will open an interactive terminal, we have entered three command

```text
set q=mx
python.org
exit
```

Below is a full example
```shell
pliu@ubuntu:~$ nslookup
> set q=mx
> python.org
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
python.org	mail exchanger = 50 mail.python.org.

Authoritative answers can be found from:
> exit

```
To do the above with subprocess, we need to use
- Popen: open a subprocess terminal
- Communicate: pass parameter to the terminal

Below is a full example

In [14]:
import subprocess

print('$ nslookup')
p = subprocess.Popen(['nslookup'], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# note communicate only accept bytes, so we need to convert string to byte
output, err = p.communicate(b'set q=mx\npython.org\nexit\n')
# the output of to communicate is byte too, so we need to convert byte back to string.
print(output.decode('utf-8'))
print('Exit code:', p.returncode)

$ nslookup
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
python.org	mail exchanger = 50 mail.python.org.

Authoritative answers can be found from:


Exit code: 0



## 1.4 Multiprocess for data science

In data science, we specially need multiprocess. Because we will often deal with big data, and long running tasks. We can use multiprocess to improve the performance of our application.

Take the below example, we will test the performance of an application in three mode:
- mono-process
- multiprocess
- multiprocess with pool

### 1.4.1 Mono process

In [4]:

# code to run in process
def run_task1(task_id: int):
    print(f"Starting task: {task_id} in process {os.getpid()}")
    time.sleep(1)
    print(f"Ending task: {task_id} in process {os.getpid()}")


def run_task2(n):
    sum_factors = 0
    for i in range(1, n):
        if n % i == 0:
            sum_factors = sum_factors + i
    if sum_factors == n:
        print('{} is a Perfect number'.format(n))

In [29]:
# Test for task 1
start_time = time.time()
for i in range(10):
    run_task1(i)
end_time = time.time()
print("Mono process test in {:.4f} seconds".format(end_time - start_time))

Starting task: 0 in process 10878
Ending task: 0 in process 10878
Starting task: 1 in process 10878
Ending task: 1 in process 10878
Starting task: 2 in process 10878
Ending task: 2 in process 10878
Starting task: 3 in process 10878
Ending task: 3 in process 10878
Starting task: 4 in process 10878
Ending task: 4 in process 10878
Starting task: 5 in process 10878
Ending task: 5 in process 10878
Starting task: 6 in process 10878
Ending task: 6 in process 10878
Starting task: 7 in process 10878
Ending task: 7 in process 10878
Starting task: 8 in process 10878
Ending task: 8 in process 10878
Starting task: 9 in process 10878
Ending task: 9 in process 10878
Mono process test in 10.0126 seconds


In [27]:
# Test for task 2
start_time = time.time()
for i in range(1, 100000):
    run_task2(i)
end_time = time.time()
print("Mono process test in {:.4f} seconds".format(end_time - start_time))

6 is a Perfect number
28 is a Perfect number
496 is a Perfect number
8128 is a Perfect number
Mono process test in 203.4457 seconds


### 1.4.2 Multi process

In [5]:

# Test for task1
process_list = []
start_time = time.time()
for i in range(10):
    p = Process(target=run_task1, args=(i,))
    process_list.append(p)
    p.start()

# wait all child process to finish
for p in process_list:
    p.join()
end_time = time.time()

print("Mono process test in {:.4f} seconds".format(end_time - start_time))

Starting task: 1 in process 107642Starting task: 2 in process 107644Starting task: 3 in process 107650
Starting task: 4 in process 107651
Starting task: 0 in process 107641

Starting task: 5 in process 107662Starting task: 6 in process 107667

Starting task: 7 in process 107670Starting task: 8 in process 107675


Starting task: 9 in process 107678
Ending task: 3 in process 107650Ending task: 1 in process 107642

Ending task: 2 in process 107644Ending task: 0 in process 107641

Ending task: 4 in process 107651
Ending task: 5 in process 107662
Ending task: 6 in process 107667Ending task: 8 in process 107675

Ending task: 7 in process 107670
Ending task: 9 in process 107678
Mono process test in 1.0934 seconds


In [None]:
# Test for task2
process_list = []
start_time = time.time()
for i in range(1, 100000):
    p = Process(target=run_task2, args=(i,))
    process_list.append(p)
    p.start()

# wait all child process to finish
for p in process_list:
    p.join()
end_time = time.time()

print("Mono process test in {:.4f} seconds".format(end_time - start_time))

6 is a Perfect number
28 is a Perfect number
496 is a Perfect number
8128 is a Perfect number


### 1.4.3 Multi process with poll

In [8]:
# test for task 1
start_time = time.time()
pool = Pool(4)
# add task
pool.map(run_task1, range(0, 10))

# close pool
pool.close()
# ask main to wait all process in pool to finish
pool.join()
end_time = time.time()
print("Mono process test in {:.4f} seconds".format(end_time - start_time))

Starting task: 1 in process 107782Starting task: 0 in process 107781Starting task: 2 in process 107783Starting task: 3 in process 107784



Ending task: 1 in process 107782Ending task: 0 in process 107781
Ending task: 3 in process 107784Ending task: 2 in process 107783Starting task: 4 in process 107782


Starting task: 5 in process 107781

Starting task: 6 in process 107783Starting task: 7 in process 107784

Ending task: 4 in process 107782Ending task: 5 in process 107781
Starting task: 8 in process 107782

Starting task: 9 in process 107781Ending task: 6 in process 107783

Ending task: 7 in process 107784
Ending task: 8 in process 107782
Ending task: 9 in process 107781
Mono process test in 3.0626 seconds


In [9]:
# test for task 2
start_time = time.time()
pool = Pool(4)
for i in range(1, 100000):
    # add tasks to pool
    pool.apply_async(run_task2, args=(i,))

# close pool
pool.close()
# ask main to wait all process in pool to finish
pool.join()
end_time = time.time()
print("Mono process test in {:.4f} seconds".format(end_time - start_time))

6 is a Perfect number
28 is a Perfect number
496 is a Perfect number
8128 is a Perfect number
Mono process test in 86.9723 seconds


### 1.4.4 Conclusion

we have tested the execution duration in three mode:
For task 1:
- mono-process: 10.0160 seconds
- multiprocess: 1.1322 seconds
- multiprocess with pool: 3.0675 seconds

For task 2:
- mono-process: 203.4457 seconds
- multiprocess: failed, out of memory
- multiprocess with pool: 86.9723 seconds


Even though, for task1 the mode multiprocess without pool is quicker compare to with pool. We still recommend to use pool, because it also saves your memory. Only the task in pool will consume the memory. The task in the waiting queue will not consume memory.

In task2, the multiprocess failed, because we load 100000 process simultaneously, and they all consume memory. So, **use pool when you have many process**.

Thus, it is very well evident that by deploying a suitable method from the multiprocessing library, we can achieve a significant reduction in computation time.