<a href="https://colab.research.google.com/github/goteguru/kmooc_python/blob/main/notebooks/en/kmooc_10_1_automatizalas_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automation and processes

If we want to automate processes (for example to start or stop some software on the machine), it's unavoidable to go beyond just Python, and sometimes we need to work in parallel.

Let's see how we can start other programs from Python or run (even Python) programs simultaneously.

## External programs

I mentioned earlier that colab runs a Linux container, so here we can start Linux commands, but the following works on all operating systems â€” you just need to adjust program names or paths as appropriate.

If you want to do this kind of work, our best friend is the `subprocess` package!

In [None]:
import subprocess

result = subprocess.run(
    ["ls", "-l"],       # or on Windows: ["cmd", "/c", "dir"]
    capture_output=True, # capture the output
    text=True,
    check=False
)

print("Return code:", result.returncode)
print("STDOUT:")
print(result.stdout) # standard output
print("STDERR:")
print(result.stderr) # standard error


Return code: 0
STDOUT:
total 4
drwxr-xr-x 1 root root 4096 Nov 20 14:30 sample_data

STDERR:



You can see it's not too hard to run programs (everything after the first parameter was optional), you simply pass the command parts as a list. If you don't want to pass arguments, just give the name.

The return value probably needs some explanation:


**Return code**: Every program returns an integer when it finishes that indicates whether it ran successfully.  
  * 0 â†’ success
  * non-0 â†’ some error or abnormal condition occurred

**Standard output (STDOUT)**: These are the program's normal messages, e.g. what it "prints to the screen". This is what Python's print() uses.

**Standard error (STDERR)**: The program's error messages and warnings arrive on a separate channel. This is separate even if the program otherwise runs fine.


## Running multiple programs one after another



In [None]:
import subprocess

commands = [
    ["python", "--version"],
    ["echo", "Hello"],
]

for cmd in commands:
    result = subprocess.run(cmd, capture_output=True, text=True)
    print(f"Command: {cmd}", end=' ')
    print("RC:", result.returncode, end=' ')
    print("OUT:", result.stdout.strip())

Parancs: ['python', '--version'] RC: 0 OUT: Python 3.12.12
Parancs: ['echo', 'Hello'] RC: 0 OUT: Hello


## Multiple programs, but in parallel
If you don't want to wait for programs to finish (for example because they take a long time), you can start them "at the same time".  (Note, this code block may run for a while)


In [None]:
import subprocess

commands = [
    ["sleep", "5"], # this command does nothing for 5 seconds
    ["sleep", "7"],
    ["sleep", "3"],
]

processes = []

# start them
for cmd in commands:
    p = subprocess.Popen(cmd)
    processes.append(p)

# here we can do something else...
# ...
# ...

# wait for all of them
for p in processes:
    p.communicate()  # blocks until the given process finishes



In [None]:
!ps xa | grep 'python|sleep|grep'

## Multiple Python processes for parallel computation

We already mentioned that Python runs on a single CPU core. In a world of modern 8, 16 or more core processors this can seem wasteful, especially if we're working on a task that can be parallelized well (for example processing separate, independent files).

The multiprocessing package helps with this. Here we can start a large number of computations and wait until all are finished!

Each computation works on a different parameter (in parallel)! In the example below the inputs are simple numbers, but they could be anything.

In [None]:
from multiprocessing import Pool
import math

def heavy_computation(x: int) -> float:
    # something a bit more CPU-intensive
    s = 0.0
    for i in range(100_000):
        s += math.sqrt(x * i + 1)
    return s

if __name__ == "__main__":  # On Windows this is REQUIRED
    inputs = [1, 2, 3, 4, 5, 6, 7, 8]

    with Pool(processes=4) as pool:  # 4 parallel processes
        results = pool.map(heavy_computation, inputs)

    print(results)


Well, I guess you didn't have to wait too long ðŸ˜€ These machines today compute a few hundred thousand square roots in no time. But there are tasks that will strain them too. Now you can start your computations in parallel, across multiple cores at once!




## Multiple processes manually

In the previous example the with context manager nicely simplified things. Sometimes you might want to start and stop processes manually. That's not particularly hard either, but it takes a bit more typing:


In [None]:
from multiprocessing import Process
import time # not needed except for sleeping

def worker(name: str, n: int) -> None:
  # here we could do all sorts of complicated things....
  print(f"{name} starting")
  time.sleep(n) # but for now we just sleep a bit.
  print(f"{name} done")

if __name__ == "__main__":
  # create two processes
  p1 = Process(target=worker, args=("Process A", 2))
  p2 = Process(target=worker, args=("Process B", 3))

  # start them...
  p1.start()
  p2.start()

  # wait for both
  p1.join()
  p2.join()

print("All processes have finished.")


## When to use which:

**subprocess** is useful if:
- you want to run a non-Python program (e.g. ffmpeg video encoder, git, static analysis tools, etc.),
- you want to start separate python script.py processes and process their output.

**multiprocessing** is much more convenient when:
- you want to parallelize the same Python code,
- you want to run functions on many inputs,
- you don't want to deal with manual process management/output reading.


## Other possibilities for parallel processing

Here we only mentioned the simplest option, which is sufficient for the problems we encounter most often. There are other options for parallel task solving depending on the goal.
In user interface programming Async solutions are often used (which create a parallel-like effect on a single CPU core / thread) and provide a more interactive user experience, while for more efficient parallel processing threads are used, which are less wasteful and allow shared memory usage.

Of course Python also has (multiple) packages for these, even though we didn't cover them in detail. The built-in Python packages for these are:

* Asyncio: https://docs.python.org/3/library/asyncio.html
* Threading: https://docs.python.org/3/library/threading.html