## Memory while spawning subprocesses
```
Classical way unix start a subprocess is fork+exec.
Forking copies memory usage of the parent process.
Below I want to test and compare memory usage while 
using different methods of invoking a subprocess.
Updated April 2020

Plan:
1. create external python script which internally 
   creates big dataframe (2 GB) and then sleeps/prints
2. start this script using subprocess.call() - get its PID
    - measure memory usage by PID
    - measure total available memory
3. try several variations:
    - big parrent, big child
    - big parent, small child
4. try fork/exec vs spawn_posix
5. try/compare: subprocess.call(), subprocess.check_output(), subprocess.Popen()
6. try os.popen()
7. try multiprocessing
8. try celery

# ---------------------------------
 - https://www.unix.com/unix-for-advanced-and-expert-users/178644-spawn-vs-fork.html

"""
1. all modern UNIX systems also support posix_spawn(), 
   which does not copy all of memory. Its function is 
   to do a "lightweight" version of fork().

2. Linux has something like posix_spawn() called clone().
"""
# ---------------------------------
"multiprocessing" defaults to fork. To switch to "spawn":
    mp.set_start_method('spawn')
"span" will launch each child process as a fresh Python 
interpreter that only inherits resources as necessary.
# ---------------------------------
```

In [None]:
import sys, os, datetime, subprocess
import pandas as pd
import numpy as np
# import our utilities
from util_jupyter import *
from util_models import *
from mybag import *

In [None]:
def print_dt():
    dt_str = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(dt_str)

In [None]:
def ddd(nrows=10):
    """
    # returns a simple pandas DataFrame - useful for quick tests
    # nrows is number of rows (divisible by 10), for example:
    #     df = ddd()
    #     df = ddd(100)
    #     df = ddd(10**6)   # million rows
    """
    n_aa = 10
    nn = int(nrows/n_aa)
    if nn < 1:
        nn = 1
    aa = pd.DataFrame({
          'ii':nn*[0,1,2,3,4,5,np.nan,7,8,9],
          'i1':nn*[6,5,4,3,2,1,0,-1,-2,-3],
          'i2':nn*[6,5,4,4,1,1,0,-1,-2,-3],
          'ff':nn*[0.0,1.0,2.0,np.NaN,4.0,5.0,6.0,7.0,8.0,9.0],
          'f1':nn*[0.0,1.01,2.002,3.0003,4.00004,5.000005,6.0000006,7.0,8.0,9.0],
          'f2':nn*[1.11,2.22,3.33,4.44,5.55,7.77,9.99,0.01,-0.01,-1.11],
          'ss':nn*['s0','s1','狗','汽车',np.nan,'s5','s6','s7','s8','s9'],
          's1':nn*list(np.array(['s0','s1','s2','s2',np.nan,'s5','s6','s7','s8','s9'],dtype=np.str)),
          's2':nn*['1.11','2.22','3.33','4.44','5.55','7.77','9.99','0.01','-0.01','-1.11'],
          'bb':nn*[True, False, True, False, np.nan, False, True,np.nan, False, True],
          'b1':nn*[True, False, True, False, True, False, True, True, False, True],
          'xx':nn*list(range(n_aa)),
          'yy':nn*[x*50 + 60 + np.random.randn() for x in range(n_aa)]
    })
    aa = aa[['ii','i1','i2','ff','f1','f2','ss','s1','s2','bb','b1','xx','yy']].copy()
    aa.index = range(len(aa))

    return aa

In [None]:
%%time
# ------------------------------------------------
# Create a big DataFrames
# ------------------------------------------------
print_dt()
bag = MyBunch()
bag.df = ddd(10**7)
print(bag)

In [None]:
gid = os.getgid()
pid = os.getpid()
print (pid,gid)