# Agenda

- Assignment feedback
- Generators
- Requests
- Multiprocessing

## Multiprocessing
  * A bit of functional programming: maps
  * Multiprocessing in Python
  * Multiprocessing in the OS

## Functional programming: map

* Given a function $f$ and a list
  * Mapping simply means running the function $f$ for *every* element in the list

In [None]:
def add_two(x):
    return x + 2

In [None]:
list = [1, 5, 7, 9]

In [None]:
map(add_two, my_list)

In [None]:
list(map(add_two, my_list))

## Mapping in pandas

In [None]:
import pandas as pd

df = pd.read_csv('befkbhalderstatkode.csv')
df.head()

In [None]:
df['BYDEL'].map(add_two)

In [None]:
df['AAR'].map(add_two)

# An intro to `multiprocessing`

* Concurrency vs. parallelise

* Concurrency = out-of-order
* Parallelism = same *time*

## Python multiprocessing library

https://docs.python.org/3/library/multiprocessing.html

In [None]:
from multiprocessing import Pool

In [None]:
with Pool(5) as p:
    print(p.map(add_two, [1, 2, 3]))

In [None]:
contributor_urls = ['https://api.github.com/repositories/596892/contributors?page=' + str(x) for x in range(1, 15)]
contributor_urls

In [1]:
### import os
import sys
import time
import logging
import requests
import api_keys
from multiprocessing import Pool, cpu_count


HEADER = {'Authorization': f'token {api_keys.GITHUB_API_KEY}'}
contributor_urls = [f'https://api.github.com/repositories/596892/contributors?page={idx}' for idx in range(1, 15)]


def hard_work(a_url):
    print(f'{__name__}/{os.getppid()}/{os.getpid()} gets data from {a_url}')
    r = requests.get(a_url, headers=HEADER)
    time.sleep(6)
    print('Done')
    return [(contrib['login'], contrib['contributions'],
             contrib['html_url']) for contrib in r.json()]

In [None]:
def run_sequential_download():
    contributors = []
    logging.info('Running the sequential program.')
    start = time.time()
    for contributor_url in contributor_urls:
        contributors += hard_work(contributor_url)
    print(f'It took {time.time() - start}s in total.')

    return contributors

In [None]:
def run_parallel_processes():
    workers = cpu_count()
    pool = Pool(processes=workers)

    print('Running the concurrent program.')
    start = time.time()
    result = pool.map(hard_work, contributor_urls)

    print(f'It took {time.time() - start}s in total.')
    return result

In [None]:
if __name__ == '__main__':
    if sys.argv[1] == '-s':
        run_sequential_download()
    elif sys.argv[1] == '-p':
        run_parallel_processes()

Run in one terminal `$ top` and in another one `$ python multiprocessing_example.py -p`

~~~bash
Running the concurrent program.
__main__/21782/21783 gets data from https://api.github.com/repositories/596892/contributors?page=1
__main__/21782/21784 gets data from https://api.github.com/repositories/596892/contributors?page=2
__main__/21782/21785 gets data from https://api.github.com/repositories/596892/contributors?page=3
__main__/21782/21786 gets data from https://api.github.com/repositories/596892/contributors?page=4
__main__/21782/21787 gets data from https://api.github.com/repositories/596892/contributors?page=5
__main__/21782/21788 gets data from https://api.github.com/repositories/596892/contributors?page=6
__main__/21782/21789 gets data from https://api.github.com/repositories/596892/contributors?page=7
__main__/21782/21790 gets data from https://api.github.com/repositories/596892/contributors?page=8
Done
__main__/21782/21790 gets data from https://api.github.com/repositories/596892/contributors?page=9
Done
__main__/21782/21788 gets data from https://api.github.com/repositories/596892/contributors?page=10
Done
Done
__main__/21782/21784 gets data from https://api.github.com/repositories/596892/contributors?page=11
__main__/21782/21785 gets data from https://api.github.com/repositories/596892/contributors?page=12
Done
Done
__main__/21782/21789 gets data from https://api.github.com/repositories/596892/contributors?page=13
__main__/21782/21787 gets data from https://api.github.com/repositories/596892/contributors?page=14
Done
Done
Done
Done
Done
Done
Done
Done
It took 19.567875146865845s in total.
~~~~


~~~bash
PID    COMMAND      %CPU  TIME     #TH    #WQ   #PORT MEM    PURG   CMPRS  PGRP  PPID  STATE    BOOSTS           %CPU_ME %CPU_OTHRS UID  FAULTS    COW      MSGSENT    MSGRECV
21790  python3.6    0.0   00:00.04 4      2     31    12M    0B     0B     21782 21782 sleeping *0[2]            0.00000 0.00000    501  5340      1819     81         34
21789  python3.6    0.0   00:00.04 4      2     31    12M    0B     0B     21782 21782 sleeping *0[2]            0.00000 0.00000    501  5353      1730     81         34
21788  python3.6    0.0   00:00.04 4      2     31    12M    0B     0B     21782 21782 sleeping *0[2]            0.00000 0.00000    501  5323      1725     81         34
21787  python3.6    0.0   00:00.04 4      2     31    12M    0B     0B     21782 21782 sleeping *0[2]            0.00000 0.00000    501  5350      1762     81         34
21786  python3.6    0.0   00:00.04 4      2     31    12M    0B     0B     21782 21782 sleeping *0[2]            0.00000 0.00000    501  5342      1746     81         34
21785  python3.6    0.0   00:00.04 4      2     31    12M    0B     0B     21782 21782 sleeping *0[2]            0.00000 0.00000    501  5358      1794     81         34
21784  python3.6    0.0   00:00.04 5      3     33    12M    0B     0B     21782 21782 sleeping *0[2]            0.00000 0.00000    501  5328      1799     81         34
21783  python3.6    0.0   00:00.04 5      3     32    12M    0B     0B     21782 21782 sleeping *0[2]            0.00000 0.00000    501  5344      1889     81         34
21782  python3.6    0.1   00:00.26 4      0     17    19M    0B     0B     21782 21556 sleeping *0[1]            0.00000 0.00000    501  8136      2015     59         26
~~~

## Let the OS do this.



```python
import os
import sys
import time
import requests
import api_keys


HEADER = {'Authorization': f'token {api_keys.GITHUB_API_KEY}'}


def hard_work(a_url):
    sys.stdout.write(f'{__name__}/{os.getppid()}/{os.getpid()} gets data from {a_url}\n')
    r = requests.get(a_url, headers=HEADER)
    time.sleep(3)
    sys.stdout.write('Done')
    return [(contrib['login'], contrib['contributions'],
             contrib['html_url']) for contrib in r.json()]


if __name__ == '__main__':
    sys.stdout.write(str(hard_work(sys.argv[1])))
```

~~~bash
#!/bin/bash
for url in 'https://api.github.com/repositories/596892/contributors?page=1' 'https://api.github.com/repositories/596892/contributors?page=2' 'https://api.github.com/repositories/596892/contributors?page=3' 'https://api.github.com/repositories/596892/contributors?page=4' 'https://api.github.com/repositories/596892/contributors?page=5' 'https://api.github.com/repositories/596892/contributors?page=6' 'https://api.github.com/repositories/596892/contributors?page=7' 'https://api.github.com/repositories/596892/contributors?page=8' 'https://api.github.com/repositories/596892/contributors?page=9' 'https://api.github.com/repositories/596892/contributors?page=10' 'https://api.github.com/repositories/596892/contributors?page=11' 'https://api.github.com/repositories/596892/contributors?page=12' 'https://api.github.com/repositories/596892/contributors?page=13' 'https://api.github.com/repositories/596892/contributors?page=14'

do
    echo "Started python ./hard_work.py ${url}"
    nohup python ./hard_work.py ${url} </dev/null >> output.log 2>&1 &
done
~~~

Read the official docs for more information on for example how to share data between processes: https://docs.python.org/3.6/library/multiprocessing.html

## Exercise

* Find the ID of one of your repositories on GitHub
* Write a function that creates a URL by adding an index to this url:
    `https://api.github.com/repositories/33015583/contributors?page=`
* Create a list of 100 URLs by calling the above function 100 times with indices from 0 to 99 inclusive
* Using `%%timeit`, measure how long it takes to call the url 100 times sequentially
* Create a thread pool
* Using `%%timeit`, measure how long it takes to call the url 100 times *in parallel*