# Multiprocessing for map data

In this notebook we will demonstrate how to apply multiprocessing for images with map data.

First we will demonstrate the transformation we want to apply, and then we will run it on multiple map sections at the same time with multiprocessing.

In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import multiprocessing
import pickle
import PIL
from PIL import Image
from pathlib import Path

In [16]:
data_folder = Path("../data/")

What we want to do is load a TIFF image with the data for a map layer, process it and then save the results.

## Example process

### Loading the image

In [11]:
Image.MAX_IMAGE_PIXELS = 2000000000  # Increase PIL image load limit

In [46]:
filename="Hansen_GFC-2018-v1.6_treecover2000_50N_010W.tif"
granule = Image.open(data_folder/"hansen"/filename)
granule.size

(40000, 40000)

### Processing the image

In [40]:
granule_resized = granule.resize((1000, 1000), PIL.Image.LINEAR)
granule_array = np.array(granule_resized)

### Saving the image to file

In [56]:
processed_filename = filename[:-3]+"pickle"
with open(data_folder/"processed/hansen"/processed_filename, "wb") as file:
    pickle.dump(granule_array, file)

We can check that it was stored properly by loading it again.

In [59]:
with open(data_folder/"processed/hansen"/processed_filename, "rb") as file:
    array_check = pickle.load(file)
print(np.all(array_check == granule_array))
del array_check

True


### As a function

We can turn the previous process in to a function for easier use later on.

In [110]:
def process_image(image):
    # Just a simple example of a process, but non-trivial in terms of time
    return np.array(image.resize((1000, 1000), PIL.Image.LINEAR))

def image_pipeline(origin, destination=None):
    origin=Path(origin)
    if destination is None:
        filename = origin.stem + ".pickle"
        destination = origin.parent.parent/"processed"/origin.parts[-2]/filename
    loaded = Image.open(origin)
    processed = process_image(loaded)
    with destination.open("wb") as file:
        pickle.dump(processed, file)
    return destination  # Just to return something

And we can check that it worked.

In [111]:
%%time

origin = data_folder/"hansen"/"Hansen_GFC-2018-v1.6_treecover2000_50N_000E.tif"
image_pipeline(origin)

destination = data_folder/"processed/hansen/Hansen_GFC-2018-v1.6_treecover2000_50N_000E.pickle"
Path(destination).is_file()  # This checks that the file now exists

CPU times: user 7.08 s, sys: 624 ms, total: 7.7 s
Wall time: 7.71 s


True

## Multiprocessing

Now we use our huble pipeline with `multiprocessing`.

First it would be useful to know how many CPU we have in this machine. This is easy to do with `multiprocessing.cpu_count()`

In [112]:
print(f"There are {multiprocessing.cpu_count()} CPU available.", )

There are 16 CPU available.


There are various ways to start a multiprocess, but we are going to use the `Pool`. We need to gather of all the inputs we want to our pipeline in a list.

In [113]:
input_list = list(origin.parent.glob("*.tif"))

Then, we define a `Pool` manager with a set number of maximum concurrent processes.

In [114]:
max_processes = multiprocessing.cpu_count() // 2  # Let's use at most half of our processors here
pool = multiprocessing.Pool(max_processes)

Finally, we execute it over all out inputs with the `.map` method.

In [116]:
%%time
results = pool.map(image_pipeline, input_list)

CPU times: user 20.2 ms, sys: 6.99 ms, total: 27.2 ms
Wall time: 9.67 s


And done! It takes only slightly more time to process all the images here than to process a single one if we have enough processors in the pool.