# Session 11: Data Processing

In [1]:
import os

# For parallel processing
import parsl
import parsl
from parsl import python_app
from parsl.config import Config
from parsl.channels import LocalChannel
from parsl.executors import HighThroughputExecutor
from parsl.providers import LocalProvider

# helpers
from grouputils import initialize_rasterizer

## Background

![](https://raw.githubusercontent.com/PermafrostDiscoveryGateway/viz-raster/develop/docs/images/raster_tldr.png)

Today, we will take the regularly gridded staged data from yesterday and create rasters from the vector data (left side of diagram above). Each pixel of the raster will contain a stasticic calculated based on the underlying vector data for that pixel.

The two statistics we calculate are:

- number of IWP per pixel
- proportion of pixel covered by IWP

Similar to the `TileStager` from the first step of the group project, in this step we will use a `RasterTiler` class, which we initialize using `initialize_rasterizer`.

In [None]:
iwp_rasterizer = initialize_rasterizer("/home/shares/example-pdg-data")


Let's have a look at the data we need to rasterize first.

In [None]:

staged_paths = iwp_rasterizer.tiles.get_filenames_from_dir('staged')
print(len(staged_paths), "files to rasterize.")

We can use the `rasterize_vector` method on our `RasterTiler` class object, and pass it the first file in our list of staged files to rasterize just one file.

In [None]:
iwp_rasterizer.rasterize_vector(staged_paths[0])

This went pretty quickly, but with thousands of files, it would still be beneficial to paralleize. Estimate the length of time it would take to process the files in series below:

In [None]:
# estimate duration of rasterization process

## Rasterize in Parallel

Given what you know about the limits of parallelization, discuss in your group whether you think this problem is:

- cpu-bound
- memory-bound
- I/O-bound
- network-bound

How does it compare to the parallelization task from yesterday?

Instead of parallelizing this step over all 2000 files, we will instead process the data in batches of files. Below is a function we can use to make batches of files.

In [None]:
# Because rasterization is relatively quick, we want each parsl "task" to process a batch of tiles.
def make_batch(items, batch_size):
    # Create batches of a given size from a list of items.
    return [items[i:i + batch_size] for i in range(0, len(items), batch_size)]

Use the `make_batch` function to make batches of 10 items each of our staged data (`staged_paths`).

In [None]:
# make batches of 10 files

Now, set up your Parsl executor again, with `max_workers` set at 32.

In [None]:
# Set up Parsl executor


Next, set up your Parsl app to run the `rasterize_vector` method in parallel. Remember that Parsl apps cannot rely on global variables or package imports, so you'll need to make sure to pass the app all of the variables it needs.

In [None]:
# Make a Parsl app that uses the rasterize_vectors method

Now, execute the app in parallel over all of the batches of files you created previously. Don't forget to add a line to shut down the executor.

In [None]:
# Rasterize the batches in parallel

If you set your executor up correctly, this should have finished in around 18 minutes. Check to make sure that you have the same number of GeoTIFF files as we do staged vector tiles.

In [None]:
geotiff_paths = iwp_rasterizer.tiles.get_filenames_from_dir('geotiff')
len(geotiff_paths) == len(staged_paths)


## Bonus

Thinking back to the group project yesterday, how would you test whether this problem is CPU or I/O bound? One option would be to test whether adding more CPU decreases your computation time. Set up a test on a subset of the staged files (the first 500 is fine) and time the process at different `max_workers`. To make sure you don't overload the server, do not exceed 32 workers in your tests.