# Parallelism for Image Pre-processing

In [2]:
import cv2
import os
import time
from multiprocessing import Pool

We will be testing the performance of multiprocessing for image processing by focusing on its effect on processing time. We defined a function that reads an image, converts it to grayscale, and resizes it to 224x224 pixels.

The cv module contains many functions for image preprocessing, and a reference can be found at https://www.kaggle.com/code/khotijahs1/cv-image-preprocessing.

First, we will attempt sequential operations and parallel operations on a folder containing 1591 images to compare their computational times.

The following code performs the sequential operation:

In [3]:
start_time = time.time()

path = "../Data/PlantVillage/Tomato_healthy"
save_path = "../sequentially_processed_images"
os.makedirs(save_path, exist_ok=True)

def image_process(image_path):
    # read the images from given path
    img = cv2.imread(os.path.join(path, image_path))
    # convert image to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    # resize image to 224*224 pixels
    resized = cv2.resize(gray, (224, 224), interpolation=cv2.INTER_AREA)
    cv2.imwrite(os.path.join(save_path, image_path), resized)


list_image = os.listdir(path)
for image_path in list_image:
    image_process(image_path)

print('Processing time: {0} [sec]'.format(time.time() - start_time))


Processing time: 3.2909960746765137 [sec]


## Multi-processing

In [5]:
start_time = time.time()


path = "../Data/PlantVillage/Tomato_healthy"
save_path = "../parallelly_processed_images"
os.makedirs(save_path, exist_ok=True)
def image_process(image_path):
    img = cv2.imread(os.path.join(path, image_path))
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    resized = cv2.resize(gray, (224, 224), interpolation = cv2.INTER_AREA)
    cv2.imwrite(os.path.join(save_path, image_path), resized)
    return
def main():
    # create a list of all the images in the given path
    list_image = os.listdir(path)
    # counts the number of CPUs available
    workers = os.cpu_count()
    # number of processors used will be equal to workers
    with Pool(workers) as p:
        # processing the images parallely using the number of CPUs available
        p.map(image_process, list_image)

    print('Processing time: {0} [sec]'.format(time.time() - start_time))


if __name__ == '__main__':
    main()

Processing time: 1.2821319103240967 [sec]


Just in case that the code above doesn't work in the Jupyter notebook, we can run the code in a python script.

In [1]:
%pycat parallel.py

[0;32mimport[0m [0mcv2[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mos[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mtime[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mmultiprocessing[0m [0;32mimport[0m [0mPool[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0mstart_time[0m [0;34m=[0m [0mtime[0m[0;34m.[0m[0mtime[0m[0;34m([0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0mpath[0m [0;34m=[0m [0;34m"/Users/xin/Library/CloudStorage/OneDrive-UniversityofBristol/DST/DST-assessment-3/Data/PlantVillage/Tomato_healthy"[0m[0;34m[0m
[0;34m[0m[0msave_path[0m [0;34m=[0m [0;34m"./parallelly_processed_images1"[0m[0;34m[0m
[0;34m[0m[0mos[0m[0;34m.[0m[0mmakedirs[0m[0;34m([0m[0msave_path[0m[0;34m,[0m [0mexist_ok[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;32mdef[0m [0mimage_process[0m[0;34m([0m[0mimage_path[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m

In [2]:
%run -i parallel.py

Processing time: 1.2050859928131104 [sec]


Processing the 1591 images in this folder in parallel with 8 CPUs reduced the processing time by a factor of 2.5, demonstrating that multiprocessing is a huge advantage when dealing with large-scale data.

## Another folder of images

Now we will test the performance of multiprocessing on a smaller dataset, which consists of only 145 images. This is only about a quarter of the number of images in the previous folder.

In [20]:
start_time = time.time()

path = "../Data/PlantVillage/Potato___healthy"
save_path = "../sequentially_processed_images1"
os.makedirs(save_path, exist_ok=True)

def image_process(image_path):
    img = cv2.imread(os.path.join(path, image_path))
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    resized = cv2.resize(gray, (224, 224), interpolation=cv2.INTER_AREA)
    cv2.imwrite(os.path.join(save_path, image_path), resized)


list_image = os.listdir(path)
for image_path in list_image:
    image_process(image_path)

print('Processing time: {0} [sec]'.format(time.time() - start_time))

Processing time: 0.2946169376373291 [sec]


In [24]:
start_time = time.time()


path = "../Data/PlantVillage/Potato___healthy"
save_path = "../parallelly_processed_images1"
os.makedirs(save_path, exist_ok=True)
def image_process(image_path):
    img = cv2.imread(os.path.join(path, image_path))
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    resized = cv2.resize(gray, (224, 224), interpolation = cv2.INTER_AREA)
    cv2.imwrite(os.path.join(save_path, image_path), resized)
    return
def main():
    list_image = os.listdir(path)
    workers = os.cpu_count()
    with Pool(workers) as p:
        p.map(image_process, list_image)

    print('Processing time: {0} [sec]'.format(time.time() - start_time))


if __name__ == '__main__':
    main()

Processing time: 0.6227138042449951 [sec]


In [21]:
%pycat parallel1.py

[0;32mimport[0m [0mcv2[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mos[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mtime[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mmultiprocessing[0m [0;32mimport[0m [0mPool[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0mstart_time[0m [0;34m=[0m [0mtime[0m[0;34m.[0m[0mtime[0m[0;34m([0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0mpath[0m [0;34m=[0m [0;34m"/Users/xin/Library/CloudStorage/OneDrive-UniversityofBristol/DST/DST-assessment-3/Data/PlantVillage/Potato___healthy"[0m[0;34m[0m
[0;34m[0m[0msave_path[0m [0;34m=[0m [0;34m"./parallelly_processed_images1"[0m[0;34m[0m
[0;34m[0m[0mos[0m[0;34m.[0m[0mmakedirs[0m[0;34m([0m[0msave_path[0m[0;34m,[0m [0mexist_ok[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;32mdef[0m [0mimage_process[0m[0;34m([0m[0mimage_path[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[

In [16]:
%run -i parallel1.py

Processing time: 0.5803048610687256 [sec]


For the much smaller dataset, the processing time for multiprocessing is actually twice as long as the processing time for sequential processing.

## Conclusion
In conclusion, our experiments with multi-processing have shown that it can be a powerful tool for reducing the processing time of large-scale data. When dealing with a significant number of images, the use of multi-processing with a sufficient number of CPUs can result in significant improvements in processing time. However, it is important to note that multi-processing may not always be beneficial for smaller datasets, as the overhead of coordinating multiple processes may outweigh any potential gains. Overall, it is important to carefully consider the specific requirements of each project and dataset before deciding whether or not to utilize multi-processing.