## 1. Preparation

Download and preprocess the BDD100k dataset, the steps equal the ones found in the [bdd100k.ipynb](https://github.com/nyikovicsmate/thesis/blob/dev/utils/datasets/bdd100k.ipynb) dataset preparation notebook.

In [3]:
# get the dynamic download link
!curl -s "https://2x5kv9t5uf.execute-api.us-west-2.amazonaws.com/production?func=create_download_challenge_link&filename=bdd100k"%"2Fbdd100k_images.zip" -H "Accept: */*" -o uri.txt
# download the dataset (approx 6.5G)
!xargs -n 1 curl -o "bdd100k_images.zip" < uri.txt
# extract
!unzip -q bdd100k_images.zip -d bdd100k_images

# if there is a problem with unzipping it's most likely caused by a failed download
# this can happen when colab is a little too slow to start the download and the dynamic download link expires
# if this happens just try running the cell again

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6619M  100 6619M    0     0  36.1M      0  0:03:03  0:03:03 --:--:-- 36.5M


In [0]:
# move things around
!mv ./bdd100k_images/bdd100k/images/100k ./images 

In [0]:
# cleanup
!rm uri.txt
!rm bdd100k_images.zip
!rm -rf ./bdd100k_images
!rm -rf sample_data

In [0]:
# download the preprocesing script
!curl -s -O https://raw.githubusercontent.com/nyikovicsmate/thesis/dev/utils/preprocess.py
# download requirements.txt
!curl -s -O https://raw.githubusercontent.com/nyikovicsmate/thesis/dev/utils/requirements.txt

In [7]:
!pip3 install -r requirements.txt

Collecting numpy>=1.18
[?25l  Downloading https://files.pythonhosted.org/packages/62/20/4d43e141b5bc426ba38274933ef8e76e85c7adea2c321ecf9ebf7421cedf/numpy-1.18.1-cp36-cp36m-manylinux1_x86_64.whl (20.1MB)
[K     |████████████████████████████████| 20.2MB 232kB/s 
[?25hCollecting tqdm>=4.43
[?25l  Downloading https://files.pythonhosted.org/packages/47/55/fd9170ba08a1a64a18a7f8a18f088037316f2a41be04d2fe6ece5a653e8f/tqdm-4.43.0-py2.py3-none-any.whl (59kB)
[K     |████████████████████████████████| 61kB 8.5MB/s 
[?25hCollecting h5py>=2.10
[?25l  Downloading https://files.pythonhosted.org/packages/60/06/cafdd44889200e5438b897388f3075b52a8ef01f28a17366d91de0fa2d05/h5py-2.10.0-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 42.2MB/s 
Collecting rawpy>=0.14.0
[?25l  Downloading https://files.pythonhosted.org/packages/53/50/13dd9863a3e30b10f15e5abe7c1545db24a78cfe820c342978ae5d87e8c3/rawpy-0.14.0-cp36-cp36m-manylinux2010_x86_64.whl (1.6MB)
[K     

In [8]:
!python3 preprocess.py -h

usage: preprocess.py [-h] [-a AUGMENT_VALUE] [-f {png,hdf,lmdb}] [-g]
                     [-m {clip,clip_rnd,scale,scale_rnd}] [-n NAME]
                     [-s SIZE SIZE]
                     [root]

positional arguments:
  root                  The root directory from where the search for images
                        starts. (default: '.)'

optional arguments:
  -h, --help            show this help message and exit
  -a AUGMENT_VALUE, --augment AUGMENT_VALUE
                        Besides preprocessed images, store augmented ones as
                        well. Augmented image is a processed image with every
                        2nd pixel (in a checkerboard pattern) set to
                        augment_value [0-255].
  -f {png,hdf,lmdb}, --format {png,hdf,lmdb}
                        Output format to use. Supported: png, hdf, lmdb.
                        (default: png)
  -g, --grayscale       Grayscale images.
  -m {clip,clip_rnd,scale,scale_rnd}, --method {clip,clip_rnd

In [15]:
# preprocess the dataset
!python3 preprocess.py -n "bdd100k_hdf" -f "hdf" -m "clip" -g -s 225 225 
!python3 preprocess.py -n "bdd100k_lmdb" -f "lmdb" -m "clip" -g -s 225 225 
!python3 preprocess.py -n "bdd100k_png" -f "png" -m "clip" -g -s 225 225 

Looking for images under /content
Found 100000 images.
Processing images.
100% 100000/100000 [23:02<00:00, 72.34it/s]
Done.
Looking for images under /content
Found 100000 images.
Processing images.
100% 100000/100000 [32:14<00:00, 51.69it/s]
Done.


## 2. Benchmarking

This point all 3 formats contain the same 100000 225x225 px grayscale images. The benchmarks are focused on the following 4 details of each format:

1.   the **disk size** of the dataset
2.   random **single image** read speed
3.   **sequential batch** read speed 
4.   **random batch** read speed


1.   **disk size benchmark**: calculating the hard disk space the dataset in the given format occupies.
2.   **single image read benchmark**: given 1000 random indexes in the range of (0, 100000) measuring how long does it take on average (from 100 runs) to read one.
3. **sequential batch benchmark**: with a batch size of 1000 indexes, and with the indexes being sequentially ordered (e.g. 1st batch (0-999), 2nd (1000-1999) and so on), measuring how long does it take to read the whole dataset into memory once.
4. **random batch benchmark**: with a batch size of 1000 indexes, and the indexes in each batch are in a random order, measuring how long does it take to read the whole dataset into memory once.


In [0]:
import pathlib
import h5py
import lmdb
import cv2
from google.colab.patches import cv2_imshow
import pickle
from typing import List, Tuple
import numpy as np
import time


### helper functions


def get_size(path: pathlib.Path):
    """
    Returns the size of a file/directory in MB.
    """
    if path.is_file():
        return path.stat().st_size / (1024**2) # st_size returns size in bytes
    elif path.is_dir():
        return sum(f.stat().st_size / (1024**2) for f in path.rglob('*') if f.is_file())
    else:
        raise Exception()

def get_random_batch() -> np.ndarray:
    """
    Returns a 1000 indexes from the range of (0,100000).
    """
    return np.random.randint(0, 100000, 1000)

def get_sequential_batches() -> np.ndarray:
    """
    Returns a 100 batches of 1000 sequential indexes covering the range of (0,100000).
    """
    idxs = np.arange(0, 100000)
    batches = np.reshape(idxs, (100, 1000))
    return batches

def get_random_batches() -> np.ndarray:
    """
    Returns a 100 batches of 1000 random indexes covering the range of (0,100000).
    """
    idxs = np.arange(0, 100000)
    np.random.shuffle(idxs)
    batches = np.reshape(idxs, (100, 1000))
    return batches


### single image read functions 


def read_single_png(idx: int):
    """
    Utility function for reading an image back into memory.
    """
    image_path = pathlib.Path.joinpath(pathlib.Path.cwd(), "bdd100k_png", "images", f"{idx}.png")
    return np.array(cv2.imread(str(image_path), cv2.IMREAD_GRAYSCALE))

def read_single_hdf(idx: int):
    """
    Utility function for reading an image back into memory from a .h5 file.
    """
    with h5py.File(pathlib.Path.joinpath(pathlib.Path.cwd(), "bdd100k_hdf.h5"), "r") as file:
        image = np.array(file["images"][idx], dtype=np.uint8)
    return image

def read_single_lmdb(idx: int):
    """
    Utility function for reading an image back into memory from a lmdb database.
    """
    lmdb_dir = pathlib.Path.joinpath(pathlib.Path.cwd(), "bdd100k_lmdb")
    env = lmdb.open(str(lmdb_dir), readonly=True, max_dbs=2, readahead=False)
    db = env.open_db(key="images".encode("utf8"))
    with env.begin(db=db) as txn:
        data = txn.get(f"{idx}".encode("utf8"))
        image = pickle.loads(data)
    env.close()
    return image


### sequential batch read functions


def read_sequential_png(idxs: List[int]):
    return [read_single_png(idx) for idx in idxs]

def read_sequential_hdf(idxs: List[int]):
    with h5py.File(pathlib.Path.joinpath(pathlib.Path.cwd(), "bdd100k_hdf.h5"), "r") as file:
        # h5py supports index ranges, read times are significantly faster 
        # than reading each image separately 
        images = np.array(file["images"][idxs], dtype=np.uint8)
    return images

def read_sequential_lmdb(idxs: List[int]):
    lmdb_dir = pathlib.Path.joinpath(pathlib.Path.cwd(), "bdd100k_lmdb")
    images = []
    # set readahead to True to fully utilize underlying os capabilities  
    env = lmdb.open(str(lmdb_dir), readonly=True, max_dbs=2, readahead=True) 
    db = env.open_db(key="images".encode("utf8"))
    with env.begin(db=db) as txn:
        for idx in idxs:
            data = txn.get(f"{idx}".encode("utf8"))
            image = pickle.loads(data)
            images.append(image)
    env.close()
    return images

### random batch read functions 

def read_random_png(idxs: List[int]):
    return read_sequential_png(idxs)

def read_random_hdf(idxs: List[int]):
    idxs = list(sorted(idxs))
    with h5py.File(pathlib.Path.joinpath(pathlib.Path.cwd(), "bdd100k_hdf.h5"), "r") as file:
        # h5py still supports index ranges, read times are significantly faster 
        # than reading each image separately 
        # but with random indexes the indexes must be in ascending order
        images = np.array(file["images"][idxs], dtype=np.uint8)
    return images

def read_random_lmdb(idxs: List[int]):
    lmdb_dir = pathlib.Path.joinpath(pathlib.Path.cwd(), "bdd100k_lmdb")
    images = []
    # set readahead to False to speed up random reads  
    env = lmdb.open(str(lmdb_dir), readonly=True, max_dbs=2, readahead=False) 
    db = env.open_db(key="images".encode("utf8"))
    with env.begin(db=db) as txn:
        for idx in idxs:
            data = txn.get(f"{idx}".encode("utf8"))
            image = pickle.loads(data)
            images.append(image)
    env.close()
    return images


In [0]:
def size_benchmark():
    results = {"png": 0, "hdf": 0, "lmdb": 0}
    results["png"] = get_size(pathlib.Path.joinpath(pathlib.Path.cwd(), "bdd100k_png", "images"))
    results["hdf"] = get_size(pathlib.Path.joinpath(pathlib.Path.cwd(), "bdd100k_hdf.h5"))
    results["lmdb"] = get_size(pathlib.Path.joinpath(pathlib.Path.cwd(), "bdd100k_lmdb"))
    return results

def single_benchmark():
    results = {"png": 0, "hdf": 0, "lmdb": 0}
    idxs = get_random_batch()[0:100] # 100 random indexes
    for idx in idxs:
        start = time.time()
        read_single_png(idx)
        end = time.time()
        results["png"] += (end-start) * 1000 # time.time is in sec
    results["png"] /= 100 # take the average of 100 runs

    for idx in idxs:
        start = time.time()
        read_single_hdf(idx)
        end = time.time()
        results["hdf"] += (end-start) * 1000 # time.time is in sec
    results["hdf"] /= 100 # take the average of 100 runs

    for idx in idxs:
        start = time.time()
        read_single_lmdb(idx)
        end = time.time()
        results["lmdb"] += (end-start) * 1000 # time.time is in sec
    results["lmdb"] /= 100 # take the average of 100 runs

    return results

def sequential_batch_benchmark():
    results = {"png": 0, "hdf": 0, "lmdb": 0}
    idxs = get_sequential_batches()

    for idx in idxs:
        start = time.time()
        read_sequential_png(idx)
        end = time.time()
        results["png"] += (end-start) * 1000 # time.time is in sec
    results["png"] /= 100 # take the average of 100 runs

    for idx in idxs:
        start = time.time()
        read_sequential_hdf(idx)
        end = time.time()
        results["hdf"] += (end-start) * 1000 # time.time is in sec
    results["hdf"] /= 100 # take the average of 100 runs

    for idx in idxs:
        start = time.time()
        read_sequential_lmdb(idx)
        end = time.time()
        results["lmdb"] += (end-start) * 1000 # time.time is in sec
    results["lmdb"] /= 100 # take the average of 100 runs

    return results


def random__batch_benchmark():
    results = {"png": 0, "hdf": 0, "lmdb": 0}
    idxs = get_sequential_batches()

    for idx in idxs:
        start = time.time()
        read_random_png(idx)
        end = time.time()
        results["png"] += (end-start) * 1000 # time.time is in sec
    results["png"] /= 100 # take the average of 100 runs

    for idx in idxs:
        start = time.time()
        read_random_hdf(idx)
        end = time.time()
        results["hdf"] += (end-start) * 1000 # time.time is in sec
    results["hdf"] /= 100 # take the average of 100 runs

    for idx in idxs:
        start = time.time()
        read_random_lmdb(idx)
        end = time.time()
        results["lmdb"] += (end-start) * 1000 # time.time is in sec
    results["lmdb"] /= 100 # take the average of 100 runs

    return results

## 3. Results

### 3.1 Size benchmark

In [24]:
print("Results of size benchmark [MB]")
print(size_benchmark())

Results of size benchmark [MB]
{'png': 1693.4731349945068, 'hdf': 4827.977561950684, 'lmdb': 5082.4375}


In [21]:
!du -h bdd100k_png/

1.9G	bdd100k_png/images
1.9G	bdd100k_png/


In [22]:
!du -h bdd100k_hdf.h5

4.8G	bdd100k_hdf.h5


In [23]:
!du -h bdd100k_lmdb

5.0G	bdd100k_lmdb


### 3.2 Single read benchmark

In [25]:
print("Results of single read benchmark [ms]")
print(single_benchmark())

Results of single read benchmark [ms]
{'png': 0.7945394515991211, 'hdf': 4.191272258758545, 'lmdb': 4.199466705322266}


### 3.3 Sequential batch read benchmark

In [29]:
print("Results of sequential batch read benchmark [ms]")
print(sequential_batch_benchmark())

Results of sequential batch read benchmark [ms]
{'png': 690.3145956993103, 'hdf': 705.4850220680237, 'lmdb': 966.8847441673279}


### 3.4 Random batch read benchmark

In [30]:
print("Results of random batch read benchmark [ms]")
print(random__batch_benchmark())

Results of random batch read benchmark [ms]
{'png': 1365.8288359642029, 'hdf': 742.2910022735596, 'lmdb': 3947.6974749565125}


## Benchmarking

In [0]:
import time
result_single = {"png": 0, "hdf5": 0, "lmdb": 0}
result_seq = {"png": 0, "hdf5": 0, "lmdb": 0}
result_rand = {"png": 0, "hdf5": 0, "lmdb": 0}

In [0]:
before = time.time()
read_single_png(0)
after = time.time()
result_single["png"] = (after - before) * 1000

before = time.time()
read_single_hdf5(0)
after = time.time()
result_single["hdf5"] = (after - before) * 1000

before = time.time()
read_single_lmdb(0)
after = time.time()
result_single["lmdb"] = (after - before) * 1000

In [0]:
before = time.time()
read_many_sequentially_png()
after = time.time()
result_seq["png"] = (after - before) * 1000

before = time.time()
read_many_sequentially_hdf5()
after = time.time()
result_seq["hdf5"] = (after - before) * 1000

before = time.time()
read_many_sequentially_lmdb()
after = time.time()
result_seq["lmdb"] = (after - before) * 1000

In [0]:
before = time.time()
read_many_randomly_png(random_idxs)
after = time.time()
result_rand["png"] = (after - before) * 1000

idxs = sorted(list(random_idxs))
before = time.time()
read_many_randomly_hdf5(idxs)
after = time.time()
result_rand["hdf5"] = (after - before) * 1000

before = time.time()
read_many_randomly_lmdb(random_idxs)
after = time.time()
result_rand["lmdb"] = (after - before) * 1000

In [0]:
print(f"Single read results [ms]: {result_single}")
print(f"Sequential read results [ms]: {result_seq}")
print(f"Random read results [ms]: {result_rand}")

Single read results [ms]: {'png': 0.41174888610839844, 'hdf5': 1.4827251434326172, 'lmdb': 0.3237724304199219}
Sequential read results [ms]: {'png': 11.526823043823242, 'hdf5': 1.3456344604492188, 'lmdb': 0.9093284606933594}
Random read results [ms]: {'png': 12.545347213745117, 'hdf5': 3.5293102264404297, 'lmdb': 1.1408329010009766}
