# Why is the vectorized dataloader slower?

This notebook documents my attempt to speed up the data generation process (selecting events, projecting them to 3 randomized views, pixelizing, and slicing out patches) through vectorization. Specifically, it seems that the patchification step takes longer if I used the vectorized projection step, which confuses me.

In [1]:
import h5py as h5

import sys
import math
import numpy as np
import scipy
from scipy.spatial.transform import Rotation

import io
import imageio
import matplotlib.pyplot as plt
import matplotlib

from dataloader.projection import *
from dataloader.dataset import *

In [2]:
# Load our target dataset
path = "/sdf/home/y/youngsam/data/dune/larnet/h5/DataAccessExamples/train/generic_v2_51800_v1.h5"
dataset = dataset_from_file(path)

print(f"Loaded {len(dataset)} events")

Loaded 51800 events


In [3]:
# Set batch size (N) and number of views per event (S)
N = 100
S = 3

# Original

We iteratively apply projection (3D point cloud to aggregated 2D sparse pixels) and patchification (2D sparse pixels to 2D patches) to every view of every event. This takes about 3.5 seconds for $N=100$, the majority of which is spent in the projection step.

In [4]:
# Choose random events and rotations
%time chosen_events = random.choices(dataset.images, k=N)
%time chosen_rotations = [[Rotation.random() for _ in range(S)] for _ in range(N)]

CPU times: user 66 μs, sys: 14 μs, total: 80 μs
Wall time: 84.6 μs
CPU times: user 5.78 ms, sys: 84 μs, total: 5.87 ms
Wall time: 5.76 ms


In [5]:
%%time
# Apply the project_2d method to all images
projections = [[project_2d(
    event, rotation.as_matrix(),
    x_min=dataset.x_min,
    x_max=dataset.x_max,
    x_pix=dataset.x_pix,
    y_min=dataset.y_min,
    y_max=dataset.y_max,
    y_pix=dataset.y_pix
) for rotation in rotations] for event, rotations in zip(chosen_events, chosen_rotations)]

CPU times: user 2.55 s, sys: 6.43 ms, total: 2.56 s
Wall time: 2.57 s


In [6]:
%%time
# Apply the patchify method to all images
patches = [[patchify(
    coords, values,
    x_pix=dataset.x_pix, y_pix=dataset.y_pix,
    x_patch=dataset.x_patch, y_patch=dataset.y_patch
) for coords, values in p] for p in projections]

CPU times: user 463 ms, sys: 7.05 ms, total: 470 ms
Wall time: 472 ms


In [7]:
# The total time to choose events is roughly the sum of the above
%time dataset.choose_events(N, S)
pass

CPU times: user 3.2 s, sys: 23.5 ms, total: 3.22 s
Wall time: 3.23 s


# Vectorized version

Performing the projection step with vectorization produces a roughly 10x speedup (2.5s -> 0.25s) however, for some reason this causes the unmodified patchification step to take 3s instead of 0.5, meaning that it ends up running at the same speed or slightly slower.

In [8]:
%time
# Stack all the images together, compute a list of point counts per image
selected_images = dataset.images[:100]
all_images = np.concat(selected_images)
point_counts = np.array([x.shape[0] for x in selected_images])
all_images.shape, point_counts.shape

CPU times: user 4 μs, sys: 1 μs, total: 5 μs
Wall time: 8.82 μs


((421938, 8), (100,))

In [9]:
%time
# Generate some rotations, the fast way
N, S = point_counts.shape[0], 4
rotations_flat = Rotation.random(num=N*S)
rotation_matrices = np.reshape(rotations_flat.as_matrix(), (N, S, 3, 3))

CPU times: user 4 μs, sys: 1 μs, total: 5 μs
Wall time: 9.54 μs


In [10]:
# Set parameters to the same as the dataset
x_min, y_min, x_max, y_max, x_pix, y_pix = dataset.x_min, dataset.y_min, dataset.x_max, dataset.y_max, dataset.x_pix, dataset.y_pix
x_patch, y_patch = dataset.x_patch, dataset.y_patch

In [11]:
%%time
# Do the projection step with vectorization

# Assign matrices to each point
ids = np.repeat(np.arange(len(point_counts)), point_counts)
corresponding_matrices = rotation_matrices[ids]

# Add S axis and convert to 3x1 vectors to matrix multiply; remove extraneous axis after rotation
rotated_coordinates = (corresponding_matrices @ all_images[:, np.newaxis, :3, np.newaxis])[..., 0]

# Pixelate x and y coordinates, discard z
projected_x = np.trunc(x_pix * (rotated_coordinates[:,:,0] - x_min) / (x_max - x_min)).astype(int)
projected_y = np.trunc(y_pix * (rotated_coordinates[:,:,1] - y_min) / (y_max - y_min)).astype(int)
projected_points = np.stack([projected_x, projected_y], axis=2)

# Compute the start and end of each image, separate into one pair of points/values for each event, with view number as axis 1
point_ends = np.cumsum(point_counts)
point_starts = point_ends - point_counts
image_blocks = [(projected_points[a:b], all_images[a:b,3]) for a, b in zip(point_starts, point_ends)]

CPU times: user 236 ms, sys: 52.6 ms, total: 288 ms
Wall time: 288 ms


In [12]:
%%time
# Apply the patchify method to all images, same as in the original method, except somehow it's slower
patches2 = [[patchify(
    coords[:,i,:], values,
    x_pix=x_pix, y_pix=y_pix,
    x_patch=x_patch, y_patch=y_patch
) for i in range(S)] for (coords, values) in image_blocks]

CPU times: user 3.34 s, sys: 15.6 ms, total: 3.35 s
Wall time: 3.37 s


Just to check that minor differences in code aren't causing this performance loss, I can rearrange `image_blocks` into the same format as `projections` in the original method, and observe that the exact same code runs slower

In [13]:
%%time
# Rearrange image_blocks into the same format as before
projections = list()
for n in range(N):
    subres = list()
    point_sets, values = image_blocks[n]
    for s in range(S):
        points = point_sets[:, s, :]
        subres.append((points, values))
    projections.append(subres)

CPU times: user 2.62 ms, sys: 0 ns, total: 2.62 ms
Wall time: 2.64 ms


In [14]:
%%time
# Apply the patchify method to all images, the exact same code as above
patches2 = [[patchify(
    coords, values,
    x_pix=dataset.x_pix, y_pix=dataset.y_pix,
    x_patch=dataset.x_patch, y_patch=dataset.y_patch
) for coords, values in p] for p in projections]

CPU times: user 3.33 s, sys: 24.4 ms, total: 3.36 s
Wall time: 3.37 s
