# Lecture 17 : GPU Sobel Edge Detection

## Clone the materials repo to access datafiles.

In [1]:
!git clone https://code.vt.edu/jasonwil/cmda3634_materials.git

Cloning into 'cmda3634_materials'...
remote: Enumerating objects: 279, done.[K
remote: Counting objects: 100% (242/242), done.[K
remote: Compressing objects: 100% (235/235), done.[K
remote: Total 279 (delta 88), reused 9 (delta 2), pack-reused 37 (from 1)[K
Receiving objects: 100% (279/279), 43.40 MiB | 17.40 MiB/s, done.
Resolving deltas: 100% (93/93), done.


In [2]:
# copy the lecture 17 files to our working directory
!cp cmda3634_materials/L17/* .

# Part 1 : Converting a Color Video to Grayscale Video in Python

## In Python, cv2 is the module used to access the OpenCV (Open Source Computer Vision) library.

## OpenCV is a library for image processing, computer vision, and machine learning tasks.

In [3]:
%%writefile grayscale.py
import cv2 # for video processing
import sys # for command line arguments
import numpy as np # for matrix processing

# make sure command line arguments are provided
if (len(sys.argv) < 3):
    print ('command usage :',sys.argv[0],'infile','outfile')
    exit(1)
infile = sys.argv[1]
outfile = sys.argv[2]

# open video
input = cv2.VideoCapture(infile)

# Check if the video opened successfully
if not input.isOpened():
    print("Error opening video input file")
    exit()

# get information about the video
len = int(input.get(cv2.CAP_PROP_FRAME_COUNT))
cols = int(input.get(cv2.CAP_PROP_FRAME_WIDTH))
rows = int(input.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = input.get(cv2.CAP_PROP_FPS)

# print number of frames
print ('number of frames =',len)

# open the output video
output = cv2.VideoWriter(outfile, cv2.VideoWriter_fourcc(*'mp4v'),
                        fps, (cols,rows), False)

# read video frame by frame
# convert each frame to grayscale and write to outfile
while input.isOpened():
    ret, frame = input.read()
    if not ret:
        break
    # Convert frame to grayscale and write to output
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    output.write(frame)
input.release()
output.release()

Writing grayscale.py


## Convert a 1 frame per second video to grayscale.

In [4]:
!time python3 grayscale.py shuttle1.mp4 gray1.mp4

number of frames = 13

real	0m0.908s
user	0m1.111s
sys	0m0.096s


## Convert a 20 frame per second video to grayscale.

In [5]:
!time python3 grayscale.py shuttle20.mp4 gray20.mp4

number of frames = 269

real	0m8.178s
user	0m11.503s
sys	0m0.153s


# Part 2 : Sobel Edge Detection in Python

Sobel edge detection consists of the following two kernels that we apply to each pixel in the image to detect vertical and horizontal edges.

$$\begin{bmatrix} -1 & 0 & +1 \\ -2 & 0 & +2 \\ -1 & 0 & +1 \end{bmatrix} \qquad
\begin{bmatrix} +1 & +2 & +1 \\ 0 & 0 & 0 \\ -1 & -2 & -1 \end{bmatrix}$$

Example : Apply each kernel to the center pixel of the following image matrix.

$$\begin{bmatrix} 1 & 1 & 1 \\ 0 & 0 & 0 \\ 2 & 2 & 2 \end{bmatrix}$$

We apply the first kernel which detects vertical edges:

$$G_x = (-1)(1) + 0(1) + 1(1) + (-2)(0) + 0(0) + 2(0) + (-1)(2) + 0(2) + 1(2) = 0$$

We apply the second kernel which detects horizontal edges:

$$G_y = (1)(1) + 2(1) + 1(1) + 0(0) + 0(0) + 0(0) + (-1)(2) + (-2)(2) + (-1)(2) = -4$$

To combine the results of the detectors we use the formula
$$|G| = |G_x| + |G_y| = 0 + 4 = 4$$

Given a $3 \times 3$ piece of an image matrix
$$\begin{bmatrix} P_1 & P_2 & P_3 \\ P_4 & P_5 & P_6 \\ P_7 & P_8 & P_9 \end{bmatrix}$$

we can compute value of $|G|$ at the pixel $P_5$ using the formula
$$|G| = |P_3 + 2P_6 + P_9 - P_1 - 2P_4 - P_7| + |P_1+2P_2+P_3-P_7-2P_8-P_9|$$

We will pad the perimeter of the image matrix with zeros to ensure that the above formula can be computed at each pixel.

We can threshold the value of $|G|$ to determine whether to color the pixel black (when $|G| \leq$ threshold) or
white (when $|G| > $ threshold).

In [14]:
%%writefile sobel.py
import cv2 # for video processing
import sys # for command line arguments
import numpy as np # for matrix processing
import time # to time part of the code

# make sure command line arguments are provided
if (len(sys.argv) < 4):
    print ('command usage :',sys.argv[0],'infile','outfile','threshold')
    exit(1)
infile = sys.argv[1]
outfile = sys.argv[2]
threshold = int(sys.argv[3])

# open video
input = cv2.VideoCapture(infile)

# Check if the video opened successfully
if not input.isOpened():
    print("Error opening video input file")
    exit()

# get information about the video
len = int(input.get(cv2.CAP_PROP_FRAME_COUNT))
cols = int(input.get(cv2.CAP_PROP_FRAME_WIDTH))
rows = int(input.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = input.get(cv2.CAP_PROP_FPS)

# print number of frames
print ('number of frames =',len)

# open the output video
output = cv2.VideoWriter(outfile, cv2.VideoWriter_fourcc(*'mp4v'),
                        fps, (cols,rows), False)

# read video frame by frame
# convert each frame to grayscale and write to outfile
while input.isOpened():
    ret, frame = input.read()
    if not ret:
        break
    # Convert frame to grayscale if it's not already
    if frame.ndim == 3:
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # Convert video frame to matrix
    A = frame.astype(np.int32, copy=False)
    # Add zero padding to matrix
    A_pad = np.pad(A,[1,1],'constant',constant_values = 0)
    # Create the edges matrix
    E = np.zeros((rows,cols),dtype='uint8')

    # start timer
    start = time.process_time()

    # Run Sobel Edge Detector
    for i in range(0,rows):
        for j in range(0,cols):
            P1 = A_pad[i][j]
            P2 = A_pad[i][j+1]
            P3 = A_pad[i][j+2]
            P4 = A_pad[i+1][j]
            P6 = A_pad[i+1][j+2]
            P7 = A_pad[i+2][j]
            P8 = A_pad[i+2][j+1]
            P9 = A_pad[i+2][j+2]
            Gx = P3+2*P6+P9-P1-2*P4-P7
            Gy = P1+2*P2+P3-P7-2*P8-P9
            size = np.abs(Gx)+np.abs(Gy)
            if (size > threshold):
                E[i][j] = 255

    # print how long it took to run sobel on frame
    elapsed = time.process_time()-start
    print ('Time to run Sobel on frame =',np.round(elapsed,4),'seconds')

    # Write the edges as the next video frame
    output.write(E)

input.release()
output.release()

Overwriting sobel.py


## Run python Sobel edge detector on a 1 frame per second video

In [15]:
!time python3 sobel.py shuttle1.mp4 edges1.mp4 50

number of frames = 13
Time to run Sobel on frame = 14.687 seconds
Time to run Sobel on frame = 13.9769 seconds
Time to run Sobel on frame = 14.1191 seconds
Time to run Sobel on frame = 13.9155 seconds
Time to run Sobel on frame = 14.0173 seconds
Time to run Sobel on frame = 14.161 seconds
Time to run Sobel on frame = 14.4727 seconds
Time to run Sobel on frame = 14.7105 seconds
Time to run Sobel on frame = 13.9811 seconds
Time to run Sobel on frame = 14.0408 seconds
Time to run Sobel on frame = 14.0485 seconds
Time to run Sobel on frame = 14.0075 seconds
Time to run Sobel on frame = 13.9079 seconds

real	3m6.670s
user	3m4.837s
sys	0m0.235s


## Use **ffmpeg** to add audio to shuttle1 edges video.

In [16]:
!ffmpeg -y -hide_banner -loglevel error -i edges1.mp4 -i shuttle_audio.mp3 -c:v copy -map 0:v:0 -map 1:a:0 -c:a aac -b:a 192k shuttle1_edges.mp4

## About how long would it take to run the python Sobel edge detector on the 20 fps version with 269 frames assuming that it takes roughly 14 seconds per frame?

## Answer: Over 1 hour!

# Part 3 : Sobel Edge Detection using CUDA

## Previously we learned how to replace computationally demanding Python code wth C code.

## This allows us to get the ease of using Python along with the high performance of C.

## We also saw that the we could have our C functions use OpenMP.

## This allows us to further accelerate computational demanding Python codes.

## Today we will look at an example where our C functions use CUDA.

## For Sobel edge detection, what is the minimum amount of work we could have each thread do?

## Answer: A single pixel!

## With one thread per pixel, how many threads are needed for a 1984 x 1000 video frame?

## Answer : 1984*1000 = 1984000 threads.


## Complete the following CUDA code for the GPU Sobel Edge detector.

In [17]:
%%writefile gpu_sobel_start.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cuda.h>

typedef unsigned char byte;

__global__ void sobelKernel(byte* A_pad, byte* E, int rows, int cols, int threshold) {

    ////////////
    // add code to finish the Sobel kernel so that each thread handles one pixel
    ////////////

    {
        int i = 0;
        int j = 0;
        int P1 = A_pad[i*(cols+2)+j];
        int P2 = A_pad[i*(cols+2)+j+1];
        int P3 = A_pad[i*(cols+2)+j+2];
        int P4 = A_pad[(i+1)*(cols+2)+j];
        int P6 = A_pad[(i+1)*(cols+2)+j+2];
        int P7 = A_pad[(i+2)*(cols+2)+j];
        int P8 = A_pad[(i+2)*(cols+2)+j+1];
        int P9 = A_pad[(i+2)*(cols+2)+j+2];
        int Gx = P3+2*P6+P9-P1-2*P4-P7;
        int Gy = P1+2*P2+P3-P7-2*P8-P9;
        int size = abs(Gx)+abs(Gy);
        if (size > threshold) {
            E[i*cols+j] = 255;
        }
    }

}

extern "C" void sobel(byte* A_pad, byte* E, int rows, int cols, int threshold) {

    // use 256 threads per thread block
    int B = 256;

    // each thread processes one pixel

    ////////////
    // add code to compute the number of thread blocks G
    ////////////

    // A_pad and E on the device

    ////////////
    // add code to allocate space on device for A_pad and E.
    ////////////

    // start the timer
    clock_t start = clock();

    // copy A_pad from the host to the device

    ////////////
    // add code to copy A_pad from the host to the device
    ////////////

    // Initialize E on the device

    ////////////
    // add code to initialize E on the device
    ////////////

    // launch the Sobel kernel

    ////////////
    // add code to launch the Sobel kernel
    ////////////

    // copy E from the device to the host

    ////////////
    // add code to copy E from the device to the host
    ////////////

    // stop the timer
    clock_t stop = clock();
    double elapsed = (double)(stop-start)/CLOCKS_PER_SEC;

#ifdef DIAG
    // print timing result
    printf ("Time to run Sobel on frame = %.4f seconds\n",elapsed);
#endif

}

Writing gpu_sobel_start.cu


In [None]:
!nvcc -DDIAG -Xcompiler -fPIC -shared -arch=sm_75 -O3 -o gpu_sobel_start.so gpu_sobel_start.cu

## **Change runtime to T4.**

## Clone the materials repo to access datafiles.

In [1]:
!git clone https://code.vt.edu/jasonwil/cmda3634_materials.git

Cloning into 'cmda3634_materials'...
remote: Enumerating objects: 279, done.[K
remote: Counting objects: 100% (242/242), done.[K
remote: Compressing objects: 100% (235/235), done.[K
remote: Total 279 (delta 88), reused 9 (delta 2), pack-reused 37 (from 1)[K
Receiving objects: 100% (279/279), 43.40 MiB | 18.17 MiB/s, done.
Resolving deltas: 100% (93/93), done.


In [2]:
# copy the lecture 17 files to our working directory
!cp cmda3634_materials/L17/* .

In [6]:
%%writefile gpu_sobel.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cuda.h>

typedef unsigned char byte;

__global__ void sobelKernel(byte* A_pad, byte* E, int rows, int cols, int threshold) {

    int thread_num = blockIdx.x*blockDim.x+threadIdx.x;
    if (thread_num < rows*cols) {
        int i = thread_num/cols;
        int j = thread_num%cols;
        int P1 = A_pad[i*(cols+2)+j];
        int P2 = A_pad[i*(cols+2)+j+1];
        int P3 = A_pad[i*(cols+2)+j+2];
        int P4 = A_pad[(i+1)*(cols+2)+j];
        int P6 = A_pad[(i+1)*(cols+2)+j+2];
        int P7 = A_pad[(i+2)*(cols+2)+j];
        int P8 = A_pad[(i+2)*(cols+2)+j+1];
        int P9 = A_pad[(i+2)*(cols+2)+j+2];
        int Gx = P3+2*P6+P9-P1-2*P4-P7;
        int Gy = P1+2*P2+P3-P7-2*P8-P9;
        int size = abs(Gx)+abs(Gy);
        if (size > threshold) {
            E[i*cols+j] = 255;
        }
    }
}

extern "C" void sobel(byte* A_pad, byte* E, int rows, int cols, int threshold) {

    // use 256 threads per thread block
    int B = 256;

    // each thread processes one pixel
    int G = (rows*cols+B-1)/B;

    // A_pad and E on the device
    byte* d_A_pad;
    byte* d_E;
    cudaMalloc (&d_A_pad,(rows+2)*(cols+2)*sizeof(byte));
    if (d_A_pad == NULL) { printf ("cudaMalloc error!\n"); exit(1); }
    cudaMalloc (&d_E,rows*cols*sizeof(byte));
    if (d_A_pad == NULL) { printf ("cudaMalloc error!\n"); exit(1); }

    // copy A_pad from the host to the device
    cudaMemcpy (d_A_pad, A_pad, (rows+2)*(cols+2)*sizeof(byte), cudaMemcpyHostToDevice);

    // Initialize E on the device
    cudaMemset (d_E,0,rows*cols*sizeof(byte));

    // time kernel
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);

    // launch the Sobel kernel
    sobelKernel<<<G, B>>>(d_A_pad, d_E, rows, cols, threshold);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float ms = 0;
    cudaEventElapsedTime(&ms, start, stop);  // returns milliseconds
    double seconds = (double)ms / 1000.0;

#ifdef DIAG
    printf("Sobel kernel compute time = %.6f seconds\n", seconds);
#endif

    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    // copy E from the device to host
    cudaMemcpy (E, d_E, rows*cols*sizeof(byte), cudaMemcpyDeviceToHost);

}

Overwriting gpu_sobel.cu


In [7]:
%%writefile gpu_sobel.py
import cv2 # for video processing
import sys # for command line arguments
import numpy as np # for matrix processing
import ctypes as ct # for calling C from Python
lib = ct.cdll.LoadLibrary("./gpu_sobel.so") # load GPU sobel edge detector

# make sure command line arguments are provided
if (len(sys.argv) < 4):
    print ('command usage :',sys.argv[0],'infile','outfile','threshold')
    exit(1)
infile = sys.argv[1]
outfile = sys.argv[2]
threshold = int(sys.argv[3])

# open video
input = cv2.VideoCapture(infile)

# Check if the video opened successfully
if not input.isOpened():
    print("Error opening video input file")
    exit()

# get information about the video
len = int(input.get(cv2.CAP_PROP_FRAME_COUNT))
cols = int(input.get(cv2.CAP_PROP_FRAME_WIDTH))
rows = int(input.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = input.get(cv2.CAP_PROP_FPS)

# print number of frames
print ('number of frames =',len)

# open the output video
output = cv2.VideoWriter(outfile, cv2.VideoWriter_fourcc(*'mp4v'),
                        fps, (cols,rows), False)

# read video frame by frame
# convert each frame to grayscale and write to outfile
while input.isOpened():
    ret, frame = input.read()
    if not ret:
        break
    # Convert frame to grayscale if it's not already
    if frame.ndim == 3:
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # Convert video frame to matrix
    A = np.array(frame)
    # Add zero padding to matrix
    A_pad = np.pad(A,[1,1],'constant',constant_values = 0)
    # Create the edges matrix
    E = np.zeros((rows,cols),dtype='uint8')

    # call the GPU sobel edge detector
    A_pad_cptr = A_pad.ctypes.data_as(ct.POINTER(ct.c_uint8))
    E_cptr = E.ctypes.data_as(ct.POINTER(ct.c_uint8))
    lib.sobel(A_pad_cptr,E_cptr,ct.c_int(rows),ct.c_int(cols),ct.c_int(threshold))

    # Write the edges as the next video frame
    output.write(E)

input.release()
output.release()

Overwriting gpu_sobel.py


## First run on 1 frame per second video (with diagnostics).

In [8]:
!nvcc -DDIAG -Xcompiler -fPIC -shared -arch=sm_75 -O3 -o gpu_sobel.so gpu_sobel.cu

In [9]:
!time python3 gpu_sobel.py shuttle1.mp4 edges1.mp4 50

number of frames = 13
Sobel kernel compute time = 0.000195 seconds
Sobel kernel compute time = 0.000086 seconds
Sobel kernel compute time = 0.000086 seconds
Sobel kernel compute time = 0.000085 seconds
Sobel kernel compute time = 0.000084 seconds
Sobel kernel compute time = 0.000086 seconds
Sobel kernel compute time = 0.000084 seconds
Sobel kernel compute time = 0.000085 seconds
Sobel kernel compute time = 0.000086 seconds
Sobel kernel compute time = 0.000086 seconds
Sobel kernel compute time = 0.000084 seconds
Sobel kernel compute time = 0.000086 seconds
Sobel kernel compute time = 0.000085 seconds

real	0m1.039s
user	0m0.944s
sys	0m0.283s


## Run on 20 frames per second video (without diagnostics).

In [10]:
!nvcc -Xcompiler -fPIC -shared -arch=sm_75 -O3 -o gpu_sobel.so gpu_sobel.cu

In [11]:
!time python3 gpu_sobel.py shuttle20.mp4 edges20.mp4 50

number of frames = 269

real	0m7.024s
user	0m9.765s
sys	0m0.681s


## Use **ffmpeg** to add audio to our shuttle20 edges video.

In [12]:
!ffmpeg -y -hide_banner -loglevel error -i edges20.mp4 -i shuttle_audio.mp3 -c:v copy -map 0:v:0 -map 1:a:0 -c:a aac -b:a 192k shuttle20_edges.mp4

## After downloading edges video, **disconnect T4 runtime and change runtime type back to CPU.**