---
# OpenCL Programming in C
---

This notebook contains an introduction to OpenCL programming in C. For detailed coverage, The Khronos Group's documentation is a good source:

- [Khronos Group OpenCL](https://www.khronos.org/opencl)
- [OpenCL Specification](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html)
- [OpenCL SDK](https://github.com/KhronosGroup/OpenCL-SDK)
- [OpenCL Guide](https://github.com/KhronosGroup/OpenCL-Guide?tab=readme-ov-file)

**Note! If you don't have an OpenCL-enabled device on your system**

- You can run this notebook in Google CoLab.
  - Skip ahead to [1.1 Running the Notebook Locally or on Google CoLab](#11-running-the-notebook-locally-or-on-google-coLab)

**Note! If you are on Windows**

- Make sure you have installed Visual Studio or the Build Tools for Visual Studio.
- Make sure you have started VSCode (`code .`) from within a `Visual Studio Developer Command Prompt` to set necessary environment variables.
  - This is required when using the MicroSoft Visual C/C++ (MSVC) compiler `cl.exe` in VSCode on Windows.
- If you are using PowerShell as your default shell in VSCode, your default PowerShell profile file `profile.ps1` might not be digitally signed.
  - This will lead to errors when you compile C code in VSCode.
  - If so, you can fix this error by executing either of the two PowerShell commands below:
    - `Rename-Item "$env:USERPROFILE\Documents\WindowsPowerShell\profile.ps1" -NewName "profile.ps1.bak"`
    - `Set-ExecutionPolicy RemoteSigned -Scope CurrentUser`

This notebook covers:

- [1. Prerequisites](#1-prerequisites) 
  - [1.1 Running the Notebook Locally or on Google CoLab](#11-running-the-notebook-locally-or-on-google-coLab)
  - [1.2 Operating System and VSCode Shell](#12-operating-system-and-vscode-shell)
  - [1.3 C Compiler (`gcc`, `clang`, `cl`)](#13-c-compiler-gcc-clang-cl)
  - [1.4 OpenCL Library and Header Files](#14-opencl-library-and-header-files)
  - [1.5 Configuring `tasks.json`, `launch.json` and `c_cpp_properties.json`](#15-configuring-tasksjson-launchjson-and-c_cpp_propertiesjson)
  - [1.6 Create the File `tasks.json`](#16-create-the-file-tasksjson)
  - [1.7 Create the File `launch.json`](#17-create-the-file-launchjson)
  - [1.8 Create the File `c_cpp_properties.json`](#18-create-the-file-c_cpp_propertiesjson)
  - [1.9 VSCode Extensions](#19-vscode-extensions)
  - [1.10 Using Built-in Cell Magic `%%writefile`](#110-using-built-in-cell-magic-writefile)
  - [1.11 Compiling and Executing an OpenCL Program from a Notebook Code Cell](#111-compiling-and-executing-an-opencl-program-from-a-notebook-code-cell)
  - [1.12 Compiling and Debugging a Single-file OpenCL Program](#112-compiling-and-debugging-a-single-file-opencl-program)
  - [1.13 Compiling and Debugging a Multi-file OpenCL Program](#113-compiling-and-debugging-a-multi-file-opencl-program)
- [2. OpenCL Basics](#2-opencl-basics)
  - [2.1 Listing OpenCL-enabled Devices and Properties](#21-listing-opencl-enabled-devices-and-properties)
  - [2.2 Hello World in Host Code (CPU)](#22-hello-world-in-host-code-cpu)
  - [2.3 Hello World in Device Code (GPU)](#23-hello-world-in-device-code-gpu)
  - [2.4 NDRange (Global Size), Work Groups, Work Items, Devices, CUs, and PEs](#24-ndrange-global-size-work-groups-work-items-devices-cus-and-pes)
  - [2.5 Error Checking](#25-error-checking)
  - [2.6 Measuring Execution Time on the Host (CPU) and on the Device (GPU)](#26-measuring-execution-time-on-the-host-cpu-and-on-the-device-gpu)
  - [2.7 Shared Memory and Thread Synchronization on the Device (GPU)](#27-shared-memory-and-thread-synchronization-on-the-device-gpu)
  - [2.8 Constant Memory on the Device (GPU)](#28-constant-memory-on-the-device-gpu)
- [3. Sample Problems](#3-sample-problems)
  - [3.1 1D Vector Addition on the Host (CPU)](#31-1d-vector-addition-on-the-host-cpu)
  - [3.2 1D Vector Addition on the Device (GPU)](#32-1d-vector-addition-on-the-device-gpu)
  - [3.3 1D Convolution on the Host (CPU)](#33-1d-convolution-on-the-host-cpu)
  - [3.4 1D Convolution on the Device (GPU)](#34-1d-convolution-on-the-device-gpu)
  - [3.5 2D Convolution on the Host (CPU)](#35-2d-convolution-on-the-host-cpu)
  - [3.6 2D Convolution on the Device (GPU)](#36-2d-convolution-on-the-device-gpu)
- [4. Cleanup](#4-cleanup)

---
# 1. Prerequisites
---

## 1.1 Running the Notebook Locally or on Google CoLab

- Run the cell below to check if you have an OpenCL-enabled device and an OpenCL SDK on your system.

In [562]:
!clinfo

Number of platforms                               1
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 3.0 CUDA 12.8.90
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info cl_khr_external_semaphore cl_khr_external_memory cl_khr_external_semaphore_opaque_fd cl_khr_external_memory_opaque_fd cl_khr_semaphore
  Platform Extensions

### Inspect the output from the cell above:
- If you see a `Number of platforms` listed above with a value of at least `1`.
  - Skip to [1.2 Operating System and VSCode Shell](#12-operating-system-and-vscode-shell)
- If you don't see a `Number of platforms` listed above with a value of at least `1`, follow the instructions below.

  1. Click the icon below to open the notebook in Google CoLab.
     
     [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/paga-hb/C1PD2C_2025/blob/main/notebooks/opencl.ipynb)

  2. When the notebook opens in CoLab, choose `File -> Save a copy in Drive` from the main menu.
  3. Choose `Runtime -> Change runtime type` from the main menu, select `TP4 GPU` as the hardware accelerator, and click the `Save` button.
  4. In a notebook cell run the following code:

      ```c
      !sudo apt install -y gdb
      ```

  5. When the cell stops executing:
     - Continue executing each cell below, without changing any values when prompted to, until you reach [1.4 OpenCL Library and Header Files](#14-opencl-library-and-header-files)
     - In [1.4 OpenCL Library and Header Files](#14-opencl-library-and-header-files), when prompted to choose OpenCL paths, enter the following two values:
       - `opencl_include_path = "/usr/local/cuda-12.5/targets/x86_64-linux/include/CL/cl.h"`
       - `opencl_lib_path = "/usr/local/cuda-12.5/targets/x86_64-linux/lib/libOpenCL.so"`
     - Then continue executing the cells from [1.4 OpenCL Library and Header Files](#14-opencl-library-and-header-files) and onwards.

---

## 1.2 Operating System and VSCode Shell

We are going to compose JSON configuration files for VSCode, so let's collect some information about your environment.

- Let's start by finding out what OS you are on and what default shell you are using in VSCode.

**Linux/Mac**

- Run the cell below.

**Windows**

- Find out (or change) which shell you are using in VSCode.
  - Open the Command Palette: `Ctrl + Shift + P`
  - Enter the text (and press `<Enter>`): `Preferences: Open Settings (UI)`
  - In the search field, enter the text: `terminal.integrated.defaultProfile.windows`
  - Choose the tab `User` or `Workspace` (`User` are global settings, `Workspace` only applies to the current workspace)
  - Click the link: `Edit in settings.json`
  - Set your desired shell:
    - `"terminal.integrated.defaultProfile.windows": "PowerShell"`
    - `"terminal.integrated.defaultProfile.windows": "Command Prompt"`
- Choose your VSCode shell in the cell below.
  - If you are using Powershell:
    - Comment the row `windows_shell = "cmd"`
    - Uncomment the row `windows_shell = "powershell"`
- Run the cell below.

In [563]:
windows_shell_name = "cmd"
#windows_shell_name = "powershell"

import platform, os
os_name = platform.system()
if os_name == "Darwin":
    os_name = "osx"
os_name = os_name.lower()

print(f"{'Operating System (OS)':<21} : {os_name}")
if os_name == 'windows':
    windows_shell_path = !where {windows_shell_name}
    windows_shell_path = windows_shell_path[0]
    windows_shell_name = os.path.basename(windows_shell_path)
    print(f"{'Windows Shell Name':<21} : {windows_shell_name}")
    print(f"{'Windows Shell Path':<21} : {windows_shell_path}")

Operating System (OS) : linux


---
## 1.3 C Compiler (`gcc`, `clang`, `cl`)

To avoid full paths to the C compiler and debugger in the JSON configuration files, make sure the path to the C compiler is in your `PATH` environment variable.

- In the cell below, choose the installed C compiler you want to use.
  - If you are using `cl` (the C/C++ compiler, part of Microsoft Visual Studio build tools).
    - Make sure you have launched VSCode from within a `Developer Command Prompt for VS`.
      - Search in your Start Menu for `Developer Command Prompt for VS` (the version depends on your installed Visual Studio version).
      - Open it => it launches a command prompt with all environment variables (paths, includes, libs) configured to run `cl.exe` and other build tools.
      - Open VSCode from the command prompt: `code .`
    - Comment the row `c_compiler = "gcc"`
    - Uncomment the row `c_compiler = "cl"`
  - If you are using `clang` (the C compiler, part of the LLVM project).
    - Comment the row `c_compiler = "gcc"`
    - Uncomment the row `c_compiler = "clang"`
  - If you are using `gcc` (GNU Compiler Collection), you're all set.
- Run the cell below to get the path to the C compiler.
- If nothing shows up, you need to install a C compiler (and/or make sure the C compiler is in your `PATH` environment variable).

In [564]:
c_compiler = "gcc"
#c_compiler = "clang"
#c_compiler = "cl"

import os
if os_name == 'windows':
    c_compiler_path = !where {c_compiler}
else:
    c_compiler_path = !which {c_compiler}
c_compiler_path = c_compiler_path[0]
c_compiler_name = os.path.basename(c_compiler_path)

if c_compiler == 'cl':
    c_debugger_name = "cdb.exe"
    c_debugger_path = "<integrated>"
if c_compiler == "gcc":
    c_debugger_name = "gdb"
if c_compiler == "clang":
    c_debugger_name = "lldb"

if os_name == 'windows':
    if c_compiler != 'cl':
        c_debugger_path = !where {c_debugger_name}
else:
    c_debugger_path = !which {c_debugger_name}

if c_compiler != 'cl':
    c_debugger_path = c_debugger_path[0]
    c_debugger_name = os.path.basename(c_debugger_path)

print(f"{'C Compiler Name':<15} : {c_compiler_name}")
print(f"{'C Compiler Path':<15} : {c_compiler_path}")
print(f"{'C Debugger Name':<15} : {c_debugger_name}")
print(f"{'C Debugger Path':<15} : {c_debugger_path}")

C Compiler Name : gcc
C Compiler Path : /usr/bin/gcc
C Debugger Name : gdb
C Debugger Path : /usr/bin/gdb


---
## 1.4 OpenCL Library and Header Files

Let's find out where your OpenCL library and header files are located on your system.

- Run the cell below to get the path to OpenCL's library and header files.
- If nothing shows up, you need to install the OpenCL SDK for at least one device on your computer (and/or make sure environment variables are set up correctly).

In [565]:
if os_name == "linux":
    !find /usr -name cl.h 2>/dev/null
    !find /usr -name libOpenCL.so 2>/dev/null

if os_name == "osx":
    print("/System/Library/Frameworks/OpenCL.framework/Headers/")
    print("/System/Library/Frameworks/OpenCL.framework/OpenCL/")

if os_name == "windows":
    pass
    #!where cl.h
    #!where OpenCL.lib
    #!echo %INCLUDE%
    #!echo %LIB%
    #!echo %PATH%

/usr/include/CL/cl.h
/usr/local/cuda-12.8/targets/x86_64-linux/include/CL/cl.h
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libOpenCL.so
/usr/lib/x86_64-linux-gnu/libOpenCL.so


### From the paths listed above:
- Choose one path to the header files (`.h`) listed above (enter the full path as listed above).
- Choose one path to the library file (`.so` or `.lib`) listed above (enter the full path as listed above).
- Enter the paths in the cell below.
- Run the cell below.

In [566]:
opencl_include_path = "/usr/local/cuda-12.8/targets/x86_64-linux/include/CL/cl.h"
opencl_lib_path = "/usr/local/cuda-12.8/targets/x86_64-linux/lib/libOpenCL.so"

import os
opencl_include_dir = os.path.dirname(opencl_include_path)
if os.path.basename(opencl_include_dir) == "CL":
    opencl_include_dir = os.path.dirname(opencl_include_dir)

opencl_lib_dir = os.path.dirname(opencl_lib_path)
print(f"{'OpenCL Include Path':<19} : {opencl_include_dir}")
print(f"{'OpenCL Lib Path':<19} : {opencl_lib_dir}")

OpenCL Include Path : /usr/local/cuda-12.8/targets/x86_64-linux/include
OpenCL Lib Path     : /usr/local/cuda-12.8/targets/x86_64-linux/lib


---
## 1.5 Configuring `tasks.json`, `launch.json`, and `c_cpp_properties.json`

- To develop OpenCL programs in C with VSCode, we need to configure three VSCode workspace configuration files.
  - In the file `tasks.json` we can configure various tasks, such as build tasks for compiling OpenCL programs in C with our chosen C compiler.
  - In the file `launch.json` we can configure various debug options, such as debugging C programs with our chosen C debugger.
  - In the file `c_cpp_properties.json` we can configure the compiler to use for linting purposes (intellisense).
    - It isn't strictly necessary to create this configuration file to be able to run and debug C programs in VSCode.
- VSCode workspace configuration files (`.json`) are stored in the subfolder `.vscode`.
- Run the cell below to create the folder `.vscode`.

**Note**

- This notebook doesn't dscribe the contents of these three files in detail. To learn more, visit: 
  - [task.json](https://code.visualstudio.com/docs/debugtest/tasks)
  - [launch.json](https://code.visualstudio.com/docs/debugtest/debugging)
  - [c_cpp_properties.json](https://code.visualstudio.com/docs/cpp/configure-intellisense)

In [567]:
import os
os.makedirs(".vscode", exist_ok=True)

---
## 1.6 Create the File `tasks.json`

- Run the cell below to create the file `tasks.json` in subfolder `.vscode`.

In [568]:
import os, json

src_path = "${workspaceFolder}/src/*.c"
include_path = "${workspaceFolder}/include"
bin_path = "${workspaceFolder}/bin/main.exe"
if os_name == "windows":
    src_path = "${workspaceFolder}\\src\\*.c"
    include_path = "${workspaceFolder}\\include"
    bin_path = "${workspaceFolder}\\bin\\main.exe"

makedir_command = "mkdir"
makedir_args = ["-p", "src", "include", "bin"]
if os_name == "windows":
    makedir_command = windows_shell_path
    if windows_shell_name == "powershell.exe":
        makedir_args = ["-NoProfile", "-ExecutionPolicy", "Bypass", "-Command", "New-Item -ItemType Directory -Path 'src','include','bin' -Force -ErrorAction SilentlyContinue"]
    else:
        makedir_args = ["/c", "if not exist src mkdir src & if not exist include mkdir include & if not exist bin mkdir bin"]

clean_command = "find"
clean_args = ["./bin", "-type", "f", "-name", "*.exe", "-delete"]
if os_name == "windows":
    clean_command = windows_shell_path
    if windows_shell_name == "powershell.exe":
        clean_args = ["-NoProfile", "-ExecutionPolicy", "Bypass", "-Command", "Get-ChildItem -Path .\\bin -Include *.exe, *.ilk, *.pdb, *.obj -Recurse | Remove-Item -Force"]
    else:
        clean_args = ["/c", "del /s /q /f .\\bin\\*.exe 2>nul .\\bin\\*.ilk 2>nul .\\bin\\*.pdb 2>nul .\\bin\\*.obj pdb 2>nul"]

c_build_command = c_compiler_path
c_build_multi_args = ["-std=c17", "-Wall", "-g", src_path, "-I", include_path, "-o", bin_path] 
c_build_active_args = ["-std=c17", "-Wall", "-g", "${file}", "-o", bin_path]
if os_name == "windows" and c_compiler_name == "cl.exe":
    c_build_multi_args = ["/std:c17", "/nologo", "/Zi", "/EHsc", "/Fe:bin\\main.exe", "/Fo:bin\\", "/Fd:bin\\", "src\\*.c", "/I", "include"]
    c_build_active_args = ["/std:c17", "/nologo", "/Zi", "/EHsc", "/Fe:bin\\main.exe", "/Fo:bin\\", "/Fd:bin\\", "${file}"]

c_build_multi_args_opencl = ["-std=c17", "-Wall", "-g", src_path, "-I", include_path, "-I", opencl_include_dir, "-L", opencl_lib_dir, "-l", "OpenCL", "-o", bin_path] 
c_build_active_args_opencl = ["-std=c17", "-Wall", "-g", "${file}", "-I", opencl_include_dir, "-L", opencl_lib_dir, "-l", "OpenCL", "-o", bin_path]
if os_name == "windows" and c_compiler_name == "cl.exe":
    c_build_multi_args_opencl = ["/std:c17", "/nologo", "/Zi", "/EHsc", "/Fe:bin\\main.exe", "/Fo:bin\\", "/Fd:bin\\", "src\\*.c", "/I", "include", "/I", opencl_include_dir, "/link", opencl_lib_dir, "OpenCL.lib"]
    c_build_active_args_opencl = ["/std:c17", "/nologo", "/Zi", "/EHsc", "/Fe:bin\\main.exe", "/Fo:bin\\", "/Fd:bin\\", "${file}", "/I", opencl_include_dir, "/link", opencl_lib_dir, "OpenCL.lib"]
if os_name == "osx" and c_compiler_name == "clang":
    c_build_multi_args_opencl = ["-std=c17", "-Wall", "-g", src_path, "-I", include_path, "-framework", "OpenCL", "-o", bin_path] 
    c_build_active_args_opencl = ["-std=c17", "-Wall", "-g", "${file}", "-framework", "OpenCL", "-o", bin_path]

tasks_json = {
    "version": "2.0.0",
    "tasks": [
        {
            "type": "shell",
            "label": "Make directories",
            "command": makedir_command,
            "args": makedir_args,
            "problemMatcher": []
        },
        {
            "type": "shell",
            "label": "Clean .exe files",
            "dependsOn": ["Make directories"],
            "command": clean_command,
            "args": clean_args,
            "problemMatcher": []
        },
        {
            "type": "shell",
            "label": "opencl: build multi file",
            "dependsOn": ["Clean .exe files"],
            "command": c_build_command,
            "args": c_build_multi_args_opencl,
            "options": {
                "cwd": "${workspaceFolder}"
            },
            "problemMatcher": [
                "$gcc"
            ],
            "group": {
                "kind": "build",
                "isDefault": False
            },
            "detail": f"compiler: {c_compiler_path}"
        },
        {
            "type": "shell",
            "label": "opencl: build active file",
            "dependsOn": ["Clean .exe files"],
            "command": c_build_command,
            "args": c_build_active_args_opencl,
            "options": {
                "cwd": "${fileDirname}"
            },
            "problemMatcher": [
                "$gcc"
            ],
            "group": {
                "kind": "build",
                "isDefault": True
            },
            "detail": f"compiler: {c_compiler_path}"
        }
    ]
}

os.makedirs(".vscode", exist_ok=True)
json_string = json.dumps(tasks_json, indent=4)
with open(".vscode/tasks.json", "w") as f:
    json.dump(tasks_json, f, indent=4)

---
## 1.7 Create the File `launch.json`

- Run the cell below to create the file `launch.json` in subfolder `.vscode`.

In [569]:
import os, json

if "gdb" in c_debugger_name:
    c_debugger_type = "cppdbg"
    c_debugger_mi_mode = "gdb"
    stop_at_entry_name = "stopAtEntry"
    environment = True
    console = False
    setupcommands = [{"description": "Enable pretty-printing for gdb", "text": "-enable-pretty-printing", "ignoreFailures": True}]
if "lldb" in c_debugger_name:
    c_debugger_type = "lldb"
    c_debugger_mi_mode = "lldb"
    stop_at_entry_name = "stopOnEntry"
    environment = False
    console = False
    setupcommands = None
if c_debugger_name == "cdb.exe":
    c_debugger_name = "msvc"
    c_debugger_path = None
    c_debugger_type = "cppvsdbg"
    c_debugger_mi_mode = None
    stop_at_entry_name = "stopAtEntry"
    environment = True
    console = True
    setupcommands = None

launch_json = {
    "version": "0.2.0",
    "configurations": [
        {
            "name": "opencl: launch multi file",
            "preLaunchTask": "opencl: build multi file",
            "type": c_debugger_type,
            "request": "launch",
            "program": bin_path,
            "args": [],
            f"{stop_at_entry_name}": False,
            "cwd": "${workspaceFolder}"
        },
        {
            "name": "opencl: launch active file",
            "preLaunchTask": "opencl: build active file",
            "type": c_debugger_type,
            "request": "launch",
            "program": bin_path,
            "args": [],
            f"{stop_at_entry_name}": False,
            "cwd": "${workspaceFolder}"
        }
    ]
}

if environment:
    for i in range(2):
        launch_json["configurations"][i]["environment"] = []
if c_debugger_name != "lldb":
    if console:
        for i in range(2):
            launch_json["configurations"][i]["console"] = "integratedTerminal" # "externalTerminal"
    else:
        for i in range(2):
            launch_json["configurations"][i]["externalConsole"] = False # True
    if c_debugger_mi_mode:
        for i in range(2):
            launch_json["configurations"][i]["MIMode"] = c_debugger_mi_mode
    if c_debugger_path:
        for i in range(2):
            launch_json["configurations"][i]["miDebuggerPath"] = c_debugger_path
    if setupcommands:
        for i in range(2):
            launch_json["configurations"][i]["setupCommands"] = setupcommands


os.makedirs(".vscode", exist_ok=True)
with open(".vscode/launch.json", "w") as f:
    json.dump(launch_json, f, indent=4)

---
## 1.8 Create the File `c_cpp_properties.json`

- Run the cell below to create the file `c_cpp_properties.json` in subfolder `.vscode`.

In [570]:
import os, json

if os_name == "linux" and c_compiler_name == "gcc":
    intelliSenseMode = "linux-gcc-x64"
    # intelliSenseMode = "linux-gcc-arm64"
if os_name == "linux" and c_compiler_name == "clang":
    intelliSenseMode = "linux-clang-x64"
    # intelliSenseMode = "linux-clang-arm64"
if os_name == "osx" and c_compiler_name == "gcc":
    intelliSenseMode = "macos-gcc-x64"
    # intelliSenseMode = "macos-gcc-arm64"
if os_name == "osx" and c_compiler_name == "clang":
    intelliSenseMode = "macos-clang-x64"
    # intelliSenseMode = "macos-clang-arm64"
if os_name == "windows" and c_compiler_name == "gcc.exe":
    intelliSenseMode = "windows-gcc-x64"
if os_name == "windows" and c_compiler_name == "clang.exe":
    intelliSenseMode = "windows-clang-x64"
if os_name == "windows" and c_compiler_name == "cl.exe":
    intelliSenseMode = "windows-msvc-x64"

launch_json = {
    "configurations": [
        {
            "name": "Linter",
            "includePath": [
                "${workspaceFolder}/**"
            ],
            "defines": [],
            "compilerPath": c_compiler_path,
            "cStandard": "c17",
            "cppStandard": "c++17",
            "intelliSenseMode": intelliSenseMode
        }
    ],
    "version": 4
}

os.makedirs(".vscode", exist_ok=True)
with open(".vscode/c_cpp_properties.json", "w") as f:
    json.dump(launch_json, f, indent=4)

---
## 1.9 VSCode Extensions

- To develop OpenCL programs in C with VSCode, we need a few VSCode extensions (the last two are only needed for Jupyter Notebooks).
  - C/C++ Extension Pack: https://marketplace.visualstudio.com/items?itemName=ms-vscode.cpptools-extension-pack
  - CodeLLDB: https://marketplace.visualstudio.com/items?itemName=vadimcn.vscode-lldb
  - Makefile Tools: https://marketplace.visualstudio.com/items?itemName=ms-vscode.makefile-tools
  - Jupyter: https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter
  - Python: https://marketplace.visualstudio.com/items?itemName=ms-python.python

- Run the cell below to install any missing VSCode extensions.

In [571]:
!code --install-extension ms-vscode.cpptools-extension-pack --force
!code --install-extension vadimcn.vscode-lldb --force
!code --install-extension ms-vscode.makefile-tools --force
!code --install-extension ms-toolsai.jupyter --force
!code --install-extension ms-python.python --force

Installing extensions...
Extension 'ms-vscode.cpptools-extension-pack' is already installed.
Installing extensions...
Extension 'vadimcn.vscode-lldb' is already installed.
Installing extensions...
Extension 'ms-vscode.makefile-tools' is already installed.
Installing extensions...
Extension 'ms-toolsai.jupyter' is already installed.
Installing extensions...
Extension 'ms-python.python' is already installed.


---
## 1.10 Using Built-in Cell Magic `%%writefile`

- The cell magic `%%writefile filename`, writes the contents of a notebook cell to the specified `filename` (or file path).
  - This functionality is built-in to Jupyter Notebooks (it's not an extension).
  - We can use it to write any code cell contents to a file in the file system.
  - Let's write some OpenCL code to the file `main.c` with a kernel in `kernel.c`.
- Run the two cells below.
- Then inspect the resulting files `kernel.cl` and `main.c`, which contain each code cell's contents (except for the cell magic command `%%writefile filename`).

In [572]:
%%writefile main.c
#include <stdio.h>
#include <stdlib.h>
#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>

const char *source =
"__kernel void mykernel()\n"
"{\n"
"    printf(\"Hello World!\\n\");\n"
"}\n";

int main()
{
    cl_int err;

    cl_platform_id platform;
    cl_uint num_platforms;
    err = clGetPlatformIDs(1, &platform, &num_platforms);

    cl_device_id device;
    cl_uint num_devices;
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 1, &device, &num_devices);

    cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);

    cl_command_queue queue = clCreateCommandQueueWithProperties(context, device, 0, &err);

    cl_program program = clCreateProgramWithSource(context, 1, &source, NULL, &err);
    err = clBuildProgram(program, 1, &device, NULL, NULL, NULL);

    cl_kernel kernel = clCreateKernel(program, "mykernel", &err);

    size_t global_size = 1;
    size_t local_size = 1;
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);

    clFinish(queue);

    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    return 0;
}

Writing main.c


---
## 1.11 Compiling and Executing an OpenCL Program From a Notebook Code Cell

- We can compile and execute an OpenCL program from a notebook code cell using the syntax `!<shell command>`, where:
  - `!` indicates that the succeeding text on the same row should be sent to the shell (terminal).
  - `<shell command>` is the shell (terminal) command we want to execute.
  - Standard output is redirected to the cell output.

- Run the cell below to see what the build command and execute command is in your shell:
  - The *build single file* command compiles and links the file `main.c` in your workspace folder and places the executable file `main.exe` in the `bin` folder.
  - The *build multi file* command compiles and links all `.c` files in the `src` and `.h` files in the `include` folder and places the executable file `main.exe` in the `bin` folder.
  - The *execute* command executes the file `main.exe` in the `bin` folder.

In [573]:
build_single_file_command = [f'"{c_build_command}"'] + c_build_active_args_opencl
build_single_file_command = " ".join(build_single_file_command).replace('${file}', 'main.c').replace('${workspaceFolder}', '.')

build_multi_file_command = [f'"{c_build_command}"'] + c_build_multi_args_opencl
build_multi_file_command = " ".join(build_multi_file_command).replace('${file}', 'main.c').replace('${workspaceFolder}', '.')

execute_command = bin_path.replace('${workspaceFolder}', '.')

print(f'Build single file command : {build_single_file_command}')
print(f'Build multi file command  : {build_multi_file_command}')
print(f'Execute command           : {execute_command}')

Build single file command : "/usr/bin/gcc" -std=c17 -Wall -g main.c -I /usr/local/cuda-12.8/targets/x86_64-linux/include -L /usr/local/cuda-12.8/targets/x86_64-linux/lib -l OpenCL -o ./bin/main.exe
Build multi file command  : "/usr/bin/gcc" -std=c17 -Wall -g ./src/*.c -I ./include -I /usr/local/cuda-12.8/targets/x86_64-linux/include -L /usr/local/cuda-12.8/targets/x86_64-linux/lib -l OpenCL -o ./bin/main.exe
Execute command           : ./bin/main.exe


- Run the cell below to:
  - Create the folder `bin` in your workspace folder if it doesn't already exist.
  - Build the single source code file `main.c` in your workspace folder into the executable file `main.exe` in the `bin` folder.
  - Run the executable file `main.exe` in the `bin` folder.
- Notice the file `main.exe` has been created in the file system (in the `bin` folder), and the program's output is shown as the cell's output in the notebook. 

In [574]:
import os
os.makedirs("bin", exist_ok=True)

!{build_single_file_command}
!{execute_command}

Hello World!


---
## 1.12 Compiling and Debugging a Single-file OpenCL Program

Let's see `tasks.json` and `launch.json` in action for a single-file (`.c`) OpenCL program.

- First, let's create the file `main.c` in the cell below.

In [575]:
%%writefile main.c
#include <stdio.h>
#include <stdlib.h>
#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>

const char *source =
"__kernel void mykernel()\n"
"{\n"
"    printf(\"Hello World!\\n\");\n"
"}\n";

void checkOpenCL(cl_int err, const char *msg)
{
    if (err != CL_SUCCESS)
    {
        fprintf(stderr, "%s failed: %d\n", msg, err);
        exit(EXIT_FAILURE);
    }
}

int main()
{
    cl_int err;

    // 1. Platform
    cl_platform_id platform;
    cl_uint num_platforms;
    err = clGetPlatformIDs(1, &platform, &num_platforms);

    // 2. Device
    cl_device_id device;
    cl_uint num_devices;
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 1, &device, &num_devices);

    // 3. Context
    cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);

    // 4. Command queue
    cl_command_queue queue = clCreateCommandQueueWithProperties(context, device, 0, &err);

    // 5. Program
    cl_program program = clCreateProgramWithSource(context, 1, &source, NULL, &err);
    checkOpenCL(err, "clCreateProgramWithSource");
    err = clBuildProgram(program, 1, &device, NULL, NULL, NULL);

    // (Optional) Check build log if needed
    if (err != CL_SUCCESS)
    {
        char log[2048];
        clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL);
        printf("Build log:\n%s\n", log);
        exit(EXIT_FAILURE);
    }

    // 6. Kernel
    cl_kernel kernel = clCreateKernel(program, "mykernel", &err);
    checkOpenCL(err, "clCreateKernel");

    // 7. Launch
    size_t global_size = 1;
    size_t local_size = 1;
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);

    // 8. Wait and finish
    clFinish(queue);

    // 9. Cleanup
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    return 0;
}

Overwriting main.c


- Now, let's debug the file `main.c`.
  - Open the file `main.c` in VSCode's editor.
  - Set a breakpoint on the two `return` statements (`F9`).
  - Switch to the `Run and Debug` view (Linux/Windows: `Ctrl + Shift + D`, Mac: `Cmd + Shift + D`).
  - In the drop-down combobox, select the launch configuration `<COMPILER>: launch active file`, where `<COMPILER>`is the name of your C compiler.
  - Click the green `Play` icon.
  - Use the debug toolbar in the top-middle of VSCode to debug the code.
    - Notice the debugger stops at the breakpoints.
      - This is because we are using a C debugger compatible with your chosen C compiler.
    - Notice you can view variables (local, registers), watch variables, view the call stack, and toggle breakpoints in the `Run and Debug` view.
  - Stop debugging (red `Square` icon in the debug toolbar).
- Next, look at the status bar (at the bottom of VSCode) where you will see the name of the launch configuration `<COMPILER>: launch active file`.
  - Click on it, and select `<COMPILER>: launch active file` again (make sure `main.c` is the active file in the editor, not the notebook).
    - This is an alternative method to start a debug session.
  - Stop debugging.
- Press `F5` (make sure `main.c` is active in VSCode's editor), which is a third alternative to launch the debugger.
  - This launches the debug configuration with `preLaunchTask` set to the default task (in `tasks.json`).
  - Stop debugging.
- Press (Linux/Windows: `Ctrl + Shift + B`, Mac: `Ctrl + Shift + B`) to execute the default build task (in `tasks.json`).
  - Make sure `main.c` is active in VSCode's editor (since the default build task is set to the active file task).
  - Notice the compiled executable `main.exe` is placed in the subfolder `bin` (configured in the default build task).
    - This is also where the debugger finds the executable `main.exe` (configured in `launch.json`).

- Remeber, you can always compile a single `main.c` file in your workspace folder and run it using the commands below.

In [576]:
!{build_single_file_command}
!{execute_command}

Hello World!


---
## 1.13 Compiling and Debugging a Multi-file OpenCL Program

- Let's see `tasks.json` and `launch.json` in action for a multi-file (`*.c`) OpenCL program.
  - First We will create two source code files `.c` in the `src` folder, and one header file `.h` in the `include` folder.
    - We will use the same code as before, but will place the OpenCL kernel (function) code in its own `.cl` file, together with a source code file `.c` for loading it and its prototype in a header file `.h`.
  - Then we will use:
    - The other (non-default) build task in `tasks.json` to build the executable.
    - The other launch configuration (linked to the non-default build task) in `launch.json` to debug it.

- Run the four cells below to create:
  - The folder structure `src`, `include`, and `bin` (if it hasn't already been created).
  - The main source code file `main.c` in the folder `src`.
  - The kernel source code file `helloKernel.cl` in the folder `src`.
  - The source code file `utils.c` in the folder `src`.
  - The header file `utils.h` in the folder `include`.

In [577]:
import os
os.makedirs("src", exist_ok=True)
os.makedirs("include", exist_ok=True)
os.makedirs("bin", exist_ok=True)

In [578]:
%%writefile src/kernel.cl
__kernel void mykernel()
{
    printf("Hello World!\n");
}

Writing src/kernel.cl


In [579]:
%%writefile src/main.c
#include <stdio.h>
#include <stdlib.h>
#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>

#define KERNEL_FILE "src/kernel.cl"

char* read_kernel_source(const char* filename)
{
    FILE* fp = fopen(filename, "r");
    if (!fp)
    {
        fprintf(stderr, "Failed to open kernel file: %s\n", filename);
        exit(1);
    }
    fseek(fp, 0, SEEK_END);
    long size = ftell(fp);
    rewind(fp);
    char* src = (char*)malloc(size + 1);
    fread(src, 1, size, fp);
    src[size] = '\0';
    fclose(fp);
    return src;
}

int main()
{
    cl_int err;

    // 1. Platform
    cl_platform_id platform;
    cl_uint num_platforms;
    err = clGetPlatformIDs(1, &platform, &num_platforms);

    // 2. Device
    cl_device_id device;
    cl_uint num_devices;
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 1, &device, &num_devices);

    // 3. Context
    cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);

    // 4. Command queue
    cl_command_queue queue = clCreateCommandQueueWithProperties(context, device, 0, &err);

    // 5. Program
    char* source = read_kernel_source(KERNEL_FILE); // Read kernel source
    cl_program program = clCreateProgramWithSource(context, 1, (const char**)&source, NULL, &err);
    err = clBuildProgram(program, 1, &device, NULL, NULL, NULL);

    // (Optional) Check build log if needed
    if (err != CL_SUCCESS)
    {
        char log[2048];
        clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL);
        printf("Build log:\n%s\n", log);
        exit(EXIT_FAILURE);
    }

    // 6. Kernel
    cl_kernel kernel = clCreateKernel(program, "mykernel", &err);

    // 7. Launch
    size_t global_size = 1;
    size_t local_size = 1;
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);

    // 8. Wait and finish
    clFinish(queue);

    // 9. Cleanup
    free(source);
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    return 0;
}

Writing src/main.c


- Build the multi-file C program.
  - Notice the multi-file build task in `tasks.json` isn't the default build task (`isDefault` is not set to `true` under `group`).
  - Therefore, we can't use (Linux/Windows: `Ctrl + Shift + B`, mac: `Cmd + Shift + B`).
  - Instead we can:
    - Bring up the Command Palette (Linux/Windows: `Ctrl + Shift + P`, mac: `Cmd + Shift + P`).
    - Choose `Tasks: Run Task` and select the task `<COMPILER>: build multi file`, where `<COMPILER>` is the name of your chosen C compiler.
  - The executable `main.exe` is placed in the `bin` folder (as configured in `tasks.json`).
- Debug the multi-file C program.
  - Open `main.c` and `math_utils.c` in the `src` folder and set breakpoints on the two `return` statements.
  - Notice the multi-file launch task in `launch.json` isn't linked to the default build task (in `tasks.json`).
    - Therefore, we can't use `F5`.
    - Instead we can:
      - Switch the the `Run and Debug` view, select `<COMPILER>: launch multi file` from the drop-down list, and click the green `Play` icon.
      - Or select `<COMPILER>: launch multi file` from the status bar (at the bottom of VSCode).
    - The C program is built and the debugger lauched, attaching to the executable `main.exe` in the `bin` folder.
  - Stop debugging.

- Remeber, you can always compile a multi-file C program (`src/*.c`, `include/*.h`) and run it using the commands below.

In [580]:
!{build_multi_file_command}
!{execute_command}

Hello World!


---
# 2. OpenCL Basics
---

- Now we know how to create CUDA programs in C (single-file, multi-file, and in a Jupyter Notebook cell).
- Going forward, we will explore fundamental CUDA programming concepts as single-file programs in notebook cells using the `%%cuda` cell magic command.
  - The code in each `%%cuda` cell can be placed in a `main.cu` file by:
    - Manually copying the cell contents and removing the `%%cuda` row.
    - Replacing the `%%cuda` row with `%%writefile main.cu` and running the cell.
  - Then you can manually compile it to `main.exe` with `nvcc main.cu -o main.exe`.
- NVidia's documentation is a good source to learn more about CUDA:
    - [Cuda C Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/contents.html)

---
## 2.1 Listing OpenCL-enabled Devices and Properties

- First, let's find out what OpenCL-enabled devices are available on your computer.

### Using `clinfo`

- The simplest way to list OpenCL-enabled devices is using the tool `clinfo`.
  - The top row shows you the number of platforms on your system:
    - Example platforms: Nvidia, AMD, Intel, etc.
  - For each platform, it shows you which devices are available:
    - Example devices: An Nvidia GPU, an AMD CPU, an AMD GPU, an Intel CPU, etc.
  - For each device, it shows you information about that device.
- Run the cell below to see the CUDA-enabled devices on your system.

In [581]:
!clinfo

Number of platforms                               1
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 3.0 CUDA 12.8.90
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info cl_khr_external_semaphore cl_khr_external_memory cl_khr_external_semaphore_opaque_fd cl_khr_external_memory_opaque_fd cl_khr_semaphore
  Platform Extensions

### Using C Code

- We can also find out what Platforms and OpenCL-enabled devices are available using C code.
  - The code below lists important properties that will become familiar the more you learn about OpenCL (for optimization purposes).
- Run the cell below to list platforms, OpenCL-enabled devices, and their properties.

In [582]:
%%writefile main.c
#include <stdio.h>
#include <stdlib.h>

#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>

void checkOpenCL(cl_int err, const char *msg)
{
    if (err != CL_SUCCESS)
    {
        fprintf(stderr, "%s failed: %d\n", msg, err);
        exit(EXIT_FAILURE);
    }
}

int main()
{
    cl_uint numPlatforms;
    cl_int err;

    err = clGetPlatformIDs(0, NULL, &numPlatforms);
    checkOpenCL(err, "clGetPlatformIDs (count)");

    if (numPlatforms == 0)
    {
        puts("No OpenCL platforms found.");
        return 1;
    }

    cl_platform_id *platforms = (cl_platform_id *) malloc(sizeof(cl_platform_id) * numPlatforms);
    err = clGetPlatformIDs(numPlatforms, platforms, NULL);
    checkOpenCL(err, "clGetPlatformIDs");

    for (cl_uint p = 0; p < numPlatforms; ++p)
    {
        char platformName[128];
        clGetPlatformInfo(platforms[p], CL_PLATFORM_NAME, sizeof(platformName), platformName, NULL);
        printf("Platform %u: %s\n", p, platformName);

        cl_uint numDevices = 0;
        err = clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_ALL, 0, NULL, &numDevices);
        if (err != CL_SUCCESS || numDevices == 0)
        {
            puts("  No devices found.");
            continue;
        }

        cl_device_id *devices = (cl_device_id *) malloc(sizeof(cl_device_id) * numDevices);
        err = clGetDeviceIDs(platforms[p], CL_DEVICE_TYPE_ALL, numDevices, devices, NULL);
        checkOpenCL(err, "clGetDeviceIDs");

        for (cl_uint i = 0; i < numDevices; ++i)
        {
            char name[128];
            cl_uint compute_units, max_work_group_size, clock_frequency;
            cl_ulong global_mem, local_mem, constant_mem;
            cl_device_type type;

            clGetDeviceInfo(devices[i], CL_DEVICE_NAME, sizeof(name), name, NULL);
            clGetDeviceInfo(devices[i], CL_DEVICE_TYPE, sizeof(type), &type, NULL);
            clGetDeviceInfo(devices[i], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(compute_units), &compute_units, NULL);
            clGetDeviceInfo(devices[i], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(max_work_group_size), &max_work_group_size, NULL);
            clGetDeviceInfo(devices[i], CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(clock_frequency), &clock_frequency, NULL);
            clGetDeviceInfo(devices[i], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(global_mem), &global_mem, NULL);
            clGetDeviceInfo(devices[i], CL_DEVICE_LOCAL_MEM_SIZE, sizeof(local_mem), &local_mem, NULL);
            clGetDeviceInfo(devices[i], CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE, sizeof(constant_mem), &constant_mem, NULL);

            printf("  - Device %u: %s\n", i, name);
            printf("    - %-31s : %s\n", "Type",
                (type == CL_DEVICE_TYPE_GPU) ? "GPU" :
                (type == CL_DEVICE_TYPE_CPU) ? "CPU" :
                (type == CL_DEVICE_TYPE_ACCELERATOR) ? "Accelerator" : "Other");
            printf("    - %-31s : %u\n", "Compute Units (CU)", compute_units);
            printf("    - %-31s : %u\n", "Max Work Group Size", max_work_group_size);
            printf("    - %-31s : %.2f MHz\n", "Clock Frequency", clock_frequency * 1.0);
            printf("    - %-31s : %lu bytes\n", "Global Memory Size", global_mem);
            printf("    - %-31s : %lu bytes\n", "Local Memory Size (shared)", local_mem);
            printf("    - %-31s : %lu bytes\n", "Constant Memory Size", constant_mem);
            printf("\n");
        }

        free(devices);
    }

    free(platforms);
    return 0;
}

Overwriting main.c


In [583]:
!{build_single_file_command}
!{execute_command}

Platform 0: NVIDIA CUDA
  - Device 0: NVIDIA RTX 2000 Ada Generation Laptop GPU
    - Type                            : GPU
    - Compute Units (CU)              : 24
    - Max Work Group Size             : 32767
    - Clock Frequency                 : 1455.00 MHz
    - Global Memory Size              : 8198619136 bytes
    - Local Memory Size (shared)      : 49152 bytes
    - Constant Memory Size            : 65536 bytes



---
## 2.2 Hello World in Host Code (CPU)

- The program in the cell below is a simple C program (no CUDA code) that runs on the host (CPU).
  - We include the necessary header files:
    - `stdio.h` for `printf`.

      ```c
      #include <stdio.h>
      ```
  - Then we define the `main()` function:
    - We print out the text `Hello World!`.
    - Then we return the exit code `0` to the operating system.
    
      ```c
      int main(void)
      {
        printf("Hello World!\n");
        
        return 0;
      }
      ```
- Run the cell below to see the output.

In [584]:
%%writefile main.c
#include <stdio.h>

// Host entry point (a normal C main function)
int main(void)
{
   printf("Hello World!\n");
   
   return 0;
}

Overwriting main.c


In [585]:
!{build_single_file_command}
!{execute_command}

Hello World!


---
## 2.3 Hello World in Device Code (GPU)

- The program in the cell below is a simple CUDA program that runs code on the host (CPU) and the device (GPU).
  - We include the necessary header files:
    - `stdio.h` for `printf`.
    - `cuda_runtime.h` for `cudaDeviceSynchronize`.

      ```c
      #include <stdio.h>
      #include <cuda_runtime.h> // contains CUDA function prototypes
      ```
  - Then we define a CUDA kernel function:
    - `mykernel()` is the name of the **kernel function** (it can be any name we like).
    - A **kernel function** runs on the device (GPU) and is **launched** (called) from the host (CPU).
    - `__global__` is a CUDA **qualifier** (qualifying a function with `__global__` makes the function a **kernel function**.
    - The kernel function takes no arguments (`void`), prints the text `Hello World!`, and does NOT return a value (`void`).
    - All **kernel functions** MUST have a `void` return type.

      ```c
      __global__ void mykernel(void) // the __global__ qualifier makes this a kernel function
      {
          printf("Hello World!\n");
      }
      ```
  - Lastly we define the `main()` function:
    - First we **launch** (call) the **kernel function** `mykernel()` using the syntax `kernel<<<n_blocks, n_threads>>>()`.
      - `kernel` is the name of the **kernel function** (`mykernel` in our code), which is called from the host (CPU) and runs on the device (GPU).
      - `<<<n_blocks, n_threads>>>` is a CUDA **launch configuration** (`<<<1, 1>>>` in our code), which makes this a **kernel launch** (kernel function call).
      - `n_blocks` is the number of `blocks` to use when launching the kernel function.
        - In Nvidia terminology, a CUDA-enabled GPU contains a `grid` of `blocks`, where each `block` contains a number of `threads`.
      - `n_threads` is the number of `threads` to use when launching the kernel function.
        - Each `block` contains this number of `threads`.
      - The code `mykernel<<<1,1>>>()` launches the kernel (calls the function) `mykernel()` on the GPU with 1 `block` containing 1 `thread` in that `block`.
        - It's the *equivalent* of creating 1 new thread and running the function on that thread in a traditional C program running on the host (CPU).
        - The kernel launch is an asynchronous function call, so it immediately returns control to the host (CPU).
    - Then we call the CUDA function `cudaDeviceSynchronize()`.
      - This function call is a synchronous call, which blocks the host's (CPU's) main thread until the kernel function completes (returns) on the device (GPU).
      - This is necessary, otherwise the `main()` function would go out of scope (terminate) before we have a chance to retrieve any results from the kernel function.
    - Finally we return the exit code `0` to the operating system.
    
      ```c
      int main(void)
      {
        mykernel<<<1, 1>>>();    // kernel launch (calls the kernel function mykernel, and is asynchronous)
        cudaDeviceSynchronize(); // blocks the host's main thread (CPU) until the kernel function completes (returns)
        
        return 0;
      }
      ```
- Run the cell below to see the output.

**TL;DR**

- The CUDA function prototypes are declared in header file `cuda_runtime.h`.
- A function with the `__global__` qualifier is called a **kernel function** that **runs on the device (GPU)** and is **called from the host (CPU)**.
- A **kernel launch** calls a **kernel function** using the syntax `kernelfunction<<<n_blocks, n_threads>>>()` and is an **asynchrounous call**.
- The function `cudaDeviceSynchronize()` is a **synchronous** call that blocks the host code until the **kernel function** is complete (returns).
- The Nvidia compiler `nvcc` separates source code into **host** and **device** components.
  - Device functions (e.g. `mykernel()`) is processed by the NVIDIA compiler `nvcc`.
  - Host functions (e.g. `main()`) are processed by a standard host C compiler (e.g. `gcc`, `clang`, or `cl.exe`).
    - NVCC instructs the underlying C compiler to compile host code during the compilation process.

In [586]:
%%writefile main.c
#include <stdio.h>
#include <stdlib.h>
#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>

const char *source =
"__kernel void mykernel()\n"
"{\n"
"    printf(\"Hello World!\\n\");\n"
"}\n";

void checkOpenCL(cl_int err, const char *msg)
{
    if (err != CL_SUCCESS)
    {
        fprintf(stderr, "%s failed: %d\n", msg, err);
        exit(EXIT_FAILURE);
    }
}

void checkOpenCLBuildLog(cl_int err, cl_program program, cl_device_id device)
{
    if (err != CL_SUCCESS)
    {
        char log[2048];
        clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL);
        printf("Build log:\n%s\n", log);
        exit(EXIT_FAILURE);
    }
}

int main()
{
    cl_int err;

    // 1. Platform
    cl_platform_id platform;
    cl_uint num_platforms;
    err = clGetPlatformIDs(1, &platform, &num_platforms);
    checkOpenCL(err, "clGetPlatformIDs");

    // 2. Device
    cl_device_id device;
    cl_uint num_devices;
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 1, &device, &num_devices);
    checkOpenCL(err, "clGetDeviceIDs");

    // 3. Context
    cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);
    checkOpenCL(err, "clCreateContext");

    // 4. Command queue
    cl_command_queue queue = clCreateCommandQueueWithProperties(context, device, 0, &err);
    checkOpenCL(err, "lCreateCommandQueueWithProperties");

    // 5. Program
    cl_program program = clCreateProgramWithSource(context, 1, &source, NULL, &err);
    checkOpenCL(err, "clCreateProgramWithSource");
    err = clBuildProgram(program, 1, &device, NULL, NULL, NULL);
    checkOpenCLBuildLog(err, program, device);

    // 6. Kernel
    cl_kernel kernel = clCreateKernel(program, "mykernel", &err);
    checkOpenCL(err, "clCreateKernel");

    // 7. Launch
    size_t global_size = 1;
    size_t local_size = 1;
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);
    checkOpenCL(err, "clEnqueueNDRangeKernel");

    // 8. Wait and finish
    clFinish(queue);

    // 9. Cleanup
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    return 0;
}

Overwriting main.c


In [587]:
!{build_single_file_command}
!{execute_command}

Hello World!


### Creating a Utility Module

In [588]:
%%writefile src/utils.c
#include <stdio.h>
#include <stdlib.h>
#include "utils.h"

#define KERNEL_FILE "src/kernel.cl"
#define KERNEL_NAME "mykernel"

char* read_kernel_source(const char* filename)
{
    FILE* fp = fopen(filename, "r");
    if (!fp)
    {
        fprintf(stderr, "Failed to open kernel file: %s\n", filename);
        exit(1);
    }
    fseek(fp, 0, SEEK_END);
    long size = ftell(fp);
    rewind(fp);
    char* src = (char*)malloc(size + 1);
    fread(src, 1, size, fp);
    src[size] = '\0';
    fclose(fp);
    return src;
}

const char* openclGetErrorString(cl_int err)
{
    switch (err)
    {
        case CL_SUCCESS:                                  return "CL_SUCCESS";
        case CL_DEVICE_NOT_FOUND:                         return "CL_DEVICE_NOT_FOUND";
        case CL_DEVICE_NOT_AVAILABLE:                     return "CL_DEVICE_NOT_AVAILABLE";
        case CL_COMPILER_NOT_AVAILABLE:                   return "CL_COMPILER_NOT_AVAILABLE";
        case CL_MEM_OBJECT_ALLOCATION_FAILURE:            return "CL_MEM_OBJECT_ALLOCATION_FAILURE";
        case CL_OUT_OF_RESOURCES:                         return "CL_OUT_OF_RESOURCES";
        case CL_OUT_OF_HOST_MEMORY:                       return "CL_OUT_OF_HOST_MEMORY";
        case CL_PROFILING_INFO_NOT_AVAILABLE:             return "CL_PROFILING_INFO_NOT_AVAILABLE";
        case CL_MEM_COPY_OVERLAP:                         return "CL_MEM_COPY_OVERLAP";
        case CL_IMAGE_FORMAT_MISMATCH:                    return "CL_IMAGE_FORMAT_MISMATCH";
        case CL_IMAGE_FORMAT_NOT_SUPPORTED:               return "CL_IMAGE_FORMAT_NOT_SUPPORTED";
        case CL_BUILD_PROGRAM_FAILURE:                    return "CL_BUILD_PROGRAM_FAILURE";
        case CL_MAP_FAILURE:                              return "CL_MAP_FAILURE";
        case CL_MISALIGNED_SUB_BUFFER_OFFSET:             return "CL_MISALIGNED_SUB_BUFFER_OFFSET";
        case CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST:return "CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST";
        case CL_COMPILE_PROGRAM_FAILURE:                  return "CL_COMPILE_PROGRAM_FAILURE";
        case CL_LINKER_NOT_AVAILABLE:                     return "CL_LINKER_NOT_AVAILABLE";
        case CL_LINK_PROGRAM_FAILURE:                     return "CL_LINK_PROGRAM_FAILURE";
        case CL_DEVICE_PARTITION_FAILED:                  return "CL_DEVICE_PARTITION_FAILED";
        case CL_KERNEL_ARG_INFO_NOT_AVAILABLE:            return "CL_KERNEL_ARG_INFO_NOT_AVAILABLE";

        // clCreateProgramWithSource, clBuildProgram, etc.
        case CL_INVALID_VALUE:                            return "CL_INVALID_VALUE";
        case CL_INVALID_DEVICE_TYPE:                      return "CL_INVALID_DEVICE_TYPE";
        case CL_INVALID_PLATFORM:                         return "CL_INVALID_PLATFORM";
        case CL_INVALID_DEVICE:                           return "CL_INVALID_DEVICE";
        case CL_INVALID_CONTEXT:                          return "CL_INVALID_CONTEXT";
        case CL_INVALID_QUEUE_PROPERTIES:                 return "CL_INVALID_QUEUE_PROPERTIES";
        case CL_INVALID_COMMAND_QUEUE:                    return "CL_INVALID_COMMAND_QUEUE";
        case CL_INVALID_HOST_PTR:                         return "CL_INVALID_HOST_PTR";
        case CL_INVALID_MEM_OBJECT:                       return "CL_INVALID_MEM_OBJECT";
        case CL_INVALID_IMAGE_FORMAT_DESCRIPTOR:          return "CL_INVALID_IMAGE_FORMAT_DESCRIPTOR";
        case CL_INVALID_IMAGE_SIZE:                       return "CL_INVALID_IMAGE_SIZE";
        case CL_INVALID_SAMPLER:                          return "CL_INVALID_SAMPLER";
        case CL_INVALID_BINARY:                           return "CL_INVALID_BINARY";
        case CL_INVALID_BUILD_OPTIONS:                    return "CL_INVALID_BUILD_OPTIONS";
        case CL_INVALID_PROGRAM:                          return "CL_INVALID_PROGRAM";
        case CL_INVALID_PROGRAM_EXECUTABLE:               return "CL_INVALID_PROGRAM_EXECUTABLE";
        case CL_INVALID_KERNEL_NAME:                      return "CL_INVALID_KERNEL_NAME";
        case CL_INVALID_KERNEL_DEFINITION:                return "CL_INVALID_KERNEL_DEFINITION";
        case CL_INVALID_KERNEL:                           return "CL_INVALID_KERNEL";
        case CL_INVALID_ARG_INDEX:                        return "CL_INVALID_ARG_INDEX";
        case CL_INVALID_ARG_VALUE:                        return "CL_INVALID_ARG_VALUE";
        case CL_INVALID_ARG_SIZE:                         return "CL_INVALID_ARG_SIZE";
        case CL_INVALID_WORK_DIMENSION:                   return "CL_INVALID_WORK_DIMENSION";
        case CL_INVALID_WORK_GROUP_SIZE:                  return "CL_INVALID_WORK_GROUP_SIZE";
        case CL_INVALID_WORK_ITEM_SIZE:                   return "CL_INVALID_WORK_ITEM_SIZE";
        case CL_INVALID_GLOBAL_OFFSET:                    return "CL_INVALID_GLOBAL_OFFSET";
        case CL_INVALID_EVENT_WAIT_LIST:                  return "CL_INVALID_EVENT_WAIT_LIST";
        case CL_INVALID_EVENT:                            return "CL_INVALID_EVENT";
        case CL_INVALID_OPERATION:                        return "CL_INVALID_OPERATION";
        case CL_INVALID_GL_OBJECT:                        return "CL_INVALID_GL_OBJECT";
        case CL_INVALID_BUFFER_SIZE:                      return "CL_INVALID_BUFFER_SIZE";
        case CL_INVALID_MIP_LEVEL:                        return "CL_INVALID_MIP_LEVEL";
        case CL_INVALID_GLOBAL_WORK_SIZE:                 return "CL_INVALID_GLOBAL_WORK_SIZE";        
        default:
        {
            static char unknown[64];
            snprintf(unknown, sizeof(unknown), "Unknown OpenCL error code %d", err);
            return unknown;
        }
    }
}

void checkOpenCL(cl_int err, const char *msg)
{
    if (err != CL_SUCCESS)
    {
        printf("Error: %s (%s)\n", msg, openclGetErrorString(err));
        exit(EXIT_FAILURE);
    }
}

void checkOpenCLBuildLog(cl_int err, cl_program program, cl_device_id device)
{
    if (err != CL_SUCCESS)
    {
        char log[2048];
        clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL);
        printf("Build log:\n%s\n", log);
        exit(EXIT_FAILURE);
    }
}

void setupOpenCL(cl_context *context, cl_command_queue *queue, cl_program *program, cl_kernel *kernel)
{
    cl_int err;
    
    // 1. Platform
    cl_platform_id platform;
    cl_uint num_platforms;
    err = clGetPlatformIDs(1, &platform, &num_platforms);
    checkOpenCL(err, "clGetPlatformIDs");

    // 2. Device
    cl_device_id device;
    cl_uint num_devices;
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 1, &device, &num_devices);
    checkOpenCL(err, "clGetDeviceIDs");

    // 3. Context
    /*cl_context*/ *context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);
    checkOpenCL(err, "clCreateContext");

    // 4. Command queue
    const cl_queue_properties props[] = { CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0 };
    /*cl_command_queue*/ *queue = clCreateCommandQueueWithProperties(*context, device, props, &err);
    checkOpenCL(err, "lCreateCommandQueueWithProperties");

    // 5. Program
    char* source = read_kernel_source(KERNEL_FILE); // Read kernel source
    /*cl_program*/ *program = clCreateProgramWithSource(*context, 1, (const char**)&source, NULL, &err);
    checkOpenCL(err, "clCreateProgramWithSource");
    err = clBuildProgram(*program, 1, &device, NULL, NULL, NULL);
    checkOpenCLBuildLog(err, *program, device);

    // 6. Kernel
    /*cl_kernel*/ *kernel = clCreateKernel(*program, KERNEL_NAME, &err);
    checkOpenCL(err, "clCreateKernel");
}

void teardownOpenCL(cl_context *context, cl_command_queue *queue, cl_program *program, cl_kernel *kernel)
{
    clReleaseKernel(*kernel);
    clReleaseProgram(*program);
    clReleaseCommandQueue(*queue);
    clReleaseContext(*context);
}

Writing src/utils.c


In [589]:
%%writefile include/utils.h
#pragma once
#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>

char* read_kernel_source(const char* filename);
const char* openclGetErrorString(cl_int err);
void checkOpenCL(cl_int err, const char *msg);
void checkOpenCLBuildLog(cl_int err, cl_program program, cl_device_id device);
void setupOpenCL(cl_context *context, cl_command_queue *queue, cl_program *program, cl_kernel *kernel);
void teardownOpenCL(cl_context *context, cl_command_queue *queue, cl_program *program, cl_kernel *kernel);

Writing include/utils.h


### Using the Utility Module

In [590]:
%%writefile src/kernel.cl
__kernel void mykernel()
{
    printf("Hello World!\n");
}

Overwriting src/kernel.cl


In [591]:
%%writefile src/main.c
#include <stdio.h>
#include <stdlib.h>
#include "utils.h"

int main()
{
    cl_int err;

    // 1. Platform, 2. Device, 3. Context, 4. Command queue, 5. Program, 6. Kernel
    cl_context context;
    cl_command_queue queue;
    cl_program program;
    cl_kernel kernel;
    setupOpenCL(&context, &queue, &program, &kernel);

    // 7. Launch
    size_t global_size = 1;
    size_t local_size = 1;
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);
    checkOpenCL(err, "clEnqueueNDRangeKernel");

    // 8. Wait and finish
    clFinish(queue);

    // 9. Cleanup
    teardownOpenCL(&context, &queue, &program, &kernel);

    return 0;
}

Overwriting src/main.c


In [592]:
!{build_multi_file_command}
!{execute_command}

Hello World!


---
## 2.4 NDRange (Global Size), Work Groups, Work Items, Devices, CUs, and PEs

- In the previous CUDA code, we saw:
  - A CUDA **kernel**, where:
    - `__global__` is a CUDA-specific qualifier that makes a function a **kernel function**.
      - A **kernel function** is executed on the host (GPU), and called from the host (GPU).
    - `kernel` is the name of the **kernel function**.
    - `parameterlist` is the kernel function's comma-separated list of parameters.

      ```c
      __global__ kernel(parameterlist)
      {

      }
      ```
  - A CUDA **kernel launch**, where:
    - `kernel` is the name of the **kernel** function.
    - `<<<  >>>` is a special syntax used to **launch** (call) a **kernel function** and contains **launch parameters**.
    - `blocks` is a **launch parameter** with the number of **blocks per grid**.
    - `threads` is a **launch parameter** with the number of **threads per block**.
    - `argumentlist` are the arguments passed to the kernel function (and must match its parameterlist above).
  
      ```bash
      kernel<<<blocks, threads>>>(argumentlist);
      ```

<img src="images/opencl.png" width="500" style="float: right; margin-right: 50px;" />

- A CUDA-enabled GPU:
  - Is called a `Device`.
  - Has a number of `Streaming Multiprocessors (SMs)`.
  - Each `SM` has a number of `Streaming Processors (SPs)`.
- During a **kernel launch** `kernel<<<blocks, threads>>>(argumentlist)`:
  - CUDA lauches a `grid`. 
  - A `grid` contains a number of `blocks` (specified as `blocks` within `<<<blocks, threads>>>`).
  - Each `block` contains a number of `threads` (specified as `threads` within `<<<blocks, threads>>>`).
- CUDA maps:
  - A `block` to run on an `SM` (multiple `blocks` can be assigned to the same `SM`).
  - Each `thread` within a `block` to an `SP`.
- So, a `block` runs on an `SM`, and each `thread` within that `block` runs on a `SP` within that `SM`.

- For each **kernel launch**:
  - CUDA creates a `grid` containing `blocks` containing `threads`.
  - Each `thread` runs a copy of the same `kernel` function with the same `argumentlist`.
  - 4 CUDA-specific global variables are available within each copy of the `kernel` function:
    - `gridDim` of type `dim3`, which is a `struct` with three member variables `int x`, `int y`, and `int z`.
      - This is the number of `blocks` in a `grid`, and can be specified as a 1D (`x`), 2D (`y`), or 3D (`z`) set of `blocks`.
    - `blockDim` of type `dim3`, which is a `struct` with three member variables `int x`, `int y`, and `int z`.
      - This is the number of `threads` in a `block`, and can be specified as a 1D (`x`), 2D (`y`), or 3D (`z`) set of `threads`.
    - `blockIdx` of type `dim3`, which is a `struct` with three member variables `int x`, `int y`, and `int z`.
      - This is unique ID of a `block` within the `grid`, and has a 1D (`x`), 2D (`y`), and 3D (`z`) ID.
    - `threadIdx` of type `dim3`, which is a `struct` with three member variables `int x`, `int y`, and `int z`.
      - This is unique ID of a `thread` within a `block`, and has a 1D (`x`), 2D (`y`), and 3D (`z`) ID.

      ```c
      typedef struct
      {
        int x;
        int y;
        int z;
      } dim3;
      ```
    
  - In the figure, we see that each `SM` contains:
    - A `register file` (the blue rectangle).
      - The `register file` is divided into `**chunks**, where each `thread` (running in the `block` assigned to the `SM`) is assigned one **chunk**.
      - Each `thread`'s chunk of the `register file` is referred to as the `thread`'s `private memory` (only that `thread` can access that `private memory`).
      - In a `kernel` function, we can declare a local variable with the qualifier `__private__` which stores that variable in a `thread`'s `private memory`.
        - If we don't specify a qualifier, the variable is by default stored in the `thread`'s `private memory`.
    - A `shared memory` buffer (the green rectangle).
      - The `shared memory` is shared by all `threads` running in the `block` assigned to that `SM`.
      - To store a variable in `shared memory`, we declare the variable with the `__shared__` qualifier.
  - There is also `global memory`, which is declared using the qualifier `__global__`.
    - If we don't explicitly qualify a parameter in the kernel function's parameterlist with a qualifier, it is by default `__global__`, i.e. referring to `global memory`.
  - Finally, there is `constant memory` stored outside an `SM` (just like `global memory`), and is `read-only` memory (it can we written to once before a kernel launch).
    - `constant memory` is small, but very effient (the CUDA compiler can optimize access to it since it is `read-only`).

**TL;DR**
- A CUDA-enabled GPU is referred to as a `device`, and has an array of `SMs`, each with a number of `SPs`.
- A kernel launch specifies the number of `blocks` and `threads` to use within `<<<blocks, threads>>>`.
- CUDA maps a `block` to an `SM`.
- Each `thread` within a `block` is run on an `SP` within that `SM`.
- A CUDA-enabled GPU has `global memory` and `constant memory` located outside of any `SM`.
- Each `SM` has `shared memory` (shared by all `threads` running in the `block` assigned to an `SM`).
- Each `SM` has a `register file`, divided into chunks, where each `thread` is a assigned a chunk referred to as `private memory` (private to that `thread`).
- Each `thread` runs a **copy** of the same `kernel` function with the same `argumentlist`.

### Code Demonstrating NDRanges, Work Groups, and Work Items

- Let's write a CUDA program that demonstrates running a `grid` of `blocks`, each with a number of `threads`, on the device (GPU).
- The program copies the elements from one 1D `int` array `input` to another 1D `int` array `output`.
  - We include the necessary header files:
    - `stdio.h` for `printf`
    - `stdlib.h` for `malloc` and `free`
    - `time.h` for `srand` and `rand`
    - `cuda_runtime.h` for `cudaMalloc`, `cudaMemcpy`, and `cudaFree`
    
    ```c
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include <cuda_runtime.h>
    ```
  
  - We define symbolic constants:
    - `N` with the value `5` for the number of elements in each array
    - `THREADS_PER_BLOCK_X` with the value `2` for the number of `threads` in each `block`

    ```c
    #define N 5
    #define THREADS_PER_BLOCK_X 2
    ```
  
  - We define a kernel function called `kernel`.
    - It has the `__global__` qualifier, making it a kernel function, returns `void`, and has three parameters:
      - An `int *` pointer `input` which points to the `input` array in the device's `global memory`.
      - An `int *` pointer `output` which points to the `output` array in the device's `global memory`.
      - An `int` variable `n` with the number of elements in each array.
    - In the function's body, we:
      - Use the CUDA-specific global variables `gridDim`, `blockDim`, `blockIdx`, and `threadIdx`, all of type `dim3` (with `int` member variables `x`, `y`, `z`).
        - CUDA sets these as global variables, behind the scenes, during a kernel launch.
        - They are available to each thread within each copy of the kernel function.
        - `gridDim.x` is the number of `blocks` in a `grid`.
          - In our case we have `(5 + 2 - 1) / 2 = 3` since we calculate it as `(N + THREADS_PER_BLOCK_X - 1) / THREADS_PER_BLOCK_X` in the `main` function.
        - `blockDim.x` is the number of `threads` in a `block`.
          - In our case we have `2`, defined by `THREADS_PER_BLOCK_X`.
        - `blockIdx.x` is the unique `ID` of a `block` within the `grid`.
          - Since `gridDim.x` is `3`, `blockIdx.x` ranges from `0` to `2` (zero-based, i.e. from `0` to `gridDim.x - 1`).
        - `threadIdx.x` is the unique `ID` of a `thread` within a `block`.
          - Since `blockDim.x` is `2`, `threadIdx.x` ranges from `0` to `1` (zero-based, i.e. from `0` to `blockDim.x.x - 1`).
      - We calculate the global ID for a `thread` as `int idx = threadIdx.x + blockIdx.x * blockDim.x`.
        - We have `gridDim.x * blockDim.x = 3 * 2 = 6` threads in total, so `idx` ranges from `0` to `5` (zero-based).
      - We print out the values of `gridDim.x`, `blockDim.x`, `blockIdx.x`, `threadIdx.x`, and `idx`.
        - This shows us which `thread` is running this specific copy of the kernel function, which `block` it is in, etc.
      - We have a `boundary guard`, i.e. `if(idx >= n) return;`
        - The ensures we don't index into an array with `idx` if `idx` is out of bounds.
        - We have `6` threads in total, and `5` (defined by `N`) elements in each array, so `idx = 5` is out of bounds.
      - Finally, we copy one element from the `input` array into the `output` using `idx` as the index.
        - Remember, each thread runs its own `copy` of the kernel function (in parallel), each with the same set of kernel function `arguments`.

    ```c
    __global__ void kernel(int *input, int *output, int n)
    {       
        int idx = threadIdx.x + blockIdx.x * blockDim.x;
        
        printf("gridDim.x = %d, blockDim.x = %d, blockIdx.x = %d, threadIdx.x = %d, idx = %d\n", gridDim.x, blockDim.x, blockIdx.x, threadIdx.x, idx);
        
        if(idx >= n)
        {
            printf("Boundary checking avoided indexing outside of the arrays [idx = %d]\n", idx);
            return;
        }

        output[idx] = input[idx];
    }
    ```
- In the `main()` function:
  - We seed the pseudorandom number generator with the value `0` so the random numers we create will be the same every time we run the program.

    ```c
    srand(0);
    ```

  - We declare:
    - `int` pointer variables `h_input` and `h_output` for the two arrays, which will point to heap memory (RAM) on the host (CPU).
    - `int` pointer variables `d_input` and `d_output` for the two arrays, which will point to global memory on the device (GPU).
    - `int` variable `data_size` and initialize it to `N * sizeof(int)`, i.e. the total size of each array in bytes (with `N` elements of type `int` in each).
    
    ```c
    int *h_input, *h_output;
    int *d_input, *d_output;
    int data_size = N * sizeof(int);
    ```
  - We allocate memory on the host (GPU) with `malloc`, storing the pointers to the memory in variables `h_input` and `h_output`.

    ```c
    h_input = (int *)malloc(data_size);
    h_output = (int *)malloc(data_size);
    ```
  - We allocate memory on the device (GPU) with `cudaMalloc` storing the pointers to the memory in variables `d_input` and `d_output`.

    ```c
    cudaMalloc((void **)&d_input, data_size);
    cudaMalloc((void **)&d_output, data_size);
    ```
  - We initialize the `h_input` array on the host (CPU) with random values using the `rand()` function.

    ```c
    for(int i = 0; i<N; i++)
    {
        h_input[i] = rand() % 100; // random integers between 0 and 99
    }
    ```
  - We copy the elements of both arrays stored in host (GPU) memory (RAM) to device (GPU) global memory with `cudaMemcpy`.
    - Its first argument is a pointer to the memory to `copy to`.
    - Its second argument is a pointer to the memory to `copy from`.
    - Its third argument is the `size (in bytes)` of memory to copy (all `N` elements in our case).
    - Its fourth argument is a symbolic constant which determines the direction of the copy operation.
      - `cudaMemcpyHostToDevice` copies memory from the host (CPU) to the device (GPU).
      - `cudaMemcpyDeviceToHost` copies memory from the device (GPU) to the host (CPU).
    - Here we are copying the `input` and `output` arrays from the host (`h_input`, `h_output`) to the device (`d_input`, `d_output`).

    ```c
    cudaMemcpy(d_input, h_input, data_size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_output, h_output, data_size, cudaMemcpyHostToDevice);
    ```
  - We are now using CUDA's `dim3` struct to define the `gridDim` (number of blocks) and `blockDim` (number of threads in each block).
    - Before we just used `int` litterals in the launch configuration `<<<1, 1>>>`, but we can also use `dim3` variables `<<<gridDim, blockDim>>>`.
    - The launch configuration supports launching `1D`, `2D`, and `3D` blocks and threads, depending on the problem we want to solve, e.g:
      - For a `1D` problem such as copying elements between two 1D arrays, we only need to use 1 dimension, which is the `x` member variable in `dim3` structs.
      - For a `2D` problem such as filtering a 2D image, we might need to use 2 dimensions, which are the `x` and `y` member variables in `dim3` structs.
      - For a `3D` problem such as filtering a 3D MRI-scan volume, we might need to use 3 dimensions, which are the `x`, `y` and `z` member variables in `dim3` structs.
      - For our 1D problem, we are only using the `x` member variable, which means the other dimensions `y` and `z` will be set to the value `1`.
    - We create a `dim3` variable `blockDim` and initialize its `x` member variable to `THREADS_PER_BLOCK_X` (member variables `y` and `z` will be set to `1`).
      - This means we have `2` threads per block since `THREADS_PER_BLOCK_X` is defined with the value `2`.
    - We create a `dim3` variable `gridDim` and initialize its `x` member variable to `(N + THREADS_PER_BLOCK_X - 1) / THREADS_PER_BLOCK_X`.
      - This means we have `3` blocks per grid since `(N + THREADS_PER_BLOCK_X - 1) / THREADS_PER_BLOCK_X = (5 + 2 -1) / 2 = 3` (and there is only ever 1 grid).
      - This (standard) construct is commonly used to ensure enough `threads` are launched to solve a problem (but can launch more `threads` than data elements).

    ```c
    dim3 blockDim(THREADS_PER_BLOCK_X);
    dim3 gridDim((N + THREADS_PER_BLOCK_X - 1) / THREADS_PER_BLOCK_X);
    ```
  - Next, we launch the kernel with:
    - Launch configuration `<<<gridDim, blockDim>>>`, where `gridDim` and `blockDim` are our two `dim3` variables.
    - Argument list `d_input, d_output, N`  `d_input`, where `d_input` and `d_output` are the `int` pointers to the arrays on the device (GPU), and `N` the number of elements.

    ```c
    kernel<<<gridDim, blockDim>>>(d_input, d_output, N);
    ```
  - We copy the elements in the `d_output` array on the device (GPU) back to the array `h_output` on the host (CPU) using `cudaMemcpy`.
    - Notice the final argument is now `cudaMemcpyDeviceToHost`, i.e. the direction of the copy operation is from the device (GPU) to the host (CPU).

    ```c
    cudaMemcpy(h_output, d_output, data_size, cudaMemcpyDeviceToHost);   
    ```
  - Then we print out the elements in the two arrays `h_input` and `h_output` on the host (CPU).
   
    ```c
    printf("\n%-5s   %-6s\n", "input", "output");
    for(int i = 0; i<N; i++)
    {
        printf("%-5d   %-6d\n", h_input[i], h_output[i]);
    }
    ```
  - Finally, we free the mmeory allocated for the arrays:
    - We free the `int` pointers (`d_input` and `d_output`) pointing to memory on the device (GPU) with `cudaFree`.
    - We free the `int` pointers (`h_input` and `h_output`) pointing to memory on the host (CPU) with `free`.
    - Both functions take a pointer to the memory 
    - Notice the naming convention used in this program for pointers to memory on the host (`h_` prefix) and the device (`d_` prefix).

    ```c
    cudaFree(d_input);
    cudaFree(d_output);
    free(h_input);
    free(h_output);
    ```
- Run the cell below to see the output from the program.

**TL;DR**
- A kernel launch `kernel<<<blocks, threads>>>(argumentlist)` has:
  - A launch configuration `<<<gridDim, blockDim>>>` that specifies how man `blocks` (`gridDim`) and `threads` per `block` (`blockDim`) to launch.
    - It accepts `int` parameters, e.g. `<<<3, 2>>>` or `dim3` parameters, e.g. `<<<gridDim, blockDim>>>`.
    - `dim3` is a struct containing `int` member variables `x`, `y`, and `z`, used to structure `blocks` and `threads` for `1D`, `2D`, or `3D` problems.
  - An argument list `argumentlist` which must match the kernel function's parameter list.
    - Each `thread` runs a `copy` of the same kernel function, with the exact same `argumentlist`, in parallel (at the same time).
- A kernel function is run for each `thread`, where each `thread` has access to 4 global `dim3` variables `gridDim`, `blockDim`, `blockIdx`, and `threadIdx`
  - `gridDim` and `blockDim` are from the launch configuration and contain the number of `blocks` (`gridDim`) and number of `threads` per `block` (`blockDim`).
  - `blockIdx` and `threadIdx` contain unique `block` IDs within a grid (`blockIdx`) and unique `thread` IDs within a `block` (`threadIdx`).
- The construct `(N + THREADS_PER_BLOCK_X - 1) / THREADS_PER_BLOCK_X`:
  - Is commonly used ensure enough (at least as many) `threads` are launched needed to solve a problem (cover all data elements).
  - But can launch more `threads` than the total number of data elements.
- Since we can have more `threads` than data elements, we **always use bounday guards in CUDA kernels** to avoid out-of-bounds indexing.
- We use `malloc` and `free` for managing memory on the host (CPU).
- We use `cudaMalloc` and `cudaFree` for managing memory on the device (GPU).
- We use `cudaMemcpy` to copy memory between the host (CPU) and device (GPU), where the fourth argument determines the direction of the copy operation.
  - `cudaMemcpyHostToDevice` copies memory from the `host` (CPU) to the `device` (GPU).
  - `cudaMemcpyDeviceToHost` copies memory from the `device` (GPU) to the `host` (CPU).

In [593]:
%%writefile src/kernel.cl
__kernel void mykernel(__global const int *input, __global int *output, int n)
{
    int num_work_groups = get_num_groups(0); // equivalent to CUDA's gridDim.x
    int work_group_size = get_local_size(0); // equivalent to CUDA's blockDim.x
    int work_group_id = get_group_id(0);     // equivalent to CUDA's blockIdx.x
    int work_item_id = get_local_id(0);      // equivalent to CUDA's threadIdx.x
    int idx = get_global_id(0);              // equivalent to CUDA's threadIdx.x + blockIdx.x * blockDim.x

    printf("num_work_groups = %d, work_group_size = %d, work_group_id = %d, work_item_id = %d, idx = %d\n", num_work_groups, work_group_size, work_group_id, work_item_id, idx);

    if (idx >= n) {
        printf("Boundary checking avoided indexing outside of the arrays [idx = %d]\n", idx);
        return;
    }

    output[idx] = input[idx];
}

Overwriting src/kernel.cl


In [594]:
%%writefile src/main.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "utils.h"

#define N 5
#define WORKITEMS_PER_WORKGROUP_0 2

int main(void)
{
    // Setup OpenCL
    cl_int err; cl_context context; cl_command_queue queue; cl_program program; cl_kernel kernel;
    setupOpenCL(&context, &queue, &program, &kernel);

    srand(0);
    
    int *h_input, *h_output;
    int data_size = N * sizeof(int);

    h_input = (int *)malloc(data_size);
    h_output = (int *)malloc(data_size);

    for(int i = 0; i<N; i++)
    {
        h_input[i] = rand() % 100;
    }

    // Device buffers
    cl_mem d_input, d_output;
    d_input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, N * sizeof(int), h_input, &err); // CL_MEM_COPY_HOST_PTR copies h_input values to device buffer
    d_output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, N * sizeof(int), NULL, &err);

    // Set kernel arguments
    cl_int n = N;
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_input);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_output);
    clSetKernelArg(kernel, 2, sizeof(int), &n);

    // Kernel launch configuration
    size_t localSize = WORKITEMS_PER_WORKGROUP_0;
    size_t globalSize = ((N + WORKITEMS_PER_WORKGROUP_0 - 1) / WORKITEMS_PER_WORKGROUP_0) * WORKITEMS_PER_WORKGROUP_0;

    // Enqueue kernel
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize, 0, NULL, NULL);

    // Read result back (CL_TRUE makes this a synchronous (blocking) call)
    err = clEnqueueReadBuffer(queue, d_output, CL_TRUE, 0, N * sizeof(int), h_output, 0, NULL, NULL);

    // Wait for all queued operations to finish (not really needed here because of CL_TRUE in clEnqueueReadBuffer above)
    err = clFinish(queue);

    printf("\n%-5s   %-6s\n", "input", "output");
    for(int i = 0; i<N; i++)
    {
        printf("%-5d   %-6d\n", h_input[i], h_output[i]);
    }

    // Cleanup
    free(h_input);
    free(h_output);
    err = clReleaseMemObject(d_input);
    err = clReleaseMemObject(d_output);
    
    // Teardown OpenCL
    teardownOpenCL(&context, &queue, &program, &kernel);

    return 0;
}

Overwriting src/main.c


In [595]:
!{build_multi_file_command}
!{execute_command}

num_work_groups = 3, work_group_size = 2, work_group_id = 2, work_item_id = 0, idx = 4
num_work_groups = 3, work_group_size = 2, work_group_id = 2, work_item_id = 1, idx = 5
num_work_groups = 3, work_group_size = 2, work_group_id = 1, work_item_id = 0, idx = 2
num_work_groups = 3, work_group_size = 2, work_group_id = 1, work_item_id = 1, idx = 3
num_work_groups = 3, work_group_size = 2, work_group_id = 0, work_item_id = 0, idx = 0
num_work_groups = 3, work_group_size = 2, work_group_id = 0, work_item_id = 1, idx = 1
Boundary checking avoided indexing outside of the arrays [idx = 5]

input   output
83      83    
86      86    
77      77    
15      15    
93      93    


### Inspecting the Output

- In the output we see:
  - `gridDim.x` is `3`, i.e. there a `3` `blocks` in the grid.
  - `blockDim.x` is `2`, i.e. there a `2` `threads` in each `block`.
  - `blockIdx.x` varies from `0` to `2`, i.e. from `0` to `gridDim.x - 1`, and is a `block`'s unique ID (i.e. unique within a kernel launch).
  - `threadIdx.x` varies from `0` to `1`, i.e. from `0` to `blockDim.x - 1`, and is a `thread`'s unique block ID (i.e. unique within a block).
  - `idx` varies from `0` to `5`, i.e. from `0` to `gridDim.x * blockDim.x`, and is a `thread`'s unique global ID (i.e. unique within a kernel launch).
  - The boundary guard was triggered for one thread, i.e. the thread with `idx = 5`, because we only have `N = 5` elements in each array.
    - So, we can have more threads running than elements in our data/arrays, why **we should always make use of boundary guards in our CUDA kernels**.
  - The `input` and `output` arrays have the same element values, so our CUDA kernel's logic is functionally correct.

---
## 2.5 Error Checking

- CUDA supports checking for errors in device (GPU) code and from calling any CUDA function.
- The `#include`s, `#define`s, and the kernel function are the same as in the previous example.
  - `stdlib.h` also includes the function prototype for `exit` and the symbolic constant `EXIT_FAILURE` used in this example.
  - `cuda_runtime` also includes:
    - The the function prototypes for `cudaMalloc`, `cudaFree`, and `cudaMemcpy`, since we are back to NOT using `unified memory`.
    - The function prototypes for `cudaGetLastError` and `cudaGetErrorString`, which we use for checking errors from CUDA function calls.
    - The typedef `cudaError_t` and symbolic constant `cudaSuccess`, also used for checking CUDA errors.
- The only modifications in the `main()` function are:
  - We are using the code from the example we used before the `unified memory` example, where we don't use `unified memory`.
  - We have wrapped all CUDA function calls as arguments in a function called `checkCuda()`, explained below, e.g.

    ```c
    checkCuda(cudaMalloc((void **)&d_input, data_size), "cudaMalloc");
    ```
  - After the kernel launch, we use the code below to check for CUDA errors in the kernel function.

    ```c
    checkCuda(cudaGetLastError(), "kernel");
    ```
  - Then, after freeing all allocated memory, we deliberately produce an error:
    - We call `cudaFree()` on the `d_output` pointer twice, which produces an error the second time since that memory has already been freed.
      - All CUDA functions return a value of type `cudaError_t` which can be checked to see if an error occured in CUDA code on the device (GPU).
      - The only CUDA operation that doesn't have a return value of type `cudaError_t` is the kernel launch.
      - For that reason, CUDA provides the function `cudaGetLastError()` which will return the latest `cudaError_t` (we can use it after any CUDA function call). 
    ```c
    checkCuda(cudaFree(d_output), "cudaFree");
    ```
- We have wrapped all CUDA function calls, returning a value of type `cudaError_t`, in the function `checkCuda()`.
  - We pass the `cudaError_t` value as the first argument, and a string message as the second (the wrapped function name has been used).
- The function `checkCuda()` is defined by ourselves (it's been placed between the kernel function and the `main()` function the sample code):
  - It takes a CUDA error (`cudaError_t`) as its first argument and a message (string) as its second argument, returning `void`.
  - It checks if the value of the `cudaError_t` error is different from the symbolic constant `cudaSuccess` (`cudaSuccess` means there is no error).
    - If so, it retrieves a string-representation of the error by calling the function `cudaGetErrorString()`, which takes the error as an argument.
    - Then the error string is printed out together with an optional message (second argument to `checkCuda()`).
    - Finally, it terminates the program by calling the `exit()` function, passing in the symbolic constant `EXIT_FAILURE` as the return value to the operating system.
  - We can use `checkCuda()` by wrapping it around any CUDA function call, e.g. `checkCuda(cudaFree(d_output), "cudaFree")`.
    - Error checking will not be used going forward in this notebook to make the examples clearer, but good practice is to check for errors after each CUDA function call.

    ```c
    void checkCuda(cudaError_t err, const char *msg)
    {
        if (err != cudaSuccess)
        {
            printf("Error: %s (%s)\n", msg, cudaGetErrorString(err));
            exit(EXIT_FAILURE);
        }
    }
    ```
- Run the cell below to see the output (it will be the same as before, except for the error message that we deliberately produced).

**TL:DR**

- Each CUDA function returns a value of type `cudaError_t`.
  - If it's value is different from the symbolic constant `cudaSuccess`, and error occurred.
  - We can retrieve a string representation of a CUDA error by passing a `cudaError_t` instance as an argument the the function `cudaGetErrorString`.
  - We can retrieve the last CUDA error after a CUDA function call, including the kernel launch, using the function `cudaGetLastError`.
- We can exit from a C program prematurely, by calling the C function `exit`, passing in an exit code to the operating system.
  - The symbolic constant `EXIT_FAILURE` can be used as an exit code, representing a general error.

In [596]:
%%writefile src/kernel.cl
__kernel void mykernel(__global const int *input, __global int *output, int n)
{
    int idx = get_global_id(0);
    output[idx] = input[idx];
}

Overwriting src/kernel.cl


In [597]:
%%writefile src/main.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "utils.h"

#define N 5
#define WORKITEMS_PER_WORKGROUP_0 2

int main(void)
{
    // Setup OpenCL
    cl_int err; cl_context context; cl_command_queue queue; cl_program program; cl_kernel kernel;
    setupOpenCL(&context, &queue, &program, &kernel);

    srand(0);
    
    int *h_input, *h_output;
    int data_size = N * sizeof(int);

    h_input = (int *)malloc(data_size);
    h_output = (int *)malloc(data_size);

    for(int i = 0; i<N; i++)
    {
        h_input[i] = rand() % 100;
    }

    // Device buffers
    cl_mem d_input, d_output;
    
    d_input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, N * sizeof(int), h_input, &err); // CL_MEM_COPY_HOST_PTR copies h_input values to device buffer
    checkOpenCL(err, "clCreateBuffer");
    
    d_output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, N * sizeof(int), NULL, &err);
    checkOpenCL(err, "clCreateBuffer");

    // Set kernel arguments
    cl_int n = N;
    err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_input);
    checkOpenCL(err, "clSetKernelArg");
    
    err = clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_output);
    checkOpenCL(err, "clSetKernelArg");
    
    err = clSetKernelArg(kernel, 2, sizeof(int), &n);
    checkOpenCL(err, "clSetKernelArg");

    // Kernel launch configuration
    size_t localSize = WORKITEMS_PER_WORKGROUP_0;
    size_t globalSize = ((N + WORKITEMS_PER_WORKGROUP_0 - 1) / WORKITEMS_PER_WORKGROUP_0) * WORKITEMS_PER_WORKGROUP_0;

    // Enqueue kernel
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize, 0, NULL, NULL);
    checkOpenCL(err, "clEnqueueNDRangeKernel");

    // Read result back (CL_TRUE makes this a synchronous (blocking) call)
    err = clEnqueueReadBuffer(queue, d_output, CL_TRUE, 0, N * sizeof(int), h_output, 0, NULL, NULL);
    checkOpenCL(err, "clEnqueueReadBuffer");

    // Wait for all queued operations to finish (not really needed here because of CL_TRUE in clEnqueueReadBuffer above)
    err = clFinish(queue);
    checkOpenCL(err, "clFinish");

    printf("\n%-5s   %-6s\n", "input", "output");
    for(int i = 0; i<N; i++)
    {
        printf("%-5d   %-6d\n", h_input[i], h_output[i]);
    }
    printf("\n");

    // Cleanup
    free(h_input);
    free(h_output);

    err = clReleaseMemObject(d_input);
    checkOpenCL(err, "clReleaseMemObject");
    
    err = clReleaseMemObject(d_output);
    checkOpenCL(err, "clReleaseMemObject");

    err = clSetKernelArg(kernel, 99, sizeof(cl_mem), &d_input);        // Intentionally set a kernel argument with invalid arg index
    checkOpenCL(err, "Intentional clSetKernelArg with invalid index");
    
    // Teardown OpenCL
    teardownOpenCL(&context, &queue, &program, &kernel);

    return 0;
}

Overwriting src/main.c


In [598]:
!{build_multi_file_command}
!{execute_command}


input   output
83      83    
86      86    
77      77    
15      15    
93      93    

Error: Intentional clSetKernelArg with invalid index (CL_INVALID_ARG_INDEX)


### Inspecting the Output

- In the output we see that the result is the same as before (the only difference is that we are NOT using `unified memory`).
- We also see the error message `invalid argument` returned from `cudaFree()` when we try to deallocate device (GPU) memory that has already been freed.

---
## 2.6 Measuring Execution Time on the Host (CPU) and on the Device (GPU)

- A common workflow is to first implement an algorithm in a function on the host (CPU), and then in a kernel on the device (GPU).
  - The CPU version can act as a baseline benchmark for GPU kernel performance.
  - The CPU version can be used to verify the results of a GPU kernel.
  - For inexperienced manycore programmers, it's often easier to start with a CPU version, and then convert it into a GPU version.
- Let's use the same code as before, but instrument it with timing code, wrapped around the CPU function call and around the GPU kernel launch.
- The imported header files are the same as before:
  - `stdlib.h` also contains the function prototype for `clock`, the `clock_t` typedef, and the symbolic constant `CLOCKS_PER_SEC`.
    - `clock()` is a parameterless function returning a value of type `clock_t`.
    - `clock_t` contains the number of `ticks` elapsed since the program started.
    - `CLOCKS_PER_SEC` is defined as the number of `ticks` in a second (`ticks / CLOCKS_PER_SEC * 1000.0` converts `ticks` to milliseconds).
  - `cuda_runtime.h` contains a typedef `cudaEvent_t` and prototypes `cudaEventCreate`, `cudaEventRecord`, `cudaEventElapsedTime`, `cudaEventSynchronize`, and `cudaEventDestroy`.
    - `cudaEvent_t` represent a CUDA event, e.g. `cudaEvent_t start` (we won't explore CUDA events (or CUDA streams) in detail in this notebook).
    - `cudaEventCreate` is used to initialize a CUDA event, e.g. `cudaEventCreate(&start)`
    - `cudaEventRecord` is used to start recording (monitoring) a CUDA event, e.g. `cudaEventRecord(start)`
    - `cudaEventElapsedTime` is used to compute and return the elapsed time in milliseconds between to CUDA events, e.g. `cudaEventElapsedTime(&elapsed_ms, start, stop)`
    - `cudaEventSynchronize` blocks the CPU's main thread until an event has completed (in our code, when the kernel is done), e.g. `cudaEventSynchronize(stop)`
    - `cudaEventDestroy` frees (destroys) a CUDA event, e.g. `cudaEventDestroy(start)`
    
- We define a host (CPU) function `copy()`, equivalent to the device (GPU) kernel function `kernel()`
  - The GPU kernel function is the same as before.
    
    ```c
    void copy(int *input, int *output, int n)
    {
        for(int idx = 0; idx < n; idx++)
        {
            output[idx] = input[idx];
        }
    }
    ```
- In the `main()` function:
  - We wrap the code below around the device (GPU) kernel launch `kernel()`.
    - First we declare two CUDA event variables `start` and `stop`, and initialize them with `cudaEventCreate(&start)` and `cudaEventCreate(&stop)`.
    - Then we record the `start` event with `cudaEventRecord(start)` (this records the current time in the kernel and stores it in the `start` event).
    - Next, the device (GPU) kernel is launched as usual.
    - Then we record the `stop` event with `cudaEventRecord(stop)` (this records the current time in the kernel and stores it in the `stop` event).
    - We call `cudaEventSynchronize(stop)` to block the host (CPU) main thread until the `stop`event is done (i.e. until the kernel is done).
    - Finally, we declare a variable `float gpu_elapsed_ms` and pass it to the function `cudaEventElapsedTime(&gpu_elapsed_ms, start, stop);`
      - We also pass in `start`and `stop`, where the function will store the elapsed time in milliseonds in the variable `gpu_elapsed_ms`.

    ```c
    // --------------------------------------------------------------
    // Timing the device (GPU) kernel execution time
    // --------------------------------------------------------------
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);
  
    // Device kernel() launch
    kernel<<<gridDim, blockDim>>>(d_input, d_output, N);
  
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float gpu_elapsed_ms;
    cudaEventElapsedTime(&gpu_elapsed_ms, start, stop);
    // --------------------------------------------------------------
    ```
  - We wrap the code below around the host (CPU) function call `copy()`.
    - We call the function `clock()` to record the current number of `ticks` since the program started, and store the result in a variable `cpu_start` of type `clock_t`.
    - Then we call the host (CPU) function `copy()`.
    - Next, we call the function `clock()` again to record the current number of `ticks` again and store the result in a variable `cpu_stop` of type `clock_t`.
    - Finally, we calculate the elapsed number of milliseconds as `float cpu_elapsed_ms = (double)(cpu_stop - cpu_start) / CLOCKS_PER_SEC * 1000.0`.
  
    ```c
    // --------------------------------------------------------------
    // Timing the host (CPU) function execution time
    // --------------------------------------------------------------
    clock_t cpu_start = clock();

    // Host function call
    copy(h_input, h_output_cpu, N);
    
    clock_t cpu_stop = clock();
    float cpu_elapsed_ms = (double)(cpu_stop - cpu_start) / CLOCKS_PER_SEC * 1000.0;
    // --------------------------------------------------------------
    ```
  - We print out the execution time for the device (GPU) kernel and host (GPU) function.

    ```c
    printf("GPU execution time  : %f ms\n", gpu_elapsed_ms);
    printf("CPU execution time  : %f ms\n", cpu_elapsed_ms);
    ```
  - We print out the execution time for the device (GPU) kernel and host (GPU) function.

    ```c
    printf("GPU execution time  : %f ms\n", gpu_elapsed_ms);
    printf("CPU execution time  : %f ms\n", cpu_elapsed_ms);
    ```
  - We use a separate `int` pointer variable `h_output_cpu` for storing the output from the host (CPU) function call.
    - We verify the output results from the device (GPU) kernel and host (CPU) function are the same.
    - This is a common best practice when verifying the correct functionality of an algorithm implemented in a device (GPU) kernel.
      - We use the `abs()` function to compute the absolute difference between each eleement pair in the two arrays.
      - If we were using `float`s instead of `int`s, we can use the `fabs()` function and compare the difference to e.g. `1e-5`.

    ```c
    int errorsum = 0;
    for (int i = 0; i < N; i++)
    {
        int error = abs(h_output[i] - h_output_cpu[i]);
        if (error > 0)
        {
            //printf("Result verification failed for element with index %d!\n", i);
            errorsum += error;
        }
    }
    // Print verification result
    printf("\nVerification : %s\n", (errorsum > 0) ? "FAILED" : "PASSED");
    ```
  - We also print out the two arrays as before (same code) after launching the device (GPU) kernel and after calling the host (CPU) function.
  - Lastly, we also free all memory, including the two CUDA events.

    ```c
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    ```
- Run the cell below to see the output.
  - We won't record time in this notebook going forward to make the example code clearer, but now you know how to do it yourself-

**TL:DR**

- We can measure the execution time for a CUDA kernel with types and function prototypes declared in `cuda_runtime.h`.
- We can measure the execution time for a C function (or any C code) with types and function prototypes declared in `stdlib.h`.
- We can cmopute the absolute difference between two results to determine if they are correct (given at least one of the results is correct).

In [599]:
%%writefile src/kernel.cl
__kernel void mykernel(__global const int *input, __global int *output, int n)
{
    int idx = get_global_id(0);
    output[idx] = input[idx];
}

Overwriting src/kernel.cl


In [600]:
%%writefile src/main.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "utils.h"

#define N 5
#define WORKITEMS_PER_WORKGROUP_0 2

// Host function
void copy(int *input, int *output, int n)
{
    for(int idx = 0; idx < n; idx++)
    {
        output[idx] = input[idx];
    }
}

int main(void)
{
    // Setup OpenCL
    cl_int err; cl_context context; cl_command_queue queue; cl_program program; cl_kernel kernel;
    setupOpenCL(&context, &queue, &program, &kernel);

    srand(0);
    
    int *h_input, *h_output, *h_output_cpu;
    int data_size = N * sizeof(int);

    h_input = (int *)malloc(data_size);
    h_output = (int *)malloc(data_size);
    h_output_cpu = (int *)malloc(data_size);

    for(int i = 0; i<N; i++)
    {
        h_input[i] = rand() % 100;
    }

    // Device buffers
    cl_mem d_input, d_output;
    d_input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, N * sizeof(int), h_input, &err); // CL_MEM_COPY_HOST_PTR copies h_input values to device buffer
    d_output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, N * sizeof(int), NULL, &err);

    // Set kernel arguments
    cl_int n = N;
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_input);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_output);
    clSetKernelArg(kernel, 2, sizeof(int), &n);

    // Kernel launch configuration
    size_t localSize = WORKITEMS_PER_WORKGROUP_0;
    size_t globalSize = ((N + WORKITEMS_PER_WORKGROUP_0 - 1) / WORKITEMS_PER_WORKGROUP_0) * WORKITEMS_PER_WORKGROUP_0;

    // --------------------------------------------------------------
    // Timing the device (GPU) kernel execution time
    // --------------------------------------------------------------
    // Enqueue kernel with event
    cl_event kernel_event;
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize, 0, NULL, &kernel_event);

    // Wait for the kernel to finish
    clWaitForEvents(1, &kernel_event);

    // Query profiling info
    cl_ulong time_start, time_end;
    clGetEventProfilingInfo(kernel_event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
    clGetEventProfilingInfo(kernel_event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
    double gpu_elapsed_ms = (time_end - time_start) * 1e-6;  // Convert nanoseconds to milliseconds
    // --------------------------------------------------------------

    // Read result back (CL_TRUE makes this a synchronous (blocking) call)
    clEnqueueReadBuffer(queue, d_output, CL_TRUE, 0, N * sizeof(int), h_output, 0, NULL, NULL);
    
    // Print measured device kernel execution time
    printf("GPU execution time  : %f ms\n", gpu_elapsed_ms);

    // Print elements in both arrays
    printf("\n%-5s   %-6s\n", "input", "output");
    for(int i = 0; i<N; i++)
    {
        printf("%-5d   %-6d\n", h_input[i], h_output[i]);
    }
    printf("\n");

    // --------------------------------------------------------------
    // Timing the host (CPU) function execution time
    // --------------------------------------------------------------
    clock_t cpu_start = clock();

    // Host function call
    copy(h_input, h_output_cpu, N);
    
    clock_t cpu_stop = clock();
    float cpu_elapsed_ms = (double)(cpu_stop - cpu_start) / CLOCKS_PER_SEC * 1000.0;
    // --------------------------------------------------------------

    // Print measured host function execution time
    printf("CPU execution time  : %f ms\n", cpu_elapsed_ms);

    // Print elements in both arrays
    printf("\n%-5s   %-6s\n", "input", "output");
    for(int i = 0; i<N; i++)
    {
        printf("%-5d   %-6d\n", h_input[i], h_output_cpu[i]);
    }

    // Verify the results in the GPU output with the CPU output
    int errorsum = 0;
    for (int i = 0; i < N; i++)
    {
        int error = abs(h_output[i] - h_output_cpu[i]);
        if (error > 0)
        {
            //printf("Result verification failed for element with index %d!\n", i);
            errorsum += error;
        }
    }
    
    // Print verification result
    printf("\nVerification : %s\n", (errorsum > 0) ? "FAILED" : "PASSED");

    // Cleanup
    free(h_input);
    free(h_output);
    free(h_output_cpu);

    clReleaseMemObject(d_input);
    clReleaseMemObject(d_output);
    clReleaseEvent(kernel_event); // Release the event

    // Teardown OpenCL
    teardownOpenCL(&context, &queue, &program, &kernel);

    return 0;
}

Overwriting src/main.c


In [601]:
!{build_multi_file_command}
!{execute_command}

GPU execution time  : 0.010240 ms

input   output
83      83    
86      86    
77      77    
15      15    
93      93    

CPU execution time  : 0.001000 ms

input   output
83      83    
86      86    
77      77    
15      15    
93      93    

Verification : PASSED


### Inspecting the Output

- In the output we see that the execution time on the GPU is slower than on the CPU.
- This is expected since copying 5 elements from one array to another is just a waste of time on a GPU.
  - **Not all problems are suitable for a GPU, in which case we should use the CPU instead**.
- We also see the results verification `PASSED` so we can rest assured that the kernel function is correct (if the CPU function correct, of course).

---
## 2.7 Shared Memory and Thread Synchronization on the Device (GPU)

- `Shared memory` is a fast, low-latency memory located on-chip, accessible by all `threads` in a `block`.
  - **Location**: On-chip, accessible by all `threads` in a `block`.
  - **Access**: Readable and writeable by all `threads` in a `block` (also `writable` from the `host` before kernel launch).
  - **Size limit**: Typically `48 KB` per `SM` (Streaming Multiprocessor).
  - **Speed**: Very fast, much faster than `global memory`.
  - **Scope**: Shared only among `threads` in the same `block`.
  - **Lifetime**: Exists for the duration of the `block`.
- Use `shared memory` when:
  - Threads need to cooperate, such as tiling, caching, or communication between `threads`.
    - `Shared memory` is specified within a kernel function with the qualifier `__shared__`.
    - It can be initialized `statically` or `dynamically`.
- `Thread synchronization` is used to synchronize `threads`, especially `threads`in a `block` when using `shared memory`:
  - Purpose: Barrier synchronization — all `threads` in the `block` must reach it before any can proceed.
    - A barrier (thread synchronization) is done with the command `__syncthreads()` in kernel function.
  - Ensures all `shared memory` reads/writes are complete before continuing.
  - Used to prevent `race conditions`.
- Issues to be aware of when using `shared memory`:
  - `Non-coalesced memory access` also applies when accessing `global memory` (actually more important in that case).
  - `Wavefront (thread) divergence` applies to `threads` within the same `wavefront` (a `wavefront` is a group of `32` threads that are scheduled to run on `SPs` within a `SM`/`block`).
  - `Low occupancy` is related to `shared memory` but applies more generally to a `kernel launch`.

  | Issue                       | Consequence                        | Fix                                |
  | --------------------------- | ---------------------------------- | ---------------------------------- |
  | Race conditions             | Wrong results                      | Use `__syncthreads()` or atomics   |
  | No synchronization          | Inconsistent reads/writes          | Use `__syncthreads()`              |
  | [Bank conflicts](https://www.youtube.com/watch?v=CZgM3DEBplE)              | Performance slowdown               | Pad arrays, restructure access     |
  | Exceeding memory limit      | Kernel launch fails or runs slower | Reduce usage, use fewer threads    |
  | Wrong indexing              | Wrong data or crash                | Use `threadIdx` properly           |
  | Uninitialized/out-of-bounds | Undefined behavior                 | Always initialize and guard bounds |
  | [Non-coalesced memory access](https://www.youtube.com/watch?v=mLxZyWOI340&list=PLAwxTw4SYaPnFKojVQrmyOGFCqHTxfdv2&index=97)| Slower execution speed | Coalesce memory access |
  | [Wavefront (thread) divergence](https://www.youtube.com/watch?v=bHkFV-YMxxY&list=PLAwxTw4SYaPnFKojVQrmyOGFCqHTxfdv2&index=106) | Slower execution speed | Avoid branches and loops |
  | [Low occupancy](https://www.youtube.com/watch?v=2NGQvnT_3gU) | Slower execution speed | Increase occupancy |  

<br />  

- Now, let's look at a simple example of using `shared memory`.
- The code is the same as before, but with the following modifications:
  - In the `kernel` function, we declare a buffer (array) with the `__shared__` qualifier.
  - The `shared memory` can be declared with a `static` size or with a `dynamic size`.
  - In the sample code, we are using a `dynamic size`, where
    - the size isn't provided within the square brackets `[]`
    - the keyword `extern` is used infront of the `__static__` qualifier
      - this means the size id declared elsewhere (as an additional launch configuration parameter)
  - If we wanted a `static` size, we could use the commented-out row below instead, where
    - the sizs is provided within the square brackets `[THREADS_PER_BLOCK_X]` (`THREADS_PER_BLOCK_X` in this case).

  ```c
  extern __shared__ int shared[];               // dynamic size
  //__shared__ int shared[THREADS_PER_BLOCK_X]; // static size
  ```
- Let's look at the complete kernel function:
  - At the top, we decalare `shared memory` with a dynamic size.
  - Then we calculate a `thread`'s global index/ID (`g_idx`) and a `thread`'s local/shared index/ID (`s_idx`).
    - We have to be careful in how we use the threads for indexing (`g_idx` is unique within a kernel launch, `s_id` is unique within a `block` on the same `SM`).
    - Remember, if we have `blockDim.x` `threads` per `block` (with a `s_id` ranging from `0` to `blockDim.x` - 1).
  - Our usual `boundary guard` comes next `if(g_idx >0 n)`.
  - Then we copy elements from the `input` array into `shared` memory.
    - The index into the `input` array is `g_idx`.
    - The index into the `shared` array is `s_idx`.
    - Different indexing schemes might ne necessary depending on the problem/algorithm.
  - Next, we have a thread barrier `__syncthreads()`.
    - This ensures no `thread` within the `block` can continue past this row until all `threads` in the `block` have completed the code above this row.
      - This is important, since some `threads`might not have copied their element from the `input` array into the `shared` array yet.
      - In this example, it isn't an issue, because no other `thread` will read another `thread`'s element in the `shared` array in the code below the barrier `__syncthreads`.
      - For other problems, this might not be the case, so if `threads` aren't synchronized, they might continue and read stale data from the `shared` array.
  - Lastly, when all `threads` are synchronized, a `thread` copies an element from the `shared` array into the `output` array.
    - The index into the `output` array is `g_idx`.
    - The index into the `shared` array is `s_idx`.
    - Different indexing schemes might ne necessary depending on the problem/algorithm.

  ```C
  __global__ void kernel(int *input, int *output, int n)
  {
      // Shared memory
      extern __shared__ int shared[];               // dynamic size
      //__shared__ int shared[THREADS_PER_BLOCK_X]; // static size
      
      int g_idx = threadIdx.x + blockIdx.x * blockDim.x; // index in global memory (globally unique)
      int s_idx = threadIdx.x;                           // index in shared memory (unique within a block)

      if(g_idx >= n) return; // boundary guard

      // Copy elements in global memory (input) to shared memory (shared)
      shared[s_idx] = input[g_idx];

      // Synchronize threads
      __syncthreads();       // all threads in the same block must be done with the operations above before any thread can continue

      // Copy elements in shared memory (shared) to global memory (output)
      output[g_idx] = shared[s_idx];
  }
  ```
- Now, let's look at modifications in the `main()` function (most of the code is the same as before, but with the timing removed for clarity).
  - In fact, there is only one modification:
    - Since we are using a dynamic size for our `shared memory`, we first define the size of the memory with `int shared_size = THREADS_PER_BLOCK_X * sizeof(int)`.
    - Then we supply the size `shared_size` (in bytes) as a third parameter in the launch configuration `<<<gridDim, blockDim, shared_size>>>`.
    - If we were using a static size, we would comment these two rows, uncomment the last row, and use the same launch configuration as before (i.e. no change).
  - Best practice is to use a dynamic size, since we can determine a variable size in the code (without relying on e.g. a `#define` preprocessing directive).

    ```c
    // Device kernel() launch
    int shared_size = THREADS_PER_BLOCK_X * sizeof(int);
    kernel<<<gridDim, blockDim, shared_size>>>(d_input, d_output, N);
    //kernel<<<gridDim, blockDim>>>(d_input, d_output, N);

    ```
- Run the cell below to see the output (which is exactly the same as before).

**TL;DR**

- `Shared memory` can be declared using either:
  - A dynamic size
    - We use the keyword `extern`and the qualifier `__shared__` infront of the local variable in the kernel function.
    - We don't specify the size when declaring the variable in the kernel function, e.g. `extern __shared__ int shared[]`
    - We pass the size (in bytes) of the `shared memory` as a third parameter in the launch configuration `<<<blocks, threads, shared_size>>>`.
  - A static size
    - We use the qualifier `__shared__` infront of the local variable in the kernel function.
    - We include the size when declaring the variable in the kernel function, e.g. `__shared__ int shared[THREADS_PER_BLOCK_X]`
    - We call the kernel function without passing a third parameter to the launch configuration (or the value `0`) `<<<blocks, threads, shared_size>>>`.
- `Thread synchronization` is important when using `shared memory`.
  - We can synchronize `threads` with the statement `__syncthreads();`
    - No `thread` in a `block` can continue past that row until all `threads` have completed their tasks in the code above that row.
- Remember this regarding indexing:
  - A `thread`'s unique ID within a `block` is `threadIdx.x` (a specific `block` only runs on one `SM`, the `SM` with the `shared memory`).
  - A `thread`'s globally unique ID within a grid is calculated as `blockIdx.x * blockDim.x + threadIdx.x`.
- Multiple issues are related to `shared memory` (is one isn't aware of them).
  - We won't explore these issues (e.g. memory coalescence, warp divergence, bank conflicts, occupancy, etc.) in detail in this notebook.

In [602]:
%%writefile src/kernel.cl
//#define WORKITEMS_PER_WORKGROUP_0 2 // when using static local (shared) memory size

//__kernel void mykernel(__global const int *input, __global int *output, const int n)                    // when using static local (shared) memory size
__kernel void mykernel(__global const int *input, __global int *output, const int n, __local int *shared) // dynamic local (shared) memory size
{
    //__local int shared[WORKITEMS_PER_WORKGROUP_0]; // when using static local (shared) memory size

    int g_idx = get_global_id(0); // index in global memory (globally unique)
    int s_idx = get_local_id(0);  // index in local (shared) memory (unique within a workgroup)

    if (g_idx >= n) return; // boundary guard

    // Copy elements in global memory (input) to local (shared) memory
    shared[s_idx] = input[g_idx];

    // Synchronize workitems (threads)
    barrier(CLK_LOCAL_MEM_FENCE); // all workitems (threads) in the same workgroup must be done with the operations above before any workitem (thread) can continue

    // Copy elements in local (shared) memory to global memory (output)
    output[g_idx] = shared[s_idx];
}


Overwriting src/kernel.cl


In [603]:
%%writefile src/main.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "utils.h"

#define N 5
#define WORKITEMS_PER_WORKGROUP_0 2

int main(void)
{
    // Setup OpenCL
    cl_int err; cl_context context; cl_command_queue queue; cl_program program; cl_kernel kernel;
    setupOpenCL(&context, &queue, &program, &kernel);

    srand(0);
    
    int *h_input, *h_output;
    int data_size = N * sizeof(int);

    h_input = (int *)malloc(data_size);
    h_output = (int *)malloc(data_size);

    for(int i = 0; i<N; i++)
    {
        h_input[i] = rand() % 100;
    }

    // Device buffers
    cl_mem d_input, d_output;
    d_input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, data_size, h_input, &err);
    d_output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, data_size, NULL, &err);    

    // Set kernel arguments
    cl_int n = N;
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_input);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_output);
    clSetKernelArg(kernel, 2, sizeof(int), &n);
    clSetKernelArg(kernel, 3, WORKITEMS_PER_WORKGROUP_0 * sizeof(int), NULL); // dynamic local (shared) memory size (remove when using static)

    // Kernel launch configuration
    size_t localSize = WORKITEMS_PER_WORKGROUP_0;
    size_t globalSize = ((N + WORKITEMS_PER_WORKGROUP_0 - 1) / WORKITEMS_PER_WORKGROUP_0) * WORKITEMS_PER_WORKGROUP_0;

    // Enqueue kernel
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize, 0, NULL, NULL);

    // Read result back
    clEnqueueReadBuffer(queue, d_output, CL_TRUE, 0, data_size, h_output, 0, NULL, NULL);

    printf("\n%-5s   %-6s\n", "input", "output");
    for(int i = 0; i<N; i++)
    {
        printf("%-5d   %-6d\n", h_input[i], h_output[i]);
    }
    printf("\n");

    // Cleanup
    free(h_input);
    free(h_output);
    clReleaseMemObject(d_input);
    clReleaseMemObject(d_output);
    
    // Teardown OpenCL
    teardownOpenCL(&context, &queue, &program, &kernel);

    return 0;
}

Overwriting src/main.c


In [604]:
!{build_multi_file_command}
!{execute_command}


input   output
83      83    
86      86    
77      77    
15      15    
93      93    



### Inspecting the Output

- The output is exactly the same as before (same algorithm, just using different type of memory).

---
## 2.8 Constant Memory on the Device (GPU)

- `Constant memory` is a special type of GPU memory optimized for cases where many `threads` read the same values.
  - **Location**: On-device, separate from `global memory`.
  - **Access**: Readable by all `threads` and is `read-only` from the `device`, but `writable` from `host`.
  - **Size limit**: `64 KB` (per `device`).
  - **Speed**: Very fast if all `threads` access the same address.
  - **Scope**: Globally accessible (like global variables).
  - **Lifetime**: Exists for the duration of the `kernel launch`.
- Use `constant memory` when:
  - All or most `threads` access the same data (e.g., coefficients, transformation matrices, filters).
  - The data is known before kernel launch and doesn't change during execution.
  - The data is small (<= `64 KB`).
- Let's look at a simple example using `constant memory`.
- It's the same code as before, but with the `shared memory` removed, and with the following modifications:
  - Above the kernel function (not inside it), we declare a constant memory buffer (array) using the `__constant__` qualifier.
  - In the kernel function:
    - We multiply an element in the `input` array with an elements in the `constant` array, both with the same index.
    - Then we assigning the product to the `output` array using the same index.
    - Note that we have declared the size of the `constant memory` to be the same as the number of elements `N`.
      - This is perfectly fine for this example where `N = 5`, but `constant memory` is extremely limited (small).
      - We wouldn't be able to use `N` as the `constant memory`'s size if, say, `N` was `1000000` (a million elements).

      ```c
      // constant memory
      __constant__ int constant[N];
      
      // Device kernel
      __global__ void kernel(int *input, int *output, int n)
      {   
          int idx = threadIdx.x + blockIdx.x * blockDim.x;
          
          if(idx >= n) return; // boundary guard
          
          // Multiply input elements with coefficients in constant memory and store the product in output
          output[idx] = input[idx] * constant[idx];
      }
      ```
  - In the main() function, the code is the same as before (but with `shared memory` removed), but with the following modifications:
    - We declare an `int` pointer variable on the host (CPU) to define the contents to be copied to the `constant memory`.

      ```c
      int *h_coefficients;
      ```
    - We create a variable with the same size (but in bytes) as the statically defined `constant memory`.

      ```c
      int constant_size = N * sizeof(int);
      ```
    - We allocate space in host (CPU) memory (RAM) the data with will be copying to the `constant memory`.

      ```c
      h_coefficients = (int *)malloc(constant_size);
      ```
    - We initialize the data we will be copying tp `constant memory`.
      - Notice, all the elements in `h_coefficients` are two (so the elements in the `output` array from the kernel function will be twice as large as in the `input` array).

      ```c
      for(int i = 0; i<N; i++)
      {
         h_coefficients[i] = 2;
      } 
      ```
    - Then we copy the host (CPU) memory to the device (GPU) `constant mmeory` using the CUDA function `cudaMemcpyToSymbol`.
      - Notice that we aren't using the `cudaMemcpy` function when copying host (CPU) memory to device (GPU) `constant memory`.
      - We pass in a pointer to the `constant` memory as the first argument.
      - We pass in a pointer to the host (CPU) memory `h_coefficients` as the second argument.
      - We pass in the size (in bytes) of the `constant memory` as the third argument.
    
      ```c
      cudaMemcpyToSymbol(constant, h_coefficients, constant_size); // copy host (CPU) memory to device (GPU) constant memory
      ```
    - At the very end of the `main()` function, we free the memory on the host (CPU), allocated to store the values copied to `constant memory`.

      ```c
      free(h_coefficients);
      ```
- Run the cell below to see the output.

In [605]:
%%writefile src/kernel.cl
__kernel void mykernel(__global const int *input, __global int *output, const int n, __constant const int *coefficients) // constant memory
{
    int idx = get_global_id(0); // index in global memory (globally unique)

    if (idx >= n) return; // boundary guard

    // Multiply input elements with coefficients in constant memory and store the product in output
    output[idx] = input[idx] * coefficients[idx];
}

Overwriting src/kernel.cl


In [606]:
%%writefile src/main.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "utils.h"

#define N 5
#define WORKITEMS_PER_WORKGROUP_0 2

int main(void)
{
    // Setup OpenCL
    cl_int err; cl_context context; cl_command_queue queue; cl_program program; cl_kernel kernel;
    setupOpenCL(&context, &queue, &program, &kernel);

    srand(0);
    
    int *h_input, *h_output, *h_coefficients;
    int data_size = N * sizeof(int);
    int constant_size = N * sizeof(int);

    h_input = (int *)malloc(data_size);
    h_output = (int *)malloc(data_size);
    h_coefficients = (int *)malloc(constant_size);

    for(int i = 0; i<N; i++)
    {
        h_input[i] = rand() % 100;
    }

    for(int i = 0; i<N; i++)
    {
        h_coefficients[i] = 2;
    }

    // Device buffers
    cl_mem d_input, d_output;
    d_input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, data_size, h_input, &err);
    d_output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, data_size, NULL, &err);    
    cl_mem d_coefficients = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, constant_size, h_coefficients, &err);

    // Set kernel arguments
    cl_int n = N;
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_input);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_output);
    clSetKernelArg(kernel, 2, sizeof(int), &n);
    clSetKernelArg(kernel, 3, sizeof(cl_mem), &d_coefficients);

    // Kernel launch configuration
    size_t localSize = WORKITEMS_PER_WORKGROUP_0;
    size_t globalSize = ((N + WORKITEMS_PER_WORKGROUP_0 - 1) / WORKITEMS_PER_WORKGROUP_0) * WORKITEMS_PER_WORKGROUP_0;

    // Enqueue kernel
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize, 0, NULL, NULL);

    // Read result back
    clEnqueueReadBuffer(queue, d_output, CL_TRUE, 0, data_size, h_output, 0, NULL, NULL);

    printf("\n%-5s   %-6s\n", "input", "output");
    for(int i = 0; i<N; i++)
    {
        printf("%-5d   %-6d\n", h_input[i], h_output[i]);
    }
    printf("\n");

    // Cleanup
    free(h_input);
    free(h_output);
    free(h_coefficients);
    clReleaseMemObject(d_input);
    clReleaseMemObject(d_output);
    clReleaseMemObject(d_coefficients);
    
    // Teardown OpenCL
    teardownOpenCL(&context, &queue, &program, &kernel);

    return 0;
}

Overwriting src/main.c


In [607]:
!{build_multi_file_command}
!{execute_command}


input   output
83      166   
86      172   
77      154   
15      30    
93      186   



### Inspecting the Output

- We see that the values in the `output` array are twice as large compared to the `input` array,

---
# 3. Sample Problems
---

## 3.1 1D Vector Addition on the Host (CPU)

<img src="images/vectoradd_cpu.png" width="500" style="float: right; margin-right: 50px;" />

Let's start with a simple problem.

Problem
  - We have three vectors (arrays) `A`, `B`, and `C`, all with `N` elements each.
  - We want to compute the elementwise sum of `A` and `B`, and store the sum in `C`.

Solution
1. Define number of elements `N=1048576`
2. Create a host function `void vectorAdd(float *A, float *B, float *C, int n)`
    - Loop through vectors `A` and `B` with `idx=0..N-1`
    - Compute `C[idx] = A[idx] + B[idx]`
3. Create a host function `main(void)`
    - Declare and allocate memory for vectors `h_A`, `h_B`, and `h_C`.
    - Initialize vectors `h_A` and `h_B` with `N` random floats each.
    - Call function `vectorAdd` with `h_A`, `h_B`, `h_C`, `N`, and measure the execution time for `vectorAdd`.
    - Verify result is correct.
    - Print execution time, verification result, and sample elements in vectors `h_A`, `h_B`, and `h_C`.
    - Free memory allocated for vectors `h_A`, `h_B`, and `h_C`.

In [608]:
%%writefile src/main.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

// Number of elements (1048576)
#define N (1 << 20) 

// Host function (elementwise addition of vectors A and B, placing the sum in vector C)
void vectorAdd(float *A, float *B, float *C, int n)
{   
    // Loop through vectors and compute sum C = A + B
    for (int idx = 0; idx < n; idx++)
    {
        C[idx] = A[idx] + B[idx];
    }
}

// Host main routine
int main(void)
{
    // Seed pseudorandom number generator
    srand(0);

    // Compute the size of the vectors (in bytes)
    size_t size = N * sizeof(float);

    // Declare and allocate host vectors A, B, and C   
    float *h_A = (float *)malloc(size);
    float *h_B = (float *)malloc(size);
    float *h_C = (float *)malloc(size);

    // Initialize host input vectors A and B with random values between 0 and 1.0
    for (int i = 0; i < N; ++i)
    {
        h_A[i] = rand() / (float)RAND_MAX;
        h_B[i] = rand() / (float)RAND_MAX;
    }

    // Call function vectorAdd with timing
    clock_t start = clock();

    vectorAdd(h_A, h_B, h_C, N); // function call
    
    clock_t end = clock();
    double elapsed_ms = (double)(end - start) / CLOCKS_PER_SEC * 1000.0;
    
    // Verify results in ouput vector C is correct
    float errorsum = 0.0f;
    for (int i = 0; i < N; ++i)
    {
        float error = fabs(h_A[i] + h_B[i] - h_C[i]);
        if (error > 1e-5)
        {
            //printf("Result verification failed for element with index %d!\n", i);
            errorsum += error;
        }
    }

    // Print measured function execution time, verification result, and sample elements from each vector
    printf("CPU execution time  : %f ms\n", elapsed_ms);
    printf("Verification result : %s\n", (errorsum > 1e-5) ? "FAILED" : "PASSED");
    printf("Vector samples      : A[0]=%f, B[0]=%f, C[0]=%f\n", h_A[0], h_B[0], h_C[0]);
    
    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
}

Overwriting src/main.c


In [609]:
!{build_multi_file_command}
!{execute_command}

CPU execution time  : 2.569000 ms
Verification result : PASSED
Vector samples      : A[0]=0.840188, B[0]=0.394383, C[0]=1.234571


---
## 3.2 1D Vector Addition on the Device (GPU)

<img src="images/vectoradd_gpu1.png" width="450" style="float: right; margin-right: 50px;" />

Problem
  - We have three vectors (arrays) `A`, `B`, and `C`, all with `N` elements each.
  - We want to compute the elementwise sum of `A` and `B`, and store the sum in `C`.

We Know
- In CUDA, we have access to many `threads`, where `threads` are organized into `blocks`, and `blocks` are organized into a `grid`.
  - `threadIdx.x` represents a `thread’s index` along the `x` dimension within a `block`. 
  - `blockIdx.x` represents a `block’s index` along the `x` dimension within the `grid`.
  - `blockDim.x` represents the `number of threads` along the `x` dimension with a `block`.
- To get a `thread's global index` on the GPU:
  - `int index = blockDim.x * blockIdx.x + threadIdx.x`
- `Blocks` are assigned to a Streaming Multiprocessor (SM) that has a number of Streaming Processors (SPs).
  - Each `thread` executes its own copy of the `kernel function`, in parallel, with the same parameter values.
  - Each `thread` should process only one element in the arrays using the `index`.
  - If there are more threads than elements (`index >= N`), those threads should `return` immediately from the `kernel function`
- There can be a maximum of `1024` threads in a block.
  - If we have `N = 1048576` elements,
  - and `THREADS_PER_BLOCK = 1024`,
  - we get `BLOCKS = (N + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK = (1048576+1024-1) / 1024 = 1024`.
  - And if we have `24` SMs, each will be assigned roughly `1024 / 24 = 42` blocks for maximum efficiency.

<img src="images/vectoradd_gpu2.png" width="450" style="float: right; margin-right: 50px;" />

Solution
1. Define number of elements `N=1048576` and `THREADS_PER_BLOCK=1024`
2. Create a kernel `__global__ void vectorAdd(float *A, float *B, float *C, int n)`
    - Compute global thread ID `idx = blockDim.x * blockIdx.x + threadIdx.x`
    - Return if index is out of bounds (`idx >= n`) which means we have more threads than elements `n`.
      - In this case we won't since `N` is evenly divisible by `THREADS_PER_BLOCK`.
    - Compute `C[idx] = A[idx] + B[idx]`.

3. Create a host function `main(void)`
    - Declare and allocate memory for host vectors `h_A`, `h_B`, and `h_C`.
    - Initialize host vectors `h_A` and `h_B` with `N` random floats each.
    - Declare and allocate memory for device vectors `d_A`, `d_B`, and `d_C`.
    - Copy contents of host vectors `h_A` and `h_B` to device vectors `d_A` and `d_B`.
    - Launch kernel `vectorAdd` with `d_A`, `d_B`, `d_C`, `N`, and measure the execution time for `vectorAdd`.
    - Copy contents of device vector `d_C` to host vector `h_C`.
    - Verify result is correct.
    - Print execution time, verification result, and sample elements in host vectors `h_A`, `h_B`, and `h_C`.
    - Free memory allocated for device vectors `d_A`, `d_B`, and `d_C`.
    - Free memory allocated for host vectors `h_A`, `h_B`, and `h_C`.

<img src="images/coalesced_memory_access.png" width="450" style="float: right; margin-right: 50px;" />

No need for shared or constant memory, and the global memory access pattern is **coalesced** in the code, (a) in the figure.

In [610]:
%%writefile src/kernel.cl
// Device kernel (elementwise addition of vectors A and B, placing the sum in vector C)
__kernel void mykernel(
    __global const float *A,
    __global const float *B,
    __global float *C,
    const int n)
{
    // Compute index (idx) from global workitem (thread) ID
    int idx = get_global_id(0);

    // Return if index is out of bounds (means we have more workitems (threads) than elements)
    if (idx >= n)
        return;

    // Compute the sum C = A + B for the element with index idx
    C[idx] = A[idx] + B[idx];
}

Overwriting src/kernel.cl


In [611]:
%%writefile src/main.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include "utils.h"

// Number of elements (1048576)
#define N (1 << 20)

// Number of threads
#define WORKITEMS_PER_WORKGROUP_0 1024

int main(void)
{
    // Setup OpenCL
    cl_int err; cl_context context; cl_command_queue queue; cl_program program; cl_kernel kernel;
    setupOpenCL(&context, &queue, &program, &kernel);

    // Seed pseudorandom number generator
    srand(0);
    
    // Compute the size of the vectors (in bytes)
    size_t size = N * sizeof(float);

    // Declare and allocate host vectors A, B, and C   
    float *h_A = (float *)malloc(size);
    float *h_B = (float *)malloc(size);
    float *h_C = (float *)malloc(size);

    // Initialize host input vectors A and B with random values between 0 and 1.0
    for (int i = 0; i < N; ++i)
    {
        h_A[i] = rand() / (float)RAND_MAX;
        h_B[i] = rand() / (float)RAND_MAX;
    }

    // Allocate the device input vectors A, B, and copy data from host vectors A, B
    // Allocate the device output vector C
    cl_mem d_A, d_B, d_C;
    d_A = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size, h_A, &err);
    d_B = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size, h_B, &err);
    d_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, size, NULL, &err);

    // Set kernel arguments
    cl_int n = N;
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_A);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_B);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_C);
    clSetKernelArg(kernel, 3, sizeof(int), &n);

    // Kernel launch configuration
    size_t localSize = WORKITEMS_PER_WORKGROUP_0;
    size_t globalSize = ((N + WORKITEMS_PER_WORKGROUP_0 - 1) / WORKITEMS_PER_WORKGROUP_0) * WORKITEMS_PER_WORKGROUP_0;

    // Enqueue kernel with timing event
    cl_event kernel_event;
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize, 0, NULL, &kernel_event);
    
    // Wait for kernel to finish and compute execution time
    clWaitForEvents(1, &kernel_event);
    cl_ulong time_start, time_end;
    clGetEventProfilingInfo(kernel_event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
    clGetEventProfilingInfo(kernel_event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
    double elapsed_ms = (time_end - time_start) * 1e-6;  // Convert nanoseconds to milliseconds

    // Read result back from device output vector C to host output vector C
    clEnqueueReadBuffer(queue, d_C, CL_TRUE, 0, size, h_C, 0, NULL, NULL);

    // Verify the result vector is correct
    float errorsum = 0.0f;
    for (int i = 0; i < N; ++i)
    {
        float error = fabs(h_A[i] + h_B[i] - h_C[i]);
        if (error > 1e-5)
        {
            //fprintf(stderr, "Result verification failed at element %d!\n", i);
            errorsum += error;
        }
    }
  
    // Print measured kernel execution time, verification result, and sample elements from each vector
    printf("GPU execution time  : %f ms\n", elapsed_ms);
    printf("Verification result : %s\n", (errorsum > 1e-5) ? "FAILED" : "PASSED");
    printf("Vector samples      : A[0]=%f, B[0]=%f, C[0]=%f\n", h_A[0], h_B[0], h_C[0]);

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    // Free device global memory and event
    clReleaseMemObject(d_A);
    clReleaseMemObject(d_B);
    clReleaseMemObject(d_C);
    clReleaseEvent(kernel_event);
    
    // Teardown OpenCL
    teardownOpenCL(&context, &queue, &program, &kernel);

    return 0;
}

Overwriting src/main.c


In [612]:
!{build_multi_file_command}
!{execute_command}

GPU execution time  : 0.032768 ms
Verification result : PASSED
Vector samples      : A[0]=0.840188, B[0]=0.394383, C[0]=1.234571


---
## 3.3 1D Convolution on the Host (CPU)

<img src="images/1dconvolution.gif" width="600" style="float: right; margin-right: 50px;" />

Next, let's tackle the problem of a 1-dimensional (1D) convolution.

Problem
- We have an `input` vector, a `kernel` (filter), and an `output` vector.
- We want to slide the `kernel` (filter) over each element in the `input` vector.
- The `kernel` (filter) will be centered over each element in the `input` vector.
- So the `kernel`'s (filter's) width has to be odd, e.g. `1x3`, `1x5`, `1x7`.
- We multiply each element under the `kernel` (filter) in the `input` vector with `kernel`'s (filter's) elements.
- We sum the products, and assign the sum to the `output` vector with the same `index` as the current `input` vector.
- Since the `kernel` (filter) can't be centered over the boundary elements in the `input` vector, we use `zero-padding`.

Solution
1. Define number of elements `N=1048576`
2. Create a function:
   - `void convolve1D(float *input, float *output, float *filter)`
   - Loop through `input` vector.
   - Compute `output[idx] = input[idx + offset] = filter[FILTER_WIDTH/2 + offset]`
     - Only if `if(idx + offset >= 0 && idx + offset < DATA_WIDTH)`
     - Where `offset` ranges from `-FILTER_WIDTH/2` to `+FILTER_WIDTH/2`.
   - This computation is equivalent to
     - Looping through the `input` vector, zero-padded with `FILTER_WIDTH/2` elements on both sides.
     - Centering the `filter` over each original element in the zero-padded `input` vector.
     - Computing the weighted sum and storing it in the `output` vector.
4. Create a function `main(void)`
   - Define a `DATA_WIDTH`, `FILTER_WIDTH` and `FILTER_WIDTH_OFFSET` (which is `FILTER_WIDTH/2`).
   - Declare and allocate memory for vectors `input`, `ouput`, and `filter`.
   - Initialize vector `input` with `DATA_WIDTH` random floats.
   - Initialize vector `filter` with `weights` where each weight is `1.0 / FILTER_WIDTH` (averaging filter).
   - Call function `convolve1D` with `input`, `ouput`, and `filter`.
   - Measure the execution time for `convolve1D`.
   - Print execution time and sample elements in vectors `input` and `output`.
   - Free memory allocated for vectors `input`, `output`, and `filter`.

In [613]:
%%writefile src/main.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

// Number of data elements (1048576)
#define DATA_WIDTH (1 << 20)

// Number of filter elements 
#define FILTER_WIDTH 3

// Number of elements on each side of a centered filter
#define FILTER_WIDTH_OFFSET (FILTER_WIDTH / 2)

void convolve1D(float *input, float *output, float *filter)
{
    // Loop through all elements
    for(int d_col = 0; d_col < DATA_WIDTH; d_col++)
    {
        // Apply filter (slide filter over data and compute weighted sum)
        float sum = 0.0f;
        for (int offset_col = -FILTER_WIDTH_OFFSET; offset_col <= FILTER_WIDTH_OFFSET; offset_col++)
        {
            int f_col = FILTER_WIDTH_OFFSET + offset_col; // f_col: 0..FILTER_WIDTH-1
            int i_col = d_col + offset_col;               // i_col: 0-FILTER_WIDTH_OFFSET..DATA_WIDTH-1+FILTER_WIDTH_OFFSET
            
            if(i_col >= 0 && i_col < DATA_WIDTH)
            {
                sum += input[i_col] * filter[f_col];
            }
        }
        
        // Store the weighted sum in the output array
        output[d_col] = sum;
    }
}

int main(void)
{
    // Seed the random number generator
    srand(0);            // use this for same set of random numbers each time the program is run
    //srand(time(NULL)); // use this for different set of random numbers each time the program is run

    // Declare variables
    float *h_input, *h_output, *h_filter; // host copies of input, output, filter
    int data_size = DATA_WIDTH * sizeof(float);     // size of data in bytes
    int filter_size = FILTER_WIDTH * sizeof(float); // size of filter in bytes
   
    // Allocate space for host copies of input, output, filter
    h_input = (float *)malloc(data_size);
    h_output = (float *)malloc(data_size);
    h_filter = (float *)malloc(filter_size);
      
    // Setup input values
    for (int col = 0; col < DATA_WIDTH; col++)
    {
        h_input[col] = (float)rand() / RAND_MAX; // Random floats between 0 and 1.0
    }

    // Setup filter
    for (int col = 0; col < FILTER_WIDTH; col++)
    {
        h_filter[col] = 1.0f / FILTER_WIDTH; // averaging filter
    }
   
    // Call convolve1D() with timing
    clock_t start = clock();                                              // record the start time
    convolve1D(h_input, h_output, h_filter);                              // call convolve1D()
    clock_t stop = clock();                                               // record the stop time
    double elapsed_ms = (double)(stop - start) / CLOCKS_PER_SEC * 1000.0; // calculate the elapsed time in millisecond

    // Print measured calculation execution time
    printf("Calculation (%d elements, 1x%d filter) took %.2f ms\n", DATA_WIDTH, FILTER_WIDTH, elapsed_ms);
   
    // Print out the FILTER_WIDTH number of elements in the two arrays
    printf("Vector samples:\n");
    for(int i = 0; i < FILTER_WIDTH; i++)
    {
        printf("h_input[%d]=%.2f, h_output[%d]=%.2f\n", i, h_input[i], i, h_output[i]);
    }

    // Cleanup
    free(h_input);
    free(h_output);
    free(h_filter);
    
    return 0;
 }

Overwriting src/main.c


In [614]:
!{build_multi_file_command}
!{execute_command}

Calculation (1048576 elements, 1x3 filter) took 7.21 ms
Vector samples:
h_input[0]=0.84, h_output[0]=0.41
h_input[1]=0.39, h_output[1]=0.67
h_input[2]=0.78, h_output[2]=0.66


- The output shows:
  - Given an element with index `idx` in the `output` array.
  - It's value is the average of the elements with indices `-FILTER_WIDTH_OFFSET..+FILTER_WIDTH_OFFSET+1` in the `input` array.
    - Since an averaging filter was used.
  - For example
    - If the `FILTER_WIDTH` is `3`, we have `FILTER_WIDTH_OFFSET = FILTER_WIDTH / 2 = 1`.
    - The value of an element with index `idx` in the `output` array is the average of the elements with indices `idx-1`, `idx`, and `idx+1` in the `input` array.
      - `output[idx] = (input[idx-1] +  nput[idx] + nput[idx+1]) / 3`
      - If it's a bounday element, the out-of-bounds indices have zero-padded elements with a value of `0`.

---
## 3.4 1D Convolution on the Device (GPU)

<img src="images/tiled_convolution_1d.png" width="600" style="float: right; margin-right: 50px;" />

Problem
- We have an `input` vector, a `kernel` (filter), and an `output` vector.
- We want to slide the `kernel` (filter) over each element in the `input` vector.
- The `kernel` (filter) will be centered over each element in the `input` vector.
- So the `kernel`'s (filter's) width has to be odd, e.g. `1x3`, `1x5`, `1x7`.
- We multiply each element under the `kernel` (filter) in the `input` vector with `kernel`'s (filter's) elements.
- We sum the products, and assign the sum to the `output` vector with the same `index` as the current `input` vector.
- Since the `kernel` (filter) can't be centered over the boundary elements in the `input` vector, we use `zero-padding`.

Solution
- We have a 1D `input` vector with `N` elements (vector marked with `N` in the figure).
- A `block` of `threads` will process `blockDim` number of elements (top row in figure).
- We don't want to load elements multiple times from `global memory` during the calulation.
  - So each `thread`in a `block` loads its `input` element into `shared memory` (called `Tile` in the figure).
  - The `shared_memory` size needs to be `TILE_BASE_WITH + 2 * FILTER_WIDTH_OFFSET`, where
    - `TILE_BASE_WITH` is the number of original `input` elements in a `block` (highlighted elements in figure).
    - `FILTER_WIDTH_OFFSET` is `FILTER_WIDTH / 2` (called `halo` elements in the figure).
    - `FILTER_WIDTH` is `5` (in the figure).

<img src="images/block_tile_loading_1d.png" width="400" style="float: right; margin-right: 50px;" />

  - This ensures the `filter`, when centered on an element, covers all neighbouring elements, e.g.
    - In `Block 0` the threads use `Tile 0`, where the original elements are `0`, `1`, `2`, `3` (see figure).
    - The `filter` is centered on `0` covering `FILTER_WIDTH_OFFSET` neighbouring elements on each side.
    - For border elements we use zero-padding (called `ghost` elements in the figure for the left-most elements).
    - So the elements included in the first convolution are `ghost`, `ghost`, `0`, `1`, `2` (where `ghost = 0`).
      - When processing element `3`, the `filter` covers elements `1`, `2`, `3`, `4`, `5`.

- For these extra `2 * FILTER_WIDTH_OFFSET` elements to be available in a `block`:
    - The `shared memory`, called `Tile`, needs a size of `TILE_BASE_WITH + 2 * FILTER_WIDTH_OFFSET` (see above).
      - This is the actual size need for a `block` of `threads`, i.e. `blockDim` which includes:
      - Threads for loading the original elements that the `filter` will center on.
      - Threads for the extra `2 * FILTER_WIDTH_OFFSET` border elements.
      - This is illustrated in the bottom figure.
- We also want to load the `filter` elements into `constant` mmeory to avoid hitting global mmeory when accessing them.

- So this is what we'll do:
1. Define:
  - `DATA_WIDTH=1048576` (number of elements in data vectors `input` and `output`)
  - `FILTER_WIDTH=3` (number of elements in the `filter` vector)
  - `FILTER_WIDTH_OFFSET=FILTER_WIDTH/2` (number of elements on each size of a centered `filter`)
  - `TILE_WIDTH_BASE=16` (number original elements in a `block` of `threads`)
    - Where the final tile size is `TILE_WIDTH_BASE + 2 * FILTER_WIDTH_OFFSET` to cover border elements.
    - This is also the size we will use for the `shared memory` and `block` size, i.e. `blockDim.x`.
    - So we have these many `threads` in each `block` and we now a `block` is assigned to an `SM`.
2. Define `constant` memory of size `FILTER_WIDTH` for the `filter`.
3. Create a kernel function `void convolve1D(float *input, float *output)`:
   - Define `shared` memory of size `TILE_WIDTH_BASE + 2 * FILTER_WIDTH_OFFSET`.
   - Let the `threads` in a `block` load their `input` elements into `shared` memory.
     - For border elements, we load the value `0` into `shared` memory (zero-padding).
   - Synchronize `threads`to ensure each `thread`in a `block` has loaded its element into `shared` memory.
   - Compute the convolution as in the CPU solution, but now using `shared` memory (input) and `constant` memory (filter).
   - Store the result in the `ouput` vector.
5. Create a function `main(void)`
   - Declare and allocate memory for vectors `input`, `ouput`, and `filter` on the host (CPU).
   - Declare and allocate memory for vectors `input` and `ouput` on the device (GPU).
   - Initialize vector `input` with `DATA_WIDTH` random floats on the host (CPU).
   - Initialize vector `filter` with `weights` on the host (CPU).
     - Each weight is `1.0 / FILTER_WIDTH` (averaging filter).
   - Copy `input` vector in host (CPU) memory to device (GPU) global memory.
   - Copy `filter` vector in host (CPU) memory to `constant` device (GPU) memory.
   - Launch kernel `convolve1D` with device (GPU) `input` and `ouput` vectors as arguments.
     - Use `gridDim`, `blockDim`, and `shared_memory_size` as launch parameters, where
       - `blockDim = TILE_WIDTH_BASE + 2 * FILTER_WIDTH_OFFSET`
       - `gridDim = (DATA_WIDTH + block_width - 1) / block_width`
       - `shared_mmeory_size = TILE_WIDTH_BASE + 2 * FILTER_WIDTH_OFFSET` (`* sizeof(float)`)
   - Measure the execution time for `convolve1D`.
   - Copy `output` vector in device (GPU) memory to host (CPU) memory.
   - Print execution time and sample elements in vectors `input` and `output`.
   - Free memory allocated for vectors `input`, `output`, and `filter` on the host (CPU).
   - Free memory allocated for vectors `input` and `output` on the device (GPU).

In [615]:
%%writefile src/kernel.cl
__kernel void mykernel(
    __global const float *input,
    __global float *output,
    __constant const float *filter,
    __local float *shared,
    const int data_width,
    const int filter_width_offset)
{
    int s_col = get_local_id(0);             // Workitem's (thread's) index in shared memory
    int d_col = get_global_id(0);            // Workitem's (thread's) index in global memory
    int i_col = d_col - filter_width_offset; // Workitem's (thread's) offset index in global memory

    // Guard against workitems (threads) with IDs that would index outside the arrays
    if (d_col >= data_width) return;

    // Fill local (shared) memory with elements in global memory
    if (i_col >= 0 && i_col < data_width)
    {
        shared[s_col] = input[i_col];
    }
    else
    {
        shared[s_col] = 0.0f; // zero-padding
    }

    // Make sure each workitem (thread) in the workgroup has entered its element
    // into local (shared) memory before any workitem (thread) continues
    barrier(CLK_LOCAL_MEM_FENCE);

    // Apply filter
    float sum = 0.0f;
    for (int offset_col = -filter_width_offset; offset_col <= filter_width_offset; offset_col++)
    {
        int f_col = filter_width_offset + offset_col;
        int i_col = s_col + f_col;
        
        if(i_col >= 0 && i_col < get_local_size(0))
        {
            sum += shared[i_col] * filter[f_col]; // data elements in local (shared) memory + filter weights in constant memory = super fast computation
        }
    }

    // Store the weighted sum in the output array
    output[d_col] = sum;
}

Overwriting src/kernel.cl


In [616]:
%%writefile src/main.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include "utils.h"

// Number of data elements (1048576)
#define DATA_WIDTH (1 << 20)

// Number of filter elements
#define FILTER_WIDTH 3

// Number of elements on each side of a centered filter
#define FILTER_WIDTH_OFFSET (FILTER_WIDTH / 2)

// Number of elements in local (shared) memory
#define TILE_WIDTH_BASE 16

int main(void)
{
    // Setup OpenCL
    cl_int err; cl_context context; cl_command_queue queue; cl_program program; cl_kernel kernel;
    setupOpenCL(&context, &queue, &program, &kernel);

    srand(0);
    //srand(time(NULL));
    
    // Declare variables
    float *h_input, *h_output, *h_filter; // host copies of input, output, filter
    int data_size = DATA_WIDTH * sizeof(float);
    int filter_size = FILTER_WIDTH * sizeof(float);

    // Allocate space for host (CPU) copies of input, output, filter
    h_input = (float *)malloc(data_size);
    h_output = (float *)malloc(data_size);
    h_filter = (float *)malloc(filter_size);

    // Setup input values
    for (int col = 0; col < DATA_WIDTH; col++)
    {
        h_input[col] = (float)rand() / RAND_MAX; // Random floats between 0 and 1.0
    }

    // Setup filter
    for (int col = 0; col < FILTER_WIDTH; col++)
    {
        h_filter[col] = 1.0f / FILTER_WIDTH; // averaging filter
    }

    // Device buffers
    cl_mem d_input, d_output, d_filter;
    d_input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, data_size, h_input, &err);
    d_output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, data_size, NULL, &err);    
    d_filter = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, filter_size, h_filter, &err);

    // Set kernel arguments
    cl_int data_width = DATA_WIDTH;
    cl_int filter_width_offset = FILTER_WIDTH_OFFSET;
    int workgroup_width = TILE_WIDTH_BASE + 2 * FILTER_WIDTH_OFFSET;
    int shared_size = workgroup_width * sizeof(float);
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_input);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_output);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_filter);
    clSetKernelArg(kernel, 3, shared_size, NULL);         // dynamic local (shared) memory size
    clSetKernelArg(kernel, 4, sizeof(int), &data_width);
    clSetKernelArg(kernel, 5, sizeof(int), &filter_width_offset);  

    // Kernel launch configuration
    size_t localSize = workgroup_width;
    size_t globalSize = ((DATA_WIDTH + workgroup_width - 1) / workgroup_width) * workgroup_width;

    // Enqueue kernel with timing event
    cl_event kernel_event;
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize, 0, NULL, &kernel_event);

    // Wait for kernel to finish and compute execution time
    clWaitForEvents(1, &kernel_event);
    cl_ulong time_start, time_end;
    clGetEventProfilingInfo(kernel_event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
    clGetEventProfilingInfo(kernel_event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
    double elapsed_ms = (time_end - time_start) * 1e-6;  // Convert nanoseconds to milliseconds

    // Copy result back to host
    clEnqueueReadBuffer(queue, d_output, CL_TRUE, 0, data_size, h_output, 0, NULL, NULL);

    // Print measured calculation execution time
    printf("Calculation (%d elements, 1x%d filter) took %.2f ms\n", DATA_WIDTH, FILTER_WIDTH, elapsed_ms);
   
    // Print out the FILTER_WIDTH number of elements in the two arrays
    printf("Vector samples:\n");
    for(int i = 0; i < FILTER_WIDTH; i++)
    {
        printf("h_input[%d]=%.2f, h_output[%d]=%.2f\n", i, h_input[i], i, h_output[i]);
    }

    // Cleanup
    free(h_input);
    free(h_output);
    free(h_filter);
    clReleaseMemObject(d_input);
    clReleaseMemObject(d_output);
    clReleaseMemObject(d_filter);
    clReleaseEvent(kernel_event);
    
    // Teardown OpenCL
    teardownOpenCL(&context, &queue, &program, &kernel);

    return 0;
}

Overwriting src/main.c


In [617]:
!{build_multi_file_command}
!{execute_command}

Calculation (1048576 elements, 1x3 filter) took 0.14 ms
Vector samples:
h_input[0]=0.84, h_output[0]=0.41
h_input[1]=0.39, h_output[1]=0.67
h_input[2]=0.78, h_output[2]=0.66


In the output we see:
- The results are the same for the GPU solution as for the CPU solution.
- The execution time for the GPU solution is significantly fast than the CPU solution.

---
## 3.5 2D Convolution on the Host (CPU)

<img src="images/2dconvolution.gif" width="600" style="float: right; margin-right: 50px;" />

**Note**
  - This is really just the same problem as a 1D convolution, but with an added second dimension.
  - Therefore the problem and solution will be the same, but with the second dimension accounted for.

Problem
- We have an `input` matrix, a `kernel` (filter), and an `output` matrix.
- We want to slide the `kernel` (filter) over each element in the `input` matrix.
- The `kernel` (filter) will be centered over each element in the `input` matrix.
- So the `kernel`'s (filter's) width and height has to be odd, e.g. `3x3`, `5x5`, `7x7`.
- We multiply each element under the `kernel` (filter) in the `input` matrix with the `kernel`'s (filter's) elements.
- We sum the products, and assign the sum to the `output` matrix with the same `index` as the current `input` matrix.
- Since the `kernel` (filter) can't be centered over the boundary elements in the `input` matrix, we use `zero-padding`.

Solution
1. Define:
   - `DATA_WIDTH=32` (number of elements in the `col` dimension for the `input` and `output`)
   - `DATA_HEIGHT=32` (number of elements in the `row` dimension for the `input` and `output`)
   - `FILTER_WIDTH=3` (number of elements int the `col` dimension for the `filter`)
   - `FILTER_HEIGHT=3` (number of elements int the `row` dimension for the `filter`)
   - `FILTER_WIDTH_OFFSET=FILTER_WIDTH/2` (number of elements to the left and right of the centered `filter`)
   - `FILTER_HEIGHT_OFFSET=FILTER_HEIGHT/2` (number of elements above and below the centered `filter`)
3. Create a function:
   - `void convolve2D(float *input, float *output, float *filter)`
   - Loop through `input` matrix.
   - Compute convolution. Store result in `output` matrix.
   - The only difference in the "D convolution compared to the 1D convolution is the additional dimension.
4. Create a function `main(void)`
   - Declare and allocate memory for matrices `input`, `ouput`, and `filter`.
   - Initialize matrix `input` with `DATA_HEIGHT * DATA_WIDTH` random floats.
   - Initialize matrix `filter` with `weights` where each weight is `1.0 / FILTER_HEIGHT * FILTER_WIDTH` (averaging filter).
   - Call function `convolve2D` with `input`, `ouput`, and `filter`.
   - Measure the execution time for `convolve2D`.
   - Print execution time and sample elements in matrices `input` and `output`.
   - Free memory allocated for matrices `input`, `output`, and `filter`.

In [618]:
%%writefile src/main.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

#define DATA_WIDTH 32
#define DATA_HEIGHT 32
#define FILTER_WIDTH 3
#define FILTER_HEIGHT 3
#define FILTER_WIDTH_OFFSET (FILTER_WIDTH/2)
#define FILTER_HEIGHT_OFFSET (FILTER_HEIGHT/2)

void convolve2D(float *input, float *output, float *filter)
{
    for(int d_row = 0; d_row < DATA_HEIGHT; d_row++)
    {
        for(int d_col = 0; d_col < DATA_WIDTH; d_col++)
        {
            float sum = 0.0f;
            for (int offset_row = -FILTER_HEIGHT_OFFSET; offset_row <= FILTER_HEIGHT_OFFSET; offset_row++)
            {
                for (int offset_col = -FILTER_WIDTH_OFFSET; offset_col <= FILTER_WIDTH_OFFSET; offset_col++)
                {
                    int f_row = FILTER_HEIGHT_OFFSET + offset_row;
                    int f_col = FILTER_WIDTH_OFFSET + offset_col;
                    int i_row = d_row + offset_row;
                    int i_col = d_col + offset_col;

                    if(i_row >= 0 && i_row < DATA_HEIGHT && i_col >= 0 && i_col < DATA_WIDTH)
                    {
                        sum += input[i_row * DATA_WIDTH + i_col] * filter[f_row * FILTER_WIDTH + f_col];
                    }
                }
            }

            output[d_row * DATA_WIDTH + d_col] = sum;
        }
    }
}

int main(void)
{
    srand(0);

    float *h_input = (float *)malloc(DATA_WIDTH * DATA_HEIGHT * sizeof(float));
    float *h_output = (float *)malloc(DATA_WIDTH * DATA_HEIGHT * sizeof(float));
    float *h_filter = (float *)malloc(FILTER_WIDTH * FILTER_HEIGHT * sizeof(float));

    for(int row = 0; row < DATA_HEIGHT; row++)
    {
        for(int col = 0; col < DATA_WIDTH; col++)
        {
            h_input[row * DATA_WIDTH + col] = (float)rand() / RAND_MAX;
        }
    }

    for(int row = 0; row < FILTER_HEIGHT; row++)
    {
        for(int col = 0; col < FILTER_WIDTH; col++)
        {
            h_filter[row * FILTER_WIDTH + col] = 1.0f / (FILTER_WIDTH * FILTER_HEIGHT);
        }
    }

    // Call convolve2D() with timing
    clock_t start = clock();
    convolve2D(h_input, h_output, h_filter);
    clock_t stop = clock();
    double elapsed_ms = (double)(stop - start) / CLOCKS_PER_SEC * 1000.0;

    printf("Calculation (%d elements, %dx%d filter) took %.2f ms\n", DATA_HEIGHT * DATA_WIDTH, FILTER_HEIGHT, FILTER_WIDTH, elapsed_ms);
    printf("\nMatrix samples:\n");
    printf("h_input %-12s h_output\n", "");
    for(int row = 0; row < FILTER_HEIGHT; row++)
    {
        for(int col = 0; col < FILTER_WIDTH; col++)
        {
            printf("%.3f ", h_input[row * DATA_WIDTH + col]);
        }
        printf("%-3s","");
        for(int col = 0; col < FILTER_WIDTH; col++)
        {
            printf("%.3f ", h_output[row * DATA_WIDTH + col]);
        }
        printf("\n");
    }

    free(h_input);
    free(h_output);
    free(h_filter);

    return 0;
}

Overwriting src/main.c


In [619]:
!{build_multi_file_command}
!{execute_command}

Calculation (1024 elements, 3x3 filter) took 0.03 ms

Matrix samples:
h_input              h_output
0.840 0.394 0.783    0.238 0.396 0.382 
0.613 0.296 0.638    0.328 0.527 0.568 
0.267 0.540 0.375    0.325 0.514 0.616 


- The output shows:
  - Given an element with index `[row, col]` in the `output` matrix.
  - It's value is the average of the elements with indices:
    - `-FILTER_HEIGHT_OFFSET..+FILTER_HEIGHT_OFFSET+1` in the `input` matrix's `row`.
    - `-FILTER_WIDTH_OFFSET..+FILTER_WIDTH_OFFSET+1` in the `input` matrix's `col`.
    - Since an averaging filter was used.
  - For example
    - If the `FILTER_HEIGHT` is `3`, we have `FILTER_HEIGHT_OFFSET = FILTER_HEIGHT / 2 = 1`.
    - If the `FILTER_WIDTH` is `3`, we have `FILTER_WIDTH_OFFSET = FILTER_WIDTH / 2 = 1`.
    - The value of an element with index `[row, col]` in the `output` matrix is the average of the elements in the `input` matrix with indices:

      ```c
      [row-1, col-1]  [row-1, col]  [row-1, col+1]
      [row  , col-1]  [row  , col]  [row  , col+1]
      [row+1, col-1]  [row+1, col]  [row+1, col+1]
      ```
    - If it's a bounday element, the out-of-bounds indices have zero-padded elements with a value of `0`.

---
## 3.6 2D Convolution on the Device (GPU)

**Note**
  - This is really just the same problem as a 1D convolution, but with an added second dimension.
  - Therefore the problem and solution will be the same, but with the second dimension accounted for.

Problem
- We have an `input` matrix, a `kernel` (filter), and an `output` matrix.
- We want to slide the `kernel` (filter) over each element in the `input` matrix.
- The `kernel` (filter) will be centered over each element in the `input` matrix.
- So the `kernel`'s (filter's) height and width has to be odd, e.g. `3x3`, `5x5`, `7x7`.
- We multiply each element under the `kernel` (filter) in the `input` matrix with the `kernel`'s (filter's) elements.
- We sum the products, and assign the sum to the `output` matrix with the same `index` as the current `input` matrix.
- Since the `kernel` (filter) can't be centered over the boundary elements in the `input` matrix, we use `zero-padding`.

Solution
1. Define:
  -  `DATA_WIDTH=32` (number of elements in the `col` dimension for the `input` and `output`)
   - `DATA_HEIGHT=32` (number of elements in the `row` dimension for the `input` and `output`)
   - `FILTER_WIDTH=3` (number of elements int the `col` dimension for the `filter`)
   - `FILTER_HEIGHT=3` (number of elements int the `row` dimension for the `filter`)
   - `FILTER_WIDTH_OFFSET=FILTER_WIDTH/2` (number of elements to the left and right of the centered `filter`)
   - `FILTER_HEIGHT_OFFSET=FILTER_HEIGHT/2` (number of elements above and below the centered `filter`) 
   - `TILE_WIDTH_BASE=16` (number original elements in a `block` of `threads` in the `col` dimension)
   - `TILE_HEIGHT_BASE=16` (number original elements in a `block` of `threads` in the `row` dimension)
     - Where the final tile size in the `col` dimension is `TILE_WIDTH_BASE + 2 * FILTER_WIDTH_OFFSET` to cover left and right border elements.
       - This is also the size we will use for the `col` dimension in `shared memory` and `block` size, i.e. `blockDim.x`.
     - Where the final tile size in the `row` dimension is `TILE_HEIGHT_BASE + 2 * FILTER_HEIGHT_OFFSET` to cover top and bottom border elements.
       - This is also the size we will use for the `row` dimension in `shared memory` and `block` size, i.e. `blockDim.y`.
     - So we have these many 2D `threads` in each `block` and we know a `block` is assigned to an `SM`.
3. Define `constant` memory of size `FILTER_HEIGHT * FILTER_WIDTH` for the `filter`.
4. Create a kernel function `void convolve2D(float *input, float *output)`:
   - Define `shared` memory of size `(TILE_HEIGHT_BASE + 2 * FILTER_HEIGHT_OFFSET) * (TILE_WIDTH_BASE + 2 * FILTER_WIDTH_OFFSET)`.
   - Let the `threads` in a `block` load their `input` elements into `shared` memory.
     - For border elements, we load the value `0` into `shared` memory (zero-padding).
   - Synchronize `threads`to ensure each `thread`in a `block` has loaded its element into `shared` memory.
   - Compute the convolution as in the CPU solution, but now using `shared` memory (input) and `constant` memory (filter).
   - Store the result in the `ouput` matrix.
5. Create a function `main(void)`
   - Declare and allocate memory for matrices `input`, `ouput`, and `filter` on the host (CPU).
   - Declare and allocate memory for matrices `input` and `ouput` on the device (GPU).
   - Initialize matrix `input` with `DATA_HEIGHT * DATA_WIDTH` random floats on the host (CPU).
   - Initialize matrix `filter` with `weights` on the host (CPU).
     - Each weight is `1.0 / (FILTER_HEIGHT * FILTER_WIDTH)` (averaging filter).
   - Copy `input` matrix in host (CPU) memory to device (GPU) global memory.
   - Copy `filter` matrix in host (CPU) memory to `constant` device (GPU) memory.
   - Launch kernel `convolve2D` with device (GPU) `input` and `ouput` matrices as arguments.
     - Use `gridDim`, `blockDim`, and `shared_memory_size` as launch parameters, where
       - `blockDim.x = TILE_WIDTH_BASE + 2 * FILTER_WIDTH_OFFSET`
       - `blockDim.y = TILE_HEIGHT_BASE + 2 * FILTER_HEIGHT_OFFSET`
       - `gridDim.x = (DATA_WIDTH + block_width - 1) / block_width`
       - `gridDim.y = (DATA_HEIGHT + block_height - 1) / block_height`
       - `shared_mmeory_size = (TILE_WIDTH_BASE + 2 * FILTER_WIDTH_OFFSET) * (TILE_HEIGHT_BASE + 2 * FILTER_HEIGHT_OFFSET)` (`* sizeof(float)`)
   - Measure the execution time for `convolve2D`.
   - Copy `output` matrix in device (GPU) memory to host (CPU) memory.
   - Print execution time and sample elements in matrices `input` and `output`.
   - Free memory allocated for matrices `input`, `output`, and `filter` on the host (CPU).
   - Free memory allocated for matrices `input` and `output` on the device (GPU).

In [620]:
%%writefile src/kernel.cl
__kernel void mykernel(
    __global const float *input,
    __global float *output,
    __constant const float *filter,
    __local float *shared,
    const int data_height,
    const int data_width,
    const int filter_height_offset,
    const int filter_width_offset,
    const int filter_width)
{
    int s_row = get_local_id(1);
    int s_col = get_local_id(0);
    
    int d_row = get_global_id(1);
    int d_col = get_global_id(0);
    
    int i_row = d_row - filter_height_offset;
    int i_col = d_col - filter_width_offset;

    if (d_col >= data_width || d_row >= data_height) return;

    if (i_row >= 0 && i_row < data_height && i_col >= 0 && i_col < data_width)
    {
        shared[s_row * get_local_size(0) + s_col] = input[i_row * data_width + i_col];
    }
    else
    {
        shared[s_row * get_local_size(0) + s_col] = 0.0f; // zero-padding
    }

    barrier(CLK_LOCAL_MEM_FENCE);

    float sum = 0.0f;
    for (int offset_row = -filter_height_offset; offset_row <= filter_height_offset; offset_row++)
    {
        for (int offset_col = -filter_width_offset; offset_col <= filter_width_offset; offset_col++)
        {
            int f_row = filter_height_offset + offset_row;
            int f_col = filter_width_offset + offset_col;
            int i_row = s_row + f_row;
            int i_col = s_col + f_col;
            
            if(i_row >= 0 && i_row < get_local_size(1) && i_col >= 0 && i_col < get_local_size(0))
            {
                sum += shared[i_row * get_local_size(0) + i_col] * filter[f_row * filter_width + f_col];
            }
        }
    }

    output[d_row * data_width + d_col] = sum;
}

Overwriting src/kernel.cl


In [621]:
%%writefile src/main.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include "utils.h"

// Number of data elements in each dimension
#define DATA_WIDTH 32
#define DATA_HEIGHT 32

// Number of filter elements in each dimension
#define FILTER_WIDTH 3
#define FILTER_HEIGHT 3

// Number of elements on each side of a centered filter in each dimension
#define FILTER_WIDTH_OFFSET (FILTER_WIDTH / 2)
#define FILTER_HEIGHT_OFFSET (FILTER_HEIGHT / 2)

// Number of elements in local (shared) memory in each dimension
#define TILE_WIDTH_BASE 16
#define TILE_HEIGHT_BASE 16

int main(void)
{
    // Setup OpenCL
    cl_int err; cl_context context; cl_command_queue queue; cl_program program; cl_kernel kernel;
    setupOpenCL(&context, &queue, &program, &kernel);

    srand(0);
    //srand(time(NULL));
    
    // Declare variables
    float *h_input, *h_output, *h_filter; // host copies of input, output, filter
    int data_size = DATA_WIDTH * DATA_HEIGHT * sizeof(float);
    int filter_size = FILTER_WIDTH * FILTER_HEIGHT * sizeof(float);

    // Allocate space for host (CPU) copies of input, output, filter
    h_input = (float *)malloc(data_size);
    h_output = (float *)malloc(data_size);
    h_filter = (float *)malloc(filter_size);

    // Setup input values
    for(int row = 0; row < DATA_HEIGHT; row++)
    {
        for(int col = 0; col < DATA_WIDTH; col++)
        {
            h_input[row * DATA_WIDTH + col] = (float)rand() / RAND_MAX;
        }
    }

    // Setup filter
    for(int row = 0; row < FILTER_HEIGHT; row++)
    {
        for(int col = 0; col < FILTER_WIDTH; col++)
        {
            h_filter[row * FILTER_WIDTH + col] = 1.0f / (FILTER_WIDTH * FILTER_HEIGHT);
        }
    }

    // Device buffers
    cl_mem d_input, d_output, d_filter;
    d_input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, data_size, h_input, &err);
    d_output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, data_size, NULL, &err);    
    d_filter = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, filter_size, h_filter, &err);

    // Set kernel arguments
    cl_int data_height = DATA_HEIGHT;
    cl_int data_width = DATA_WIDTH;
    cl_int filter_height_offset = FILTER_HEIGHT_OFFSET;
    cl_int filter_width_offset = FILTER_WIDTH_OFFSET;
    cl_int filter_width = FILTER_WIDTH;
    int workgroup_height = TILE_HEIGHT_BASE + 2 * FILTER_HEIGHT_OFFSET;
    int workgroup_width = TILE_WIDTH_BASE + 2 * FILTER_WIDTH_OFFSET;
    int shared_size = workgroup_height * workgroup_width * sizeof(float);
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_input);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_output);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_filter);
    clSetKernelArg(kernel, 3, shared_size, NULL);
    clSetKernelArg(kernel, 4, sizeof(int), &data_height);
    clSetKernelArg(kernel, 5, sizeof(int), &data_width);
    clSetKernelArg(kernel, 6, sizeof(int), &filter_height_offset);
    clSetKernelArg(kernel, 7, sizeof(int), &filter_width_offset);
    clSetKernelArg(kernel, 8, sizeof(int), &filter_width);

    // Kernel launch configuration
    size_t localSize[2] = { workgroup_width, workgroup_height };
    size_t globalSize[2] = {
        ((data_width + workgroup_width - 1) / workgroup_width) * workgroup_width,
        ((data_height + workgroup_height - 1) / workgroup_height) * workgroup_height
    };

    // Enqueue kernel with timing event
    cl_event kernel_event;
    clEnqueueNDRangeKernel(queue, kernel, 2, NULL, globalSize, localSize, 0, NULL, &kernel_event);

    // Wait for kernel to finish and compute execution time
    clWaitForEvents(1, &kernel_event);
    cl_ulong time_start, time_end;
    clGetEventProfilingInfo(kernel_event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
    clGetEventProfilingInfo(kernel_event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
    double elapsed_ms = (time_end - time_start) * 1e-6;  // Convert nanoseconds to milliseconds

    // Copy result back to host
    clEnqueueReadBuffer(queue, d_output, CL_TRUE, 0, data_size, h_output, 0, NULL, NULL);

    // Print measured calculation execution time
    printf("Calculation (%d elements, %dx%d filter) took %.2f ms\n", DATA_HEIGHT * DATA_WIDTH, FILTER_HEIGHT, FILTER_WIDTH, elapsed_ms);
   
    // Print out the FILTER_WIDTH number of elements in the two arrays
    printf("\nMatrix samples:\n");
    printf("h_input %-12s h_output\n", "");
    for(int row = 0; row < FILTER_HEIGHT; row++)
    {
        for(int col = 0; col < FILTER_WIDTH; col++)
        {
            printf("%.3f ", h_input[row * DATA_WIDTH + col]);
        }
        printf("%-3s","");
        for(int col = 0; col < FILTER_WIDTH; col++)
        {
            printf("%.3f ", h_output[row * DATA_WIDTH + col]);
        }
        printf("\n");
    }

    // Cleanup
    free(h_input);
    free(h_output);
    free(h_filter);
    clReleaseMemObject(d_input);
    clReleaseMemObject(d_output);
    clReleaseMemObject(d_filter);
    clReleaseEvent(kernel_event);
    
    // Teardown OpenCL
    teardownOpenCL(&context, &queue, &program, &kernel);

    return 0;
}

Overwriting src/main.c


In [622]:
!{build_multi_file_command}
!{execute_command}

Calculation (1024 elements, 3x3 filter) took 0.01 ms

Matrix samples:
h_input              h_output
0.840 0.394 0.783    0.238 0.396 0.382 
0.613 0.296 0.638    0.328 0.527 0.568 
0.267 0.540 0.375    0.325 0.514 0.616 


In the output we see:
- The results are the same for the GPU solution as for the CPU solution.
- The execution time for the GPU solution is significantly fast than the CPU solution.

---
# 4. Cleanup
---

- Let's remove all files that have been created by this notebook.

In [623]:
import os, shutil

dirs = ["src", "include", "bin", ".vscode"]
files = ["kernel.cl", "main.c", "main.exe"]

for d in dirs:
    if os.path.exists(d):
        shutil.rmtree(d)

for f in files:
    if os.path.exists(f):
        os.remove(f)