# nvcc4Jupyter

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/paga-hb/C1PD2C_2025/blob/main/notebooks/nvcc4jupyter.ipynb)

This notebook shows how to use the nvcc4Jupyter extension.

- GitHub: https://github.com/andreinechaev/nvcc4jupyter
- Docs: https://nvcc4jupyter.readthedocs.io/en/latest

Requirements for nvcc4jupyter to work:

- NVIDIA GPU + Drivers (runtime)
  - You need a supported NVIDIA GPU and the appropriate driver version installed to actually run the CUDA code.
- CUDA Toolkit
  - Includes nvcc (the NVIDIA CUDA Compiler).
  - Provides headers and libraries needed for compiling GPU code.
  - Must be compatible with your system's GPU drivers.
- C/C++ Compiler
  - On Linux: usually g++ or clang++.
  - On Windows: MSVC (Microsoft Visual C++) or WSL with Linux tools.
  - On macOS: CUDA isn't officially supported for newer GPUs, so this is trickier.

What nvcc4jupyter does
- It lets you write CUDA C++ code in a Jupyter cell using a magic like `%%cuda`.
- It compiles the code using nvcc on your local machine.
- It then runs the compiled binary and shows the output in the notebook.

That means all the compilation and execution happens locally, so you need the full CUDA development environment.
- Common issues if something’s missing:
  - nvcc: command not found → CUDA Toolkit isn't installed or not in PATH.
  - iostream: No such file or directory → Missing C++ compiler.
  - cuda_runtime.h: No such file or directory → CUDA Toolkit headers not found.

---
## Install and Load `nvcc4jupyter`

In [22]:
%pip install nvcc4jupyter

Note: you may need to restart the kernel to use updated packages.


In [24]:
%load_ext nvcc4jupyter
%reload_ext nvcc4jupyter

The nvcc4jupyter extension is already loaded. To reload it, use:
  %reload_ext nvcc4jupyter
Source files will be saved in "/tmp/tmp4my8x1vf".


---
## C

In [48]:
%%cuda
#include <stdio.h>

int main()
{
    printf("Hello World!");
    return 0;
}

Hello World!


---
## C++

In [26]:
%%cuda
#include <iostream>

using namespace std;

int main()
{
    cout << "Hello World!" << endl;
    return 0;
}

Hello World!



---
## CUDA C

In [43]:
%%cuda
#include <stdio.h>

__global__ void hello()
{
    // CUDA supports the printf() function on the device
    printf("Hello from device -> block: %u, thread: %u\n", blockIdx.x, threadIdx.x);
}

int main()
{
    printf("Hello from host\n");
    hello<<<2, 2>>>();
    cudaDeviceSynchronize();
    return 0;
}

Hello from host
Hello from device -> block: 1, thread: 0
Hello from device -> block: 1, thread: 1
Hello from device -> block: 0, thread: 0
Hello from device -> block: 0, thread: 1



---
## CUDA C++

In [50]:
%%cuda
#include <iostream>

using namespace std;

__global__ void hello()
{
    // Note! CUDA does NOT support cout on the device (so use printf() for debugging)
    printf("Hello from device block: %u, thread: %u\n", blockIdx.x, threadIdx.x);
}

int main()
{
    // It's fine to use cout on the host
    cout << "Hello from host" << endl;
    hello<<<2, 2>>>();
    cudaDeviceSynchronize();
    return 0;
}

Hello from host
Hello from device block: 1, thread: 0
Hello from device block: 1, thread: 1
Hello from device block: 0, thread: 0
Hello from device block: 0, thread: 1



In [18]:
%%cuda
#include <cstdio>
#include <iostream>

using namespace std;

__global__ void maxi(int* a, int* b, int n)
{
    int block = 256 * blockIdx.x;
    int max = 0;

    for (int i = block; i < min(256 + block, n); i++) {

        if (max < a[i]) {
            max = a[i];
        }
    }
    b[blockIdx.x] = max;
}

int main()
{
    int n;
    n = 3 << 2;
    int a[n];

    cout << "Elements: ";
    for (int i = 0; i < n; i++) {
        a[i] = rand() % n;
        cout << a[i] << "\t";
    }

    cudaEvent_t start, end;
    int *ad, *bd;
    int size = n * sizeof(int);
    cudaMalloc(&ad, size);
    cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
    int grids = ceil(n * 1.0f / 256.0f);
    cudaMalloc(&bd, grids * sizeof(int));

    dim3 grid(grids, 1);
    dim3 block(1, 1);

    cudaEventCreate(&start);
    cudaEventCreate(&end);
    cudaEventRecord(start);

    while (n > 1) {
        maxi<<<grids, block>>>(ad, bd, n);
        n = ceil(n * 1.0f / 256.0f);
        cudaMemcpy(ad, bd, n * sizeof(int), cudaMemcpyDeviceToDevice);
    }

    cudaEventRecord(end);
    cudaEventSynchronize(end);

    float time = 0;
    cudaEventElapsedTime(&time, start, end);

    int ans[2];
    cudaMemcpy(ans, ad, 4, cudaMemcpyDeviceToHost);

    cout << "\nThe largest element is: " << ans[0] << endl;

    cout << "The time required: " << time << " seconds" << endl;
}

Elements: 7	10	9	7	5	7	10	0	9	1	2	7	
The largest element is: 10
The time required: 0.098304 seconds

