<a href="https://colab.research.google.com/github/imrickysu/cuda_on_colab/blob/master/00_CUDA_env.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enable GPU Mode

Click menu Runtime -> Change Runtime Type

Test GPU with the following commands

In [None]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [None]:
import torch
torch.cuda.is_available()

True

In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

'/device:GPU:0'

In [None]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0


# Install NVCC for Jupyter

nvcc4jupyter plugin can let you run CUDA C code in Jupyter Notebook.

This is my updated plugin to enable Tesla K80 in CUDA 11.

In [None]:
!pip install git+git://github.com/imrickysu/nvcc4jupyter.git

Collecting git+git://github.com/imrickysu/nvcc4jupyter.git
  Cloning git://github.com/imrickysu/nvcc4jupyter.git to /tmp/pip-req-build-q189h4ws
  Running command git clone -q git://github.com/imrickysu/nvcc4jupyter.git /tmp/pip-req-build-q189h4ws
Building wheels for collected packages: NVCCPlugin
  Building wheel for NVCCPlugin (setup.py) ... [?25l[?25hdone
  Created wheel for NVCCPlugin: filename=NVCCPlugin-0.0.2-py3-none-any.whl size=4286 sha256=fa9580ac7cc09ea35c6dbae330e023114b096c6335698951f304fa3d9d0523df
  Stored in directory: /tmp/pip-ephem-wheel-cache-k61sfjz7/wheels/77/ca/04/be7399dd9623f64729e513d1c6082358e0c24a309100064304
Successfully built NVCCPlugin
Installing collected packages: NVCCPlugin
Successfully installed NVCCPlugin-0.0.2


In [None]:
%load_ext nvcc_plugin

created output directory at /content/src
Out bin /content/result.out


# Execute Single CUDA file with nvcc4jupyter

If your CUDA application has only one single source file, you can use `%%cu` magic command to compile this file and execute the compile result.

In [None]:
%%cu
#include <stdio.h>
#include <stdlib.h>
__global__ void add(int *a, int *b, int *c) {
    *c = *a + *b;
}

int main() {
int a, b, c;           // host copies of variables a, b & c
int *d_a, *d_b, *d_c;  // device copies of variables a, b & c
int size = sizeof(int);

// Allocate space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Setup input values  
c = 0;
a = 3;
b = 5;

// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU
add<<<1,1>>>(d_a, d_b, d_c);

// Copy result back to host
cudaError err = cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
if(err!=cudaSuccess) {
    printf("CUDA error copying to Host: %s\n", cudaGetErrorString(err));
}
printf("result is %d\n",c);

// Cleanup
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}

result is 8



If any error happens, e.g. running result is incorrect, you can try running the following host code.

In [None]:
%%cu

#include <iostream>
int main()
{
    std::cout << "Your host code test works fine.";
    return 0;
}

Your host code test works fine.


# Compiling and Running Multiple Files

nvcc4jupyter cannot deal with multiple files properly. Write the cu file to storage and call nvcc manually to compile this CUDA program.

In [None]:
%%writefile vectorAdd.cu

/**
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

/**
 * Vector addition: C = A + B.
 *
 * This sample is a very basic sample that implements element by element
 * vector addition. It is the same as the sample illustrating Chapter 2
 * of the programming guide with some additions like error checking.
 */

#include <stdio.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

#include "helper_cuda.h"
/**
 * CUDA Kernel Device code
 *
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.
 */
__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        C[i] = A[i] + B[i];
    }
}

/**
 * Host main routine
 */
int
main(void)
{
    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = 50000;
    size_t size = numElements * sizeof(float);
    printf("[Vector addition of %d elements]\n", numElements);

    // Allocate the host input vector A
    float *h_A = (float *)malloc(size);

    // Allocate the host input vector B
    float *h_B = (float *)malloc(size);

    // Allocate the host output vector C
    float *h_C = (float *)malloc(size);

    // Verify that allocations succeeded
    if (h_A == NULL || h_B == NULL || h_C == NULL)
    {
        fprintf(stderr, "Failed to allocate host vectors!\n");
        exit(EXIT_FAILURE);
    }

    // Initialize the host input vectors
    for (int i = 0; i < numElements; ++i)
    {
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;
    }

    // Allocate the device input vector A
    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device input vector B
    float *d_B = NULL;
    err = cudaMalloc((void **)&d_B, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device output vector C
    float *d_C = NULL;
    err = cudaMalloc((void **)&d_C, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy the host input vectors A and B in host memory to the device input vectors in
    // device memory
    printf("Copy input data from the host memory to the CUDA device\n");
    err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    err = cudaGetLastError();

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy the device result vector in device memory to the host result vector
    // in host memory.
    printf("Copy output data from the CUDA device to the host memory\n");
    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Verify that the result vector is correct
    for (int i = 0; i < numElements; ++i)
    {
        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
        {
            fprintf(stderr, "Result verification failed at element %d!\n", i);
            exit(EXIT_FAILURE);
        }
    }

    printf("Test PASSED\n");

    // Free device global memory
    err = cudaFree(d_A);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_B);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_C);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    printf("Done\n");
    return 0;
}



Writing vectorAdd.cu


In [None]:
!wget https://raw.githubusercontent.com/NVIDIA/cuda-samples/master/Common/helper_cuda.h
!wget https://raw.githubusercontent.com/NVIDIA/cuda-samples/master/Common/helper_string.h

--2021-10-09 14:54:24--  https://raw.githubusercontent.com/NVIDIA/cuda-samples/master/Common/helper_cuda.h
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27706 (27K) [text/plain]
Saving to: ‘helper_cuda.h.1’


2021-10-09 14:54:24 (12.6 MB/s) - ‘helper_cuda.h.1’ saved [27706/27706]

--2021-10-09 14:54:24--  https://raw.githubusercontent.com/NVIDIA/cuda-samples/master/Common/helper_string.h
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11242 (11K) [text/plain]
Saving to: ‘helper_string.h’


2021-10-09 14:54:25 (80.6 MB/s) - ‘help

If your allocated GPU is K80, you'll get an error using nvcc of CUDA 11 directly.

In [None]:
!nvcc vectorAdd.cu -o vectorAdd && ./vectorAdd

In file included from [01m[KvectorAdd.cu:26:0[m[K:
[01m[Khelper_cuda.h:41:10:[m[K [01;31m[Kfatal error: [m[Khelper_string.h: No such file or directory
 #include [01;31m[K<helper_string.h>[m[K
          [01;31m[K^~~~~~~~~~~~~~~~~[m[K
compilation terminated.


You need to add `-arch=sm_37` to specify K80 architecture because this sm is below the default supported arch of CUDA 11.
The output warning can be ignored in CUDA 11. It means CUDA may not support K80 in the future.

In [None]:
!nvcc vectorAdd.cu -arch=sm_37 -o vectorAdd && ./vectorAdd

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done


Execute the compilation output.

# Write and Compile C++ Code

In [None]:
%%writefile welcome.cpp

#include <iostream>
int main()
{
    std::cout << "Hello World\n";
    return 0;
}

Writing welcome.cpp


In [None]:
!g++ welcome.cpp -o welcome && ./welcome

Hello World
