<a href="https://colab.research.google.com/github/jenesias/Blog-List/blob/main/TensorFlow_with_GPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tensorflow with GPU

This notebook provides an introduction to computing on a [GPU](https://cloud.google.com/gpu) in Colab. In this notebook you will connect to a GPU, and then run some basic TensorFlow operations on both the CPU and a GPU, observing the speedup provided by using the GPU.


## Enabling and testing the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

Next, we'll confirm that we can connect to the GPU with tensorflow:

In [None]:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

TensorFlow 2.x selected.
Found GPU at: /device:GPU:0


## Observe TensorFlow speedup on GPU relative to CPU

This example constructs a typical convolutional neural network layer over a
random image and manually places the resulting ops on either the CPU or the GPU
to compare execution speed.

In [2]:
!nvcc --version


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [21]:
%%writefile vector_addition_parallel.cu

#include <iostream>
#include <vector>
#include <chrono>
#include <cuda_runtime.h>

__global__ void vectorAddParallel(float *a, float *b, float *c, int n) {
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    int size = 1000 * 1000;
    size_t size_bytes = size * sizeof(float);

    std::vector<float> h_a(size, 1.0f);
    std::vector<float> h_b(size, 2.0f);
    std::vector<float> h_c(size);

    float *d_a, *d_b, *d_c;

    // Allocate memory on the device
    cudaMalloc(&d_a, size_bytes);
    cudaMalloc(&d_b, size_bytes);
    cudaMalloc(&d_c, size_bytes);

    // Copy vectors from host to device
    cudaMemcpy(d_a, h_a.data(), size_bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b.data(), size_bytes, cudaMemcpyHostToDevice);

    // Define grid and block dimensions
    dim3 gridSize((size + 255) / 256, 1, 1);
    dim3 blockSize(256, 1, 1);

    auto start_time = std::chrono::high_resolution_clock::now();

    // Launch the vector addition kernel
    vectorAddParallel<<<gridSize, blockSize>>>(d_a, d_b, d_c, size);

    // Check for CUDA errors
    cudaDeviceSynchronize();
    cudaError_t cudaError = cudaGetLastError();
    if (cudaError != cudaSuccess) {
        std::cerr << "CUDA error: " << cudaGetErrorString(cudaError) << std::endl;
        return 1;
    }

    // Copy result from device to host
    cudaMemcpy(h_c.data(), d_c, size_bytes, cudaMemcpyDeviceToHost);

    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);



    std::cout << "Parallel Vector Addition took " << duration.count() << " milliseconds.\n";

    // Free memory on the device
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    return 0;
}


Overwriting vector_addition_parallel.cu


In [22]:
!nvcc vector_addition_parallel.cu -o vector_addition_parallel
!./vector_addition_parallel


Parallel Vector Addition took 1 milliseconds.


In [17]:
%%writefile vector_addition_sequential.cpp

#include <iostream>
#include <vector>
#include <chrono>

void vectorAddSequential(const std::vector<float>& a, const std::vector<float>& b, std::vector<float>& c) {
    int n = a.size();
    for (int i = 0; i < n; ++i) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    int size = 1000 * 1000;

    std::vector<float> h_a(size, 1.0f);
    std::vector<float> h_b(size, 2.0f);
    std::vector<float> h_c(size);

    auto start_time = std::chrono::high_resolution_clock::now();

    vectorAddSequential(h_a, h_b, h_c);

    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);

    // Print the result (printing first 10 elements)



    std::cout << "Sequential Vector Addition took " << duration.count() << " milliseconds.\n";

    return 0;
}


Overwriting vector_addition_sequential.cpp


In [18]:
!g++ vector_addition_sequential.cpp -o vector_addition_sequential
!./vector_addition_sequential


Sequential Vector Addition took 7 milliseconds.
