<a href="https://colab.research.google.com/github/kt-chan/cuda-demo/blob/main/cuda_cplusplus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 使用Google Colab寫C++程式並運行

Create sample c++ code with %%writefile filename.cpp

In [None]:
%%writefile demo.cpp

#include <iostream>
using namespace std;
int main()
{
    string text = "world2";
    cout << "hello, " + text;
}

Compile the code with %%shell command

In [None]:
%%shell

g++ demo.cpp -o demo



Execution by just run it, with %%shell command.

In [None]:
%%shell
./demo

Hello World!




# 配置 CUDA Environment

In [None]:
# check nvidia card info
!nvidia-smi

Thu May 23 04:01:29 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
# check os info
!cat /etc/*release

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy


In [None]:
#get current working directory
!pwd

/content


Remote all legacy cuda framework, and update to latest version Go here: https://developer.nvidia.com/cuda-downloads

In [None]:
!apt-get --purge remove cuda nvidia* libnvidia-*
!dpkg -l | grep cuda- | awk '{print $2}' | xargs -n1 dpkg --purge
!apt-get remove cuda-*
!apt autoremove
!apt-get update

In [None]:
!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
!sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
!wget https://developer.download.nvidia.com/compute/cuda/12.5.0/local_installers/cuda-repo-ubuntu2204-12-5-local_12.5.0-555.42.02-1_amd64.deb
!sudo dpkg -i cuda-repo-ubuntu2204-12-5-local_12.5.0-555.42.02-1_amd64.deb
!sudo cp /var/cuda-repo-ubuntu2204-12-5-local/cuda-*-keyring.gpg /usr/share/keyrings/
!sudo apt-get update
!sudo apt-get -y install cuda-toolkit-12-5

--2024-05-22 09:49:35--  https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190 [application/octet-stream]
Saving to: ‘cuda-ubuntu2204.pin’


2024-05-22 09:49:36 (4.09 MB/s) - ‘cuda-ubuntu2204.pin’ saved [190/190]

--2024-05-22 09:49:36--  https://developer.download.nvidia.com/compute/cuda/12.5.0/local_installers/cuda-repo-ubuntu2204-12-5-local_12.5.0-555.42.02-1_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3302514250 (3.1G) [application/x-deb]
Saving to: ‘cuda-repo-ubuntu2204

In [None]:
#After refresh the cuda framework, check version info

!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


# Setup Google Colab for cuda c++, rerun this for new session.

set your runtime to cuda by click "runtime" -> "change runtime type" in above toolbar, and select T4 GPU.

First, you have to install nvcc plugin for cuda compiler

In [None]:
!pip install nvcc4jupyter

Collecting nvcc4jupyter
  Downloading nvcc4jupyter-1.2.1-py3-none-any.whl (10 kB)
Installing collected packages: nvcc4jupyter
Successfully installed nvcc4jupyter-1.2.1


then, Load the plugin



In [None]:
%load_ext nvcc4jupyter

Detected platform "Colab". Running its setup...
Source files will be saved in "/tmp/tmpfhvxfg1x".


In [None]:
# check nvidia card info
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


# Demo Code - Simple Loop

In [None]:
%%writefile SimpleLogger.h
#include <iostream>
#include <string>
#include <ctime>
#include <iomanip>
#include <sstream>
#include <stdexcept>
#include <cstdarg>  // Correct header for va_list and related macros
#include <cstdio>

enum LogLevel {
	DEBUG,
	INFO,
	WARN,
	ERROR,
	FATAL
};

class SimpleLogger {
private:
	LogLevel currentLevel;
	std::string logFormat; // Stores the format for log messages

	// Helper function to convert LogLevel to string
	std::string LogLevelToString(LogLevel level) {
		switch (level) {
		case DEBUG: return "DEBUG";
		case INFO: return "INFO";
		case WARN: return "WARN";
		case ERROR: return "ERROR";
		case FATAL: return "FATAL";
		default: return "UNKNOWN";
		}
	}

public:
	// Constructor that accepts an initial log level and an optional log format
	SimpleLogger(LogLevel level, const std::string& format = "[%Y-%m-%d %H:%M:%S] [%L] %M")
		: currentLevel(level), logFormat(format) {}

	void SetLogLevel(LogLevel level) {
		currentLevel = level;
	}

	// Function to set the log format
	void SetLogFormat(const std::string& format) {
		logFormat = format;
	}

	// Printf-style log function
	void Logf(LogLevel level, const char* format, ...) {
		if (level < currentLevel) return;

		char buffer[1024];
		std::va_list args;
		va_start(args, format);
		// Use std::vsnprintf to safely format the string into the buffer
		std::vsnprintf(buffer, sizeof(buffer), format, args);
		va_end(args);

		// Now log the formatted message using the existing Log method
		Log(level, buffer);
	}

	// Function to log messages with the given log level
	void Log(LogLevel level, const int& message) {
		Log(level, std::to_string(message));
	}

	// Function to log messages with the given log level
	void Log(LogLevel level, const std::string& message) {
		if (level < currentLevel) return; // Skip messages below the current log level

		// Get the current time
		std::time_t now = std::time(nullptr);
		std::tm timeInfo;
		// Use localtime_s on MSVC or localtime on other platforms
#ifdef _MSC_VER
		localtime_s(&timeInfo, &now);
#else
		std::tm* timeInfoPtr = std::localtime(&now);
		if (!timeInfoPtr) return; // If localtime fails, return
		timeInfo = *timeInfoPtr;
#endif

		// Replace format specifiers with actual values
		std::string formattedMessage = logFormat;
		size_t pos = 0;
		if ((pos = formattedMessage.find("%Y")) != std::string::npos) {
			formattedMessage.replace(pos, 3, std::to_string(timeInfo.tm_year + 1900));
		}
		if ((pos = formattedMessage.find("%m")) != std::string::npos) {
			formattedMessage.replace(pos, 2, std::to_string(timeInfo.tm_mon + 1));
		}
		if ((pos = formattedMessage.find("%d")) != std::string::npos) {
			formattedMessage.replace(pos, 2, std::to_string(timeInfo.tm_mday));
		}
		if ((pos = formattedMessage.find("%H")) != std::string::npos) {
			formattedMessage.replace(pos, 2, std::to_string(timeInfo.tm_hour));
		}
		if ((pos = formattedMessage.find("%M")) != std::string::npos) {
			formattedMessage.replace(pos, 2, std::to_string(timeInfo.tm_min));
		}
		if ((pos = formattedMessage.find("%S")) != std::string::npos) {
			formattedMessage.replace(pos, 2, std::to_string(timeInfo.tm_sec));
		}
		if ((pos = formattedMessage.find("%L")) != std::string::npos) {
			formattedMessage.replace(pos, 2, LogLevelToString(level));
		}
		if ((pos = formattedMessage.find("%M")) != std::string::npos) {
			size_t end = pos + 2;
			formattedMessage.replace(pos, end - pos, message);
		}

		// Output the formatted log message to the console with endl;
		std::cout << formattedMessage << std::endl;

	}
};

// Declare a static instance of SimpleLogger
static SimpleLogger& GlobalLogger(LogLevel  level = INFO) {
	static SimpleLogger logger(level);
	return logger;
}


Simple 1 layer loop demo

// %%cuda_group_save -n demo.cu -g share

In [6]:
%%writefile demo.cpp

#define _CRT_SECURE_NO_WARNINGS
#include <cstdlib>
#include <iostream>
#include <stdexcept>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <device_launch_parameters.h>
#include "SimpleLogger.h"


#define Debug false
#define K 8

using namespace std;

class demo {


public:

	void genRandomNumber(int* a, int n) {
		for (int i = 0; i < n; i++) {
			a[i] = rand() % n;
			GlobalLogger().Log(DEBUG, a[i]);
		}
		GlobalLogger().Log(DEBUG, "\n");
		GlobalLogger().Log(INFO, "first elment: " + to_string(a[1]) + "...");
	}


	void cpucal(int* a, int* b, long long& c, int n)
	{
		long long  sum = 0;
		// calculate the dot product of two array
		for (int i = 0; i < n; i++)
		{
			sum += (a[i] * b[i]);
		}
		c = sum;
	}



	bool checkGPU()
	{
		const char* gpu_env = std::getenv("COLAB_GPU");
		if (gpu_env && atoi(gpu_env) > 0)
		{
			GlobalLogger().Log(INFO, "A GPU is connected.");
			return true;
		}
		else
		{
			GlobalLogger().Log(INFO, "No accelerator is connected.");
			return false;
		}
	}

	int run(void) {

		checkGPU();

		unsigned int N = 1 << K;
		GlobalLogger().Logf(INFO, "Total array size: %d\n", N);

		int* a, * b;
		long long cpu_output;

		a = (int*)malloc(N * sizeof(int));
		b = (int*)malloc(N * sizeof(int));

		GlobalLogger().Log(INFO, "value of array a:\t");
		genRandomNumber(a, N);
		GlobalLogger().Log(INFO, "value of array b:\t");
		genRandomNumber(b, N);

		GlobalLogger().Log(INFO, "@CPU, summing value...");

		clock_t t;

		// calling cpu
		t = clock();//start time
		cpucal(a, b, cpu_output, N);
		t = clock() - t;//total time = end time - start time
		GlobalLogger().Log(INFO, "result:  " + to_string(cpu_output));
		GlobalLogger().Logf(INFO, "CPU Avg time = %lf ms.\n", ((((float)t) / CLOCKS_PER_SEC) * 1000));

		cudaFree(a);
		cudaFree(b);

		return 0;
	}
};



int main()
{
	GlobalLogger(INFO);
	GlobalLogger().Log(INFO, "Application started. This is info level log");
	GlobalLogger().Log(DEBUG, "This is a debug message."); // This will not be shown because it's below the INFO level

	demo demoapp;
	demoapp.run();
}

Writing demo.cpp


In [7]:
%%shell
nvcc -o demo demo.cpp
./demo

[20245-28 8:6:59] [INFO] Application started. This is info level log
[20245-28 8:6:59] [INFO] A GPU is connected.
[20245-28 8:6:59] [INFO] Total array size: 256

[20245-28 8:6:59] [INFO] value of array a:	
[20245-28 8:6:59] [INFO] first elment: 198...
[20245-28 8:6:59] [INFO] value of array b:	
[20245-28 8:6:59] [INFO] first elment: 112...
[20245-28 8:6:59] [INFO] @CPU, summing value...
[20245-28 8:6:59] [INFO] result:  4214798
[20245-28 8:6:59] [INFO] CPU Avg time = 0.002000 ms.





# Demo Code - Complex nested loop

In [4]:
%%writefile SimpleLogger.h
#include <iostream>
#include <string>
#include <ctime>
#include <iomanip>
#include <sstream>
#include <stdexcept>
#include <cstdarg>  // Correct header for va_list and related macros
#include <cstdio>

enum LogLevel {
	DEBUG,
	INFO,
	WARN,
	ERROR,
	FATAL
};

class SimpleLogger {
private:
	LogLevel currentLevel;
	std::string logFormat; // Stores the format for log messages

	// Helper function to convert LogLevel to string
	std::string LogLevelToString(LogLevel level) {
		switch (level) {
		case DEBUG: return "DEBUG";
		case INFO: return "INFO";
		case WARN: return "WARN";
		case ERROR: return "ERROR";
		case FATAL: return "FATAL";
		default: return "UNKNOWN";
		}
	}

public:
	// Constructor that accepts an initial log level and an optional log format
	SimpleLogger(LogLevel level, const std::string& format = "[%Y-%m-%d %H:%M:%S] [%L] %M")
		: currentLevel(level), logFormat(format) {}

	void SetLogLevel(LogLevel level) {
		currentLevel = level;
	}

	// Function to set the log format
	void SetLogFormat(const std::string& format) {
		logFormat = format;
	}

	// Printf-style log function
	void Logf(LogLevel level, const char* format, ...) {
		if (level < currentLevel) return;

		char buffer[1024];
		std::va_list args;
		va_start(args, format);
		// Use std::vsnprintf to safely format the string into the buffer
		std::vsnprintf(buffer, sizeof(buffer), format, args);
		va_end(args);

		// Now log the formatted message using the existing Log method
		Log(level, buffer);
	}

	// Function to log messages with the given log level
	void Log(LogLevel level, const int& message) {
		Log(level, std::to_string(message));
	}

	// Function to log messages with the given log level
	void Log(LogLevel level, const std::string& message) {
		if (level < currentLevel) return; // Skip messages below the current log level

		// Get the current time
		std::time_t now = std::time(nullptr);
		std::tm timeInfo;
		// Use localtime_s on MSVC or localtime on other platforms
#ifdef _MSC_VER
		localtime_s(&timeInfo, &now);
#else
		std::tm* timeInfoPtr = std::localtime(&now);
		if (!timeInfoPtr) return; // If localtime fails, return
		timeInfo = *timeInfoPtr;
#endif

		// Replace format specifiers with actual values
		std::string formattedMessage = logFormat;
		size_t pos = 0;
		if ((pos = formattedMessage.find("%Y")) != std::string::npos) {
			formattedMessage.replace(pos, 3, std::to_string(timeInfo.tm_year + 1900));
		}
		if ((pos = formattedMessage.find("%m")) != std::string::npos) {
			formattedMessage.replace(pos, 2, std::to_string(timeInfo.tm_mon + 1));
		}
		if ((pos = formattedMessage.find("%d")) != std::string::npos) {
			formattedMessage.replace(pos, 2, std::to_string(timeInfo.tm_mday));
		}
		if ((pos = formattedMessage.find("%H")) != std::string::npos) {
			formattedMessage.replace(pos, 2, std::to_string(timeInfo.tm_hour));
		}
		if ((pos = formattedMessage.find("%M")) != std::string::npos) {
			formattedMessage.replace(pos, 2, std::to_string(timeInfo.tm_min));
		}
		if ((pos = formattedMessage.find("%S")) != std::string::npos) {
			formattedMessage.replace(pos, 2, std::to_string(timeInfo.tm_sec));
		}
		if ((pos = formattedMessage.find("%L")) != std::string::npos) {
			formattedMessage.replace(pos, 2, LogLevelToString(level));
		}
		if ((pos = formattedMessage.find("%M")) != std::string::npos) {
			size_t end = pos + 2;
			formattedMessage.replace(pos, end - pos, message);
		}

		// Output the formatted log message to the console with endl;
		std::cout << formattedMessage << std::endl;

	}
};

// Declare a static instance of SimpleLogger
static SimpleLogger& GlobalLogger(LogLevel  level = INFO) {
	static SimpleLogger logger(level);
	return logger;
}


Overwriting SimpleLogger.h


Complex N layer loop demo

In [5]:
%%writefile demo.cu

#define _CRT_SECURE_NO_WARNINGS
#include <cstdlib>
#include <iostream>
#include <stdexcept>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <device_launch_parameters.h>
#include "SimpleLogger.h"

#define K 8 // set value to 2^K, K should less than 30


using namespace std;


__global__ void gpucal_partial_kernel(int* a, int* b, long long* c, int n)
{
	int threadId = threadIdx.x + blockDim.x * blockIdx.x;
	c[threadId] += (a[threadId] * b[threadId]);
}


class DemoCuda {
private:
	void checkCudaError(cudaError_t error) {
		if (error != cudaSuccess) {
			GlobalLogger().Logf(ERROR, "CUDA Error: %s", cudaGetErrorString(error));
		}
	}

	void checkCudaError(cudaError_t error, string stmt) {
		if (error != cudaSuccess) {
			GlobalLogger().Logf(ERROR, "CUDA Statement: %s", stmt);
			checkCudaError(error);
		}
	}

	void genRandomNumber(int* a, int n) {
		for (int i = 0; i < n; i++) {
			a[i] = rand() % n;
			GlobalLogger().Log(DEBUG, a[i]);
		}
		GlobalLogger().Log(DEBUG, "\n");
		GlobalLogger().Log(INFO, "first elment: " + to_string(a[1]) + "...");
	}

	void cpucal(int* a, int* b, long long& c, int n)
	{
		long long  sum = 0;
		// calculate the dot product of two array
		for (int i = 0; i < n; i++)
		{
			sum += (a[i] * b[i]);
		}
		c = sum;
	}



	void gpucal(int* a, int* b, long long* c, int n) {
		int* d_a, * d_b;
		long long* d_c;

		// Attempt to allocate memory on the host
		checkCudaError(cudaMalloc(&d_a, n * sizeof(int)), "cudaMalloc(&d_a, N * sizeof(int))");
		cudaMemcpy(d_a, a, n * sizeof(int), cudaMemcpyHostToDevice);

		checkCudaError(cudaMalloc(&d_b, n * sizeof(int)), "cudaMalloc(&d_b, N * sizeof(int))");
		cudaMemcpy(d_b, b, n * sizeof(int), cudaMemcpyHostToDevice);

		checkCudaError(cudaMalloc(&d_c, n * sizeof(long long)), "cudaMalloc(&d_c, n * sizeof(int))");
		cudaMemset(d_c, 0, n * sizeof(long long));


		// define kernel call
		int grids = max(1, (n + 255) / 256);
		int blocks = max(1, min(n, 256));

		GlobalLogger().Logf(INFO, "start to run gpucal_partial_kernel with <<<%d, %d>>>:  ", grids, blocks);
		//gpucal_partial_kernel <<<grids, blocks>>> (d_a, d_b, d_c, n);
		gpucal_partial_kernel <<<grids, blocks>>> (d_a, d_b, d_c, n);

		// Sync Device to Host
		GlobalLogger().Logf(INFO, "start to run cudaDeviceSynchronize ... ");
		cudaDeviceSynchronize();

		cudaMemcpy(c, d_c, n * sizeof(long long), cudaMemcpyDeviceToHost);
		GlobalLogger().Logf(INFO, "start to run cudaMemcpy with c array first element: %s.", to_string(c[0]));


		// Sum up the partial sums on the host to get the final result
		GlobalLogger().Logf(DEBUG, "Sum up the partial sums  ....");
		long long sum = 0;
		for (int i = 0; i < n; ++i) {
			GlobalLogger().Logf(DEBUG, "c[i]: %d", c[i]);
			sum += c[i];
			GlobalLogger().Logf(DEBUG, "sum: %lld", sum);
		}

		GlobalLogger().Logf(INFO, "Sum value: %lld", sum);

		// Write the final sum to c[0]
		c[0] = sum;
		GlobalLogger().Logf(INFO, "gpucal return value: %lld", c[0]);

		checkCudaError(cudaFreeHost(a), "cudaFreeHost(a)");
		checkCudaError(cudaFreeHost(b), "cudaFreeHost(b)");
		checkCudaError(cudaFree(d_a), "cudaFree(d_a)");
		checkCudaError(cudaFree(d_b), "cudaFree(d_b)");
		checkCudaError(cudaFree(d_c), "cudaFree(d_c)");
	}

	bool checkGPU()
	{
		const char* gpu_env = getenv("COLAB_GPU");
		if (gpu_env && atoi(gpu_env) > 0)
		{
			GlobalLogger().Log(INFO, "A GPU is connected.");
			return true;
		}
		else
		{
			GlobalLogger().Log(INFO, "No accelerator is connected.");
			return false;
		}
	}
public:

	int run(void)
	{

		bool GPU = checkGPU();

		unsigned int N = 1 << K;
		GlobalLogger().Logf(INFO, "Total array size: %d\n", N);

		int* a, * b;
		long long cpu_output;

		if (!GPU)
		{
			GlobalLogger().Log(INFO, "allocating cpu memory ... ");
			a = (int*)malloc(N * sizeof(int));
			b = (int*)malloc(N * sizeof(int));
		}
		else
		{
			GlobalLogger().Log(INFO, "allocating gpu memory ... ");
			// Attempt to allocate memory on the host
			checkCudaError(cudaMallocHost(&a, N * sizeof(int)));
			checkCudaError(cudaMallocHost(&b, N * sizeof(int)));
		}

		GlobalLogger().Log(INFO, "value of array a:\t");
		genRandomNumber(a, N);
		GlobalLogger().Log(INFO, "value of array b:\t");
		genRandomNumber(b, N);

		GlobalLogger().Log(INFO, "@CPU, finding dot product for value of size(n) * size(n) ... \n");

		clock_t t;

		// calling cpu
		t = clock(); // start time
		cpucal(a, b, cpu_output, N);
		t = clock() - t; // total time = end time - start time

		GlobalLogger().Log(INFO, "result: " + to_string(cpu_output));

		GlobalLogger().Logf(INFO, "@CPU Avg time = %lf ms.\n", ((((float)t) / CLOCKS_PER_SEC) * 1000));
		if (GPU) {
			GlobalLogger().Log(INFO, "@GPU, finding dot product for value of size(n) * size(n) ... ");

			// reset value c[0]
			long long* gpu_output;
			checkCudaError(cudaMallocHost(&gpu_output, N * sizeof(long long)));
			memset(gpu_output, 0, N * sizeof(long long));

			// calling cpu
			t = clock(); // start time
			gpucal(a, b, gpu_output, N);
			t = clock() - t; // total time = end time - start time

			GlobalLogger().Log(INFO, "result: " + to_string(gpu_output[0]));
			GlobalLogger().Logf(INFO, "@CPU Avg time = %lf ms.\n", ((((float)t) / CLOCKS_PER_SEC) * 1000));
			cudaFreeHost(gpu_output);

		}

		if (!GPU)
		{
			free(a);
			free(b);
		}
		else
		{
			cudaFreeHost(a);
			cudaFreeHost(b);
		}

		return 0;
	}
};

int main()
{
	GlobalLogger(INFO);
	GlobalLogger().Log(INFO, "Application started. Logging at LogLevel: INFO");
	DemoCuda app;
	app.run();
}

Overwriting demo.cu


In [6]:
%%shell
nvcc -o demo demo.cu
./demo

[20245-28 8:16:21] [INFO] Application started. Logging at LogLevel: INFO
[20245-28 8:16:21] [INFO] A GPU is connected.
[20245-28 8:16:21] [INFO] Total array size: 256

[20245-28 8:16:21] [INFO] allocating gpu memory ... 
[20245-28 8:16:21] [INFO] value of array a:	
[20245-28 8:16:21] [INFO] first elment: 198...
[20245-28 8:16:21] [INFO] value of array b:	
[20245-28 8:16:21] [INFO] first elment: 112...
[20245-28 8:16:21] [INFO] @CPU, finding dot product for value of size(n) * size(n) ... 

[20245-28 8:16:21] [INFO] result: 4214798
[20245-28 8:16:21] [INFO] @CPU Avg time = 0.003000 ms.

[20245-28 8:16:21] [INFO] @GPU, finding dot product for value of size(n) * size(n) ... 
[20245-28 8:16:21] [INFO] start to run gpucal_partial_kernel with <<<1, 256>>>:  
[20245-28 8:16:21] [INFO] start to run cudaDeviceSynchronize ... 
[20245-28 8:16:21] [INFO] start to run cudaMemcpy with c array first element: 9O��.
[20245-28 8:16:21] [INFO] Sum value: 4214798
[20245-28 8:16:21] [INFO] gpucal return v

