Skip to content

Jacob273/JG.TechLearning.CUDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 

Repository files navigation

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by Nvidia.

https://en.wikipedia.org/wiki/CUDA

  • Code written on NVIDIA Cuda Runtime 10.2

  • Tested on Quadro M2000M

Implemented & measured:

  1. Polynomial calculation on CPU & GPU (part of the code was provided by an academic teacher)
  2. Image gray scaling on CPU & GPU
  3. Matrix multiplication on CPU & GPU (how the loop rolling/unrolling performed by the NVCC compiler affects the time it takes to perform calculations)

Tasks that have been completed here and theoretical introduction to laboratories was developed by IT engineer and academic teacher - Slawomir Wernikowski at the West Pomeranian University of Technology

Polynomial calculation results

Test 1

JG::Starting program which uses GPu and CPU to calculate polynomial for <1000000> values

  • GPU no. of blocks : <16>
  • Duration GPU: <3> [ms]
  • Duration CPU: <2> [ms]

Test 2 JG::Starting program which uses GPu and CPU to calculate polynomial for <10000000> values

  • GPU no. of blocks : <16>
  • Duration GPU: <28> [ms]
  • Duration CPU: <31> [ms]

Test 2.1 JG::Starting program which uses GPu and CPU to calculate polynomial for <10000000> values

  • GPU no. of blocks : <1024>
  • Duration GPU: <17> [ms]
  • Duration CPU: <32> [ms]

Test 3 JG::Starting program which uses GPu and CPU to calculate polynomial for <100000000> values

  • GPU no. of blocks : <16>
  • Duration GPU: <290> [ms]
  • Duration CPU: <348> [ms]

Test 4

  • JG::Starting program which uses GPu and CPU to calculate polynomial for <100000000> values
  • GPU no. of blocks : <1024>
  • Duration GPU: <189> [ms]
  • Duration CPU: <362> [ms]

Conclusions:

  • When size of vector storing values is bigger than 10^6 the GPU calculations are much faster than CPU (even twice faster)
  • When number of blocks (and threads inside blocks) are fully used we may get results pretty fast, decreasing number of threads used in blocks will extend the time

Image gray scaling on CPU & GPU output files & measurements

Input 1 Output 1 Input 2 Output 2


Results

CPU: Intel64 Family 6 Model 94 Stepping 3

GPU:NVIDIA Quadro M2000M

TEST1

Width of an image: 640

Height of an image: 480

Resolution (total number of pixels): 307200

Number of colors: 256

GPU duration 2370833 nanoseconds

CPU duration 69864629 nanoseconds


TEST2

Width of an image: 640

Height of an image: 480

Resolution (total number of pixels): 307200

Number of colors: 256

GPU duration 2420297 nanoseconds

CPU duration 68725466 nanoseconds


TEST3

Width of an image: 1419

Height of an image: 1001

Resolution (total number of pixels): 1420419

Number of colors: 16777216

GPU duration 11482626 nanoseconds

CPU duration 346865093 nanoseconds


TEST4

Width of an image: 1419

Height of an image: 1001

Resolution (total number of pixels): 1420419

Number of colors: 16777216

GPU duration 11523404 nanoseconds

CPU duration 343563144 nanoseconds

Matrix multiplication on CPU & GPU Results

a) Matrix A=300x300, Matrix B=300x300

  • CPU (avg) = 133ms
  • GPU version "without pragma unroll" (avg) = 123ms
  • GPU version with pragma unroll (avg) = 111ms ( 10% time faster)

b) Matrix A=900x900, Matrix B=900x900

  • CPU (avg) = 4839 ms,
  • GPU version "without pragma unroll" (avg) = 1646ms
  • GPU version with pragma unroll (widthC calculated at runtime, no boost) = 1633ms (not faster at all)
  • GPU version with pragma roll but used CONST_WIDTH_C 400ms (almost 24-30% time faster)

Loop below has been tested with #pragma unroll

    for (int i = 0; i < CONST_WIDTH_C; i++)
    {
        tmp_sum += in_tabA[row * CONST_WIDTH_C+ i] * in_tabB[i * CONST_WIDTH_C + col];
    }

Conclusions:

  • The effectiveness of the #pragma unroll directive in the context of performance - strongly depends on what is calculated in a loop
  • When performing tests, it turned out that the CONST_WIDTH_C variable must be either constant or defined using #define - if it was calculated in runtime based on the size of dynamic matrix (i.e. not known at the compilation stage) - then #pragma unroll did not bring any profit
  • #pragma unroll at the compilation stage reduce the number of operations that would have to be carried out in the case of dynamic loop development is reduced

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published