# Rodar o OpenACC no Google Colab

Para verificar se o procedimento funciona, vou testar com o programa `task2_solution.c` que foi parte de um curso da NVidia.

## Ambiente de execução

É preciso selecionar um ambiente de execução com **GPU** no Colab para poder executar o programa

A célula a seguir grava o arquivo `task2_solution.c` no diretório padrão do Colab para podermos compilar.

In [1]:
%%writefile timer.h

/*
 *  Copyright 2012 NVIDIA Corporation
 *
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 */

#ifndef TIMER_H
#define TIMER_H

#include <stdlib.h>

#ifdef WIN32
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#else
#include <sys/time.h>
#endif

#ifdef WIN32
double PCFreq = 0.0;
__int64 timerStart = 0;
#else
struct timeval timerStart;
#endif

void StartTimer()
{
#ifdef WIN32
    LARGE_INTEGER li;
    if(!QueryPerformanceFrequency(&li))
        printf("QueryPerformanceFrequency failed!\n");

    PCFreq = (double)li.QuadPart/1000.0;

    QueryPerformanceCounter(&li);
    timerStart = li.QuadPart;
#else
    gettimeofday(&timerStart, NULL);
#endif
}

// time elapsed in ms
double GetTimer()
{
#ifdef WIN32
    LARGE_INTEGER li;
    QueryPerformanceCounter(&li);
    return (double)(li.QuadPart-timerStart)/PCFreq;
#else
    struct timeval timerStop, timerElapsed;
    gettimeofday(&timerStop, NULL);
    timersub(&timerStop, &timerStart, &timerElapsed);
    return timerElapsed.tv_sec*1000.0+timerElapsed.tv_usec/1000.0;
#endif
}

#endif // TIMER_H


Writing timer.h


In [2]:
%%writefile task2_solution.c

#include <math.h>
#include <string.h>
#include "timer.h"

#define NN 1024
#define NM 1024

float A[NN][NM];
float Anew[NN][NM];

int main(int argc, char** argv)
{
    const int n = NN;
    const int m = NM;
    const int iter_max = 1000;

    const double tol = 1.0e-6;
    double error     = 1.0;

    memset(A, 0, n * m * sizeof(float));
    memset(Anew, 0, n * m * sizeof(float));

    for (int j = 0; j < n; j++)
    {
        A[j][0]    = 1.0;
        Anew[j][0] = 1.0;
    }

    printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);

    StartTimer();
    int iter = 0;

    while ( error > tol && iter < iter_max )
    {
        #pragma acc kernels
        {
            error = 0.0;

            for( int j = 1; j < n-1; j++)
            {
                for( int i = 1; i < m-1; i++ )
                {
                    Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
                                        + A[j-1][i] + A[j+1][i]);
                    error = fmax( error, fabs(Anew[j][i] - A[j][i]));
                }
            }

            for( int j = 1; j < n-1; j++)
            {
                for( int i = 1; i < m-1; i++ )
                {
                    A[j][i] = Anew[j][i];
                }
            }
        }

        if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);

        iter++;

    }

    double runtime = GetTimer();

    printf(" total: %f s\n", runtime / 1000);

    return 0;
}


Writing task2_solution.c


In [3]:
!ls -l

total 12
drwxr-xr-x 1 root root 4096 Nov 12 14:25 sample_data
-rw-r--r-- 1 root root 1450 Nov 13 23:43 task2_solution.c
-rw-r--r-- 1 root root 1573 Nov 13 23:43 timer.h


# NVidia HPC SDK

Para instalar o compilador **nvc**, que compila os códigos com OpenACC, você deve instalar o **hpc sdk** da NVidia.

O download é individual, você precisará preencher seu nome, email e país para ter os links de download.

Abaixo os comandos que a página da NVidia vai disponibilizar para vc fazer o download. Isso foi em Nov/2024; pode ser que já seja outra versão, então é bom conferir lá no site da NVidia.

In [4]:
!curl https://developer.download.nvidia.com/hpc-sdk/ubuntu/DEB-GPG-KEY-NVIDIA-HPC-SDK | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-hpcsdk-archive-keyring.gpg

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  1626  100  1626    0     0  12846      0 --:--:-- --:--:-- --:--:-- 12904


In [5]:
!echo 'deb [signed-by=/usr/share/keyrings/nvidia-hpcsdk-archive-keyring.gpg] https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64 /' | sudo tee /etc/apt/sources.list.d/nvhpc.list

deb [signed-by=/usr/share/keyrings/nvidia-hpcsdk-archive-keyring.gpg] https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64 /


In [2]:
!sudo apt full-upgrade

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following packages have been kept back:
  libcudnn8 libcudnn8-dev libnccl-dev libnccl2
The following packages will be upgraded:
  base-files bash binutils binutils-common binutils-x86-64-linux-gnu bsdutils
  coreutils cuda-compat-12-2 cuda-keyring cuda-toolkit-12-config-common
  cuda-toolkit-config-common dpkg dpkg-dev e2fsprogs libbinutils libblkid1
  libc-bin libctf-nobfd0 libctf0 libdpkg-perl libext2fs2 libgnutls30
  libldap-2.5-0 libmount1 libpam-modules libpam-modules-bin libpam-runtime
  libpam0g libperl5.34 libprocps8 libsmartcols1 libss2 libudev1 linux-libc-dev
  login logsave mount openssl passwd perl perl-base perl-modules-5.34 procps
  tar util-linux
45 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.
Need to get 60.4 MB of archives.
After this operation, 196 kB of additional disk space will be used.
Get:1 https://developer.download.nvi

In [6]:
!sudo apt-get update -y
!sudo apt-get install -y nvhpc-24-9

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
0% [Connecting to archive.ubuntu.com (91.189.91.83)] [Connecting to security.ub0% [Connecting to archive.ubuntu.com (91.189.91.83)] [Connecting to security.ub                                                                               Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
0% [Waiting for headers] [Waiting for headers] [Connected to r2u.stat.illinois.0% [Waiting for headers] [Waiting for headers] [Connected to r2u.stat.illinois.                                                                               Get:3 https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64  InRelease [2,126 B]
0% [Waiting for headers] [Waiting for headers] [Connected to r2u.stat.illinois.                                                                               Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
0

In [21]:
!sudo apt install cuda-toolkit-12-2

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  cuda-documentation-12-2 cuda-nsight-12-2 cuda-nsight-systems-12-2
  cuda-nvvp-12-2 cuda-tools-12-2 cuda-visual-tools-12-2 default-jre
  default-jre-headless fonts-dejavu-core fonts-dejavu-extra gds-tools-12-2
  libatk-wrapper-java libatk-wrapper-java-jni libfontenc1 libtinfo5
  libxcb-icccm4 libxcb-image0 libxcb-keysyms1 libxcb-render-util0 libxcb-util1
  libxcb-xinerama0 libxcb-xinput0 libxcb-xkb1 libxkbcommon-x11-0 libxkbfile1
  libxtst6 libxxf86dga1 nsight-systems-2023.2.3 openjdk-11-jre x11-utils
Suggested packages:
  mesa-utils
The following NEW packages will be installed:
  cuda-documentation-12-2 cuda-nsight-12-2 cuda-nsight-systems-12-2
  cuda-nvvp-12-2 cuda-toolkit-12-2 cuda-tools-12-2 cuda-visual-tools-12-2
  default-jre default-jre-headless fonts-dejavu-core fonts-dejavu-extra
  gds-tools-12-2 libatk-wrapper-java libatk-wrapp

## Baixando e instalando manualmente os pacotes HPC SDK da NVidia

In [None]:
!wget https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-20-9_20.9_amd64.deb

--2020-11-13 22:59:06--  https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-20-9_20.9_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2117119120 (2.0G) [application/x-deb]
Saving to: ‘nvhpc-20-9_20.9_amd64.deb’


2020-11-13 22:59:27 (96.2 MB/s) - ‘nvhpc-20-9_20.9_amd64.deb’ saved [2117119120/2117119120]



In [None]:
!wget https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-2020_20.9_amd64.deb

--2020-11-13 22:59:32--  https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-2020_20.9_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1272 (1.2K) [application/x-deb]
Saving to: ‘nvhpc-2020_20.9_amd64.deb’


2020-11-13 22:59:32 (73.0 MB/s) - ‘nvhpc-2020_20.9_amd64.deb’ saved [1272/1272]



In [None]:
!wget https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-20-9-cuda-multi_20.9_amd64.deb

--2020-11-13 22:59:32--  https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-20-9-cuda-multi_20.9_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1558347920 (1.5G) [application/x-deb]
Saving to: ‘nvhpc-20-9-cuda-multi_20.9_amd64.deb’


2020-11-13 23:00:07 (42.1 MB/s) - ‘nvhpc-20-9-cuda-multi_20.9_amd64.deb’ saved [1558347920/1558347920]



In [None]:
!sudo apt-get install ./nvhpc-20-9_20.9_amd64.deb ./nvhpc-2020_20.9_amd64.deb ./nvhpc-20-9-cuda-multi_20.9_amd64.deb

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'nvhpc-20-9' instead of './nvhpc-20-9_20.9_amd64.deb'
Note, selecting 'nvhpc-2020' instead of './nvhpc-2020_20.9_amd64.deb'
Note, selecting 'nvhpc-20-9-cuda-multi' instead of './nvhpc-20-9-cuda-multi_20.9_amd64.deb'
The following NEW packages will be installed:
  nvhpc-20-9 nvhpc-20-9-cuda-multi nvhpc-2020
0 upgraded, 3 newly installed, 0 to remove and 12 not upgraded.
Need to get 0 B/3,675 MB of archives.
After this operation, 10.1 GB of additional disk space will be used.
Get:1 /content/nvhpc-2020_20.9_amd64.deb nvhpc-2020 amd64 20.9 [1,272 B]
Get:2 /content/nvhpc-20-9_20.9_amd64.deb nvhpc-20-9 amd64 20.9 [2,117 MB]
Get:3 /content/nvhpc-20-9-cuda-multi_20.9_amd64.deb nvhpc-20-9-cuda-multi amd64 20.9 [1,558 MB]
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/

## Compilando

Infelizmente não consegui alterar o PATH para adicionar o caminho onde os compiladores são instalados. Tentei de várias formas e não funcionou; creio ser uma restrição do ambiente do Colab.

Por isso, temos que usar o compilador indicando o caminho completo.

In [7]:
!ls -l /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvaccelinfo

-rwxr-xr-x 1 root root 62192 Sep 23 23:31 /opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvaccelinfo


In [8]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvaccelinfo


CUDA Driver Version:           12020
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  535.104.05  Sat Aug 19 01:15:15 UTC 2023

Device Number:                 0
Device Name:                   Tesla T4
Device Revision Number:        7.5
Global Memory Size:            15835660288
Number of Multiprocessors:     40
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1590 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate: 

In [9]:
!nvidia-smi

Wed Nov 13 23:48:28 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [10]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc -acc -Minfo -target=gpu -fast -o task2_out task2_solution.c

      printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
      ^


GetTimer:
     64, FMA (fused multiply-add) instruction(s) generated
main:
     24, Loop not fused: function call before adjacent loop
         Loop not vectorized: unprofitable for target
         Loop unrolled 8 times
     32, StartTimer inlined, size=2 (inline) file task2_solution.c (38)
     35, Loop not vectorized/parallelized: potential early exits
     38, Generating implicit copyin(A[:][:]) [if not already present]
         Generating implicit copyout(Anew[1:1022][1:1022]) [if not already present]
         Generating implicit copyout(A[1:1022][1:1022]) [if not already present]
     41, Loop is parallelizable
     43, Loop is parallelizable
         Generating NVIDIA GPU code
         41, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
             Generating implicit reduction(max:error)
         43,   /* blockIdx.x threadIdx.x auto-collapsed */
     51, Loop is paralleliz

In [11]:
!./task2_out

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 3.288734 s


# Rodando os exemplos do arquivo zip `openacc_files.zip`

Você precisa fazer o upload de cada arquivo para o Colab ou subir o .zip e extrair.

Para compilar, use os parâmetros acima, onde compilei o código `task2_solution.c`.


# Task 1

## Benchmarking

In [16]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc -fast -Minfo -target=gpu -acc -o task1_pre_out task1.c

GetTimer:
     64, FMA (fused multiply-add) instruction(s) generated
main:
     26, Loop not fused: function call before adjacent loop
         Loop not vectorized: unprofitable for target
         Loop unrolled 8 times
     34, StartTimer inlined, size=2 (inline) file task1.c (38)
     43, Generated vector simd code for the loop containing reductions
     53, Recognized memory copy idiom
     63, FMA (fused multiply-add) instruction(s) generated
     64, GetTimer inlined, size=9 (inline) file task1.c (55)


In [17]:
!./task1_pre_out

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 1.199594 s


In [18]:
%%bash
/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc -fast -mp -Minfo -o task1_omp task1_omp.c

GetTimer:
     64, FMA (fused multiply-add) instruction(s) generated
main:
     25, Loop not fused: function call before adjacent loop
         Loop not vectorized: unprofitable for target
         Loop unrolled 8 times
     33, StartTimer inlined, size=2 (inline) file task1_omp.c (38)
     36, Loop not vectorized/parallelized: potential early exits
     41, #omp parallel
         41, Generating reduction(max:error)
     43, Loop not vectorized/parallelized: not countable
     49, Loop not vectorized/parallelized: not countable
     52, #omp parallel
     54, Loop not vectorized/parallelized: not countable
     63, FMA (fused multiply-add) instruction(s) generated
     65, GetTimer inlined, size=9 (inline) file task1_omp.c (55)


In [19]:
!./task1_omp

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 3.619319 s


In [20]:
%%bash
/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc -acc -fast -Minfo -o task1_simple task1_simple.c

GetTimer:
     64, FMA (fused multiply-add) instruction(s) generated
main:
     25, Loop not fused: function call before adjacent loop
         Loop not vectorized: unprofitable for target
         Loop unrolled 8 times
     33, StartTimer inlined, size=2 (inline) file task1_simple.c (38)
     42, Generated vector simd code for the loop containing reductions
     52, Recognized memory copy idiom
     63, GetTimer inlined, size=9 (inline) file task1_simple.c (55)
     63, FMA (fused multiply-add) instruction(s) generated


In [21]:
!./task1_simple

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 1.221149 s


# Task 2

In [25]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc --help


Overall switches:
-###                Display commands that would be executed
-#                  Display each command as it is run
-c                  Stop after assemble; output in object file
-[no]defaultoptions Use default options from configuration file
--diag_error<arg>   Override the severity of a diagnostic
--diag_remark<arg>  Override the severity of a diagnostic
--diag_suppress<arg>
                    Override the severity of a diagnostic
--display_error_number
                    Display error message numbers
-dryrun             Display commands that would be executed
-drystdinc          Display standard include directories and exit
-dumpversion        Display compiler short version
-echo[=go|stop]     Echo the command line flags and stop (default) or continue (=go). This is useful when the compiler is invoked by a script.
--flagcheck         Don't compile anything, just emit error messages for command-line switches
-flags              Show all compiler switches
-f[no-]str

In [26]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc -acc -Minfo -target=gpu -fast -o task2_out task2.c

      printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
      ^


GetTimer:
     64, FMA (fused multiply-add) instruction(s) generated
main:
     23, Loop not fused: function call before adjacent loop
         Loop not vectorized: unprofitable for target
         Loop unrolled 8 times
     31, StartTimer inlined, size=2 (inline) file task2.c (38)
     34, Loop not vectorized/parallelized: potential early exits
     36, Generating implicit copyin(A[:][:]) [if not already present]
         Generating implicit copyout(Anew[1:1022][1:1022]) [if not already present]
         Generating implicit copy(error) [if not already present]
     39, Loop is parallelizable
     41, Loop is parallelizable
         Generating NVIDIA GPU code
         39, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
             Generating implicit reduction(max:error)
         41,   /* blockIdx.x threadIdx.x auto-collapsed */
     47, Generating implicit copyin(Anew[1:1022][1:

In [27]:
!./task2_out

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 4.622393 s


# Task 3 - Movimentação de Dados

In [29]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc -acc -fast -target=gpu  -Minfo=accel -o task3_out task3.c

      printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
      ^


main:
     37, Generating copy(Anew[:][:],A[:][:]) [if not already present]
     43, Loop is parallelizable
     45, Loop is parallelizable
         Generating NVIDIA GPU code
         43, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
             Generating implicit reduction(max:error)
         45,   /* blockIdx.x threadIdx.x auto-collapsed */
     53, Loop is parallelizable
     55, Loop is parallelizable
         Generating NVIDIA GPU code
         53, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
         55,   /* blockIdx.x threadIdx.x auto-collapsed */


In [30]:
!./task3_out

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 0.467812 s


# Task 4

In [36]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc -acc -fast -target=gpu -Minfo=accel -o task4_out task4.c

      printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
      ^


main:
     35, Generating copyin(Anew[:][:]) [if not already present]
         Generating copy(A[:][:]) [if not already present]
     40, Generating implicit copy(error) [if not already present]
     42, Loop is parallelizable
     44, Loop is parallelizable
         Generating NVIDIA GPU code
         42, #pragma acc loop gang(4), vector(4) /* blockIdx.y threadIdx.y */
             Generating reduction(max:error)
         44, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     53, Loop is parallelizable
     55, Loop is parallelizable
         Generating NVIDIA GPU code
         53, #pragma acc loop gang(4), vector(4) /* blockIdx.y threadIdx.y */
         55, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */


In [37]:
!./task4_out

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 0.401029 s


# Task 4 - comparando com OpenMP - 4096 x 4096

In [27]:
%%bash
/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc -fast -mp -Minfo -o task4_4096_omp task4_4096_omp.c

      printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
      ^


GetTimer:
     64, FMA (fused multiply-add) instruction(s) generated
main:
     23, Loop not fused: function call before adjacent loop
         Loop not vectorized: unprofitable for target
         Loop unrolled 8 times
     31, StartTimer inlined, size=2 (inline) file task4_4096_omp.c (38)
     34, Loop not vectorized/parallelized: potential early exits
     39, #omp parallel
         39, Generating reduction(max:error)
     41, Generated vector simd code for the loop containing reductions
     47, Loop not vectorized/parallelized: not countable
     50, #omp parallel
     52, Recognized memory copy idiom
     63, GetTimer inlined, size=9 (inline) file task4_4096_omp.c (55)
     63, FMA (fused multiply-add) instruction(s) generated


In [28]:
!./task4_4096_omp

Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 24.557304 s


## Task 4 com OpenACC - 4096 x 4096

In [22]:
%%writefile task4_4096acc.c
#include <math.h>
#include <string.h>
#include "timer.h"

#define NN 4096
#define NM 4096

float A[NN][NM];
float Anew[NN][NM];

int main(int argc, char** argv)
{
    const int n = NN;
    const int m = NM;
    const int iter_max = 1000;

    const double tol = 1.0e-6;
    double error     = 1.0;

    memset(A, 0, n * m * sizeof(float));
    memset(Anew, 0, n * m * sizeof(float));

    for (int j = 0; j < n; j++)
    {
        A[j][0]    = 1.0;
        Anew[j][0] = 1.0;
    }

    printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);

    StartTimer();
    int iter = 0;

    #pragma acc data copy(A) copyin(Anew)
    while ( error > tol && iter < iter_max )
    {
        error = 0.0;

        #pragma acc kernels
        {
            #pragma acc loop gang(4) vector(4) reduction(max:error)
            for( int j = 1; j < n-1; j++)
            {
                for( int i = 1; i < m-1; i++ )
                {
                    Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
                                        + A[j-1][i] + A[j+1][i]);
                    error = fmax( error, fabs(Anew[j][i] - A[j][i]));
                }
            }

            #pragma acc loop gang(4) vector(4)
            for( int j = 1; j < n-1; j++)
            {
                for( int i = 1; i < m-1; i++ )
                {
                    A[j][i] = Anew[j][i];
                }
            }
        }

        if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);

        iter++;

    }

    double runtime = GetTimer();

    printf(" total: %f s\n", runtime / 1000);

    return 0;
}


Writing task4_4096acc.c


In [23]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin/nvc -acc -fast -target=gpu -Minfo=accel -o task4_4096acc_out task4_4096acc.c

      printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
      ^


main:
     35, Generating copyin(Anew[:][:]) [if not already present]
         Generating copy(A[:][:]) [if not already present]
     40, Generating implicit copy(error) [if not already present]
     42, Loop is parallelizable
     44, Loop is parallelizable
         Generating NVIDIA GPU code
         42, #pragma acc loop gang(4), vector(4) /* blockIdx.y threadIdx.y */
             Generating reduction(max:error)
         44, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     53, Loop is parallelizable
     55, Loop is parallelizable
         Generating NVIDIA GPU code
         53, #pragma acc loop gang(4), vector(4) /* blockIdx.y threadIdx.y */
         55, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */


In [24]:
!./task4_4096acc_out

Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 2.054522 s


In [None]:
!cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU @ 2.20GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2200.000
cache size	: 56320 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa
bogomips	: 4400.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 b

In [30]:
!OMP_NUM_THREADS=4 ./task4_4096_omp

Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 25.949361 s


In [33]:
%%bash
export NVC_ACC_TIME=1
./task4_4096acc_out

Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 2.080366 s


# Cálculo do Speedup


Construa uma tabela mostrando os tempos de execução de cada programa e o speedup conseguido com a paralelização.