# Rodar o OpenACC no Google Colab

Para verificar se o procedimento funciona, vou testar com o programa `task2_solution.c` que foi parte de um curso da NVidia.

## Ambiente de execução

É preciso selecionar um ambiente de execução com **GPU** no Colab para poder executar o programa

A célula a seguir grava o arquivo `task2_solution.c` no diretório padrão do Colab para podermos compilar.

In [None]:
%%writefile timer.h

/*
 *  Copyright 2012 NVIDIA Corporation
 *
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 */

#ifndef TIMER_H
#define TIMER_H

#include <stdlib.h>

#ifdef WIN32
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#else
#include <sys/time.h>
#endif

#ifdef WIN32
double PCFreq = 0.0;
__int64 timerStart = 0;
#else
struct timeval timerStart;
#endif

void StartTimer()
{
#ifdef WIN32
    LARGE_INTEGER li;
    if(!QueryPerformanceFrequency(&li))
        printf("QueryPerformanceFrequency failed!\n");

    PCFreq = (double)li.QuadPart/1000.0;

    QueryPerformanceCounter(&li);
    timerStart = li.QuadPart;
#else
    gettimeofday(&timerStart, NULL);
#endif
}

// time elapsed in ms
double GetTimer()
{
#ifdef WIN32
    LARGE_INTEGER li;
    QueryPerformanceCounter(&li);
    return (double)(li.QuadPart-timerStart)/PCFreq;
#else
    struct timeval timerStop, timerElapsed;
    gettimeofday(&timerStop, NULL);
    timersub(&timerStop, &timerStart, &timerElapsed);
    return timerElapsed.tv_sec*1000.0+timerElapsed.tv_usec/1000.0;
#endif
}

#endif // TIMER_H


Writing timer.h


In [None]:
%%writefile task2_solution.c

#include <math.h>
#include <string.h>
#include "timer.h"

#define NN 1024
#define NM 1024

float A[NN][NM];
float Anew[NN][NM];

int main(int argc, char** argv)
{
    const int n = NN;
    const int m = NM;
    const int iter_max = 1000;
    
    const double tol = 1.0e-6;
    double error     = 1.0;
    
    memset(A, 0, n * m * sizeof(float));
    memset(Anew, 0, n * m * sizeof(float));
        
    for (int j = 0; j < n; j++)
    {
        A[j][0]    = 1.0;
        Anew[j][0] = 1.0;
    }
    
    printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
    
    StartTimer();
    int iter = 0;
    
    while ( error > tol && iter < iter_max )
    {
        #pragma acc kernels
        {
            error = 0.0;
          
            for( int j = 1; j < n-1; j++)
            {
                for( int i = 1; i < m-1; i++ )
                {
                    Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
                                        + A[j-1][i] + A[j+1][i]);
                    error = fmax( error, fabs(Anew[j][i] - A[j][i]));
                }
            }
            
            for( int j = 1; j < n-1; j++)
            {
                for( int i = 1; i < m-1; i++ )
                {
                    A[j][i] = Anew[j][i];    
                }
            }
        }

        if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
        
        iter++;
     
    }

    double runtime = GetTimer();

    printf(" total: %f s\n", runtime / 1000);

    return 0;
}


Overwriting task2_solution.c


In [None]:
!ls -l

total 8
drwxr-xr-x 1 root root 4096 Nov  6 17:30 sample_data
-rw-r--r-- 1 root root 1516 Nov 15 18:57 task2_solution.c


## Baixando e instalando os pacotes HPC SDK da NVidia

In [None]:
!wget https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-20-9_20.9_amd64.deb

--2020-11-13 22:59:06--  https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-20-9_20.9_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2117119120 (2.0G) [application/x-deb]
Saving to: ‘nvhpc-20-9_20.9_amd64.deb’


2020-11-13 22:59:27 (96.2 MB/s) - ‘nvhpc-20-9_20.9_amd64.deb’ saved [2117119120/2117119120]



In [None]:
!wget https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-2020_20.9_amd64.deb

--2020-11-13 22:59:32--  https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-2020_20.9_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1272 (1.2K) [application/x-deb]
Saving to: ‘nvhpc-2020_20.9_amd64.deb’


2020-11-13 22:59:32 (73.0 MB/s) - ‘nvhpc-2020_20.9_amd64.deb’ saved [1272/1272]



In [None]:
!wget https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-20-9-cuda-multi_20.9_amd64.deb

--2020-11-13 22:59:32--  https://developer.download.nvidia.com/hpc-sdk/20.9/nvhpc-20-9-cuda-multi_20.9_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1558347920 (1.5G) [application/x-deb]
Saving to: ‘nvhpc-20-9-cuda-multi_20.9_amd64.deb’


2020-11-13 23:00:07 (42.1 MB/s) - ‘nvhpc-20-9-cuda-multi_20.9_amd64.deb’ saved [1558347920/1558347920]



In [None]:
!sudo apt-get install ./nvhpc-20-9_20.9_amd64.deb ./nvhpc-2020_20.9_amd64.deb ./nvhpc-20-9-cuda-multi_20.9_amd64.deb

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'nvhpc-20-9' instead of './nvhpc-20-9_20.9_amd64.deb'
Note, selecting 'nvhpc-2020' instead of './nvhpc-2020_20.9_amd64.deb'
Note, selecting 'nvhpc-20-9-cuda-multi' instead of './nvhpc-20-9-cuda-multi_20.9_amd64.deb'
The following NEW packages will be installed:
  nvhpc-20-9 nvhpc-20-9-cuda-multi nvhpc-2020
0 upgraded, 3 newly installed, 0 to remove and 12 not upgraded.
Need to get 0 B/3,675 MB of archives.
After this operation, 10.1 GB of additional disk space will be used.
Get:1 /content/nvhpc-2020_20.9_amd64.deb nvhpc-2020 amd64 20.9 [1,272 B]
Get:2 /content/nvhpc-20-9_20.9_amd64.deb nvhpc-20-9 amd64 20.9 [2,117 MB]
Get:3 /content/nvhpc-20-9-cuda-multi_20.9_amd64.deb nvhpc-20-9-cuda-multi amd64 20.9 [1,558 MB]
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/

## Compilando

Infelizmente não consegui alterar o PATH para adicionar o caminho onde os compiladores são instalados. Tentei de várias formas e não funcionou; creio ser uma restrição do ambiente do Colab.

Por isso, temos que usar o compilador indicando o caminho completo.

In [None]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin/nvc --version


nvc 20.9-0 LLVM 64-bit target on x86-64 Linux -tp haswell 
NVIDIA Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.


In [None]:
!nvidia-smi

Fri Nov 13 23:09:35 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin/nvc -acc -Minfo -gpu=cuda10.1 -fast -o task2_out task2_solution.c

  #endif // TIMER_H
                   ^

      printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
      ^

  }
   ^

GetTimer:
      4, include "timer.h"
          64, FMA (fused multiply-add) instruction(s) generated
main:
     24, Loop not fused: function call before adjacent loop
         Loop unrolled 8 times
     32, StartTimer inlined, size=2 (inline) file task2_solution.c (38)
     38, Generating implicit copyout(Anew[1:1022][1:1022]) [if not already present]
         Generating implicit copyin(A[:][:]) [if not already present]
         Generating implicit copyout(A[1:1022][1:1022]) [if not already present]
     41, Loop is parallelizable
     43, Loop is parallelizable
         Generating Tesla code
         41, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
             Generating implicit reduction(max:error)
         43, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
     43, Loop not vectorized: mixed data types
     51, Loop is p

In [None]:
!./task2_out

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 4.547713 s


# Task 1

## Benchmarking

In [None]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin/nvc -fast -o task1_pre_out task1/task1.c

  }
   ^



In [None]:
!./task1_pre_out

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 3.460064 s


In [None]:
%%bash
/opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin/nvc -fast -mp -Minfo -o task1_omp task1/task1_omp.c

GetTimer:
      3, include "timer.h"
          63, FMA (fused multiply-add) instruction(s) generated
main:
     25, Loop not fused: function call before adjacent loop
         Loop unrolled 8 times
     33, StartTimer inlined, size=2 (inline) file task1/task1_omp.c (37)
     36, Loop not vectorized/parallelized: potential early exits
     41, Parallel region activated
         Parallel loop activated with static block schedule
     43, Loop not vectorized/parallelized: not countable
     49, Loop not vectorized/parallelized: contains a parallel region
     52, Parallel region activated
         Parallel loop activated with static block schedule
     54, Loop not vectorized/parallelized: not countable
     62, FMA (fused multiply-add) instruction(s) generated
     65, GetTimer inlined, size=9 (inline) file task1/task1_omp.c (54)


In [None]:
!./task1_omp

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 3.336994 s


In [None]:
%%bash
/opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin/nvc -acc -fast -Minfo -o task1_simple task1/task1_simple.c

          newline
  }
   ^

GetTimer:
      3, include "timer.h"
          63, FMA (fused multiply-add) instruction(s) generated
main:
     25, Loop not fused: function call before adjacent loop
         Loop unrolled 8 times
     33, StartTimer inlined, size=2 (inline) file task1/task1_simple.c (37)
     42, Loop not vectorized: mixed data types
     52, Memory copy idiom, loop replaced by call to __c_mcopy4
     62, FMA (fused multiply-add) instruction(s) generated
     63, GetTimer inlined, size=9 (inline) file task1/task1_simple.c (54)


In [None]:
!./task1_simple

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 3.545271 s


# Task 2

In [None]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin/nvc -acc -Minfo -gpu=cuda10.1 -fast -o task2_out task2/task2_solution.c

          implicitly
      printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
      ^

GetTimer:
      3, include "timer.h"
          63, FMA (fused multiply-add) instruction(s) generated
main:
     23, Loop not fused: function call before adjacent loop
         Loop unrolled 8 times
     31, StartTimer inlined, size=2 (inline) file task2/task2_solution.c (37)
     37, Generating implicit copyout(Anew[1:1022][1:1022]) [if not already present]
         Generating implicit copyin(A[:][:]) [if not already present]
         Generating implicit copyout(A[1:1022][1:1022]) [if not already present]
     40, Loop is parallelizable
     42, Loop is parallelizable
         Generating Tesla code
         40, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
             Generating implicit reduction(max:error)
         42, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
     42, Loop not vectorized: mixed data types
     50, Loop is parallelizable
     52, Lo

In [None]:
!./task2_out

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 4.490865 s


# Task 3 - Movimentação de Dados

In [None]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin/nvc -acc -fast -gpu=cuda10.1  -Minfo=accel -o task3_out task3/task3_solution.c

          implicitly
      printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
      ^

main:
     35, Generating copyin(Anew[:][:]) [if not already present]
         Generating copy(A[:][:]) [if not already present]
     41, Loop is parallelizable
     43, Loop is parallelizable
         Generating Tesla code
         41, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
             Generating implicit reduction(max:error)
         43, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
     51, Loop is parallelizable
     53, Loop is parallelizable
         Generating Tesla code
         51, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
         53, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */


In [None]:
!./task3_out

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 0.431402 s


# Task 4

In [None]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin/nvc -acc -fast -gpu=cuda10.1 -Minfo=accel -o task4_out_task3 task4/task4.c

      printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
      ^

main:
     35, Generating copy(A[:][:]) [if not already present]
         Generating copyin(Anew[:][:]) [if not already present]
     40, Generating implicit copy(error) [if not already present]
     42, Loop is parallelizable
     44, Loop is parallelizable
         Generating Tesla code
         42, #pragma acc loop gang(4), vector(4) /* blockIdx.y threadIdx.y */
             Generating reduction(max:error)
         44, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     53, Loop is parallelizable
     55, Loop is parallelizable
         Generating Tesla code
         53, #pragma acc loop gang(4), vector(4) /* blockIdx.y threadIdx.y */
         55, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */


In [None]:
!./task4_out_task3

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 0.390342 s


### Com informação de `gang` para melhorar o desempenho

In [None]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin/nvc -acc -fast -gpu=cuda10.1 -Minfo=accel -o task4_out task4/task4_solution.c

          implicitly
      printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
      ^

main:
     35, Generating create(Anew[:][:]) [if not already present]
         Generating copy(A[:][:]) [if not already present]
     41, Loop is parallelizable
     44, Loop is parallelizable
         Generating Tesla code
         41, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
             Generating implicit reduction(max:error)
         44, #pragma acc loop gang(8), vector(32) /* blockIdx.x threadIdx.x */
     52, Loop is parallelizable
     55, Loop is parallelizable
         Generating Tesla code
         52, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
         55, #pragma acc loop gang(8), vector(32) /* blockIdx.x threadIdx.x */


In [None]:
!./task4_out

Jacobi relaxation Calculation: 1024 x 1024 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 0.362704 s


# Task 4 - comparando com OpenMP - 4096 x 4096

In [None]:
%%bash
/opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin/nvc -fast -mp -Minfo -o task4_4096_omp task4/task4_4096_omp.c

          implicitly
      printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
      ^

GetTimer:
      3, include "timer.h"
          63, FMA (fused multiply-add) instruction(s) generated
main:
     23, Loop not fused: function call before adjacent loop
         Loop unrolled 8 times
     31, StartTimer inlined, size=2 (inline) file task4/task4_4096_omp.c (37)
     34, Loop not vectorized/parallelized: potential early exits
     39, Parallel region activated
         Parallel loop activated with static block schedule
     41, Loop not vectorized: mixed data types
     47, Loop not vectorized/parallelized: contains a parallel region
     50, Parallel region activated
         Parallel loop activated with static block schedule
     52, Memory copy idiom, loop replaced by call to __c_mcopy4
     62, FMA (fused multiply-add) instruction(s) generated
     63, GetTimer inlined, size=9 (inline) file task4/task4_4096_omp.c (54)


In [None]:
!./task4_4096_omp

Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 53.128543 s


In [None]:
!cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU @ 2.20GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2200.000
cache size	: 56320 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa
bogomips	: 4400.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 b

In [None]:
!OMP_NUM_THREADS=8 ./task4_4096_omp

Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 56.965805 s


# Task 4 com OpenACC - 4096 x 4096

In [None]:
!/opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin/nvc -acc -fast -gpu=cuda10.1 -Minfo=accel -o task4_4096_out task4/task4_4096_solution.c

          implicitly
      printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
      ^

main:
     35, Generating copyin(Anew[:][:]) [if not already present]
         Generating copy(A[:][:]) [if not already present]
     41, Loop is parallelizable
     44, Loop is parallelizable
         Generating Tesla code
         41, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
             Generating implicit reduction(max:error)
         44, #pragma acc loop gang(8), vector(32) /* blockIdx.x threadIdx.x */
     52, Loop is parallelizable
     55, Loop is parallelizable
         Generating Tesla code
         52, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
         55, #pragma acc loop gang(8), vector(32) /* blockIdx.x threadIdx.x */


In [None]:
!./task4_4096_out

Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 1.764997 s


In [None]:
%%bash
export NVC_ACC_TIME=1
./task4_4096_out

Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 1.754473 s


# Cálculo do Speedup


Construa uma tabela mostrando os tempos de execução de cada programa e o speedup conseguido com a paralelização.