## Reduction: the sum of the elements of an array

In [1]:
#3.3c
import sys

value = 5 * 10**7  # valor por defecto

for arg in sys.argv[1:]:
    if arg.isdigit():
        value = int(arg)
        break

print(f"Array size (value) = {value}")


Array size (value) = 50000000


In [2]:
import numpy as np

def reduc_operation(A):
    """Compute the sum of the elements of Array A."""
    s = 0
    for i in range(A.size):
        s += A[i]
    return s

# Secuencial

value = 5*10**7

X = np.random.rand(value)

# Para imprimir los primeros valores del array

#print(X[0:12])

# Utilizando las operaciones mágicas de ipython

tiempo = %timeit -r 2 -o -q reduc_operation(X)

print("Time taken by reduction operation using a function:", tiempo)

print(f"And the result of the sum of numbers in the range [0, value) is: {reduc_operation(X)}\n")


# Utilizando numpy.sum()

tiempo = %timeit -r 2 -o -q np.sum(X)

print("Time taken by reduction operation using numpy.sum():", tiempo)

print("Now, the result using numpy.sum():", np.sum(X),"\n ")


Time taken by reduction operation using a function: 5.21 s ± 218 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 24999484.57012388

Time taken by reduction operation using numpy.sum(): 18.8 ms ± 36.3 μs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Now, the result using numpy.sum(): 24999484.570128664 
 


In [3]:
#multiprocessing
from multiprocessing import Pool
import math

def reduc_operation(A):
    s = 0.0
    for i in range(A.size):
        s += A[i]
    return s

def parallel_reduc_operation(A, nprocs):
    # Dividir el array en nprocs bloques
    chunks = np.array_split(A, nprocs)

    with Pool(processes=nprocs) as pool:
        partial_sums = pool.map(reduc_operation, chunks)

    return sum(partial_sums)

# Pruebas
for nprocs in [2, 4]:
    print(f"\nRunning with {nprocs} processes")
    tiempo = %timeit -r 2 -o -q parallel_reduc_operation(X, nprocs)
    print(tiempo)



Running with 2 processes
3.25 s ± 29.4 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

Running with 4 processes
1.8 s ± 17.6 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)


In [4]:
#numba
from numba import njit, prange

@njit
def reduc_operation_numba(A):
    s = 0.0
    for i in range(A.size):
        s += A[i]
    return s

# Warm-up (necesario para compilar)
reduc_operation_numba(X)

tiempo = %timeit -r 2 -o -q reduc_operation_numba(X)
print("Time taken by Numba (sequential):", tiempo)


Time taken by Numba (sequential): 49.1 ms ± 109 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)


In [7]:
#numba paralelo
@njit(parallel=True)
def reduc_operation_numba_par(A):
    s = 0.0
    for i in prange(A.size):
        s += A[i]
    return s

# Warm-up
reduc_operation_numba_par(X)

tiempo = %timeit -r 2 -o -q reduc_operation_numba_par(X)
print("Time taken by Numba (parallel):", tiempo)


Time taken by Numba (parallel): 11.4 ms ± 6.56 μs per loop (mean ± std. dev. of 2 runs, 100 loops each)


In [6]:
import time

start = time.time()
res = reduc_operation_numba_par(X)
end = time.time()

print("Result:", res)
print("Elapsed time (s):", end - start)


Result: 24999484.570128884
Elapsed time (s): 0.01178884506225586


CPUs per task: 1
NUMBA_NUM_THREADS: 1
------------------------------------
Array size (value) = 100000000
Time taken by reduction operation using a function: 8.66 s ± 93.9 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 25001918.175571993

Time taken by reduction operation using numpy.sum(): 32.1 ms ± 1.04 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 25001918.17556623


Running with 2 processes
5.55 s ± 2.11 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

Running with 4 processes
3.06 s ± 3.38 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Time taken by Numba (sequential): 57.7 ms ± 8.51 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Time taken by Numba (parallel): 57.7 ms ± 582 ns per loop (mean ± std. dev. of 2 runs, 10 loops each)
Result: 25001918.175571993
Elapsed time (s): 0.057888031005859375
Array size (value) = 1000000000
Time taken by reduction operation using a function: 8.68 s ± 82.5 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 25000045.127364773

Time taken by reduction operation using numpy.sum(): 32.2 ms ± 53.2 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 25000045.127357215


Running with 2 processes
5.58 s ± 2.17 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

Running with 4 processes
3.09 s ± 3.9 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Time taken by Numba (sequential): 57.7 ms ± 17 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Time taken by Numba (parallel): 57.7 ms ± 15.7 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Result: 25000045.127364773
Elapsed time (s): 0.057996273040771484
------------------------------------
finish


CPUs per task: 2
NUMBA_NUM_THREADS: 2
------------------------------------
Array size (value) = 100000000
Time taken by reduction operation using a function: 8.78 s ± 39.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 25000627.59099791

Time taken by reduction operation using numpy.sum(): 32.2 ms ± 47.2 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 25000627.59099589


Running with 2 processes
5.63 s ± 599 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)

Running with 4 processes
3.15 s ± 46.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Time taken by Numba (sequential): 57.7 ms ± 1.51 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Time taken by Numba (parallel): 28.9 ms ± 13 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Result: 25000627.590997588
Elapsed time (s): 0.0290985107421875
Array size (value) = 1000000000
Time taken by reduction operation using a function: 8.68 s ± 30.3 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 24999272.046989053

Time taken by reduction operation using numpy.sum(): 32.1 ms ± 5.72 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 24999272.046985712


Running with 2 processes
5.99 s ± 252 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)

Running with 4 processes
3.32 s ± 9.25 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Time taken by Numba (sequential): 57.7 ms ± 3.43 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Time taken by Numba (parallel): 28.9 ms ± 2.11 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Result: 24999272.046983533
Elapsed time (s): 0.029165267944335938
------------------------------------
finish


CPUs per task: 4
NUMBA_NUM_THREADS: 4
------------------------------------
Array size (value) = 100000000
Time taken by reduction operation using a function: 8.63 s ± 34.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 24999704.8633083

Time taken by reduction operation using numpy.sum(): 32 ms ± 1.36 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 24999704.863297828


Running with 2 processes
5.98 s ± 2.6 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

Running with 4 processes
3.3 s ± 12.4 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Time taken by Numba (sequential): 57.8 ms ± 147 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Time taken by Numba (parallel): 17.9 ms ± 2 μs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Result: 24999704.863299146
Elapsed time (s): 0.017487049102783203
Array size (value) = 1000000000
Time taken by reduction operation using a function: 8.62 s ± 155 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 24998245.09202447

Time taken by reduction operation using numpy.sum(): 32.1 ms ± 18.2 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 24998245.092027344


Running with 2 processes
6.02 s ± 33.8 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

Running with 4 processes
3.29 s ± 989 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
Time taken by Numba (sequential): 57.8 ms ± 145 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Time taken by Numba (parallel): 17.4 ms ± 5.63 μs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Result: 24998245.09202436
Elapsed time (s): 0.018390417098999023
------------------------------------
finish


CPUs per task: 8
NUMBA_NUM_THREADS: 8
------------------------------------
Array size (value) = 100000000
Time taken by reduction operation using a function: 8.69 s ± 58.2 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 24999095.600928925

Time taken by reduction operation using numpy.sum(): 32.1 ms ± 812 ns per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 24999095.600928698


Running with 2 processes
5.98 s ± 1.65 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

Running with 4 processes
3.28 s ± 2.38 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Time taken by Numba (sequential): 57.7 ms ± 4.84 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Time taken by Numba (parallel): 12.9 ms ± 290 μs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Result: 24999095.600928318
Elapsed time (s): 0.012570619583129883
Array size (value) = 1000000000
Time taken by reduction operation using a function: 8.52 s ± 40.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 25002938.526482098

Time taken by reduction operation using numpy.sum(): 32.1 ms ± 2.52 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 25002938.52648666


Running with 2 processes
5.53 s ± 2.46 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

Running with 4 processes
3.06 s ± 1.38 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Time taken by Numba (sequential): 57.9 ms ± 67.3 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Time taken by Numba (parallel): 12.6 ms ± 28.3 μs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Result: 25002938.52648706
Elapsed time (s): 0.014859914779663086
------------------------------------
finish


En el apartado 3.3.a, utilizando el paquete multiprocessing con Pool, se observa una reducción del tiempo de ejecución respecto al código secuencial en Python. Para un array de tamaño 5⋅10^7 y 5⋅10^7, el tiempo pasa de ~4.7 s en la versión secuencial a ~3.3 s usando 2 procesos y a ~1.8–3 s usando 4 procesos. La aceleración no es lineal debido al overhead asociado a la creación de procesos y a la comunicación de datos.

En el apartado 3.3.b, la optimización con Numba proporciona una mejora mucho mayor. La versión secuencial compilada con @njit reduce el tiempo de ejecución hasta valores del orden de ~50–60 ms. Al activar el paralelismo con @njit(parallel=True) y prange, el tiempo se reduce aún más, alcanzando valores de ~10–15 ms en ejecución interactiva con varios cores.

Finalmente, en el apartado 3.3.c, la ejecución en la cola mendel mediante SLURM confirma el buen escalado de la versión paralela con Numba. Para un array de tamaño 10^8 y 10^9, el tiempo de ejecución disminuye de ~0.058 s con 1 core a ~0.029 s con 2 cores, ~0.018 s con 4 cores y ~0.013 s con 8 cores. El escalado es cercano al ideal para pocos cores y se ve limitado al aumentar el número de CPUs por el carácter memory-bound de la operación de reducción.