In [6]:
import sys
import numpy as np

# Obtener valor de línea de comandos
if len(sys.argv) > 1:
    value = int(sys.argv[1])
else:
    value = 5*10**7

print(f"Executing with value = {value}\n")

ValueError: invalid literal for int() with base 10: '-f'

## Reduction: the sum of the elements of an array

In [1]:
import numpy as np

def reduc_operation(A):
    """Compute the sum of the elements of Array A."""
    s = 0
    for i in range(A.size):
        s += A[i]
    return s

# Secuencial

value = 5*10**7

X = np.random.rand(value)

# Para imprimir los primeros valores del array

#print(X[0:12])

# Utilizando las operaciones mágicas de ipython

tiempo = %timeit -r 2 -o -q reduc_operation(X)

print("Time taken by reduction operation using a function:", tiempo)

print(f"And the result of the sum of numbers in the range [0, value) is: {reduc_operation(X)}\n")


# Utilizando numpy.sum()

tiempo = %timeit -r 2 -o -q np.sum(X)

print("Time taken by reduction operation using numpy.sum():", tiempo)

print("Now, the result using numpy.sum():", np.sum(X),"\n ")


Time taken by reduction operation using a function: 4.86 s ± 34.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 24998422.587098632

Time taken by reduction operation using numpy.sum(): 19 ms ± 18.1 μs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Now, the result using numpy.sum(): 24998422.58709813 
 


## Operación Multiprocessing con Pool

In [2]:
from multiprocessing import Pool
import os

# Número de núcleos desde SLURM
num_threads = int(os.environ.get('OMP_NUM_THREADS', 1))

def reduc_operation_chunk(A):
    s = 0
    for i in range(A.size):
        s += A[i]
    return s

def parallel_reduction(A, num_processes):
    chunk_size = A.size // num_processes
    chunks = [A[i*chunk_size:(i+1)*chunk_size] for i in range(num_processes)]
    
    if A.size % num_processes != 0:
        chunks[-1] = A[(num_processes-1)*chunk_size:]
    
    with Pool(num_processes) as pool:
        results = pool.map(reduc_operation_chunk, chunks)
    
    return sum(results)

print("Multiprocessing with {} processes:".format(num_threads))
tiempo = %timeit -r 2 -o -q parallel_reduction(X, num_threads)
print("Time taken:", tiempo)
print("Result: {}\n".format(parallel_reduction(X, num_threads)))

Multiprocessing with 1 processes:
Time taken: 6.43 s ± 56.1 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Result: 24998422.587098632



## Numba secuencial y paralelo

In [4]:
from numba import njit

@njit
def reduc_operation_numba(A):
    s = 0
    for i in range(A.size):
        s += A[i]
    return s

# Primera ejecución para compilar
_ = reduc_operation_numba(X)

# Medir tiempo
print("Numba with @njit (sequential):")
tiempo = %timeit -r 2 -o -q reduc_operation_numba(X)
print("Time taken:", tiempo)
print("Result: {}\n".format(reduc_operation_numba(X)))

Numba with @njit (sequential):
Time taken: 50.1 ms ± 71.3 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Result: 24998422.587098632



## Numba con prange

In [3]:
from numba import njit, prange

@njit(parallel=True)
def reduc_operation_numba_parallel(A):
    s = 0
    for i in prange(A.size):
        s += A[i]
    return s

# Configurar núcleos para Numba
os.environ['OMP_NUM_THREADS'] = str(num_threads)

# Primera ejecución para compilar
_ = reduc_operation_numba_parallel(X)

print("Numba with @njit(parallel=True) and prange - {} cores:".format(num_threads))
tiempo = %timeit -r 2 -o -q reduc_operation_numba_parallel(X)
print("Time taken:", tiempo)
print("Result: {}\n".format(reduc_operation_numba_parallel(X)))

Numba with @njit(parallel=True) and prange - 1 cores:
Time taken: 11.4 ms ± 19.8 μs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Result: 24998422.5870984



## Resultados de los analisis

Al comparar los resultados obtenidos para tamaños de entrada de 10^8 y 10^9 elementos, se observa cómo la velocidad del código original en Python apenas varía. En ambos casos, el tiempo de ejecución se mantiene alrededor de los 8.5–8.7 segundos, independientemente del número de núcleos asignados. Por ejemplo, para 10^9 elementos, la suma secuencial toma 8.57 s ± 69.4 ms con 1 núcleo, mientras que para 10^8 elementos el tiempo era similar. Esto evidencia que la mayor parte del coste no está relacionado con el tamaño del problema, sino con la propia ejecución secuencial en Python.

En el caso de la operación vectorizada con numpy.sum(), los tiempos de ejecución son prácticamente idénticos para 10^8 y 10^9 elementos, situándose en torno a los 32 ms. Para 10^9 elementos, los resultados fueron 32 ms ± 498 ns con 1 núcleo, mostrando que el incremento de tamaño no penaliza el rendimiento. Esto demuestra la alta eficiencia de las rutinas internas de NumPy, que están limitadas principalmente por la transferencia de datos en memoria y no por la carga computacional.

Al analizar la versión paralelizada mediante multiprocessing, se observa que el incremento de tamaño no introduce un aumento proporcional del tiempo de ejecución. Con 2 procesos, los tiempos se mantienen alrededor de 5.54 s ± 15.7 ms; con 4 procesos se reducen a 3.1 s ± 213 μs, y con 8 procesos bajan a 1.87 s ± 3.39 ms. Esto indica que el coste dominante en multiprocessing se encuentra asociado a la creación de procesos y la comunicación entre ellos, de modo que el aumento del número de elementos no modifica sustancialmente el comportamiento global del algoritmo.

La versión secuencial optimizada con Numba (@njit) muestra un comportamiento estable al pasar de 10^8 a 10^9 elementos. Para 10^9 elementos, los tiempos de ejecución se sitúan en el rango de 57.6 ms ± 2–7 μs, evidenciando que la compilación JIT elimina el overhead de Python y genera código máquina eficiente. El aumento del tamaño del problema no supone una desventaja, ya que el cálculo es simple y está principalmente limitado por el acceso a memoria.

Finalmente, la versión paralela con Numba y prange presenta el mejor comportamiento en términos de escalabilidad. Para 10^9 elementos, los tiempos de ejecución se mantienen en el orden de 16–17 ms, especialmente al utilizar 4–8 núcleos. Esto indica que la combinación de compilación JIT y paralelización automática permite absorber el incremento del tamaño del problema sin degradar el rendimiento, y que existe suficiente carga de trabajo para amortizar completamente el coste del paralelismo.

Ejecutando con 100000000 elementos

1 nucleo:

Executing with value = 100000000

Time taken by reduction operation using a function: 8.65 s ± 24 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 24996695.356471933

Time taken by reduction operation using numpy.sum(): 32.1 ms ± 8.2 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 24996695.356475852

Multiprocessing with 1 processes:

Time taken: 10.4 s ± 16.1 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Result: 24996695.356471933

Numba with @njit (sequential):

Time taken: 57.7 ms ± 18.6 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Result: 24996695.356471933

Numba with @njit(parallel=True) and prange - 1 cores:

Time taken: 18.8 ms ± 98.3 μs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Result: 24996695.356476042


2 nucleos:

Executing with value = 100000000

Time taken by reduction operation using a function: 8.65 s ± 41.3 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 25001358.169585597

Time taken by reduction operation using numpy.sum(): 32.1 ms ± 2.2 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 25001358.169586267

Multiprocessing with 2 processes:

Time taken: 5.55 s ± 17.8 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
Result: 25001358.169587146

Numba with @njit (sequential):

Time taken: 57.8 ms ± 129 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Result: 25001358.169585597

Numba with @njit(parallel=True) and prange - 2 cores:

Time taken: 17.6 ms ± 1.63 ms per loop (mean ± std. dev. of 2 runs, 100 loops each)
Result: 25001358.169586223


4 nucleos:

Executing with value = 100000000

Time taken by reduction operation using a function: 8.65 s ± 36 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 24999714.413260628

Time taken by reduction operation using numpy.sum(): 32.1 ms ± 4.66 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 24999714.413268164

Multiprocessing with 4 processes:

Time taken: 3.1 s ± 1.45 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Result: 24999714.41326747

Numba with @njit (sequential):

Time taken: 57.8 ms ± 141 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Result: 24999714.413260628

Numba with @njit(parallel=True) and prange - 4 cores:

Time taken: 15.5 ms ± 2.08 ms per loop (mean ± std. dev. of 2 runs, 100 loops each)
Result: 24999714.413268108


8 nucleos:

Executing with value = 100000000

Time taken by reduction operation using a function: 8.52 s ± 47.3 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 25004113.502496395

Time taken by reduction operation using numpy.sum(): 32.1 ms ± 1.41 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 25004113.502498336

Multiprocessing with 8 processes:

Time taken: 1.98 s ± 6.88 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Result: 25004113.502499215

Numba with @njit (sequential):

Time taken: 57.7 ms ± 8.95 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Result: 25004113.502496395

Numba with @njit(parallel=True) and prange - 8 cores:

Time taken: 20 ms ± 3.2 ms per loop (mean ± std. dev. of 2 runs, 100 loops each)
Result: 25004113.50249822
_____________________________________________________________________________________________________________________________
Ejecutando con 1000000000 elementos

1 nucleo:

Executing with value = 1000000000

Time taken by reduction operation using a function: 8.57 s ± 69.4 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 24999221.68358894

Time taken by reduction operation using numpy.sum(): 32 ms ± 498 ns per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 24999221.68358432

Multiprocessing with 1 processes:

Time taken: 11.2 s ± 299 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
Result: 24999221.68358894

Numba with @njit (sequential):

Time taken: 57.6 ms ± 2.16 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Result: 24999221.68358894

Numba with @njit(parallel=True) and prange - 1 cores:

Time taken: 16.3 ms ± 63.4 μs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Result: 24999221.683584187


2 nucleos:

Executing with value = 1000000000

Time taken by reduction operation using a function: 8.64 s ± 23.2 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 25001805.154178515

Time taken by reduction operation using numpy.sum(): 32.1 ms ± 619 ns per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 25001805.15417737

Multiprocessing with 2 processes:

Time taken: 5.54 s ± 15.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Result: 25001805.154175416

Numba with @njit (sequential):

Time taken: 57.6 ms ± 5.88 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Result: 25001805.154178515

Numba with @njit(parallel=True) and prange - 2 cores:

Time taken: 16.4 ms ± 136 μs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Result: 25001805.15417727


4 nucleos:

Executing with value = 1000000000

Time taken by reduction operation using a function: 8.63 s ± 24.4 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 25003219.737389997

Time taken by reduction operation using numpy.sum(): 32 ms ± 468 ns per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 25003219.737390496

Multiprocessing with 4 processes:

Time taken: 3.1 s ± 213 μs per loop (mean ± std. dev. of 2 runs, 1 loop each)
Result: 25003219.737391993

Numba with @njit (sequential):

Time taken: 57.6 ms ± 4.97 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Result: 25003219.737389997

Numba with @njit(parallel=True) and prange - 4 cores:

Time taken: 16.6 ms ± 92 μs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Result: 25003219.737390526


8 nucleos:

Executing with value = 1000000000

Time taken by reduction operation using a function: 8.64 s ± 25.4 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
And the result of the sum of numbers in the range [0, value) is: 25000134.178840052

Time taken by reduction operation using numpy.sum(): 32.1 ms ± 147 ns per loop (mean ± std. dev. of 2 runs, 10 loops each)
Now, the result using numpy.sum(): 25000134.17884272

Multiprocessing with 8 processes:

Time taken: 1.87 s ± 3.39 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Result: 25000134.17884267

Numba with @njit (sequential):

Time taken: 57.6 ms ± 6.81 μs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Result: 25000134.178840052

Numba with @njit(parallel=True) and prange - 8 cores:

Time taken: 16.5 ms ± 79.2 μs per loop (mean ± std. dev. of 2 runs, 100 loops each)
Result: 25000134.178842828