<a href="https://colab.research.google.com/github/matpereira/Tp3-Grupo10/blob/main/Ejercicio_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduccion**

Multiplica dos matrices cuadradas utilizando varios bloques y memoria compartida.
A cada bloque de hilo se le asigna un "mosaico" de la matriz resultante y es responsable para generar los elementos en ese mosaico. Cada hilo de un bloque calcula un elemento del azulejo.

# **Armado del Ambiente**

In [43]:
!pip install pycuda



In [58]:
import numpy as np
from numpy import linalg as la
from pycuda import driver, compiler, gpuarray, tools
from datetime import datetime
# -- initialize the device
import pycuda.autoinit


# define the (square) matrix size
MATRIX_SIZE = 4000

# define size of blocks and tiles sub-matrix
# (we assume that the block size is same as tile size)
TILE_SIZE = 20
BLOCK_SIZE = TILE_SIZE

# --------------------------------------------
# Definición de función que transforma el tiempo en  milisegundos 
tiempo_en_ms = lambda dt:(dt.days * 24 * 60 * 60 + dt.seconds) * 1000 + dt.microseconds / 1000.0
# --------------------------------------------

a_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)
b_cpu = np.random.randn(MATRIX_SIZE, MATRIX_SIZE).astype(np.float32)


# **Desarrollo CPU**

In [59]:
tiempo_img = datetime.now()
# compute reference on the CPU to verify GPU computation
c_cpu = np.dot(a_cpu, b_cpu)
tiempo_CPU = datetime.now() - tiempo_img

# **Desarrollo GPU**

In [60]:
tiempo_total_GPU = datetime.now()
def matmul(a_gpu,b_gpu,MATRIX_SIZE=MATRIX_SIZE):
    kernel_code_template = """
    __global__ void MatrixMulKernel(float *A, float *B, float *C)
    {

      const uint wA = %(MATRIX_SIZE)s;
      const uint wB = %(MATRIX_SIZE)s;

      // Block index
      const uint bx = blockIdx.x;
      const uint by = blockIdx.y;

      // Thread index
      const uint tx = threadIdx.x;
      const uint ty = threadIdx.y;

      // Index of the first sub-matrix of A processed by the block
      const uint aBegin = wA * %(BLOCK_SIZE)s * by;
      // Index of the last sub-matrix of A processed by the block
      const uint aEnd = aBegin + wA - 1;
      // Step size used to iterate through the sub-matrices of A
      const uint aStep = %(BLOCK_SIZE)s;

      // Index of the first sub-matrix of B processed by the block
      const uint bBegin = %(BLOCK_SIZE)s * bx;
      // Step size used to iterate through the sub-matrices of B
      const uint bStep = %(BLOCK_SIZE)s * wB;

      // The element of the block sub-matrix that is computed
      // by the thread
      float Csub = 0;
      // Loop over all the sub-matrices of A and B required to
      // compute the block sub-matrix
      for (int a = aBegin, b = bBegin;
           a <= aEnd;
           a += aStep, b += bStep)
        {
          // Shared memory for the sub-matrix of A
          __shared__ float As[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];
          // Shared memory for the sub-matrix of B
          __shared__ float Bs[%(BLOCK_SIZE)s][%(BLOCK_SIZE)s];

          // Load the matrices from global memory to shared memory
          // each thread loads one element of each matrix
          As[ty][tx] = A[a + wA * ty + tx];
          Bs[ty][tx] = B[b + wB * ty + tx];
          // Synchronize to make sure the matrices are loaded
          __syncthreads();

          // Multiply the two matrices together;
          // each thread computes one element
          // of the block sub-matrix
          for (int k = 0; k < %(BLOCK_SIZE)s; ++k)
            Csub += As[ty][k] * Bs[k][tx];

          // Synchronize to make sure that the preceding
          // computation is done before loading two new
          // sub-matrices of A and B in the next iteration
          __syncthreads();
        }

      // Write the block sub-matrix to global memory;
      // each thread writes one element
      const uint c = wB * %(BLOCK_SIZE)s * by + %(BLOCK_SIZE)s * bx;
      C[c + wB * ty + tx] = Csub;
    }
    """

    # get the kernel code from the template
    # by specifying the constants MATRIX_SIZE and BLOCK_SIZE
    kernel_code = kernel_code_template % {
        'MATRIX_SIZE': MATRIX_SIZE,
        'BLOCK_SIZE': BLOCK_SIZE,
        }

    # compile the kernel code
    mod = compiler.SourceModule(kernel_code)
    
    # create empty gpu array for the result (C = A * B)
    c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)

    # get the kernel function from the compiled module
    matrixmul = mod.get_function("MatrixMulKernel")

    # call the kernel on the card

    matrixmul(
        # inputs
        a_gpu, b_gpu,
        # output
        c_gpu,
        # grid of multiple blocks
        grid = (MATRIX_SIZE // TILE_SIZE, MATRIX_SIZE // TILE_SIZE),
        # block of multiple threads
        block = (TILE_SIZE, TILE_SIZE, 1),
        )

    return c_gpu



# transfer host (CPU) memory to device (GPU) memory
a_gpu = gpuarray.to_gpu(a_cpu)
b_gpu = gpuarray.to_gpu(b_cpu)
tiempo_img2 = datetime.now()
c_gpu = matmul(a_gpu,b_gpu)
tiempo_GPU = datetime.now() - tiempo_img2

tiempo_total_GPU = datetime.now() - tiempo_total_GPU

# **Metricas**

In [61]:
# print the results

print("\n")
print( "Tiempo CPU:", tiempo_en_ms( tiempo_CPU), "[ms]" )
print("-" * 8)
print( "Tiempo GPU:", tiempo_en_ms(tiempo_GPU), "[ms]" )
print("Tiempo TOTAL GPU: ", tiempo_en_ms( tiempo_total_GPU ), "[ms]" )
print("\n")
print("-" * 80)
print("Matriz C (CPU):")
print(c_cpu)

print("-" * 80)
print("Matrix C (GPU):")
print(c_gpu.get())





Tiempo CPU: 1096.717 [ms]
--------
Tiempo GPU: 270.895 [ms]
Tiempo TOTAL GPU:  305.312 [ms]


--------------------------------------------------------------------------------
Matriz C (CPU):
[[  63.54331    -29.097454   -10.1282425 ...   -8.06781    -40.361073
     4.9087973]
 [ -48.93467    -56.057816  -134.8667    ...  -96.78839     25.50584
  -108.41404  ]
 [ 103.30497    -11.911066    21.215227  ...   23.942415   -50.717957
   -32.108284 ]
 ...
 [ -84.739075   -71.62523     -2.7986305 ...   19.566782    63.95124
    11.666954 ]
 [  54.322556    53.524776    62.9317    ...  -50.570236    15.705193
    82.816666 ]
 [ -75.878395   -57.258205   -63.2155    ...   -1.0665531   83.2072
   -17.44794  ]]
--------------------------------------------------------------------------------
Matrix C (GPU):
[[  63.543385   -29.097387   -10.128312  ...   -8.067886   -40.361214
     4.908819 ]
 [ -48.934696   -56.057835  -134.8668    ...  -96.7883      25.505804
  -108.414154 ]
 [ 103.30487    -11.

# **Conclusiones**

Para resoluciones algebraicas es necesario que las dimensiones de las matrices sean muy grandes para que sea razonable utilizar una GPU para resolverlas, ya que si se trata de matrices con pequeñas dimensiones con solo utilizar el CPU se resuelve muchisimo mas rapido
ya que: 
matriz 400*400  CPU > GPU
matriz 4000*4000 GPU > CPU

... que se yo algo asi (?  

# **Bibliografia**

https://www.researchgate.net/publication/247933772_Multilevel_Optimization_of_Matrix_Multiplication_for_GPU-equipped_Systems
https://colab.research.google.com/github/cbernet/maldives/blob/master/numba/numba_cuda_kernel.ipynb
https://shephexd.github.io/development/2017/02/19/pycuda.html

