# GPU Computing - Exercise 1
## Eetu Knutars
## 13.1.2025

### Task 1
Create a custom kernel which takes an empty vector as an input and each thread writes its thread index to the empty vector that is provided to the kernel.

So we would input a vector of zeroes:

[0, 0, 0, 0, 0, 0, ...]

And the output would be a vector with:

[0, 1, 2, 3, 4, 5, ...]


Importing libraries

In [6]:
import cupy as cp
from math import ceil
import numpy as np

Defining the kernel, that takes address to the vector as an input and modifies the vector values

In [7]:
index_kernel = cp.RawKernel(r''' extern "C"
__global__ void thread_indexing(float* C)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    C[index] = index;

}''', 'thread_indexing')


Initializing the vector to be modified

In [8]:
vector_size = 20
result = cp.zeros(vector_size, dtype=np.float32)
print("Input vector:", cp.asnumpy(result))
numThreadsPerBlock = 1024
numBlocks = ceil(vector_size/numThreadsPerBlock)

Input vector: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


Modifying the vector with the kernel

In [9]:
index_kernel((numBlocks,1,1),(numThreadsPerBlock,1,1),(result))

Printing the results

In [10]:
print("Output vector:", cp.asnumpy(result))

Output vector: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19.]


## Task 2
Implement a kernel which takes two vectors A and B and adds them together to form a vector C.

Importing libraries

In [62]:
import cupy as cp
import numpy as np
import math

Kernel function

In [63]:
addition_kernel = cp.RawKernel(r''' extern "C"
__global__ void thread_indexing(float* A, float* B, float* C)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    C[index] = A[index] + B[index];


}''', 'thread_indexing')


Initializing the input vectors and the output vector

In [64]:
vector_size = 20

# Lets use random number from range 0-10
A = 10 * cp.random.random(size=vector_size, dtype=np.float32)
B = 10* cp.random.random(vector_size, dtype=np.float32)
C = cp.zeros(vector_size, dtype=np.float32)
print("Input A:", cp.asnumpy(A))
print("Input B:", cp.asnumpy(B))
numThreadsPerBlock = 1024
numBlocks = ceil(vector_size/numThreadsPerBlock)

Input A: [4.6482882  9.66821    7.405357   9.7463     0.6994483  3.6750507
 9.093338   8.524483   3.4437883  3.0960646  0.58507067 5.754448
 9.654255   9.9412     1.0858561  2.5014603  8.6135235  8.316432
 1.014083   5.006768  ]
Input B: [5.8154526  4.5723495  7.4977293  5.328376   0.10819334 1.336961
 2.957413   9.306805   0.45186752 2.7307093  3.8831053  0.10689315
 8.907053   3.9420607  3.6646419  7.9856114  6.811051   6.448777
 3.1091628  1.3445395 ]


Using the kernel function to modify the output vector

In [65]:
addition_kernel((numBlocks,1,1),(numThreadsPerBlock,1,1),(A,B,C))
print("Output C:", cp.asnumpy(C))

Output C: [10.463741  14.24056   14.903086  15.074676   0.8076416  5.0120115
 12.050751  17.831287   3.8956559  5.8267736  4.468176   5.861341
 18.561308  13.883261   4.750498  10.487072  15.424574  14.765209
  4.1232457  6.351308 ]


## Task 3
Implement a kernel which takes in vectors A and B and C and adds A and B together and multiplies the resulting vector values with the values from C to form a vector D.

Importing libraries

In [70]:
import cupy as cp
import numpy as np
import math

Kernel function that takes addresses to the input vectors and the output vector and modifies the output vector values

In [71]:
multiplicative_kernel = cp.RawKernel(r''' extern "C"
__global__ void thread_indexing(float* A, float* B, float* C, float* D)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    D[index] = A[index] + B[index];
    D[index] = D[index] * C[index];


}''', 'thread_indexing')


Initializing input vectors and the output vector

In [72]:
vector_size = 20

# Lets use random number from range 0-10
A = 10 * cp.random.random(size=vector_size, dtype=np.float32)
B = 10* cp.random.random(vector_size, dtype=np.float32)
C = 10* cp.random.random(vector_size, dtype=np.float32)
D = cp.zeros(vector_size, dtype=np.float32)
print("Input A:", cp.asnumpy(A))
print("Input B:", cp.asnumpy(B))
print("Input C:", cp.asnumpy(C))
numThreadsPerBlock = 1024
numBlocks = ceil(vector_size/numThreadsPerBlock)

Input A: [2.6577296 8.799604  1.272084  3.797549  7.1444206 5.5970592 0.9424808
 2.3478973 8.631707  3.911983  9.2830715 7.9083185 5.0087523 7.8522873
 5.7825785 7.870675  2.8631237 8.066626  6.551564  5.072396 ]
Input B: [5.4199276  8.313455   0.44178367 1.3221414  4.3011703  6.5970507
 1.5361555  9.016922   2.7818105  3.8396125  9.339224   4.7920184
 8.526458   4.28263    8.297712   1.0424354  7.399351   9.022134
 9.745493   8.906015  ]
Input C: [0.48554027 8.152894   6.184269   9.971737   7.243426   1.0893439
 4.517618   4.767298   7.679718   0.16753924 2.2954173  7.4177413
 5.5070643  4.0653577  0.45902196 0.18310072 5.7900467  8.463949
 7.859826   5.1542053 ]


Creating the output vector D

In [73]:
multiplicative_kernel((numBlocks,1,1),(numThreadsPerBlock,1,1),(A,B,C,D))
print("Output D:", cp.asnumpy(D))

Output D: [  3.922028  139.52097    10.599019   51.052204   82.90529    13.283579
  11.197533   54.179485   87.6526      1.2986964  42.745937   94.20781
  74.53927    49.33278     6.4631624   1.631997   59.420208  144.6384
 128.09204    72.0476   ]
