<a href="https://colab.research.google.com/github/lsteffenel/NumbaCuda/blob/main/Effective_Memory_Use.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Avant de commencer
L'exécution de ces notebooks sur Colab nécessite deux choses (au 4/2/2025) :

1. des resources GPU
  * Menu "Exécution" -> "Modifier le type d'exécution"
2. D'utiliser une version plus ancienne de Colab en raison de certaines incompatibilités du pilote Nvidia
  * Connecter l'environnement d'exécution
  * Menu "Outils" -> "Pallette de commandes". Cherchez "version" dans la barre et sélectionnez l'option "Utiliser la version d'environnement d'exécution de remplacement"


# Utilisation efficace du sous-système de mémoire

Maintenant que vous savez écrire des noyaux CUDA  et que vous comprenez l'importance de lancer des grilles pour donner suffisamment de travail au GPU afin de masquer la latence, vous allez apprendre des techniques pour utiliser efficacement la mémoire du GPU. Ces techniques sont largement applicables à une variété d'applications CUDA et sont parmi les plus importantes lorsqu'il s'agit d'accélérer votre code CUDA.

Vous allez commencer par en apprendre davantage sur la coalescence de mémoire (regroupement/organisation de blocs mémoire). Pour tester votre capacité à raisonner sur la coalescence, vous découvrirez ensuite les grilles bidimensionnelles et les blocs de threads. Ensuite, vous découvrirez comment utiliser la mémoire partagée, qui sera utilisée pour faciliter la coalescence là où cela n'aurait pas été possible autrement. Enfin, vous découvrirez les conflits qui peuvent arriver avec la mémoire partagée et une technique pour les résoudre.


## Le problème : l'accès "éparpillé" à la mémoire nuit les performances

Avant d’apprendre les détails sur la coalescence, exécutez les cellules suivantes pour observer les implications en termes de performances d’un changement apparemment trivial du mode d’accès aux données.

### Imports

In [1]:
import numpy as np
from numba import cuda

### Data Creation

Dans cette cellule, nous définissons `n` et créons une grille avec `n` threads. Nous créons également un vecteur de sortie de longueur `n`. Pour les entrées, nous créons des vecteurs de taille `stride * n` pour des raisons qui seront expliquées ci-dessous :

In [2]:
n = 1024*1024 # 1M

threads_per_block = 1024
blocks = int(n / threads_per_block)

stride = 16

# Input Vectors of length stride * n
a = np.ones(stride * n).astype(np.float32)
b = a.copy().astype(np.float32)

# Output Vector
out = np.zeros(n).astype(np.float32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_out = cuda.to_device(out)

### Kernel Definition

Dans `add_experiment`, chaque thread de la grille ajoutera un élément dans `a` et un élément dans `b` puis écrira le résultat dans `out`. Le noyau a été écrit de telle sorte que nous puissions passer une valeur `coalesced` de `True` ou `False` pour affecter la façon dont il indexe dans les vecteurs `a` et `b`. Vous verrez la comparaison des performances des deux modes ci-dessous.

In [3]:
@cuda.jit
def add_experiment(a, b, out, stride, coalesced):
    i = cuda.grid(1)
    # The above line is equivalent to
    # i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    if coalesced == True:
        out[i] = a[i] + b[i]
    else:
        out[i] = a[stride*i] + b[stride*i]

### Lancement d'un kernet avec un accès "coalesced"

Ici, nous passons « True » comme valeur « coalesced » et observons les performances du noyau sur plusieurs exécutions :

In [4]:
%timeit add_experiment[blocks, threads_per_block](d_a, d_b, d_out, stride, True); cuda.synchronize

91.5 µs ± 44.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


Vérifions si le noyau s'exécute comme attendu :

In [5]:
result = d_out.copy_to_host()
truth = a[:n] + b[:n]

In [6]:
np.array_equal(result, truth)

True

### Lancement d'un noyau sans accès coalescent

Dans cette cellule, nous passons " False " pour observer les performances du modèle d'accès aux données non coalescents :

In [7]:
%timeit add_experiment[blocks, threads_per_block](d_a, d_b, d_out, stride, False); cuda.synchronize

536 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Vérifions si le noyau s'exécute comme attendu :

In [8]:
result = d_out.copy_to_host()
truth = a[::stride] + b[::stride]

In [9]:
np.array_equal(result, truth)

True

### Résultats

Les performances du mode d'accès "non coalescent" sont bien pires. Vous allez maintenant découvrir pourquoi et comment réfléchir aux modes d'accès aux données pour obtenir des noyaux très performants.

## Présentation : Global Memory Coalescing

Regardez la présentation ci-dessous :

In [10]:
from IPython.display import IFrame
IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/coalescing-v3.pptx', 800, 450)

## Exercice : Somme des Colonnes et Lignes

Pour cet exercice, il vous sera demandé d'écrire un noyau pour faire la somme des colonnes, utilisant le mode d'accès mémoire coalescés. Pour commencer, vous observerez les performances sans ce mode d' accès mémoire.

### Somme des lignes

**Imports**

In [11]:
import numpy as np
from numba import cuda

**Data Creation**

Dans ce paragraphe nous créons une matrice pour l'entrée ainsi qu'un vecteur pour stocker la solution, et nous transférons chacun d'eux vers le périphérique. Nous définissons également les dimensions de la grille et du bloc à utiliser lorsque nous lançons le noyau.


In [12]:
n = 16384 # matrix side size
threads_per_block = 256
blocks = int(n / threads_per_block)

# Input Matrix
a = np.ones(n*n).reshape(n, n).astype(np.float32)
# Here we set an arbitrary row to an arbitrary value to facilitate a check for correctness below.
a[3] = 9

# Output vector
sums = np.zeros(n).astype(np.float32)

d_a = cuda.to_device(a)
d_sums = cuda.to_device(sums)

**Le noyau**

`row_sums` utilisera chaque thread pour parcourir une ligne de données, effectuer la somme, puis stockera la somme des lignes dans `sums`.

In [14]:
@cuda.jit
def row_sums(a, sums, n):
    idx = cuda.grid(1)
    sum = 0.0

    for i in range(n):
        # Each thread will sum a row of `a`
        sum += a[idx][i]

    sums[idx] = sum

**Performance**

In [15]:
%timeit row_sums[blocks, threads_per_block](d_a, d_sums, n); cuda.synchronize()



12.6 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


**Vérification du résultat**

In [16]:
result = d_sums.copy_to_host()
truth = a.sum(axis=1)

In [17]:
np.array_equal(truth, result)

True

### Somme des colonnes

**Imports**

In [18]:
import numpy as np
from numba import cuda

**Data Creation**

On reprend le même format précédent, mais avec des valeurs sur les colonnes

In [None]:
n = 16384 # matrix side size
threads_per_block = 256
blocks = int(n / threads_per_block)

a = np.ones(n*n).reshape(n, n).astype(np.float32)
# Here we set an arbitrary column to an arbitrary value to facilitate a check for correctness below.
a[:, 3] = 9
sums = np.zeros(n).astype(np.float32)

d_a = cuda.to_device(a)
d_sums = cuda.to_device(sums)

**Définition du noyau**

`col_sums` utilisera chaque thread pour parcourir une colonne de données, en la sommant, puis stockera la somme de sa colonne dans `sums`. Complétez la définition du noyau pour y parvenir (c'est à vous de le faire 😀)

In [None]:
@cuda.jit
def col_sums(a, sums, ds):
    # TODO: Write this kernel to store the sum of each column in matrix `a` to the `sums` vector.
    pass

**Vérification de la Performance**

En supposant que vous ayez écrit `col_sums` pour utiliser l'accès coalescent, vous devriez voir une accélération significative (presque 2x) par rapport aux `row_sums` "non coalescent" que vous avez exécutés ci-dessus :

In [19]:
%timeit col_sums[blocks, threads_per_block](d_a, d_sums, n); cuda.synchronize()

NameError: name 'col_sums' is not defined

**Vérification des résultats**

Confirm your kernel is working as expected.

In [None]:
result = d_sums.copy_to_host()
truth = a.sum(axis=0)

In [None]:
np.array_equal(truth, result)

## Des blocs et grilles à 2 et 3 dimensions

Les grilles et les blocs peuvent être configurés pour contenir respectivement une collection bidimensionnelle ou tridimensionnelle de blocs ou de threads. Cela est fait principalement pour des raisons de commodité pour les programmeurs qui travaillent avec des données bidimensionnels ou tridimensionnels. Voici un exemple très simple pour mettre en évidence la syntaxe. Il faudra comprendre la définition du noyau et comme il est lancé pour que le concept n'ait un sens.

In [20]:
import numpy as np
from numba import cuda

In [21]:
A = np.zeros((4,4)) # A 4x4 Matrix of 0's
d_A = cuda.to_device(A)

# Here we create a 2D grid with 4 blocks in a 2x2 structure, each with 4 threads in a 2x2 structure
# by using a Python tuple to signify grid and block dimensions.
blocks = (2, 2)
threads_per_block = (2, 2)

Ce noyau prendra une matrice d'entrée de 0 et écrira chacun de ses éléments directement dans la grille au format `X.Y`` :

In [22]:
@cuda.jit
def get_2D_indices(A):
    # By passing `2`, we get the thread's unique x and y coordinates in the 2D grid
    x, y = cuda.grid(2)
    # The above is equivalent to the following 2 lines of code:
    # x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    # y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y

    # Write the x index followed by a decimal and the y index.
    A[x][y] = x + y / 10

In [23]:
get_2D_indices[blocks, threads_per_block](d_A)



In [24]:
result = d_A.copy_to_host()
result

array([[0. , 0.1, 0.2, 0.3],
       [1. , 1.1, 1.2, 1.3],
       [2. , 2.1, 2.2, 2.3],
       [3. , 3.1, 3.2, 3.3]])

## Exercice : Somme de matrices 2D en mode coalescent

### Imports

In [25]:
import numpy as np
from numba import cuda

### Data Creation

Dans cette cellule, nous définissons des matrices d'entrée d'éléments 2048x2048 `a` et `b`, ainsi qu'une matrice de sortie initialisée de 2048x2048. Nous copions ces matrices sur le GPU.

Nous définissons également les dimensions de bloc et de grille à 2 dimensions. Notez que nous créons une grille avec le même nombre total de threads que d'éléments d'entrée et de sortie, de sorte que chaque thread de la grille calculera la somme pour un seul élément de la matrice de sortie.

In [26]:
n = 2048*2048 # 4M

# 2D blocks
threads_per_block = (32, 32)
# 2D grid
blocks = (64, 64)

# 2048x2048 input matrices
a = np.arange(n).reshape(2048,2048).astype(np.float32)
b = a.copy().astype(np.float32)

# 2048x2048 0-initialized output matrix
out = np.zeros_like(a).astype(np.float32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_out = cuda.to_device(out)

### Somme pour une matrice 2D

Votre tâche consiste à compléter les tâches à effectuer dans `matrix_add` pour additionner correctement `a` et `b` dans `out`. Pour vous aider à comprendre les modes d'accès, `matrix_add` acceptera un booléen `coalesced` indiquant si les modèles d'accès doivent être coalescents ou non. Les deux modes (coalesced et uncoalesced) devraient produire des résultats corrects, cependant, vous devriez observer des accélérations significatives ci-dessous lors de l'exécution avec `coalesced` défini sur `True`.

In [None]:
@cuda.jit
def matrix_add(a, b, out, coalesced):
    # TODO: set x and y to index correctly such that each thread
    # accesses one element in the data.
    x, y = pass

    if coalesced == True:
        # TODO: write the sum of one element in `a` and `b` to `out`
        # using a coalesced memory access pattern.
    else:
        # TODO: write the sum of one element in `a` and `b` to `out`
        # using an uncoalesced memory access pattern.

### Vérification de la performance

Exécutez les deux cellules ci-dessous pour lancer `matrix_add` avec les modèles d'accès que vous avez écrits, et observez la différence de performances. Des cellules supplémentaires ont été fournies pour confirmer l'exactitude de votre noyau.

**Coalesced**

In [None]:
%timeit matrix_add[blocks, threads_per_block](d_a, d_b, d_out, True); cuda.synchronize

In [None]:
result = d_out.copy_to_host()
truth = a+b

In [None]:
np.array_equal(result, truth)

**Uncoalesced**

In [None]:
%timeit matrix_add[blocks, threads_per_block](d_a, d_b, d_out, False); cuda.synchronize

In [None]:
result = d_out.copy_to_host()
truth = a+b

In [None]:
np.array_equal(result, truth)

## Mémoire Partagée

So far we have been differentiating between host and device memory, as if device memory were a single kind of memory. But in fact, CUDA has an even more fine-grained [memory hierarchy](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy). The device memory we have been utilizing thus far is called **global memory** which is available to any thread or block on the device, can persist for the lifetime of the application, and is a relatively large memory space.

We will now discuss how to utilize a region of on-chip device memory called **shared memory**. Shared memory is a programmer defined cache of limited size that [depends on the GPU](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities) being used and is **shared** between all threads in a block. It is a scarce resource, cannot be accessed by threads outside of the block where it was allocated, and does not persist after a kernel finishes executing. Shared memory however has a much higher bandwidth than global memory and can be used to great effect in many kernels, especially to optimize performance.

Here are a few common use cases for shared memory:

 * Caching memory read from global memory that will need to be read multiple times within a block.
 * Buffering output from threads so it can be coalesced before writing it back to global memory.
 * Staging data for scatter/gather operations within a block.

### Shared Memory Syntax

Numba provides [functions](https://numba.pydata.org/numba-doc/dev/cuda/memory.html#shared-memory-and-thread-synchronization) for allocating shared memory as well as for synchronizing between threads in a block, which is often necessary after parallel threads read from or write to shared memory.

When declaring shared memory, you provide the shape of the shared array, as well as its type, using a [Numba type](https://numba.pydata.org/numba-doc/dev/reference/types.html#numba-types). **The shape of the array must be a constant value**, and therefore, you cannot use arguments passed into the function, or, provided variables like `numba.cuda.blockDim.x`, or the calculated values of `cuda.griddim`. Here is a convoluted example to demonstrate the syntax with comments pointing out the movement from host memory to global device memory, to shared memory, back to global device memory, and finally back to host memory:

**Imports**

We will use `numba.types` to define the types of values in shared memory.

In [None]:
import numpy as np
from numba import types, cuda

**Swap Elements Using Shared Memory**

The following kernel takes an input vector, where each thread will first write one element of the vector to shared memory, and then, after syncing such that all elements have been written to shared memory, will write one element out of shared memory into the swapped output vector.

Worth noting is that each thread will be writing a swapped value from shared memory that was written into shared memory by another thread.

In [None]:
@cuda.jit
def swap_with_shared(vector, swapped):
    # Allocate a 4 element vector containing int32 values in shared memory.
    temp = cuda.shared.array(4, dtype=types.int32)

    idx = cuda.grid(1)

    # Move an element from global memory into shared memory
    temp[idx] = vector[idx]

    # cuda.syncthreads will force all threads in the block to synchronize here, which is necessary because...
    cuda.syncthreads()
    #...the following operation is reading an element written to shared memory by another thread.

    # Move an element from shared memory back into global memory
    swapped[idx] = temp[3 - cuda.threadIdx.x] # swap elements

**Data Creation**

In [None]:
vector = np.arange(4).astype(np.int32)
swapped = np.zeros_like(vector)

# Move host memory to device (global) memory
d_vector = cuda.to_device(vector)
d_swapped = cuda.to_device(swapped)

In [None]:
vector

** Run Kernel**

In [None]:
swap_with_shared[1, 4](d_vector, d_swapped)

**Check Results**

In [None]:
# Move device (global) memory back to the host
result = d_swapped.copy_to_host()
result

## Presentation: Shared Memory for Memory Coalescing

Execute the following cell to load the slides, then click on "Start Slide Show" to make them full screen.

In [None]:
from IPython.display import IFrame
IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/shared_coalescing.pptx', 800, 450)

## Exercise: Used Shared Memory for Coalesced Reads and Writes With Matrix Transpose

In this exercise you will implement what was just demonstrated in the presentation by writing a matrix transpose kernel which, using shared memory, makes coalesced reads and writes to the output matrix in global memory.

### Coalesced Reads, Uncoalesced Writes

As reference, and for performance comparison, here is a naive matrix transpose kernel that makes coalesced reads from input, but uncoalesced writes to output.

**Imports**

In [None]:
from numba import cuda
import numpy as np

**Data Creation**

Here we create a 4096x4096 input matrix `a` as well as a 4096x4096 output matrix `transposed`, and copy them to the device.

We also define a 2-dimensional grid with 2-dimensional blocks to be used below. Note that we have created a grid with a total number of threads equal to the number of elments in the input matrix.

In [None]:
n = 4096*4096 # 16M

# 2D blocks
threads_per_block = (32, 32)
#2D grid
blocks = (128, 128)

# 4096x4096 input and output matrices
a = np.arange(n).reshape((4096,4096)).astype(np.float32)
transposed = np.zeros_like(a).astype(np.float32)

d_a = cuda.to_device(a)
d_transposed = cuda.to_device(transposed)

**Naive Matrix Transpose Kernel**

This kernel correctly transposes `a`, writing the transposition to `transposed`. It makes reads from `a` in a coalesced fashion, however, its writes to `transposed` are uncoalesced.

In [None]:
@cuda.jit
def transpose(a, transposed):
    x, y = cuda.grid(2)

    transposed[x][y] = a[y][x]

**Check Performance**

In [None]:
%timeit transpose[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()

**Check Correctness**

In [None]:
result = d_transposed.copy_to_host()
expected = a.T

In [None]:
np.array_equal(result, expected)

### Refactor for Coalesced Reads and Writes

Your job will be to refactor the `transpose` kernel to use shared memory and make both reads to and writes from global memory in a coalesced fashion.

**Imports**

In [None]:
import numpy as np
from numba import cuda, types as numba_types

**Data Creation**

In [None]:
n = 4096*4096 # 16M

# 2D blocks
threads_per_block = (32, 32)
#2D grid
blocks = (128, 128)

# 4096x4096 input and output matrices
a = np.arange(n).reshape((4096,4096)).astype(np.float32)
transposed = np.zeros_like(a).astype(np.float32)

d_a = cuda.to_device(a)
d_transposed = cuda.to_device(transposed)

**Write a Transpose Kernel that Uses Shared Memory**

Complete the TODOs inside the `tile_transpose` kernel definition.

If you get stuck, feel free to check out [the solution](section3/solutions/tile_transpose_solution.py).

In [None]:
@cuda.jit
def tile_transpose(a, transposed):
    # `tile_transpose` assumes it is launched with a 32x32 block dimension,
    # and that `a` is a multiple of these dimensions.

    # 1) Create 32x32 shared memory array.

    # TODO: Your code here.

    # Compute offsets into global input array. Recall for coalesced access we want to map threadIdx.x increments to
    # the fastest changing index in the data, i.e. the column in our array.
    # Note: `a_col` and `a_row` are already correct.
    a_col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    a_row = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y

    # 2) Make coalesced read from global memory (using grid indices)
    # into shared memory array (using thread indices).

    # TODO: Your code here.

    # 3) Wait for all threads in the block to finish updating shared memory.

    # TODO: Your code here.

    # 4) Calculate transposed location for the shared memory array tile
    # to be written back to global memory. Note that blockIdx.y*blockDim.y
    # and blockIdx.x* blockDim.x are swapped (because we want to write to the
    # transpose locations), but we want to keep access coalesced, so match up the
    # threadIdx.x to the fastest changing index, i.e. the column./
    # Note: `t_col` and `t_row` are already correct.
    t_col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.x
    t_row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.y

    # 5) Write from shared memory (using thread indices)
    # back to global memory (using grid indices)
    # transposing each element within the shared memory array.

    # TODO: Your code here.

**Check Performance**

Check the performance of your refactored transpose kernel. You should see a speedup compared to the baseline transpose performance above.

In [None]:
%timeit tile_transpose[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()

**Check Correctness**

In [None]:
result = d_transposed.copy_to_host()
expected = a.T

In [None]:
np.array_equal(result, expected)

### Why Such a Small Improvement?

While this is a significant speedup for only a few lines of code, but you might think that the performance improvement is not as stark as you expected based on earlier performance improvements to use coalesced access patterns. There are 2 main reasons for this:

1. The naive transpose kernel was making coalesced reads, so, your refactored version only optimized half of the global memory access throughout the execution of the kernel.
2. Your code as written suffers from something called shared memory bank conflicts, a topic to which we will now turn our attention.

## Presentation: Memory Bank Conflicts

Execute the following cell to load the slides, then click on "Start Slide Show" to make them full screen.

In [None]:
from IPython.display import IFrame
IFrame('https://view.officeapps.live.com/op/view.aspx?src=https://developer.download.nvidia.com/training/courses/C-AC-02-V1/bank_conflicts.pptx', 800, 450)

## Exercice: Resolve Memory Bank Conflicts

As a final exercise, you will refactor the transpose kernel utilizing shared memory to be shared memory bank conflict free.

### Imports

In [None]:
import numpy as np
from numba import cuda, types as numba_types

### Data Creation

In [None]:
n = 4096*4096 # 16M
threads_per_block = (32, 32)
blocks = (128, 128)

a = np.arange(n).reshape((4096,4096)).astype(np.float32)
transposed = np.zeros_like(a).astype(np.float32)

d_a = cuda.to_device(a)
d_transposed = cuda.to_device(transposed)

### Make the Kernel Bank Conflict Free

The `tile_transpose_conflict_free` kernel is a working matrix transpose kernel which utilizes shared memory so that both reads from and writes to global memory are coalesced. Your job is to refactor the kernel so that it does not suffer from memory bank conflicts.

**Note:**  a solution will not be provided.

In [None]:
@cuda.jit
def tile_transpose_conflict_free(a, transposed):
    # `tile_transpose` assumes it is launched with a 32x32 block dimension,
    # and that `a` is a multiple of these dimensions.

    # 1) Create 32x32 shared memory array.
    tile = cuda.shared.array((32, 32), numba_types.int32)

    # Compute offsets into global input array.
    x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y

    # 2) Make coalesced read from global memory into shared memory array.
    # Note the use of local thread indices for the shared memory write,
    # and global offsets for global memory read.
    tile[cuda.threadIdx.y, cuda.threadIdx.x] = a[y, x]

    # 3) Wait for all threads in the block to finish updating shared memory.
    cuda.syncthreads()

    # 4) Calculate transposed location for the shared memory array tile
    # to be written back to global memory.
    t_x = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.x
    t_y = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.y

    # 5) Write back to global memory,
    # transposing each element within the shared memory array.
    transposed[t_y, t_x] = tile[cuda.threadIdx.x, cuda.threadIdx.y]

### Check Performance

Assuming you have correctly resolved the bank conflicts, this kernel should run significantly faster than both the naive transpose kernel, and, the shared memory (with bank conflicts) transpose kernel. In order to pass the assessment, your kernel will need to run on average in less than 840 µs.

The first value printed by running the following cell will give you the average run time of your kernel.

In [None]:
%timeit tile_transpose_conflict_free[blocks, threads_per_block](d_a, d_transposed); cuda.synchronize()

## Summary