Skip to content

[Docs] Mojo manual gpu basics exercise does not compile #4245

@Str-Gen

Description

@Str-Gen

Where is the problem?

https://docs.modular.com/mojo/manual/gpu/basics/#exercise

What can we do better?

The exercise at the end of the mojo manual on GPU basics does not compile on Max/Mojo 25.3

The code snippet is the official solution. When you try to run or compile it, you receive an error:

gpu-basic-exercise-solution.mojo:43:26: error: 'DeviceBuffer[float32]' is not subscriptable, it does not implement the __getitem__/__setitem__ methods
output_buffer[block_idx.x] = value
~~~~~~~~~~~~~^
mojo: error: failed to parse the provided Mojo source module

If you downgrade to mojo 25.2 then it works. It seems strange that the DeviceBuffer would no longer have item access operations in the next version.

from gpu import thread_idx, block_idx, warp, barrier
from gpu.host import DeviceContext, DeviceBuffer
from gpu.memory import AddressSpace
from memory import stack_allocation
from layout import Layout, LayoutTensor
from math import iota
from sys import sizeof


def main():
    ctx = DeviceContext()

    alias dtype_f32 = DType.float32
    alias elements_f32 = 32
    alias blocks_f32 = 8
    alias threads_f32 = elements_f32 // blocks_f32

    # Create buffers
    var in_buffer_host = ctx.enqueue_create_host_buffer[dtype_f32](elements_f32)
    var in_buffer_device = ctx.enqueue_create_buffer[dtype_f32](elements_f32)
    var out_buffer_host = ctx.enqueue_create_host_buffer[dtype_f32](blocks_f32)
    var out_buffer_device = ctx.enqueue_create_buffer[dtype_f32](blocks_f32)

    # Zero output buffer values
    ctx.enqueue_memset(out_buffer_device, 0)
    ctx.synchronize()

    # Fill in input values sequentially and copy to device
    iota(in_buffer_host.unsafe_ptr(), elements_f32)
    in_buffer_host.enqueue_copy_to(in_buffer_device)

    # Create the LayoutTensor
    alias LayoutF32 = Layout.row_major(blocks_f32, threads_f32)
    alias InputTensorF32 = LayoutTensor[dtype_f32, LayoutF32, MutableAnyOrigin]
    var float_tensor = InputTensorF32(in_buffer_device)

    fn reduce_sum_f32(
        in_tensor: InputTensorF32, output_buffer: DeviceBuffer[dtype_f32]
    ):
        var value = in_tensor.load[1](block_idx.x, thread_idx.x)
        value = warp.sum(value)
        if thread_idx.x == 0:
            output_buffer[block_idx.x] = value

    ctx.enqueue_function[reduce_sum_f32](
        float_tensor, out_buffer_device, grid_dim=8, block_dim=4
    )

    out_buffer_device.enqueue_copy_to(out_buffer_host)

    ctx.synchronize()

    print(out_buffer_host)

Anything else?

There is a mismatch between the exercise description and the implementation.

Create an host and device buffer for the output of DType Int64 , with 8 elements, don’t forget to zero the values with enqueue_memset()

But the implementation still uses DType.float32 for the output buffers.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions