[Docs] Mojo manual gpu basics exercise does not compile

### Where is the problem?

https://docs.modular.com/mojo/manual/gpu/basics/#exercise

### What can we do better?

The [exercise](https://docs.modular.com/mojo/manual/gpu/basics/#exercise) at the end of the mojo manual on GPU basics does not compile on Max/Mojo 25.3

The code snippet is the official solution. When you try to run or compile it, you receive an error:

> gpu-basic-exercise-solution.mojo:43:26: error: 'DeviceBuffer[float32]' is not subscriptable, it does not implement the `__getitem__`/`__setitem__` methods
>             output_buffer[block_idx.x] = value
>             ~~~~~~~~~~~~~^
> mojo: error: failed to parse the provided Mojo source module

If you downgrade to mojo 25.2 then it works. It seems strange that the DeviceBuffer would no longer have item access operations in the next version. 

```mojo
from gpu import thread_idx, block_idx, warp, barrier
from gpu.host import DeviceContext, DeviceBuffer
from gpu.memory import AddressSpace
from memory import stack_allocation
from layout import Layout, LayoutTensor
from math import iota
from sys import sizeof


def main():
    ctx = DeviceContext()

    alias dtype_f32 = DType.float32
    alias elements_f32 = 32
    alias blocks_f32 = 8
    alias threads_f32 = elements_f32 // blocks_f32

    # Create buffers
    var in_buffer_host = ctx.enqueue_create_host_buffer[dtype_f32](elements_f32)
    var in_buffer_device = ctx.enqueue_create_buffer[dtype_f32](elements_f32)
    var out_buffer_host = ctx.enqueue_create_host_buffer[dtype_f32](blocks_f32)
    var out_buffer_device = ctx.enqueue_create_buffer[dtype_f32](blocks_f32)

    # Zero output buffer values
    ctx.enqueue_memset(out_buffer_device, 0)
    ctx.synchronize()

    # Fill in input values sequentially and copy to device
    iota(in_buffer_host.unsafe_ptr(), elements_f32)
    in_buffer_host.enqueue_copy_to(in_buffer_device)

    # Create the LayoutTensor
    alias LayoutF32 = Layout.row_major(blocks_f32, threads_f32)
    alias InputTensorF32 = LayoutTensor[dtype_f32, LayoutF32, MutableAnyOrigin]
    var float_tensor = InputTensorF32(in_buffer_device)

    fn reduce_sum_f32(
        in_tensor: InputTensorF32, output_buffer: DeviceBuffer[dtype_f32]
    ):
        var value = in_tensor.load[1](block_idx.x, thread_idx.x)
        value = warp.sum(value)
        if thread_idx.x == 0:
            output_buffer[block_idx.x] = value

    ctx.enqueue_function[reduce_sum_f32](
        float_tensor, out_buffer_device, grid_dim=8, block_dim=4
    )

    out_buffer_device.enqueue_copy_to(out_buffer_host)

    ctx.synchronize()

    print(out_buffer_host)
```

### Anything else?

There is a mismatch between the exercise description and the implementation.

> Create an host and device buffer for the output of DType Int64 , with 8 elements, don’t forget to zero the values with enqueue_memset()

But the implementation still uses DType.float32 for the output buffers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Docs] Mojo manual gpu basics exercise does not compile #4245

Where is the problem?

What can we do better?

Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Docs] Mojo manual gpu basics exercise does not compile #4245

Description

Where is the problem?

What can we do better?

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions