Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA crashes when passed complex record array #9471

Open
shaunc opened this issue Feb 28, 2024 · 13 comments
Open

CUDA crashes when passed complex record array #9471

shaunc opened this issue Feb 28, 2024 · 13 comments
Labels
bug - incorrect behavior Bugs: incorrect behavior bug - segfault Bugs that cause SIGSEGV, SIGABRT, SIGILL, SIGBUS CUDA CUDA related issue/PR

Comments

@shaunc
Copy link
Contributor

shaunc commented Feb 28, 2024

Passing sufficiently complex data to a CUDA kernel, passing the data to a device function and getting a piece of it causes the kernel to crash.

Code to reproduce
from collections import namedtuple
from collections.abc import Sequence
from typing import Any, NamedTuple, Self, cast

import numpy as np
from numba import cuda  # type: ignore[input]
from numba.cuda.random import (  # type: ignore[import]
    create_xoroshiro128p_states,
)
from numpy import typing as npt

Sit = namedtuple("Sit", ["a", "b", "c"])

TrainContextHostNT = namedtuple(
    "TrainContextHostNT", ["config", "graph", "overlay"]
)
TrainOverlayNT = namedtuple("TrainOverlayNT", ["infer", "assess", "adapt"])

LevelAdaptNT = namedtuple(
    "LevelAdaptNT", ["cluster", "cluster_sit", "cluster_params"]
)
TrainConfigNT = namedtuple(
    "TrainConfigNT",
    [
        "n_samples",
        "n_features",
        "batch_size",
        "max_threads",
        "shared_bytes",
        "params",
        "dvs",
    ],
)
TrainGraphDataNT = namedtuple(
    "TrainGraphDataNT", ["model", "data", "state", "prev_state"]
)
NodeStateNT = namedtuple(
    "NodeStateNT",
    [
        "occurrence",
        "value",
        "relevance",
        "significance",
        "feature_value",
        "sample_relevance",
        "final_mem_avg",
        "final_mem_relevance",
    ],
)
CGDataNT = namedtuple(
    "CGDataNT",
    ["occurrence", "relevance", "significance", "value", "dv_params"],
)
_1 = np.float32(1)
DVParamsNT = namedtuple(
    "DVParamsNT", ["dv_definitions", "first_dv_feature", "level_err_range"]
)
DVDefinitionNT = namedtuple(
    "DVDefinitionNT",
    ["dv_idx", "threshold", "sharpness", "sig_scale", "sig_offset", "level"],
)
EMMClusterOverlayNT = namedtuple(
    "EMMClusterOverlayNT",
    ["cluster", "optimize", "util", "constraint"],
)
ClusterOverlayNT = namedtuple(
    "ClusterOverlayNT",
    [
        "min_sum_matrix",
        "max_sum_matrix",
        "rel_in",
        "rel_out",
        "dist_matrix",
        "stack",
        "i_matrix",
        "closest_out_dist",
        "max_i_array",
        "in_edges_out",
        "new_edge_count",
        "new_edge_start",
        "new_edge_source_ranked",
        "new_edge_source",
        "new_edge_target",
        "new_edge_weight",
        "new_edge_weight_min",
        "new_edge_weight_max",
    ],
)
OptimizeOverlayNT = namedtuple(
    "OptimizeOverlayNT",
    [
        "param_node_values",
        "param_edge_values",
        "set_edge_weight",
        "set_a_relevance",
        "set_value",
        "set_relevance",
        "set_significance",
        "set_mem_value",
        "set_mem_relevance",
        "set_mem_exp_avg",
        "edge_sum",
        "neg_edge_sum",
        "entropies",
        "i_min_param_set",
        "hist_lc_states",
        "hist_lc_buffer",
        "hist_diff_est",
        "back_significance",
        "back_cat_significance",
        "significance",
        "cat_significance",
        "relevance",
        "hist_node_sum",
        "hist_node_weight",
        "hist_mem_avg",
        "hist_sample_bins",
        "hist_combination",
    ],
)
UtilOverlayNT = namedtuple(
    "UtilOverlayNT",
    ["stack", "rng_states"],
)

ConstraintOverlayNT = namedtuple(
    "ConstraintOverlayNT",
    [
        "f_preserve",
        "param_node_old",
        "param_edge_old",
        "sum_nearest",
        "in_assigned",
        "sum_edge_delta",
    ],
)

ClusterSituationNT = namedtuple(
    "ClusterSituationNT",
    [
        "i_in",
        "in_level",
        "in_type",
        "i_out",
        "out_level",
        "out_type",
        "last_trained",
        "mb_samples",
        "a_weight",
        "is_objective",
        "n_iv_features",
        "n_dv_features",
    ],
)
PairNT = namedtuple("PairNT", ["a", "b"])

ClusterParamsNT = namedtuple(
    "ClusterParamsNT",
    [
        "min_relevance",
        "min_significance",
        "n_cluster_passes",
        "dist_passes",
        "dist_max_inputs",
        "dist_max_outputs",
        "dist_threshold",
        "n_spare_dist_threshold",
        "n_opt_passes",
        "n_opt_param_sets",
        "n_opt_best",
        "dim_relevance",
        "mem_ema_decay",
        "mem_relevance",
        "history_filter_bytes",
        "history_bf_thresholds",
        "seed",
    ],
)
ModelNT = namedtuple(
    "ModelNT",
    [
        "n_levels",
        "n_iv_features",
        "n_dv_objectives",
        "i_level_start",
        "i_level_edge_start",
        "i_level_edge_end",
        "i_edge_start",
        "i_edge_count",
        "threshold",
        "sharpness",
        "decay",
        "mem_threshold",
        "mem_sharpness",
        "mem_decay",
        "mem_out_weight",
        "mem_in_weight",
        "i_objective_node",
        "objective_cat",
        "param_node_min",
        "edge_target",
        "edge_source",
        "edge_weight",
        "param_node_max",
        "param_edge_min",
        "param_edge_max",
    ],
)
ClusterParamsNT = namedtuple(
    "ClusterParamsNT",
    [
        "min_relevance",
        "min_significance",
        "n_cluster_passes",
        "dist_passes",
        "dist_max_inputs",
        "dist_max_outputs",
        "dist_threshold",
        "n_spare_dist_threshold",
        "n_opt_passes",
        "n_opt_param_sets",
        "n_opt_best",
        "dim_relevance",
        "mem_ema_decay",
        "mem_relevance",
        "history_filter_bytes",
        "history_bf_thresholds",
        "seed",
    ],
)
AdaptParamsNT = namedtuple("AdaptParamsNT", ["f_preserve", "f_discard"])
TrainParamsNT = namedtuple(
    "TrainParamsNT",
    ["gpu", "model", "data", "dv", "cluster", "adapt", "a_weight", "seed"],
)
GPUParamsNT = namedtuple(
    "GPUParamsNT", ["mini_batch_size", "leave_free_frac", "leave_free_bytes"]
)
ModelParamsNT = namedtuple(
    "ModelParamsNT",
    [
        "level_nodes",
        "level_edges",
        "subj_cuts",
        "obj_cuts",
        "min_halflife",
        "max_halflife",
    ],
)
FeatureRangesNT = namedtuple(
    "FeatureRangesNT",
    [
        "n_dv_features",
        "n_iv_features",
        "feature_min",
        "feature_max",
        "feature_median",
    ],
)


class LCState(NamedTuple):
    n_threshold: np.float32
    p_threshold: np.float32
    lc_threshold: np.float32
    diff_est: np.float32

    @staticmethod
    def dtype() -> np.dtype[Any]:
        """
        Returns data type for LCState.
        """
        return np.dtype(
            dict(
                names=(
                    "n_threshold",
                    "p_threshold",
                    "lc_threshold",
                    "diff_est",
                ),
                formats=(
                    np.float32,
                    np.float32,
                    np.float32,
                    np.float32,
                ),
                aligned=True,
            ),
        )

    @classmethod
    def create_array(cls, *dims: int) -> Any:  # noqa: ANN401
        """
        Create LCState array.
        """
        return np.zeros(dims, dtype=cls.dtype())  # type: ignore[no-any-return]


U4_MAX = np.uint32(0xFFFFFFFF)


class LCBuffer(NamedTuple):
    p_threshold_bin: npt.NDArray[np.uint32]
    lc_threshold_bin: npt.NDArray[np.float32]

    @staticmethod
    def dtype(n_bin_buffer: int) -> np.dtype[Any]:
        n_bin_words = (n_bin_buffer + 31) // 32
        return np.dtype(
            dict(
                names=(
                    "p_threshold_bin",
                    "lc_threshold_bin",
                ),
                formats=(
                    np.dtype((np.uint32, n_bin_words)),
                    np.dtype((np.float32, n_bin_buffer)),
                ),
                aligned=True,
            ),
        )

    @classmethod
    def create(cls, buffer_bins: int) -> Self:
        n_buffer_words = (buffer_bins + 31) // 32
        return cls(
            p_threshold_bin=np.full(n_buffer_words, U4_MAX, dtype=np.uint32),
            lc_threshold_bin=np.zeros(buffer_bins, dtype=np.float32),
        )


def test_mem() -> None:
    """
    Test setup cluster situation.
    """
    out_nodes = in_nodes = dist_max_inputs = block_edges = level_edges = 1
    buffer_bins = thresholds = seed = threads = 1
    stream = 0
    param_sets = 4
    model_params = ModelParamsNT(
        level_nodes=((1, 2), (3, 4)),
        level_edges=((5, 6), (7, 8)),
        subj_cuts=2,
        obj_cuts=2,
        min_halflife=np.float32(0.5),
        max_halflife=np.float32(1.5),
    )
    data_params = FeatureRangesNT(
        n_dv_features=1,
        n_iv_features=1,
        feature_min=np.zeros(1, dtype=np.float32),
        feature_max=np.zeros(1, dtype=np.float32),
        feature_median=np.zeros(1, dtype=np.float32),
    )
    adapt_params_1 = AdaptParamsNT(
        f_preserve=np.float32(0.2),
        f_discard=np.float32(0.2),
    )

    dv_definitions = (
        DVDefinitionNT(
            dv_idx=0,
            threshold=_1,
            sharpness=_1,
            sig_scale=_1,
            sig_offset=_1,
            level=-1,
        ),
    )
    dv_params = DVParamsNT(
        dv_definitions=dv_definitions,
        first_dv_feature=0,
        level_err_range=PairNT(0, 1),
    )
    c_params = ClusterParamsNT(
        min_relevance=np.float32(1.0),
        min_significance=np.float32(1.0),
        n_cluster_passes=1,
        dist_passes=1,
        dist_max_inputs=2,
        dist_max_outputs=2,
        dist_threshold=np.float32(0.5),
        n_spare_dist_threshold=1,
        n_opt_passes=1,
        n_opt_param_sets=5,
        n_opt_best=2,
        dim_relevance=np.float32(0.9),
        mem_ema_decay=np.float32(0.8),
        mem_relevance=np.float32(0.9),
        history_filter_bytes=100,
        history_bf_thresholds=np.array([0.5, 0.9], dtype=np.float32),
        seed=35,
    )
    train_params_1 = TrainParamsNT(
        gpu=GPUParamsNT(10, 0.3, 0.3),
        model=model_params,
        data=data_params,
        dv=dv_params,
        cluster=c_params,
        adapt=adapt_params_1,
        a_weight=1.0,
        seed=32,
    )
    data = CGDataNT(
        occurrence=np.zeros(1, dtype=np.float64),
        relevance=np.zeros(1, dtype=np.float32),
        significance=np.zeros(1, dtype=np.float32),
        value=np.zeros((1, 1), dtype=np.float32),
        dv_params=dv_params,
        # last_occurrence=np.float64(0.0),
    )
    model = create_empty(1, 1, ((3, 2), (3, 2)), ((5, 2), (5, 2)))

    params = train_params_1
    config = TrainConfigNT(
        n_samples=1,
        n_features=1,
        batch_size=1,
        max_threads=1,
        shared_bytes=1,
        params=params,
        dvs=params.dv,
    )

    rng_states = create_xoroshiro128p_states(
        threads,
        seed,
        stream=stream,
    )
    util = UtilOverlayNT(
        stack=np.zeros((out_nodes, 1), dtype=np.int32),
        rng_states=rng_states.copy_to_host(),
    )
    mb_samples = 11
    state = NodeStateNT(
        occurrence=np.zeros(1, dtype=np.float64),
        value=np.zeros((1, 1), dtype=np.float32),
        relevance=np.zeros((1, 1), dtype=np.float32),
        significance=np.zeros((1, 1), dtype=np.float32),
        feature_value=np.zeros((1, 1), dtype=np.float32),
        sample_relevance=np.zeros(1, dtype=np.float32),
        final_mem_avg=1,
        final_mem_relevance=1,
    )
    graph = TrainGraphDataNT(
        model=model,
        data=data,
        state=state,
        prev_state=state,
    )
    cluster = (
        EMMClusterOverlayNT(
            cluster=ClusterOverlayNT(
                max_sum_matrix=np.zeros(
                    (in_nodes, out_nodes), dtype=np.float32
                ),
                min_sum_matrix=np.zeros(
                    (in_nodes, out_nodes), dtype=np.float32
                ),
                rel_in=np.zeros((in_nodes,), dtype=np.float32),
                rel_out=np.zeros((out_nodes,), dtype=np.float32),
                dist_matrix=np.zeros((out_nodes, in_nodes), dtype=np.float32),
                stack=np.zeros((out_nodes, in_nodes), dtype=np.int32),
                i_matrix=np.zeros((out_nodes, in_nodes), dtype=np.int32),
                closest_out_dist=np.zeros((in_nodes,), dtype=np.float32),
                max_i_array=np.zeros((out_nodes,), dtype=np.int32),
                in_edges_out=np.zeros((in_nodes,), dtype=np.int32),
                new_edge_count=np.zeros((out_nodes,), dtype=np.int32),
                new_edge_start=np.zeros((out_nodes,), dtype=np.int32),
                new_edge_source_ranked=np.zeros(
                    (out_nodes, dist_max_inputs),
                    dtype=np.int32,
                ),
                new_edge_source=np.zeros((block_edges,), dtype=np.int32),
                new_edge_target=np.zeros((block_edges,), dtype=np.int32),
                new_edge_weight=np.zeros((block_edges, 2), dtype=np.float32),
                new_edge_weight_min=np.zeros(
                    (block_edges, 2), dtype=np.float32
                ),
                new_edge_weight_max=np.zeros(
                    (block_edges, 2), dtype=np.float32
                ),
            ),
            optimize=OptimizeOverlayNT(
                param_node_values=np.zeros(
                    (param_sets, out_nodes, 1),
                    dtype=np.float32,
                ),
                param_edge_values=np.zeros(
                    (max(param_sets, 4), level_edges, 1),
                    dtype=np.float32,
                ),
                set_edge_weight=np.zeros(
                    (param_sets, out_nodes, mb_samples),
                    dtype=np.float32,
                ),
                set_a_relevance=np.zeros(
                    (param_sets, out_nodes, mb_samples),
                    dtype=np.float32,
                ),
                set_value=np.zeros(
                    (param_sets, out_nodes, mb_samples),
                    dtype=np.float32,
                ),
                set_relevance=np.zeros(
                    (param_sets, out_nodes, mb_samples),
                    dtype=np.float32,
                ),
                set_significance=np.zeros(
                    (param_sets, out_nodes, mb_samples),
                    dtype=np.float32,
                ),
                set_mem_value=np.zeros(
                    (param_sets, out_nodes, mb_samples),
                    dtype=np.float32,
                ),
                set_mem_relevance=np.zeros(
                    (param_sets, out_nodes, mb_samples),
                    dtype=np.float32,
                ),
                set_mem_exp_avg=np.zeros(
                    (param_sets, out_nodes), dtype=np.float32
                ),
                edge_sum=np.zeros((param_sets, out_nodes, 2), dtype=np.float32),
                neg_edge_sum=np.zeros(
                    (param_sets, out_nodes, 2), dtype=np.float32
                ),
                entropies=np.zeros((out_nodes, param_sets), dtype=np.float32),
                i_min_param_set=np.zeros(
                    (out_nodes, param_sets), dtype=np.int32
                ),
                hist_lc_states=(
                    LCState.create_array(param_sets, out_nodes, thresholds)
                ),
                hist_lc_buffer=(LCBuffer.create(buffer_bins)),
                hist_diff_est=np.zeros(
                    (param_sets, out_nodes, thresholds),
                    dtype=np.float32,
                ),
                hist_mem_avg=np.zeros(
                    (param_sets, out_nodes), dtype=np.float32
                ),
                hist_sample_bins=np.zeros(mb_samples, dtype=np.uint32),
                hist_combination=np.zeros(out_nodes, dtype=np.int32),
                back_significance=np.zeros(
                    (param_sets, out_nodes),
                    dtype=np.float32,
                ),
                back_cat_significance=np.zeros(
                    (param_sets, out_nodes),
                    dtype=np.float32,
                ),
                significance=np.zeros(
                    (param_sets, out_nodes), dtype=np.float32
                ),
                cat_significance=np.zeros(
                    (param_sets, out_nodes),
                    dtype=np.float32,
                ),
                relevance=np.zeros((param_sets, out_nodes), dtype=np.float32),
                hist_node_sum=np.zeros(
                    (param_sets, out_nodes), dtype=np.float32
                ),
                hist_node_weight=np.zeros(
                    (param_sets, out_nodes),
                    dtype=np.float32,
                ),
            ),
            util=util,
            constraint=ConstraintOverlayNT(
                f_preserve=np.zeros(out_nodes, dtype=np.float32),
                param_node_old=np.zeros(
                    (out_nodes, 1),
                    dtype=np.float32,
                ),
                param_edge_old=np.zeros(
                    (level_edges, 1),
                    dtype=np.float32,
                ),
                sum_nearest=np.zeros(1, dtype=np.float32),
                in_assigned=np.zeros(in_nodes, dtype=np.int32),
                sum_edge_delta=np.zeros(
                    (out_nodes, 1),
                    dtype=np.float32,
                ),
            ),
        ),
    )
    overlay = TrainOverlayNT(
        infer=1,
        assess=1,
        adapt=LevelAdaptNT(
            cluster=cluster,
            cluster_sit=ClusterSituationNT(
                i_in=(PairNT(0, 0), PairNT(0, 0)),
                in_level=(0, 0),
                in_type=(1, 1),
                i_out=PairNT(0, 0),
                out_level=1,
                out_type=1,
                last_trained=np.float32(0),
                mb_samples=1,
                a_weight=np.float32(1),
                n_dv_features=1,
                n_iv_features=1,
                is_objective=np.bool_(False),  # noqa: FBT003
            ),
            # initialize empty - this is filled in per-call to cluster
            cluster_params=ClusterParamsNT(
                np.float32(0),
                np.float32(0),
                0,
                0,
                0,
                0,
                np.float32(0),
                0,
                0,
                0,
                0,
                np.float32(0),
                np.float32(0),
                np.float32(0),
                0,
                np.zeros(1, dtype=np.float32),
                seed=1,
            ),
        ),
    )
    host_context = TrainContextHostNT(config, graph, overlay)

    final_args = named_tuples_to_records(
        host_context,
        records_to_device=True,
        align=True,
        contiguous=True,
    )
    mem_test_kernel[1, 1](*final_args)  # type: ignore[call-overload]


@cuda.jit(debug=True, opt=False)
def mem_test_kernel(hc: Any) -> None:  # noqa: ANN401
    context = Sit(hc, hc, 1)
    foo_device(context)


@cuda.jit(device=True, debug=True, opt=False)
def foo_device(context: Sit) -> Any:  # noqa: ANN401
    return context.a


def create_empty(
    n_iv_features: int,
    n_dv_objectives: int,
    node_count: Sequence[Sequence[int]],
    edge_count: Sequence[Sequence[int]],
) -> ModelNT:
    n_types = 2
    node_count_ = np.asarray(node_count)
    edge_count_ = np.asarray(edge_count)
    ls_node_count = node_count_[:, 0]
    lc_node_count = node_count_[:, 1]
    ls_edge_count = edge_count_[:, 0]
    lc_edge_count = edge_count_[:, 1]
    levels = node_count_.shape[0]
    levels_p = levels + 1
    if (edge_count_.shape[0] != levels) or (edge_count_.shape[1] != n_types):
        raise ValueError("Number of levels not consistent")
    i_level_start = np.zeros((levels_p, n_types), np.int32)
    i_level_edge_start = np.zeros((levels_p, n_types), np.int32)
    i_level_edge_end = np.zeros((levels_p, n_types), np.int32)
    level_node_count = ls_node_count + lc_node_count
    level_edge_count = ls_edge_count + lc_edge_count

    i_level_start[1:, 0] = np.cumsum(level_node_count)
    i_level_start[:-1, 1] = i_level_start[1:, 0] - lc_node_count
    i_level_start[-1, 1] = i_level_start[-1, 0]
    i_level_edge_start[1:, 0] = np.cumsum(level_edge_count)
    i_level_edge_start[:-1, 1] = i_level_edge_start[1:, 0] - lc_edge_count
    i_level_edge_start[-1, 1] = i_level_edge_start[-1, 0]

    # in empty model, no edges assigned
    i_level_edge_end[:, :] = i_level_edge_start
    n_nodes = i_level_start[-1, 0]
    i_level_start[-1, 1] = n_nodes
    n_edges = i_level_edge_start[-1, 0]
    i_level_edge_start[-1, 1] = n_edges

    i_edge_start = np.zeros(n_nodes, dtype=np.int32)
    i_edge_count = np.zeros(n_nodes, dtype=np.int32)

    # Point node edge starts to start of level block they belong to.
    for i_level in range(levels):
        for node_type in range(2):
            i_start = i_level_start[i_level, node_type]
            i_end_level = i_level if node_type == 0 else i_level + 1
            i_end = i_level_start[i_end_level, 1 - node_type]
            i_edge_start[i_start:i_end] = i_level_edge_start[i_level, node_type]
    n_p = 8
    param_node_min = np.zeros((n_nodes, n_p), dtype=np.float32)
    param_node_max = np.zeros((n_nodes, n_p), dtype=np.float32)
    return ModelNT(
        n_levels=levels,
        n_iv_features=n_iv_features,
        n_dv_objectives=n_dv_objectives,
        i_level_start=i_level_start,
        i_level_edge_start=i_level_edge_start,
        i_level_edge_end=i_level_edge_end,
        i_edge_start=i_edge_start,
        i_edge_count=i_edge_count,
        threshold=np.full(n_nodes, np.nan, dtype=np.float32),
        sharpness=np.zeros(n_nodes, dtype=np.float32),
        decay=np.zeros(n_nodes, dtype=np.float32),
        mem_threshold=np.zeros(n_nodes, dtype=np.float32),
        mem_sharpness=np.zeros(n_nodes, dtype=np.float32),
        mem_decay=np.zeros(n_nodes, dtype=np.float32),
        mem_out_weight=np.zeros(n_nodes, dtype=np.float32),
        mem_in_weight=np.zeros(n_nodes, dtype=np.float32),
        i_objective_node=np.zeros(n_nodes, dtype=np.int32),
        objective_cat=np.zeros(n_nodes, dtype=np.int8),
        param_node_min=param_node_min,
        param_node_max=param_node_max,
        edge_target=np.full(n_edges, -1, dtype=np.int32),
        edge_source=np.full(n_edges, -n_iv_features - 1, dtype=np.int32),
        edge_weight=np.zeros((n_edges, 2), dtype=np.float32),
        param_edge_min=np.zeros((n_edges, 2), dtype=np.float32),
        param_edge_max=np.zeros((n_edges, 2), dtype=np.float32),
    )


# ----------------------------------------------------------------------------
# from record.py
from collections import abc
from collections.abc import Iterator
from typing import Any

_EXCLUDED_SEQ_TYPES = (str, bytes, bytearray, memoryview)
TField = tuple[str, Any] | tuple[str, Any, Any]


def named_tuples_to_records(
    *args: Any,
    records_to_device: bool = False,
    align: bool = False,
    contiguous: bool = False,
) -> tuple[Any, ...]:
    """
    Convert named tuples in args to records.

    if records_to_device is True, records corresponding to named tuples are
    copied to device.

    Returns arg list with named tuples converted to records.
    """
    arg_list: list[Any] = list(args).copy()
    for i, arg in enumerate(arg_list):
        if hasattr(arg, "_asdict"):
            arg = to_record(arg, align=align)[0]
            if contiguous:
                old_ndim = arg.ndim
                arg = np.ascontiguousarray(arg)
                # if arg is a scalar, ascontiguousarray returns a 0-d array
                if old_ndim < arg.ndim:
                    arg = arg[0]

            if records_to_device:
                arg = cuda.to_device(arg)
            arg_list[i] = arg
        elif isinstance(arg, np.ndarray) and contiguous:
            old_ndim = arg.ndim
            arg = np.ascontiguousarray(arg)
            # if arg is a scalar, ascontiguousarray returns a 0-d array
            if old_ndim < arg.ndim:
                arg = arg[0]
            arg_list[i] = arg
    return tuple(arg_list)


def to_record(obj: Any, align: bool = True) -> np.record:
    """
    Convert object to record array.

    We copy an object to a single-element record array. This is useful for
    passing complex objects to CUDA kernels. NamedTuples passed to kernels
    are passed by value, not by reference, so they are copied on every call,
    and their complexity is limited to 512 elements (including all nested
    elements). Record arrays are passed by reference, so they are not copied.

    Currently a limited number of types are supported.

    Can be used like:
    ```
    rec = record.to_record(obj)
    ```

    Then for to send, for one-way trip:
    ```
    rec_dev = cuda.to_device(rec)
    some_kernel[grid, block](rec_dev[0])
    ```
    or for round trip:
    ```
    with cuda.pinned(rec):
        some_kernel[grid, block](rec[0])
    ```

    Args:
        obj: object to convert

    Returns:
        record array
    """
    fields: list[TField] = []
    values: list[Any] = []
    if hasattr(obj, "_asdict"):
        # assume NamedTuple
        items = cast(Iterator[tuple[str, Any]], obj._asdict().items())
    elif hasattr(obj, "__slots__"):
        items = cast(
            Iterator[tuple[str, Any]],
            ((name, getattr(obj, name)) for name in obj.__slots__),
        )
    elif isinstance(obj, abc.Sequence) and not isinstance(
        obj, _EXCLUDED_SEQ_TYPES
    ):
        # convert sequence to array and return directly, adding dimension
        # as caller expects a record
        value = _convert_seq_to_array(obj, align=align)
        return value[None]
    else:
        items = cast(Iterator[tuple[str, Any]], vars(obj).items())
    for name, value in items:
        if (
            isinstance(value, abc.Sequence)
            and not isinstance(value, _EXCLUDED_SEQ_TYPES)
            and not hasattr(value, "_asdict")
        ):
            # convert sequence to sub-array
            value = _convert_seq_to_array(value, align=align)

        if isinstance(value, np.ndarray):
            fields.append((name, value.dtype, value.shape))
            values.append(value)
        elif np.isscalar(value):
            fields.append((name, type(value)))
            values.append(value)
        elif isinstance(value, object):
            arr = to_record(value, align=align)
            value = arr[0]
            items = value.dtype.fields.items()
            if len(items) == 0:
                # TODO: empty sub-record is not supported by numba
                # also, an empty value has shape (0,) while the dtype
                # has shape (), which causes problems in numpy
                raise RuntimeError(
                    "empty sub-record is not supported by numba, "
                    "and numpy has creation glitches"
                )
                # fields.append((name, value.dtype))
                # values.append(arr[0])
            else:
                sub_fields = list[TField]()
                for sub_name, p_sub_dtype in items:
                    # in field list, 2nd element is dtype w/ position info
                    # we want dtype w/o position info
                    sub_dtype = p_sub_dtype[0]
                    if sub_dtype.shape == ():
                        sub_type = (
                            sub_dtype.type
                            if sub_dtype.type != np.void
                            else sub_dtype
                        )
                        sub_fields.append((sub_name, sub_type))
                    else:
                        sub_type = sub_dtype.subdtype[0].type
                        if sub_type == np.record or sub_type == np.void:
                            sub_type = sub_dtype.subdtype[0]
                        sub_fields.append((sub_name, sub_type, sub_dtype.shape))
                fields.append((name, sub_fields))
                values.append(value)
        else:
            fields.append((name, type(value)))
            values.append(value)
    if len(values) == 0:
        return np.rec.recarray(0, dtype=[])[None]
    dtype = np.dtype(fields)
    if align:
        dtype = fixup_dtype_alignment(dtype)
    rec = cast(
        np.record,
        np.rec.fromarrays(values, dtype=dtype)[None],
    )
    if align and not is_aligned(rec):
        raise RuntimeError("record is not aligned")
    return rec


def copy_from_record(out: Any, rec: np.record) -> None:
    """
    Copy from structured record to object with same shape.

    Copies values in sub-arrays in rec to sub-arrays in `out`. Individual
    non-array values cannot be copied into rec if it is a namedtuple, as
    they are immutable.

    Args:
        out: object to copy to
        rec: record to copy from
    """
    if hasattr(out, "_asdict"):
        # assume NamedTuple
        for name in out._asdict().keys():
            _copy_from_record_at(out, rec, name)
    elif hasattr(out, "__slots__"):
        for name in out.__slots__:
            _copy_from_record_at(out, rec, name)
    elif isinstance(out, abc.Sequence) and not isinstance(
        out, _EXCLUDED_SEQ_TYPES
    ):
        if len(rec.shape) > 1:
            rec = rec[0]
        for i, value in enumerate(out):
            copy_from_record(value, rec[i])
    elif hasattr(out, "__dict__"):
        # NOTE: putatively writable structure; we could set elements
        # here, unlike namedtuple case.
        for name in vars(out).keys():
            _copy_from_record_at(out, rec, name)


def _convert_seq_to_array(
    seq: Sequence[Any], align: bool = False
) -> np.ndarray[Any, Any]:
    """
    Convert sequence to array.

    Args:
        seq: sequence to convert

    Returns:
        array
    """
    try:
        sub_values = [to_record(v, align=align)[0] for v in seq]  # type: ignore[assignment]
    except TypeError:
        sv = np.array(list(seq))
        sub_values = sv.astype(fixup_dtype_alignment(sv.dtype)) if align else sv

    if len(sub_values) > 0:
        value_dtype = sub_values[0].dtype
        value_shape = (len(sub_values),) + sub_values[0].shape
        value = np.array(sub_values, dtype=value_dtype).reshape(value_shape)
    else:
        value = np.zeros(0, dtype=np.int32)
    return value


def is_aligned(rec: np.record) -> bool:
    """
    Check alignment of record.

    Args:
        rec: record to check

    Returns: True if record is aligned
    """
    if rec.dtype.names is None:
        return rec.flags["ALIGNED"]

    for name in rec.dtype.names:
        field = rec.dtype.fields[name]
        offset = field[1]
        elt_size = _max_element_size(rec[name].dtype)
        if offset % min(elt_size, 8) != 0:
            return False
        if not is_aligned(rec[name]):
            return False
    return True


def _max_element_size(dtype: np.dtype[Any]) -> int:
    """
    Calculate maximum element size of dtype.

    This determines how the dtype must be aligned.
    """
    if dtype.names is None:
        if dtype.subdtype is not None:
            sub_dtype, _ = dtype.subdtype[:2]
            return _max_element_size(sub_dtype)
        return dtype.itemsize
    return max(_max_element_size(dtype.fields[name][0]) for name in dtype.names)


def fixup_dtype_alignment(dtype: np.dtype[Any]) -> np.dtype[Any]:
    """
    Fix up alignment of dtype.

    A workaround for: https://github.com/numpy/numpy/issues/24339

    Args:
        dtype: dtype to fix up

    Returns: dtype with alignment fixed up
    """
    return _fixup_dtype_alignment(dtype)[0]


def _fixup_dtype_alignment(dtype: np.dtype[Any]) -> tuple[np.dtype[Any], int]:
    """
    Fix up alignment of dtype recursively.

    Aligns each element so that its contents are aligned, given what
    is before it in the structure.

    Args:
        dtype: dtype to fix up

    Returns fixed dtype and largest individual element size.

    """
    if dtype.names is None:
        if dtype.subdtype is not None:
            sub_dtype, layout = dtype.subdtype[:2]
            sub_dtype, size = _fixup_dtype_alignment(sub_dtype)
            return np.dtype((sub_dtype, layout)), size
        return dtype, dtype.itemsize
    names = list[str]()
    formats = list[Any]()
    offsets = list[int]()
    element_size = 0
    offset = 0
    for name in dtype.names:
        field = dtype.fields[name]  # type: ignore[assignment]
        sub_dtype = cast(np.dtype[Any], field[0])
        sub_dtype, size = _fixup_dtype_alignment(sub_dtype)
        element_size = max(element_size, size)
        if offset % size != 0:
            offset += size - offset % size
        names.append(name)
        formats.append(sub_dtype)
        offsets.append(offset)
        offset += sub_dtype.itemsize
    if offset % element_size != 0:
        offset += element_size - offset % element_size
    if dtype.type == np.record:
        new_dtype: np.dtype[Any] = np.dtype(
            (
                np.record,
                dict(
                    names=names,
                    formats=formats,
                    offsets=offsets,
                    itemsize=offset,
                ),
            ),
            align=True,
        )
    else:
        new_dtype = np.dtype(
            dict(
                names=names, formats=formats, offsets=offsets, itemsize=offset
            ),
            align=True,
        )
    return (new_dtype, element_size)


def _copy_from_record_at(out: Any, rec: np.record, name: str) -> None:
    """
    Copy from structured record to object at named field.

    Copies values in sub-arrays in rec to sub-arrays in `out`. Individual
    non-array values cannot be copied into rec if it is a namedtuple, as
    they are immutable.

    Args:
        out: object to copy to
        rec: record to copy from
        name: name of field to copy
    """
    target = getattr(out, name)
    if isinstance(target, np.ndarray):
        target[:] = rec[name]
    else:
        copy_from_record(target, rec[name])
Error message on running
$ py.test --last-failed --new-first -s --cuda --x-slow -vv tests/context/adapt/cluster/test_mem.py::test_mem -x
================================================= test session starts ==================================================
platform linux -- Python 3.11.8, pytest-7.4.4, pluggy-1.4.0 -- /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/bin/python
cachedir: .pytest_cache
rootdir: /home/shauncutts/src/factfiber.ai/learn/timbuktu
configfile: pyproject.toml
plugins: timeout-2.2.0, snapshot-0.9.0, cov-4.1.0, cases-3.8.2
timeout: 2s
timeout method: signal
timeout func_only: True
collected 1 item                                                                                                       
run-last-failure: rerun previous 1 failure

tests/context/adapt/cluster/test_mem.py::test_mem Fatal Python error: Segmentation fault

Current thread 0x00007f4e1226b740 (most recent call first):
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/cudadrv/driver.py", line 326 in safe_cuda_api_call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/cudadrv/driver.py", line 2824 in add_ptx
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/codegen.py", line 170 in get_cubin
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/codegen.py", line 195 in get_cufunc
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/dispatcher.py", line 197 in bind
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/dispatcher.py", line 926 in compile
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/dispatcher.py", line 681 in _compile_for_args
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/dispatcher.py", line 673 in call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/dispatcher.py", line 539 in __call__
  File "/home/shauncutts/src/factfiber.ai/learn/timbuktu/tests/context/adapt/cluster/test_mem.py", line 638 in test_mem
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/python.py", line 1792 in runtest
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 341 in from_call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/main.py", line 350 in pytest_runtestloop
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/main.py", line 325 in _main
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/main.py", line 271 in wrap_session
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/main.py", line 318 in pytest_cmdline_main
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/config/__init__.py", line 169 in main
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/config/__init__.py", line 192 in console_main
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/bin/py.test", line 8 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, scipy._lib._ccallback_c, yaml._yaml, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, numba.mviewbuf, numba.types.itertools, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, _cffi_backend, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering (total: 91)

With apologies for the size of the example - as it is it took me a whole day to isolate (after a couple days trying to debug). There may be ways to simplify it, but it seems that if I delete most any field at the moment, it tends to pass rather than fail. It doesn't seem to care which field. I suspect that there is some buffer somewhere - or the equivalent - that I am overflowing with the shape of the structure.

I have been struggling with similar problems for at least a few months now. Usually, I have been able to fiddle with the datastructures to make the problem go away, and the complexity of reproducers has dissuaded me from submitting them. However, the continual fiddling with datastructures is becoming a long-term maintenance issue for the code base. Also, this time the error seems more recondite. (I have trimmed a few things out of the real structures to create this example.)

numba -s output
$ poetry run numba -s
System info:

[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: pip install --upgrade pip
--------------------------------------------------------------------------------
__Time Stamp__
Report started (local time)                   : 2024-02-28 11:53:15.089183
UTC start time                                : 2024-02-28 16:53:15.089186
Running time (s)                              : 0.443454

__Hardware Information__
Machine                                       : x86_64
CPU Name                                      : znver2
CPU Count                                     : 48
Number of accessible CPUs                     : 48
List of accessible CPUs cores                 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
CFS Restrictions (CPUs worth of runtime)      : None

CPU Features                                  : 64bit adx aes avx avx2 bmi bmi2
                                                clflushopt clwb clzero cmov crc32
                                                cx16 cx8 f16c fma fsgsbase fxsr
                                                lzcnt mmx movbe mwaitx pclmul
                                                popcnt prfchw rdpid rdrnd rdseed
                                                sahf sha sse sse2 sse3 sse4.1
                                                sse4.2 sse4a ssse3 wbnoinvd xsave
                                                xsavec xsaveopt xsaves

Memory Total (MB)                             : 128727
Memory Available (MB)                         : 121209

__OS Information__
Platform Name                                 : Linux-5.4.0-172-generic-x86_64-with-glibc2.31
Platform Release                              : 5.4.0-172-generic
OS Name                                       : Linux
OS Version                                    : #190-Ubuntu SMP Fri Feb 2 23:24:22 UTC 2024
OS Specific Version                           : ?
Libc Version                                  : glibc 2.31

__Python Information__
Python Compiler                               : GCC 12.3.0
Python Implementation                         : CPython
Python Version                                : 3.11.8
Python Locale                                 : en_US.UTF-8

__Numba Toolchain Versions__
Numba Version                                 : 0.59.0
llvmlite Version                              : 0.42.0

__LLVM Information__
LLVM Version                                  : 14.0.6

__CUDA Information__
CUDA Device Initialized                       : True
CUDA Driver Version                           : 12.3
CUDA Runtime Version                          : 12.3
CUDA NVIDIA Bindings Available                : False
CUDA NVIDIA Bindings In Use                   : False
CUDA Minor Version Compatibility Available    : False
CUDA Minor Version Compatibility Needed       : False
CUDA Minor Version Compatibility In Use       : False
CUDA Detect Output:
Found 1 CUDA devices
id 0    b'NVIDIA GeForce RTX 3070'                              [SUPPORTED]
                      Compute Capability: 8.6
                           PCI Device ID: 0
                              PCI Bus ID: 1
                                    UUID: GPU-c61e09df-dc55-c4cc-3740-3cd5d5b27f81
                                Watchdog: Enabled
             FP32/FP64 Performance Ratio: 32
Summary:
        1/1 devices are supported

CUDA Libraries Test Output:
Finding driver from candidates:
        libcuda.so
        libcuda.so.1
        /usr/lib/libcuda.so
        /usr/lib/libcuda.so.1
        /usr/lib64/libcuda.so
        /usr/lib64/libcuda.so.1
Using loader <class 'ctypes.CDLL'>
        Trying to load driver...        ok
                Loaded from libcuda.so
        Mapped libcuda.so paths:
                /usr/lib/x86_64-linux-gnu/libcuda.so.545.23.08
Finding nvvm from CUDA_HOME
        Located at /usr/local/cuda/nvvm/lib64/libnvvm.so.4.0.0
        Trying to open library...       ok
Finding nvrtc from CUDA_HOME
        Located at /usr/local/cuda/lib64/libnvrtc.so.12.3.107
        Trying to open library...       ok
Finding cudart from CUDA_HOME
        Located at /usr/local/cuda/lib64/libcudart.so.12.3.101
        Trying to open library...       ok
Finding cudadevrt from CUDA_HOME
        Located at /usr/local/cuda/lib64/libcudadevrt.a
        Checking library...     ok
Finding libdevice from CUDA_HOME
        Located at /usr/local/cuda/nvvm/libdevice/libdevice.10.bc
        Checking library...     ok


__NumPy Information__
NumPy Version                                 : 1.26.4
NumPy Supported SIMD features                 : ('MMX', 'SSE', 'SSE2', 'SSE3', 'SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2')
NumPy Supported SIMD dispatch                 : ('SSSE3', 'SSE41', 'POPCNT', 'SSE42', 'AVX', 'F16C', 'FMA3', 'AVX2', 'AVX512F', 'AVX512CD', 'AVX512_KNL', 'AVX512_KNM', 'AVX512_SKX', 'AVX512_CLX', 'AVX512_CNL', 'AVX512_ICL')
NumPy Supported SIMD baseline                 : ('SSE', 'SSE2', 'SSE3')
NumPy AVX512_SKX support detected             : False

__SVML Information__
SVML State, config.USING_SVML                 : False
SVML Library Loaded                           : False
llvmlite Using SVML Patched LLVM              : True
SVML Operational                              : False

__Threading Layer Information__
TBB Threading Layer Available                 : False
+--> Disabled due to Unknown import problem.
OpenMP Threading Layer Available              : True
+-->Vendor: GNU
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

__Numba Environment Variable Information__
None found.

__Conda Information__
Conda not available.

__Installed Packages__
Package                   Version     Editable project location
------------------------- ----------- ------------------------------------------------
altair                    4.2.2
asciitree                 0.3.3
asttokens                 2.4.1
attrs                     23.2.0
black                     24.2.0
cachetools                5.3.2
cairocffi                 1.6.1
CairoSVG                  2.7.1
cffi                      1.16.0
cfgv                      3.4.0
click                     8.1.7
colorama                  0.4.6
comm                      0.2.1
contourpy                 1.2.0
coverage                  7.4.3
cssselect2                0.7.0
cycler                    0.12.1
dacite                    1.8.1
debugpy                   1.8.1
decopatch                 1.4.10
decorator                 5.1.1
defusedxml                0.7.1
distlib                   0.3.8
docopt                    0.6.2
entrypoints               0.4
executing                 2.0.1
fasteners                 0.19
fastjsonschema            2.19.1
filelock                  3.13.1
fire                      0.5.0
fonttools                 4.49.0
identify                  2.5.35
importlib-metadata        7.0.1
ingolstadt                0.1.0
iniconfig                 2.0.0
ipykernel                 6.29.2
ipython                   8.22.1
jedi                      0.19.1
Jinja2                    3.1.3
joblib                    1.3.2
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
jupyter_client            8.6.0
jupyter_core              5.7.1
jupytext                  1.16.1
kiwisolver                1.4.5
lamarck                   0.1.0
llvmlite                  0.42.0
makefun                   1.15.2
markdown-it-py            3.0.0
MarkupSafe                2.1.5
matplotlib                3.8.3
matplotlib-inline         0.1.6
mdit-py-plugins           0.4.0
mdurl                     0.1.2
mypy                      1.2.0
mypy-extensions           1.0.0
nbformat                  5.9.2
nest-asyncio              1.6.0
nodeenv                   1.8.0
numba                     0.59.0
numcodecs                 0.12.1
numpy                     1.26.4
packaging                 23.2
pandas                    2.2.1
parso                     0.8.3
pathspec                  0.12.1
pdoc                      13.1.1
pexpect                   4.9.0
pillow                    10.2.0
pip                       23.3.1
platformdirs              4.2.0
pluggy                    1.4.0
pre-commit                2.21.0
prince                    0.7.1
prompt-toolkit            3.0.43
psutil                    5.9.8
ptyprocess                0.7.0
pure-eval                 0.2.2
pyarrow                   15.0.0
pycparser                 2.21
pygal                     3.0.4
Pygments                  2.17.2
pyhumps                   3.8.0
pyparsing                 3.1.1
pytest                    7.4.4
pytest-cases              3.8.2
pytest-cov                4.1.0
pytest-snapshot           0.9.0
pytest-timeout            2.2.0
pytest-watch              4.2.0
python-dateutil           2.8.2
pytz                      2024.1
PyYAML                    6.0.1
pyzmq                     25.1.2
randomname                0.1.5
referencing               0.33.0
rpds-py                   0.18.0
ruff                      0.2.2
scikit-learn              1.4.1.post1
scipy                     1.12.0
seaborn                   0.11.2
setuptools                69.1.1
six                       1.16.0
stack-data                0.6.3
termcolor                 2.4.0
threadpoolctl             3.3.0
timbuktu                  0.1.0       /home/shauncutts/src/factfiber.ai/learn/timbuktu
tinycss2                  1.2.1
toml                      0.10.2
tomli                     2.0.1
toolz                     0.12.1
tornado                   6.4
traitlets                 5.14.1
types-PyYAML              6.0.12.12
typing_extensions         4.10.0
tzdata                    2022.7
virtualenv                20.25.1
watchdog                  4.0.0
wcwidth                   0.2.13
webencodings              0.5.1
xarray                    2023.12.0
xarray-dataclasses        0.3.0
zarr                      2.17.0
zipp                      3.17.0

No errors reported.


__Warning log__
Warning: Conda not available.
 Error was [Errno 2] No such file or directory: 'conda'

--------------------------------------------------------------------------------
If requested, please copy and paste the information between
the dashed (----) lines, or from a given specific section as
appropriate.

=============================================================
IMPORTANT: Please ensure that you are happy with sharing the
contents of the information present, any information that you
wish to keep private you should remove before sharing.
=============================================================

Update: I noticed that was newer nvidia libraries available, and updated python 3.11.3 => python 3.11.8. No change to results. I've updated numba -s output and crash report to latest.

@gmarkall
Copy link
Member

Many thanks for taking the time to capture what's needed for this bug report - I appreciate this has been a bit of a pain and difficult to nail down.

Upon my attempts so far I haven't seen a crash, but running under valgrind does suggest some incorrect behaviour in NVVM:

==804545== Conditional jump or move depends on uninitialised value(s)
==804545==    at 0xDD572C52: ??? (in /home/gmarkall/miniforge3/envs/numbadev/nvvm/lib64/libnvvm.so.4.0.0)
==804545==    by 0xDD534220: ??? (in /home/gmarkall/miniforge3/envs/numbadev/nvvm/lib64/libnvvm.so.4.0.0)
==804545==    by 0xDD51B569: ??? (in /home/gmarkall/miniforge3/envs/numbadev/nvvm/lib64/libnvvm.so.4.0.0)
==804545==    by 0xDD51617E: ??? (in /home/gmarkall/miniforge3/envs/numbadev/nvvm/lib64/libnvvm.so.4.0.0)
==804545==    by 0xDD256B78: ??? (in /home/gmarkall/miniforge3/envs/numbadev/nvvm/lib64/libnvvm.so.4.0.0)
==804545==    by 0xDCEB5C17: ??? (in /home/gmarkall/miniforge3/envs/numbadev/nvvm/lib64/libnvvm.so.4.0.0)
==804545==    by 0xDC4DD51F: ??? (in /home/gmarkall/miniforge3/envs/numbadev/nvvm/lib64/libnvvm.so.4.0.0)
==804545==    by 0xDC4DD572: ??? (in /home/gmarkall/miniforge3/envs/numbadev/nvvm/lib64/libnvvm.so.4.0.0)
==804545==    by 0xDC4DC77E: ??? (in /home/gmarkall/miniforge3/envs/numbadev/nvvm/lib64/libnvvm.so.4.0.0)
==804545==    by 0xDC141087: ??? (in /home/gmarkall/miniforge3/envs/numbadev/nvvm/lib64/libnvvm.so.4.0.0)
==804545==    by 0xDC16833A: nvvmCompileProgram (in /home/gmarkall/miniforge3/envs/numbadev/nvvm/lib64/libnvvm.so.4.0.0)
==804545==    by 0x558C9A49: ffi_call_unix64 (in /home/gmarkall/miniforge3/envs/numbadev/lib/libffi.so.8.1.0)
==804545== 

Often this sort of thing can be triggered by invalid IR that gets past the nvvmVerifyProgram() checks.

The next step is to identify whether we're producing invalid IR, or triggering an NVVM bug.

@gmarkall gmarkall added CUDA CUDA related issue/PR bug - segfault Bugs that cause SIGSEGV, SIGABRT, SIGILL, SIGBUS bug - incorrect behavior Bugs: incorrect behavior labels Feb 28, 2024
@shaunc
Copy link
Contributor Author

shaunc commented Feb 28, 2024

Thank you!

For me it crashes all the time. Also, the real datastructures are somewhat more complex (this is probably 75%) - I deleted fields somewhat randomly until I got to the point that, if I deleted more, it would not crash. (Also I didn't try valgrind - w/ compute sanitizer, when it crashed I wouldn't get any data.)

I'll be quite happy to see if any fix to what you see also fixes my problem. If not I can also send a version with MOAR fields.


NB - as far as best practices go, I realize that most kernels simply aren't so large & complex. I'm taking advantage of cooperative groups to do "many different shapes of (parallel) operation" in sequence. As a baseline I would expect this to save a bit of time by avoiding extra communications w/ CPU at the end of kernels (even if the data stays on the device), and I hope that longer term when more cooperative group primitives are available I can save even more time (e.g. by splitting the groups). Besides, ahem, crashing - would be curious to know if there are any other reasons to avoid this approach. It would seem there aren't that many people using cooperative groups despite numba as a whole (rightfully) being quite popular.

@gmarkall
Copy link
Member

I just realised, looking at your traceback, that we're observing issues in different places - yours is in the linker and mine is in NVVM. I tried with a later version of NVVM and didn't see the corruption, but I also didn't get a crash. So I need to go back and attempt to reproduce your issue again.

@gmarkall
Copy link
Member

Is there any chance you could run again with the environment variables

NUMBA_CUDA_LOG_API_ARGS=1
NUMBA_CUDA_LOG_LEVEL=DEBUG

set and provide the output please?

Note the output will go to stderr and will be around 1MB, so you may need to do whatever you need to do to capture that in pytest - I modified the reproducer so I could run it outside of pytest by calling test_mem() at the bottom of the file and running it directly, but I guess there may be a better way to do this within pytest.

@shaunc
Copy link
Contributor Author

shaunc commented Feb 29, 2024

Ok - less output than expected but that may be due to the crash itself.

logs
$ poetry run env NUMBA_CUDA_LOG_API_ARGS=1 NUMBA_CUDA_LOG_LEVEL=DEBUG p
y.test --last-failed --new-first -s --cuda --x-slow -vv tests/context/adapt/cluster/test_mem.py::test_mem -x
== CUDA [128]  INFO -- init
== CUDA [128] DEBUG -- call driver api: cuInit(0)
== CUDA [140] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [502]  INFO -- init
== CUDA [502] DEBUG -- call driver api: cuInit(0)
== CUDA [529] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f88a704cc70)>)
== CUDA [529] DEBUG -- call driver api: cuDeviceGetCount(<cparam 'P' (0x7f88a704cc70)>)
== CUDA [529] DEBUG -- call driver api: cuDeviceGet(<cparam 'P' (0x7f88a704cc70)>, 0)
== CUDA [529] DEBUG -- call driver api: cuDeviceGetAttribute(<cparam 'P' (0x7f88a704cd00)>, 75, 0)
== CUDA [529] DEBUG -- call driver api: cuDeviceGetAttribute(<cparam 'P' (0x7f88a704cd00)>, 76, 0)
== CUDA [529] DEBUG -- call driver api: cuDeviceGetName(<numba.cuda.cudadrv.driver.c_char_Array_128 object at 0x7f88a704ccb0>, 128, 0)
== CUDA [529] DEBUG -- call driver api: cuDeviceGetUuid_v2(<cparam 'P' (0x7f88a704cd90)>, 0)
== CUDA [529] DEBUG -- call driver api: cuDevicePrimaryCtxRetain(<cparam 'P' (0x7f88a704cd90)>, 0)
== CUDA [588] DEBUG -- call driver api: cuCtxPushCurrent_v2(c_void_p(94564165097248))
== CUDA [588] DEBUG -- call driver api: cuMemGetInfo_v2(<cparam 'P' (0x7f88a704cd00)>, <cparam 'P' (0x7f88a704cc70)>)
================================================= test session starts ==================================================
platform linux -- Python 3.11.8, pytest-7.4.4, pluggy-1.4.0 -- /home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/bin/python
cachedir: .pytest_cache
rootdir: /home/shauncutts/src/factfiber.ai/learn/timbuktu
configfile: pyproject.toml
plugins: timeout-2.2.0, snapshot-0.9.0, cov-4.1.0, cases-3.8.2
timeout: 2s
timeout method: signal
timeout func_only: True
collected 1 item                                                                                                       
run-last-failure: rerun previous 1 failure

tests/context/adapt/cluster/test_mem.py::test_mem == CUDA [788] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f8871955f00)>)
== CUDA [788] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f8871957220)>)
== CUDA [788] DEBUG -- call driver api: cuMemAlloc_v2(<cparam 'P' (0x7f8871957220)>, 16)
== CUDA [1575] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f45281c0)>)
== CUDA [1575] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f452a140)>)
== CUDA [1575] DEBUG -- call driver api: cuMemcpyHtoD_v2(140224185237504, 94564158032080, 16)
== CUDA [1575] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f45281c0)>)
== CUDA [1575] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f452a140)>)
== CUDA [1575] DEBUG -- call driver api: cuMemcpyDtoH_v2(94564158032080, 140224185237504, 16)
== CUDA [1584] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aec0)>)
== CUDA [1584] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f452a020)>)
== CUDA [1584] DEBUG -- call driver api: cuMemAlloc_v2(<cparam 'P' (0x7f87f452a020)>, 4304)
== CUDA [1584] DEBUG -- call driver api: cuMemcpyHtoD_v2(140224185238016, 94564199546432, 4304)
== CUDA [1595] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452ad10)>)
== CUDA [1595] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f452ada0)>)
== CUDA [1635] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f4528be0)>)
== CUDA [1635] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529f90)>)
== CUDA [1648] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [1648] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [1660] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [1663] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [1669] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [1669] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [1676] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [1677] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [1680] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [1680] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [1681] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f4529960)>)
== CUDA [1681] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529b10)>)
== CUDA [1681] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [1684] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [1698] DEBUG -- call runtime api: cudaRuntimeGetVersion
== CUDA [1699] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1699] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1699] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1699] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1699] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1700] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1700] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1700] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1700] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1700] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1700] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1700] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1700] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1700] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1701] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1701] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1701] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1701] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1701] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1701] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1701] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1701] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1701] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1701] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1702] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1702] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1702] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1702] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1702] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1702] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1702] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1702] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1702] DEBUG -- call driver api: cuCtxGetCurrent(<cparam 'P' (0x7f87f452aad0)>)
== CUDA [1703] DEBUG -- call driver api: cuCtxGetDevice(<cparam 'P' (0x7f87f4529c30)>)
== CUDA [1703] DEBUG -- call driver api: cuLinkCreate_v2(6, <numba.cuda.cudadrv.driver.c_int_Array_6 object at 0x7f87f45e39b0>, <numba.cuda.cudadrv.driver.c_void_p_Array_6 object at 0x7f87f45e2450>, <cparam 'P' (0x7f87f45e34f0)>)
== CUDA [1703] DEBUG -- call driver api: cuLinkAddData_v2(c_void_p(94564198354352), 1, c_char_p(94564204172288), 825114, c_char_p(140221892965344), 0, None, None)
Fatal Python error: Segmentation fault

Current thread 0x00007f892b0fa740 (most recent call first):
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/cudadrv/driver.py", line 321 in verbose_cuda_api_call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/cudadrv/driver.py", line 2824 in add_ptx
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/codegen.py", line 170 in get_cubin
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/codegen.py", line 195 in get_cufunc
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/dispatcher.py", line 197 in bind
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/dispatcher.py", line 926 in compile
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/dispatcher.py", line 681 in _compile_for_args
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/dispatcher.py", line 673 in call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/numba/cuda/dispatcher.py", line 539 in __call__
  File "/home/shauncutts/src/factfiber.ai/learn/timbuktu/tests/context/adapt/cluster/test_mem.py", line 638 in test_mem
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/python.py", line 1792 in runtest
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 341 in from_call
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 261 in call_runtest_hook
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/runner.py", line 114 in pytest_runtest_protocol
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/main.py", line 350 in pytest_runtestloop
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/main.py", line 325 in _main
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/main.py", line 271 in wrap_session
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/main.py", line 318 in pytest_cmdline_main
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/config/__init__.py", line 169 in main
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/lib/python3.11/site-packages/_pytest/config/__init__.py", line 192 in console_main
  File "/home/shauncutts/.cache/pypoetry/virtualenvs/timbuktu-6r8-oA4b-py3.11/bin/py.test", line 8 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, scipy._lib._ccallback_c, yaml._yaml, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, numba.mviewbuf, numba.types.itertools, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, _cffi_backend, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering (total: 91)
Segmentation fault (core dumped)

@gmarkall
Copy link
Member

gmarkall commented Mar 1, 2024

Many thanks! Could you now please run with:

NUMBA_DUMP_ASSEMBLY=1

set and provide the output please? (We are homing in on the problem, I apologise for the back-and-forth though)

@gmarkall
Copy link
Member

gmarkall commented Mar 1, 2024

And, you are correct, there's not so much output because of the crash, the majority of it in my setup came from a dump of the cubin, which it doesn't reach the point of producing in your situation.

@shaunc
Copy link
Contributor Author

shaunc commented Mar 1, 2024

Ok - you wanted a huge amount of text - this time you got it! :)

out.txt

@gmarkall
Copy link
Member

gmarkall commented Mar 1, 2024

Thanks! Using the PTX you provided in that dump, I can get:

double free or corruption (!prev)
Aborted (core dumped)

when linking it with the driver, which is now a crash at the same point you saw it.

Similarly ptxas crashes with it:

$ ptxas -arch sm_89 mem_test_kernel.ptx 
malloc(): corrupted top size
Aborted (core dumped)

@shaunc
Copy link
Contributor Author

shaunc commented Mar 1, 2024

Excellent - glad we are zeroing in. Odd that you get different PTX, though (I presume). Are you indeed seeing a difference in a diff?

@gmarkall
Copy link
Member

gmarkall commented Mar 1, 2024

I haven't looked yet, my first hypothesis is that maybe our file paths differ leading to longer variable names, or something like that. I will resume looking at this on Monday.

@shaunc
Copy link
Contributor Author

shaunc commented Mar 5, 2024

Any luck?

file paths differ leading to longer variable names

That sounds worrying - if data-structures rely on concatenated names (rather than mangling, say), then there is an implicit (but varying from platform to platform & maybe one installation path to another) limit on their complexity. (?)

@shaunc
Copy link
Contributor Author

shaunc commented Mar 21, 2024

👀 @gmarkall - any chance you'll have time to look into this? I assume you got absorbed in the release (which incorporated a few other things I was waiting for - thank you!).

I've split up my kernel and data-structures into pieces, but its an ad hoc, messy solution. I'm eager to learn at least what might be going on here and what would be the prospects of fixing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug - incorrect behavior Bugs: incorrect behavior bug - segfault Bugs that cause SIGSEGV, SIGABRT, SIGILL, SIGBUS CUDA CUDA related issue/PR
Projects
None yet
Development

No branches or pull requests

2 participants