Skip to content

Native API returns: 39 #854

@JialeSeeWorld

Description

@JialeSeeWorld

RuntimeError: Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)
When monitoring the resource manager, I didn’t notice significant GPU memory usage. Could this be caused by other issues? Here’s my code:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
import numpy as np
# ---------------------------
# Transformer 模型
# ---------------------------
class TimeSeriesTransformer(nn.Module):
    def __init__(self, input_dim, seq_len, d_model=64, nhead=4, num_layers=2, dim_feedforward=128, dropout=0.1):
        super().__init__()
        self.seq_len = seq_len
        self.input_dim = input_dim

        # 输入映射到 d_model
        self.input_projection = nn.Linear(input_dim, d_model)

        # Positional Encoding
        self.pos_embedding = nn.Parameter(torch.randn(seq_len, d_model))

        # Transformer Encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True  # batch_first=True方便输入(batch, seq_len, d_model)
        )
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

        # 输出映射到预测维度
        self.output_layer = nn.Linear(d_model, input_dim)

    def forward(self, x):
        # x: (batch, seq_len, input_dim)
        x = self.input_projection(x)  # (batch, seq_len, d_model)
        x = x + self.pos_embedding.unsqueeze(0)  # 加上位置编码
        x = self.transformer_encoder(x)  # (batch, seq_len, d_model)
        x = x[:, -1, :]  # 取序列最后一个时间步
        out = self.output_layer(x)  # (batch, input_dim)
        return out


device = torch.device(config["device"])
print(device)

# ---------------------------
# 模型、损失、优化器
# ---------------------------
model = TimeSeriesTransformer(input_dim=18, seq_len=50).to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=config["learning_rate"])

Just initializing the model caused the system to throw the aforementioned error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions