Inception_v3多卡错误 #10531

LuckySJTU · 2024-05-24T02:32:54Z

Summary

在使用Inception_v3进行evaluation时，如果使用oneflow.distributed.launch进行多卡并行会导致signal 11

Code to reproduce bug

首先需要一个inception_v3

model = flowvision.models.inception_v3(pretrained=False)
model = flow.nn.parallel.DistributedDataParallel(model, broadcast_buffers=False)

然后需要进行evaluation

@flow.no_grad()
def evaluate(model, data_loader, device, print_freq=4, eval_max_steps=-1, num_threads=1, model_name="Inception"):
    cpu_device = flow.device("cpu")
    flow.set_num_threads(num_threads)
    model.eval()
    metric_logger = utils.MetricLogger(delimiter="  ")

    num_classes, task, average = 1000, "multiclass", "macro"
    metric_collection = torchmetrics.MetricCollection({ 
            'Accuracy': torchmetrics.Accuracy(task=task, num_classes=num_classes, average=average).to('cpu'),
            'Precision': torchmetrics.Precision(task=task, num_classes=num_classes, average=average).to('cpu'), 
            'Recall': torchmetrics.Recall(task=task, num_classes=num_classes, average=average).to('cpu'),
            "AUROC": torchmetrics.AUROC(task=task, num_classes=num_classes, average=average).to('cpu'),
        }) 

    for (images, labels), i, global_step in metric_logger.log_every(data_loader, print_freq, 0, is_eval=True):
        images, labels = images.to(device), labels.to(device)
        model_time = time.time()
        preds = model(images) # error raise
        model_time = time.time() - model_time
        if model_name == "Inception":
            preds = preds[0]
        preds = preds.softmax(dim=1).cpu()
        evaluator_time = time.time()
        # only for oneflow
        preds = torch.from_numpy(preds.numpy())
        labels = torch.from_numpy(labels.numpy())

        batch_metrics = metric_collection.forward(preds, labels)
        evaluator_time = time.time() - evaluator_time

        metric_logger.update(model_time=model_time, evaluator_time=evaluator_time)
        if 0 < eval_max_steps <= i:
            break

    # gather the stats from all processes
    metric_logger.synchronize_between_processes()
    val_metrics = metric_collection.compute()
    eval_res = {
        "Accuracy": val_metrics["Accuracy"].item(),
        "Precision": val_metrics["Precision"].item(),
        "Recall": val_metrics["Recall"].item(),
        "AUROC": val_metrics["AUROC"].item(),
    }
    print("Averaged stats:", metric_logger)

    model.train()
    return eval_res

最后通过python -m oneflow.distributed.launch --nproc_per_node 2 --master_port 12345 eval.py进行测试，在 preds = model(images)时报错subprocess.CalledProcessError: Command [xxxxx] died with <Signals.SIGSEGV: 11>.

System Information

What is your OneFlow installation (pip, source, dockerhub): pip
OS: Ubuntu 22.04.4 LTS
OneFlow version (run python3 -m oneflow --doctor): 0.9.0 (git_commit: 381b12c)
Python version: 3.8.19
CUDA driver version: 11.8
GPU models: RTX2080Ti
Other info: flowvision == 0.2.1，仅多卡报错，单卡正常

The text was updated successfully, but these errors were encountered:

LuckySJTU added bug community events from community labels May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inception_v3多卡错误 #10531

Inception_v3多卡错误 #10531

LuckySJTU commented May 24, 2024 •

edited

Loading

Inception_v3多卡错误 #10531

Inception_v3多卡错误 #10531

Comments

LuckySJTU commented May 24, 2024 • edited Loading

Summary

Code to reproduce bug

System Information

LuckySJTU commented May 24, 2024 •

edited

Loading