Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inception_v3多卡错误 #10531

Open
LuckySJTU opened this issue May 24, 2024 · 0 comments
Open

Inception_v3多卡错误 #10531

LuckySJTU opened this issue May 24, 2024 · 0 comments
Labels
bug community events from community

Comments

@LuckySJTU
Copy link

LuckySJTU commented May 24, 2024

Summary

在使用Inception_v3进行evaluation时,如果使用oneflow.distributed.launch进行多卡并行会导致signal 11

Code to reproduce bug

首先需要一个inception_v3

model = flowvision.models.inception_v3(pretrained=False)
model = flow.nn.parallel.DistributedDataParallel(model, broadcast_buffers=False)

然后需要进行evaluation

@flow.no_grad()
def evaluate(model, data_loader, device, print_freq=4, eval_max_steps=-1, num_threads=1, model_name="Inception"):
    cpu_device = flow.device("cpu")
    flow.set_num_threads(num_threads)
    model.eval()
    metric_logger = utils.MetricLogger(delimiter="  ")

    num_classes, task, average = 1000, "multiclass", "macro"
    metric_collection = torchmetrics.MetricCollection({ 
            'Accuracy': torchmetrics.Accuracy(task=task, num_classes=num_classes, average=average).to('cpu'),
            'Precision': torchmetrics.Precision(task=task, num_classes=num_classes, average=average).to('cpu'), 
            'Recall': torchmetrics.Recall(task=task, num_classes=num_classes, average=average).to('cpu'),
            "AUROC": torchmetrics.AUROC(task=task, num_classes=num_classes, average=average).to('cpu'),
        }) 

    for (images, labels), i, global_step in metric_logger.log_every(data_loader, print_freq, 0, is_eval=True):
        images, labels = images.to(device), labels.to(device)
        model_time = time.time()
        preds = model(images) # error raise
        model_time = time.time() - model_time
        if model_name == "Inception":
            preds = preds[0]
        preds = preds.softmax(dim=1).cpu()
        evaluator_time = time.time()
        # only for oneflow
        preds = torch.from_numpy(preds.numpy())
        labels = torch.from_numpy(labels.numpy())

        batch_metrics = metric_collection.forward(preds, labels)
        evaluator_time = time.time() - evaluator_time

        metric_logger.update(model_time=model_time, evaluator_time=evaluator_time)
        if 0 < eval_max_steps <= i:
            break

    # gather the stats from all processes
    metric_logger.synchronize_between_processes()
    val_metrics = metric_collection.compute()
    eval_res = {
        "Accuracy": val_metrics["Accuracy"].item(),
        "Precision": val_metrics["Precision"].item(),
        "Recall": val_metrics["Recall"].item(),
        "AUROC": val_metrics["AUROC"].item(),
    }
    print("Averaged stats:", metric_logger)

    model.train()
    return eval_res

最后通过python -m oneflow.distributed.launch --nproc_per_node 2 --master_port 12345 eval.py进行测试,在 preds = model(images)时报错subprocess.CalledProcessError: Command [xxxxx] died with <Signals.SIGSEGV: 11>.

System Information

  • What is your OneFlow installation (pip, source, dockerhub): pip
  • OS: Ubuntu 22.04.4 LTS
  • OneFlow version (run python3 -m oneflow --doctor): 0.9.0 (git_commit: 381b12c)
  • Python version: 3.8.19
  • CUDA driver version: 11.8
  • GPU models: RTX2080Ti
  • Other info: flowvision == 0.2.1,仅多卡报错,单卡正常
@LuckySJTU LuckySJTU added bug community events from community labels May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug community events from community
Projects
None yet
Development

No branches or pull requests

1 participant