Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for nvidia-docker GPU container sandboxing #14

Open
thomasjungblut opened this issue May 2, 2018 · 9 comments
Open

support for nvidia-docker GPU container sandboxing #14

thomasjungblut opened this issue May 2, 2018 · 9 comments
Labels
area: container runtime Issue related to docker, kubernetes, OCI runtime area: integration Issue related to third party integrations priority: p2 Normal priority type: enhancement New feature or request

Comments

@thomasjungblut
Copy link

In order to expose GPUs in K8s, you'll have to install nvidia-docker as an additional container runtime. A lot of people surely would love to run sandboxed containers with GPU support though.

Do you guys see an easy way to layer one over the other, maybe?

@resouer
Copy link

resouer commented May 3, 2018

@thomasjungblut If you are using latest Device Plugin + CRI based Kubernetes GPU support (e.g. 1.10), nvidia-docker should not be a dependency. So docker + gVisor or even cri-containerd + gVisor would be the solution.

Though it seems current gVisor sandbox does not work well with GPU devices (correct me if I'm wrong).

@hugelgupf
Copy link
Collaborator

We don't expose access to GPUs at the moment. It's an open problem for us, too.

@flx42
Copy link

flx42 commented May 9, 2018

We have an OCI prestart hook here: https://github.com/NVIDIA/nvidia-container-runtime/tree/master/hook
It leverages libnvidia-container. But of course that's probably not sufficient, depending on the actual capabilities/blocked syscalls at runtime.

@kratan
Copy link

kratan commented May 14, 2018

supporting GPUs would be a really nice feature
greets

@JorgeCeja
Copy link

Any updates? Thanks!

@fvoznika
Copy link
Member

fvoznika commented Jun 5, 2018

Yes, support for GPUs would be really nice!

The work here involves exposing a passthru device directly from the host, as there are no device drivers in gVisor. The challenge being how to make this access secure.

At the moment though, we have a few other things to work on, and GPU support is not in our short list (yet).

@fvoznika fvoznika added the type: enhancement New feature or request label Jan 11, 2019
@ianlewis ianlewis added the area: container runtime Issue related to docker, kubernetes, OCI runtime label Jan 21, 2019
@ianlewis ianlewis added the priority: p2 Normal priority label May 31, 2019
@maxlouthain-unity
Copy link

Hi, I see this is tagged with a priority label now, and am wondering if GPU support is on the roadmap? Thanks!

amscanne pushed a commit to amscanne/gvisor that referenced this issue May 6, 2020
* Update containerd to 1.2.2

Signed-off-by: Lantao Liu <lantaol@google.com>

* Port containerd/containerd#2803.

Signed-off-by: Lantao Liu <lantaol@google.com>
@ianlewis ianlewis added the area: integration Issue related to third party integrations label Aug 14, 2020
@zvonkok
Copy link

zvonkok commented Nov 18, 2021

/cc @zvonkok

@yu-alvin
Copy link

https://docs.vaccel.org/ Found this project recently and thought it might be helpful to reference.

copybara-service bot pushed a commit that referenced this issue May 4, 2023
Updates #14

PiperOrigin-RevId: 529411547
copybara-service bot pushed a commit that referenced this issue May 5, 2023
Updates #14

PiperOrigin-RevId: 529803365
copybara-service bot pushed a commit that referenced this issue May 9, 2023
Very few ioctls are initially implemented.

Updates #14

PiperOrigin-RevId: 529511917
copybara-service bot pushed a commit that referenced this issue May 23, 2023
Currently, version 525.60.13 of the open-source driver is required; each driver
version needs to be individually qualified since the kernel driver's ABI is
unstable.

In conjunction with cl/529511919, on T4, A100, or L4 GPUs:

```
$ sudo docker run --gpus all --runtime=runsc nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

$ sudo docker run --gpus all --runtime=runsc -it nvcr.io/nvidia/pytorch:23.04-py3
...
root@ca01b7709883:/workspace# cd examples/upstream/word_language_model/ # see https://github.com/pytorch/examples/tree/main/word_language_model
root@ca01b7709883:/workspace/examples/upstream/word_language_model# python main.py --cuda --epochs 6 --model Transformer --lr 5
| epoch   1 |   200/ 2983 batches | lr 5.00 | ms/batch 10.52 | loss  7.60 | ppl  2003.10
| epoch   1 |   400/ 2983 batches | lr 5.00 | ms/batch  5.66 | loss  6.80 | ppl   895.15
| epoch   1 |   600/ 2983 batches | lr 5.00 | ms/batch  5.66 | loss  6.50 | ppl   664.17
| epoch   1 |   800/ 2983 batches | lr 5.00 | ms/batch  5.66 | loss  6.36 | ppl   576.66
| epoch   1 |  1000/ 2983 batches | lr 5.00 | ms/batch  5.61 | loss  6.26 | ppl   522.67
| epoch   1 |  1200/ 2983 batches | lr 5.00 | ms/batch  5.66 | loss  6.22 | ppl   504.51
| epoch   1 |  1400/ 2983 batches | lr 5.00 | ms/batch  5.65 | loss  6.15 | ppl   466.58
| epoch   1 |  1600/ 2983 batches | lr 5.00 | ms/batch  5.65 | loss  6.15 | ppl   470.48
| epoch   1 |  1800/ 2983 batches | lr 5.00 | ms/batch  5.68 | loss  6.03 | ppl   415.41
| epoch   1 |  2000/ 2983 batches | lr 5.00 | ms/batch  5.72 | loss  6.02 | ppl   412.43
| epoch   1 |  2200/ 2983 batches | lr 5.00 | ms/batch  5.93 | loss  5.93 | ppl   374.53
| epoch   1 |  2400/ 2983 batches | lr 5.00 | ms/batch  5.80 | loss  5.93 | ppl   377.23
| epoch   1 |  2600/ 2983 batches | lr 5.00 | ms/batch  5.74 | loss  5.93 | ppl   375.84
| epoch   1 |  2800/ 2983 batches | lr 5.00 | ms/batch  5.65 | loss  5.84 | ppl   343.92
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 19.08s | valid loss  5.75 | valid ppl   313.70
-----------------------------------------------------------------------------------------
| epoch   2 |   200/ 2983 batches | lr 5.00 | ms/batch  5.61 | loss  5.80 | ppl   329.43
| epoch   2 |   400/ 2983 batches | lr 5.00 | ms/batch  5.67 | loss  5.77 | ppl   319.79
| epoch   2 |   600/ 2983 batches | lr 5.00 | ms/batch  5.62 | loss  5.62 | ppl   276.16
| epoch   2 |   800/ 2983 batches | lr 5.00 | ms/batch  5.72 | loss  5.63 | ppl   277.32
| epoch   2 |  1000/ 2983 batches | lr 5.00 | ms/batch  5.68 | loss  5.60 | ppl   270.96
| epoch   2 |  1200/ 2983 batches | lr 5.00 | ms/batch  5.68 | loss  5.61 | ppl   273.71
| epoch   2 |  1400/ 2983 batches | lr 5.00 | ms/batch  5.64 | loss  5.62 | ppl   275.38
| epoch   2 |  1600/ 2983 batches | lr 5.00 | ms/batch  5.70 | loss  5.66 | ppl   286.58
| epoch   2 |  1800/ 2983 batches | lr 5.00 | ms/batch  5.74 | loss  5.54 | ppl   255.62
| epoch   2 |  2000/ 2983 batches | lr 5.00 | ms/batch  5.66 | loss  5.58 | ppl   264.36
| epoch   2 |  2200/ 2983 batches | lr 5.00 | ms/batch  5.65 | loss  5.48 | ppl   240.27
| epoch   2 |  2400/ 2983 batches | lr 5.00 | ms/batch  5.64 | loss  5.52 | ppl   248.69
| epoch   2 |  2600/ 2983 batches | lr 5.00 | ms/batch  5.62 | loss  5.53 | ppl   251.46
| epoch   2 |  2800/ 2983 batches | lr 5.00 | ms/batch  5.78 | loss  5.45 | ppl   233.75
-----------------------------------------------------------------------------------------
| end of epoch   2 | time: 18.00s | valid loss  5.53 | valid ppl   252.16
-----------------------------------------------------------------------------------------
| epoch   3 |   200/ 2983 batches | lr 5.00 | ms/batch  5.72 | loss  5.46 | ppl   235.25
| epoch   3 |   400/ 2983 batches | lr 5.00 | ms/batch  5.69 | loss  5.46 | ppl   234.59
| epoch   3 |   600/ 2983 batches | lr 5.00 | ms/batch  5.68 | loss  5.29 | ppl   197.90
| epoch   3 |   800/ 2983 batches | lr 5.00 | ms/batch  5.68 | loss  5.32 | ppl   204.71
| epoch   3 |  1000/ 2983 batches | lr 5.00 | ms/batch  5.66 | loss  5.31 | ppl   201.70
| epoch   3 |  1200/ 2983 batches | lr 5.00 | ms/batch  5.70 | loss  5.33 | ppl   205.88
| epoch   3 |  1400/ 2983 batches | lr 5.00 | ms/batch  5.59 | loss  5.35 | ppl   211.48
| epoch   3 |  1600/ 2983 batches | lr 5.00 | ms/batch  5.68 | loss  5.40 | ppl   220.79
| epoch   3 |  1800/ 2983 batches | lr 5.00 | ms/batch  6.03 | loss  5.29 | ppl   198.28
| epoch   3 |  2000/ 2983 batches | lr 5.00 | ms/batch  5.63 | loss  5.33 | ppl   206.45
| epoch   3 |  2200/ 2983 batches | lr 5.00 | ms/batch  5.62 | loss  5.23 | ppl   186.28
| epoch   3 |  2400/ 2983 batches | lr 5.00 | ms/batch  5.77 | loss  5.27 | ppl   194.13
| epoch   3 |  2600/ 2983 batches | lr 5.00 | ms/batch  5.62 | loss  5.29 | ppl   199.08
| epoch   3 |  2800/ 2983 batches | lr 5.00 | ms/batch  5.75 | loss  5.22 | ppl   184.77
-----------------------------------------------------------------------------------------
| end of epoch   3 | time: 18.10s | valid loss  5.45 | valid ppl   232.50
-----------------------------------------------------------------------------------------
| epoch   4 |   200/ 2983 batches | lr 5.00 | ms/batch  5.71 | loss  5.24 | ppl   189.07
| epoch   4 |   400/ 2983 batches | lr 5.00 | ms/batch  5.65 | loss  5.25 | ppl   190.61
| epoch   4 |   600/ 2983 batches | lr 5.00 | ms/batch  5.67 | loss  5.07 | ppl   159.83
| epoch   4 |   800/ 2983 batches | lr 5.00 | ms/batch  5.62 | loss  5.13 | ppl   168.20
| epoch   4 |  1000/ 2983 batches | lr 5.00 | ms/batch  5.64 | loss  5.12 | ppl   166.87
| epoch   4 |  1200/ 2983 batches | lr 5.00 | ms/batch  5.61 | loss  5.13 | ppl   169.07
| epoch   4 |  1400/ 2983 batches | lr 5.00 | ms/batch  5.60 | loss  5.17 | ppl   175.87
| epoch   4 |  1600/ 2983 batches | lr 5.00 | ms/batch  5.70 | loss  5.22 | ppl   184.63
| epoch   4 |  1800/ 2983 batches | lr 5.00 | ms/batch  5.69 | loss  5.12 | ppl   166.77
| epoch   4 |  2000/ 2983 batches | lr 5.00 | ms/batch  5.65 | loss  5.16 | ppl   173.80
| epoch   4 |  2200/ 2983 batches | lr 5.00 | ms/batch  5.71 | loss  5.05 | ppl   155.82
| epoch   4 |  2400/ 2983 batches | lr 5.00 | ms/batch  5.76 | loss  5.10 | ppl   163.49
| epoch   4 |  2600/ 2983 batches | lr 5.00 | ms/batch  5.71 | loss  5.12 | ppl   167.32
| epoch   4 |  2800/ 2983 batches | lr 5.00 | ms/batch  5.67 | loss  5.05 | ppl   155.76
-----------------------------------------------------------------------------------------
| end of epoch   4 | time: 18.03s | valid loss  5.42 | valid ppl   225.19
-----------------------------------------------------------------------------------------
| epoch   5 |   200/ 2983 batches | lr 5.00 | ms/batch  5.83 | loss  5.08 | ppl   160.77
| epoch   5 |   400/ 2983 batches | lr 5.00 | ms/batch  5.70 | loss  5.09 | ppl   163.02
| epoch   5 |   600/ 2983 batches | lr 5.00 | ms/batch  5.60 | loss  4.92 | ppl   137.13
| epoch   5 |   800/ 2983 batches | lr 5.00 | ms/batch  5.58 | loss  4.97 | ppl   143.72
| epoch   5 |  1000/ 2983 batches | lr 5.00 | ms/batch  5.64 | loss  4.96 | ppl   142.78
| epoch   5 |  1200/ 2983 batches | lr 5.00 | ms/batch  5.76 | loss  4.98 | ppl   146.04
| epoch   5 |  1400/ 2983 batches | lr 5.00 | ms/batch  5.67 | loss  5.03 | ppl   153.23
| epoch   5 |  1600/ 2983 batches | lr 5.00 | ms/batch  5.67 | loss  5.08 | ppl   160.29
| epoch   5 |  1800/ 2983 batches | lr 5.00 | ms/batch  5.67 | loss  4.98 | ppl   145.06
| epoch   5 |  2000/ 2983 batches | lr 5.00 | ms/batch  5.66 | loss  5.02 | ppl   151.17
| epoch   5 |  2200/ 2983 batches | lr 5.00 | ms/batch  5.66 | loss  4.90 | ppl   134.86
| epoch   5 |  2400/ 2983 batches | lr 5.00 | ms/batch  5.61 | loss  4.96 | ppl   142.85
| epoch   5 |  2600/ 2983 batches | lr 5.00 | ms/batch  5.66 | loss  4.98 | ppl   145.94
| epoch   5 |  2800/ 2983 batches | lr 5.00 | ms/batch  5.64 | loss  4.92 | ppl   136.60
-----------------------------------------------------------------------------------------
| end of epoch   5 | time: 17.99s | valid loss  5.39 | valid ppl   218.33
-----------------------------------------------------------------------------------------
| epoch   6 |   200/ 2983 batches | lr 5.00 | ms/batch  5.60 | loss  4.95 | ppl   140.86
| epoch   6 |   400/ 2983 batches | lr 5.00 | ms/batch  5.64 | loss  4.97 | ppl   143.35
| epoch   6 |   600/ 2983 batches | lr 5.00 | ms/batch  5.64 | loss  4.79 | ppl   120.55
| epoch   6 |   800/ 2983 batches | lr 5.00 | ms/batch  5.65 | loss  4.85 | ppl   127.48
| epoch   6 |  1000/ 2983 batches | lr 5.00 | ms/batch  5.64 | loss  4.84 | ppl   126.87
| epoch   6 |  1200/ 2983 batches | lr 5.00 | ms/batch  5.60 | loss  4.86 | ppl   129.41
| epoch   6 |  1400/ 2983 batches | lr 5.00 | ms/batch  5.66 | loss  4.91 | ppl   135.84
| epoch   6 |  1600/ 2983 batches | lr 5.00 | ms/batch  5.82 | loss  4.96 | ppl   143.08
| epoch   6 |  1800/ 2983 batches | lr 5.00 | ms/batch  5.68 | loss  4.86 | ppl   129.64
| epoch   6 |  2000/ 2983 batches | lr 5.00 | ms/batch  5.57 | loss  4.91 | ppl   134.98
| epoch   6 |  2200/ 2983 batches | lr 5.00 | ms/batch  5.80 | loss  4.79 | ppl   120.01
| epoch   6 |  2400/ 2983 batches | lr 5.00 | ms/batch  5.89 | loss  4.84 | ppl   126.87
| epoch   6 |  2600/ 2983 batches | lr 5.00 | ms/batch  5.79 | loss  4.87 | ppl   130.53
| epoch   6 |  2800/ 2983 batches | lr 5.00 | ms/batch  5.62 | loss  4.81 | ppl   122.77
-----------------------------------------------------------------------------------------
| end of epoch   6 | time: 18.09s | valid loss  5.37 | valid ppl   214.45
-----------------------------------------------------------------------------------------
| End of training | test loss  5.28 | test ppl   195.78

root@ca01b7709883:/workspace/examples/upstream/word_language_model# python generate.py --cuda
| Generated 0/1000 words
| Generated 100/1000 words
| Generated 200/1000 words
| Generated 300/1000 words
| Generated 400/1000 words
| Generated 500/1000 words
| Generated 600/1000 words
| Generated 700/1000 words
| Generated 800/1000 words
| Generated 900/1000 words
```

Updates #14

PiperOrigin-RevId: 534515559
copybara-service bot pushed a commit that referenced this issue May 24, 2023
With this change, we can now run simple CUDA applications on H100 GPUs.

```
$ docker run --runtime=runsc --gpus all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

Note that this was tested with 525.60.13 driver version.

Updates #14

PiperOrigin-RevId: 534607359
copybara-service bot pushed a commit that referenced this issue May 24, 2023
The --nvproxy flag allows container GPU usage to be specified via device nodes
and mounts provided in the runtime spec, as when using Kubernetes with GKE's
Nvidia GPU device plugin
(https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu).

The --nvproxy-docker flag additionally allows container GPU usage to be
specified via the NVIDIA_VISIBLE_DEVICES container environment variable, as
when using `docker --gpus`. This does not require the Nvidia Container Toolkit
(or the Nvidia Container Runtime [Hook], which are part of the Toolkit), but
does require libnvidia-container, which is typically installed as a dependency
of the Nvidia Container Toolkit.

Updates #14

PiperOrigin-RevId: 535002602
copybara-service bot pushed a commit that referenced this issue Jul 3, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
copybara-service bot pushed a commit that referenced this issue Jul 3, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
copybara-service bot pushed a commit that referenced this issue Jul 3, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
copybara-service bot pushed a commit that referenced this issue Jul 8, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: container runtime Issue related to docker, kubernetes, OCI runtime area: integration Issue related to third party integrations priority: p2 Normal priority type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests