Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partially address Issue #10413 by adding NV0000_CTRL_CMD_OS_UNIX_GET_EXPORT_OBJECT_INFO, NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, NV0041_CTRL_CMD_GET_SURFACE_INFO #10434

Merged
merged 1 commit into from
May 21, 2024

Conversation

thundergolfer
Copy link
Contributor

Following up on #10413 (comment).

Ayush's fix revealed more missing commands. With these changes, the reproduction in #10413 still does not work. Here's an updated reproduction Dockerfile that crashes because of the SIGCHILD handler. Without the SIGCHILD handler the program hangs.

FROM python:3.11-slim-bookworm

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler

COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L

from memory_profiler import profile

from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader

import os
import signal
import pathlib

def handler(signum, frame):
    print('Signal handler called with signal', signum)
    os.waitpid(-1, 0)
    raise KeyboardInterrupt()

# gVisor is ignoring the SIGCHILD 'Discarding ignored signal 17'
signal.signal(signal.SIGCHLD, handler)

class MagixNet(L.LightningModule):
	def __init__(self, nbr_cat):
	    super().__init__()

	    module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
	    module.fc = nn.Linear(2048, nbr_cat)

	    self.module = module


	def forward(self, x):
	    return self.module(x)

	def training_step(self, batch, batch_idx):
	    x, y = batch
	    y_hat = self(x)
	    loss = F.cross_entropy(y_hat, y)
	    return loss

	def configure_optimizers(self):
	    return torch.optim.Adam(self.parameters(), lr=0.02)

def prepare_data():
    pipeline = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])

    train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
    train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)

    val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
    val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)

    return train_dl, val_dl

if __name__ == "__main__":
    torch.set_float32_matmul_precision('medium')
    train_dl, val_dl = prepare_data()
    model = MagixNet(100)
    trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")

    start  = time.time()
    trainer.fit(model, train_dl, val_dl)
    print(f"Training duration (seconds): {time.time() - start:.2f}")
    nccl_debug_file = pathlib.Path("/tmp/runsc-nccl.txt")
    if nccl_debug_file.exists():
        print("NCCL Debugging")
        print(nccl_debug_file.read_text())
EOF

ENTRYPOINT ["python3", "repro.py"]

Run like this:

sudo docker run --runtime=runsc-2 --shm-size=1000GB --gpus '"device=GPU-48070a35-b2ea-643c-eebe-0c55d2a541a4,GPU-8061048a-aa0f-76bd-457b-71c6be60386e"' -e NCCL_DEBUG=INFO -e NCCL_DEBUG_FILE="/tmp/runsc-nccl.txt" sha256:1c1fc535214ec1111b46a87fe20558e7c078185e4158c3ce253dc56a5a9be628

/etc/docker/daemon.json

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc-2": {
            "path": "/home/modal/runsc2",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc-2/", "-debug", "-strace"]

        },
    }
}

@thundergolfer
Copy link
Contributor Author

Also worth noting that this implementation is for driver version 535. The latest driver has different params for NV0000_CTRL_CMD_OS_UNIX_GET_EXPORT_OBJECT_INFO.

@thundergolfer
Copy link
Contributor Author

thundergolfer commented May 12, 2024

https://modal-public-assets.s3.amazonaws.com/runsc.log.20240512-202107.171204.boot.txt.zip is debug logs of the program above (~150MiB).

  • uname -aLinux gcp-a100-80gb-spot-europe-west4-a-0-b819afa2-755d-47d0-b84d-667 5.15.0-205.149.5.4.el9uek.x86_64 #2 SMP Wed May 8 15:31:38 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux
  • instance type: a2-ultragpu-8g
  • runsc version: runsc version release-20240506.0-43-g8d9c53ec6be8-dirty
    • Was running a binary based off this PR.
  • NVIDIA A100-SXM4-80GB
  • Driver Version: 535.129.03
  • CUDA Version: 12.2

copybara-service bot pushed a commit that referenced this pull request May 21, 2024
…EXPORT_OBJECT_INFO, NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, NV0041_CTRL_CMD_GET_SURFACE_INFO

Following up on #10413 (comment).

Ayush's fix revealed more missing commands. With these changes, the reproduction in #10413 _still does not work._ Here's an updated reproduction Dockerfile that crashes because of the SIGCHILD handler. Without the SIGCHILD handler the program hangs.

```Dockerfile
FROM python:3.11-slim-bookworm

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler

COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L

from memory_profiler import profile

from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader

import os
import signal
import pathlib

def handler(signum, frame):
    print('Signal handler called with signal', signum)
    os.waitpid(-1, 0)
    raise KeyboardInterrupt()

# gVisor is ignoring the SIGCHILD 'Discarding ignored signal 17'
signal.signal(signal.SIGCHLD, handler)

class MagixNet(L.LightningModule):
	def __init__(self, nbr_cat):
	    super().__init__()

	    module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
	    module.fc = nn.Linear(2048, nbr_cat)

	    self.module = module

	def forward(self, x):
	    return self.module(x)

	def training_step(self, batch, batch_idx):
	    x, y = batch
	    y_hat = self(x)
	    loss = F.cross_entropy(y_hat, y)
	    return loss

	def configure_optimizers(self):
	    return torch.optim.Adam(self.parameters(), lr=0.02)

def prepare_data():
    pipeline = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])

    train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
    train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)

    val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
    val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)

    return train_dl, val_dl

if __name__ == "__main__":
    torch.set_float32_matmul_precision('medium')
    train_dl, val_dl = prepare_data()
    model = MagixNet(100)
    trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")

    start  = time.time()
    trainer.fit(model, train_dl, val_dl)
    print(f"Training duration (seconds): {time.time() - start:.2f}")
    nccl_debug_file = pathlib.Path("/tmp/runsc-nccl.txt")
    if nccl_debug_file.exists():
        print("NCCL Debugging")
        print(nccl_debug_file.read_text())
EOF

ENTRYPOINT ["python3", "repro.py"]
```

Run like this:

```
sudo docker run --runtime=runsc-2 --shm-size=1000GB --gpus '"device=GPU-48070a35-b2ea-643c-eebe-0c55d2a541a4,GPU-8061048a-aa0f-76bd-457b-71c6be60386e"' -e NCCL_DEBUG=INFO -e NCCL_DEBUG_FILE="/tmp/runsc-nccl.txt" sha256:1c1fc535214ec1111b46a87fe20558e7c078185e4158c3ce253dc56a5a9be628
```

**`/etc/docker/daemon.json`**

```
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc-2": {
            "path": "/home/modal/runsc2",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc-2/", "-debug", "-strace"]

        },
    }
}
```

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10434 from thundergolfer:master 76bf495
PiperOrigin-RevId: 635812044
pkg/abi/nvgpu/ctrl.go Outdated Show resolved Hide resolved
NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, nvgpu.NV0041_CTRL_CMD_GET_SURFACE_INFO
copybara-service bot pushed a commit that referenced this pull request May 21, 2024
…EXPORT_OBJECT_INFO, NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, NV0041_CTRL_CMD_GET_SURFACE_INFO

Following up on #10413 (comment).

Ayush's fix revealed more missing commands. With these changes, the reproduction in #10413 _still does not work._ Here's an updated reproduction Dockerfile that crashes because of the SIGCHILD handler. Without the SIGCHILD handler the program hangs.

```Dockerfile
FROM python:3.11-slim-bookworm

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler

COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L

from memory_profiler import profile

from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader

import os
import signal
import pathlib

def handler(signum, frame):
    print('Signal handler called with signal', signum)
    os.waitpid(-1, 0)
    raise KeyboardInterrupt()

# gVisor is ignoring the SIGCHILD 'Discarding ignored signal 17'
signal.signal(signal.SIGCHLD, handler)

class MagixNet(L.LightningModule):
	def __init__(self, nbr_cat):
	    super().__init__()

	    module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
	    module.fc = nn.Linear(2048, nbr_cat)

	    self.module = module

	def forward(self, x):
	    return self.module(x)

	def training_step(self, batch, batch_idx):
	    x, y = batch
	    y_hat = self(x)
	    loss = F.cross_entropy(y_hat, y)
	    return loss

	def configure_optimizers(self):
	    return torch.optim.Adam(self.parameters(), lr=0.02)

def prepare_data():
    pipeline = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])

    train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
    train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)

    val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
    val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)

    return train_dl, val_dl

if __name__ == "__main__":
    torch.set_float32_matmul_precision('medium')
    train_dl, val_dl = prepare_data()
    model = MagixNet(100)
    trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")

    start  = time.time()
    trainer.fit(model, train_dl, val_dl)
    print(f"Training duration (seconds): {time.time() - start:.2f}")
    nccl_debug_file = pathlib.Path("/tmp/runsc-nccl.txt")
    if nccl_debug_file.exists():
        print("NCCL Debugging")
        print(nccl_debug_file.read_text())
EOF

ENTRYPOINT ["python3", "repro.py"]
```

Run like this:

```
sudo docker run --runtime=runsc-2 --shm-size=1000GB --gpus '"device=GPU-48070a35-b2ea-643c-eebe-0c55d2a541a4,GPU-8061048a-aa0f-76bd-457b-71c6be60386e"' -e NCCL_DEBUG=INFO -e NCCL_DEBUG_FILE="/tmp/runsc-nccl.txt" sha256:1c1fc535214ec1111b46a87fe20558e7c078185e4158c3ce253dc56a5a9be628
```

**`/etc/docker/daemon.json`**

```
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc-2": {
            "path": "/home/modal/runsc2",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc-2/", "-debug", "-strace"]

        },
    }
}
```

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10434 from thundergolfer:master bf18079
PiperOrigin-RevId: 635812044
copybara-service bot pushed a commit that referenced this pull request May 21, 2024
…EXPORT_OBJECT_INFO, NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, NV0041_CTRL_CMD_GET_SURFACE_INFO

Following up on #10413 (comment).

Ayush's fix revealed more missing commands. With these changes, the reproduction in #10413 _still does not work._ Here's an updated reproduction Dockerfile that crashes because of the SIGCHILD handler. Without the SIGCHILD handler the program hangs.

```Dockerfile
FROM python:3.11-slim-bookworm

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler

COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L

from memory_profiler import profile

from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader

import os
import signal
import pathlib

def handler(signum, frame):
    print('Signal handler called with signal', signum)
    os.waitpid(-1, 0)
    raise KeyboardInterrupt()

# gVisor is ignoring the SIGCHILD 'Discarding ignored signal 17'
signal.signal(signal.SIGCHLD, handler)

class MagixNet(L.LightningModule):
	def __init__(self, nbr_cat):
	    super().__init__()

	    module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
	    module.fc = nn.Linear(2048, nbr_cat)

	    self.module = module

	def forward(self, x):
	    return self.module(x)

	def training_step(self, batch, batch_idx):
	    x, y = batch
	    y_hat = self(x)
	    loss = F.cross_entropy(y_hat, y)
	    return loss

	def configure_optimizers(self):
	    return torch.optim.Adam(self.parameters(), lr=0.02)

def prepare_data():
    pipeline = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])

    train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
    train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)

    val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
    val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)

    return train_dl, val_dl

if __name__ == "__main__":
    torch.set_float32_matmul_precision('medium')
    train_dl, val_dl = prepare_data()
    model = MagixNet(100)
    trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")

    start  = time.time()
    trainer.fit(model, train_dl, val_dl)
    print(f"Training duration (seconds): {time.time() - start:.2f}")
    nccl_debug_file = pathlib.Path("/tmp/runsc-nccl.txt")
    if nccl_debug_file.exists():
        print("NCCL Debugging")
        print(nccl_debug_file.read_text())
EOF

ENTRYPOINT ["python3", "repro.py"]
```

Run like this:

```
sudo docker run --runtime=runsc-2 --shm-size=1000GB --gpus '"device=GPU-48070a35-b2ea-643c-eebe-0c55d2a541a4,GPU-8061048a-aa0f-76bd-457b-71c6be60386e"' -e NCCL_DEBUG=INFO -e NCCL_DEBUG_FILE="/tmp/runsc-nccl.txt" sha256:1c1fc535214ec1111b46a87fe20558e7c078185e4158c3ce253dc56a5a9be628
```

**`/etc/docker/daemon.json`**

```
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc-2": {
            "path": "/home/modal/runsc2",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc-2/", "-debug", "-strace"]

        },
    }
}
```

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10434 from thundergolfer:master bf18079
PiperOrigin-RevId: 635812044
copybara-service bot pushed a commit that referenced this pull request May 21, 2024
…EXPORT_OBJECT_INFO, NV0000_CTRL_CMD_OS_UNIX_IMPORT_OBJECT_FROM_FD, NV0041_CTRL_CMD_GET_SURFACE_INFO

Following up on #10413 (comment).

Ayush's fix revealed more missing commands. With these changes, the reproduction in #10413 _still does not work._ Here's an updated reproduction Dockerfile that crashes because of the SIGCHILD handler. Without the SIGCHILD handler the program hangs.

```Dockerfile
FROM python:3.11-slim-bookworm

RUN apt-get update && apt-get install --yes python3 python3-distutils clang wget vim
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3 get-pip.py
RUN python3 -m pip install clang~=10.0.1 # must match version of `clang` installed above.
RUN python3 -m pip install --ignore-installed torch torchvision lightning numpy memory_profiler

COPY <<EOF repro.py
print("Hello from inside container.")
import psutil
current_process = psutil.Process()
parent_process = current_process.parent()
print(f"Processes: {current_process=} {parent_process=}")

import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L

from memory_profiler import profile

from torchvision.datasets import CIFAR100
from torchvision import transforms
from torchvision import models
from torch.utils.data import DataLoader

import os
import signal
import pathlib

def handler(signum, frame):
    print('Signal handler called with signal', signum)
    os.waitpid(-1, 0)
    raise KeyboardInterrupt()

# gVisor is ignoring the SIGCHILD 'Discarding ignored signal 17'
signal.signal(signal.SIGCHLD, handler)

class MagixNet(L.LightningModule):
	def __init__(self, nbr_cat):
	    super().__init__()

	    module = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
	    module.fc = nn.Linear(2048, nbr_cat)

	    self.module = module

	def forward(self, x):
	    return self.module(x)

	def training_step(self, batch, batch_idx):
	    x, y = batch
	    y_hat = self(x)
	    loss = F.cross_entropy(y_hat, y)
	    return loss

	def configure_optimizers(self):
	    return torch.optim.Adam(self.parameters(), lr=0.02)

def prepare_data():
    pipeline = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])

    train_ds = CIFAR100('data', train=True, download=True, transform=pipeline)
    train_dl = DataLoader(train_ds, batch_size=128, num_workers=4)

    val_ds = CIFAR100('data', train=False, download=True, transform=pipeline)
    val_dl = DataLoader(val_ds, batch_size=128, num_workers=4)

    return train_dl, val_dl

if __name__ == "__main__":
    torch.set_float32_matmul_precision('medium')
    train_dl, val_dl = prepare_data()
    model = MagixNet(100)
    trainer = L.Trainer(max_epochs=1, strategy="ddp_notebook")

    start  = time.time()
    trainer.fit(model, train_dl, val_dl)
    print(f"Training duration (seconds): {time.time() - start:.2f}")
    nccl_debug_file = pathlib.Path("/tmp/runsc-nccl.txt")
    if nccl_debug_file.exists():
        print("NCCL Debugging")
        print(nccl_debug_file.read_text())
EOF

ENTRYPOINT ["python3", "repro.py"]
```

Run like this:

```
sudo docker run --runtime=runsc-2 --shm-size=1000GB --gpus '"device=GPU-48070a35-b2ea-643c-eebe-0c55d2a541a4,GPU-8061048a-aa0f-76bd-457b-71c6be60386e"' -e NCCL_DEBUG=INFO -e NCCL_DEBUG_FILE="/tmp/runsc-nccl.txt" sha256:1c1fc535214ec1111b46a87fe20558e7c078185e4158c3ce253dc56a5a9be628
```

**`/etc/docker/daemon.json`**

```
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc-2": {
            "path": "/home/modal/runsc2",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc-2/", "-debug", "-strace"]

        },
    }
}
```

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10434 from thundergolfer:master bf18079
PiperOrigin-RevId: 635812044
@copybara-service copybara-service bot merged commit 9911927 into google:master May 21, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants