Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes running on device cuda and sometimes running on device cpu automatically, why? #1153

Closed
icejean opened this issue May 7, 2024 · 1 comment

Comments

@icejean
Copy link

icejean commented May 7, 2024

Hi, I'm new to R torch and torch, just try to set up the environmnet and run the classic MNIST example.
But something happens strangely, the 1st network below will run on cuda automatically, while the 2nd will run on cpu.
But if I run the 1st network first, then switch to run the 2nd network, it'll run on cuda too.
Any idea why?

# 1. Set up running environment.
#    WSL2 Ubuntu22 behind the GFW
Sys.setenv(http_proxy = "http://127.0.0.1:7890")
Sys.setenv(https_proxy = "http://127.0.0.1:7890")
# 2、Point to CUDA 11.8+cuDNN 8.9.2 supported by R torch.
Sys.setenv(CUDA_HOME = "/usr/local/cuda-11")
Sys.setenv(LD_LIBRARY_PATH = "/usr/local/lib:/usr/local/cuda-11/lib64")
Sys.setenv(PATH = "/usr/local/cuda-11/bin:$PATH")
# 3. Have a check.
Sys.getenv("http_proxy")
Sys.getenv("https_proxy")
Sys.getenv("CUDA_HOME")
Sys.getenv("LD_LIBRARY_PATH")
Sys.getenv("PATH")
# 4. Path to MNIST dataset.
getwd()
dir <- "./dataset/mnist"

# 5. Load the librarys.
# install.packages("torch")
# install.packages("torchvision")
# install.packages("luz")
library(torch)
library(torchvision)
library(luz)
library(reshape2)
library(ggplot2)

# 6. Check is CUDA is available.
cuda_available <- torch::cuda_is_available()
device <- if (cuda_available) torch_device("cuda:0") else torch_device("cpu")


# 6. Load MNIST dataset.
train_ds <- mnist_dataset(
  dir,
  download = TRUE,
  transform = transform_to_tensor
)

test_ds <- mnist_dataset(
  dir,
  train = FALSE,
  transform = transform_to_tensor
)

train_dl <- dataloader(train_ds, batch_size = 128, shuffle = TRUE)
test_dl <- dataloader(test_ds, batch_size = 128)

# 7. Check the first image.
image <- train_ds$data[1,1:28,1:28]
image_df <- melt(image)
ggplot(image_df, aes(x=Var2, y=Var1, fill=value))+
  geom_tile(show.legend = FALSE) + 
  xlab("") + ylab("") +
  scale_fill_gradient(low="white", high="black")

# 8. Define a network.
net <- nn_module(
  "Net",
  ## The 1st network will be loaded to cuda device automatically.
  # initialize = function() {
  #   self$conv1 <- nn_conv2d(1, 32, 3, 1)
  #   self$conv2 <- nn_conv2d(32, 64, 3, 1)
  #   self$dropout1 <- nn_dropout2d(0.25)
  #   self$dropout2 <- nn_dropout2d(0.5)
  #   self$fc1 <- nn_linear(9216, 128)
  #   self$fc2 <- nn_linear(128, 10)
  # },
  # 
  # forward = function(x) {
  #   x %>%                                  # N * 1 * 28 * 28
  #     self$conv1() %>%                     # N * 32 * 26 * 26
  #     nnf_relu() %>%
  #     self$conv2() %>%                     # N * 64 * 24 * 24
  #     nnf_relu() %>%
  #     nnf_max_pool2d(2) %>%                # N * 64 * 12 * 12
  #     self$dropout1() %>%
  #     torch_flatten(start_dim = 2) %>%     # N * 9216
  #     self$fc1() %>%                       # N * 128
  #     nnf_relu() %>%
  #     self$dropout2() %>%
  #     self$fc2()                           # N * 10
  # }
  ## Epoch 10/10
  ## Train metrics: Loss: 0.0375 - Acc: 0.988                                                                    
  ## Valid metrics: Loss: 0.0272 - Acc: 0.9912
  
  ## The 2nd network will be laoded to cpu device automatically.
  ## If I run the 1st network first, then switch to the 2nd, it'll?be run on cuda device.
  ## Why???????????????????????????????
  
  initialize = function() {
    self$layer1 <- nn_linear(in_features = 784, out_features = 512)
    self$layer2 <- nn_linear(in_features = 512, out_features = 10)
  },
  
  forward = function(x) {
    x %>%
      torch_flatten(start_dim = 2) %>% # start_dim = 2
      self$layer1() %>%
      nnf_relu() %>%
      self$layer2() %>%
      nnf_softmax(dim = 2)
  }
  ## Epoch 10/10
  ## Train metrics: Loss: 1.476 - Acc: 0.9871                                                                     
  ## Valid metrics: Loss: 1.4838 - Acc: 0.9782  
)


# 9. Train
fitted <- net %>%
  setup(
    loss = nn_cross_entropy_loss(),
    optimizer = optim_adam,
    metrics = list(
      luz_metric_accuracy()
    )
  ) %>%
  fit(train_dl, epochs = 10, valid_data = test_dl)

# 10. Predict.
preds <- predict(fitted, test_dl)
preds$shape
@icejean
Copy link
Author

icejean commented May 8, 2024

Well, it is my fault. Acutally every thing is O.K, both models are running on device cuda.
The cause is that, I'm running R torch on WSL2 Ubuntu22, and watching the GPU loads with Windows task manager, this leads to some misunderstanding.
The fact is that, CUDA toolkit on WSL2 Ubuntu is a special version that doesn't include Nvidia driver of Ubuntu,Windows will export the Windows Nvidia driver to Ubuntu at /usr/lib/wsl/lib/, and Windows task manager seems doesn't reflect all loads on the Nvidia GPU,especially those running on WSL2 Ubuntu. The main load in Windows task manaager of a Nvidia GPU is copy load, this isn't the fact.
And when I run nvidia-smi -l 1 on WSL2 Ubuntu, I can see the actual load on it, the GPU-Util & Compute M. indexes show that it's really running on the cuda device. May be nvidia-smi only looks for process running on Windows, not including those on WSL2 Ubuntu, it tells that 'No running processes found', but it isn't the truth.
The following is nvidia-smi output and windows task manager snapshot of the 1st model:
R-13
The folowing is the snapshot of the 2nd model:
R-14
After all, the fan of the GPU is running fast, it means tha the GPU is working hard.

@icejean icejean closed this as completed May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant