Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

代码能够正常运行,但显存不足的问题仍未解决 #5

Closed
ZorrowHu opened this issue Nov 10, 2020 · 12 comments
Closed

代码能够正常运行,但显存不足的问题仍未解决 #5

ZorrowHu opened this issue Nov 10, 2020 · 12 comments

Comments

@ZorrowHu
Copy link

谢谢你的代码分享!之前一直苦于GPU显存不足的问题无法在大数据集上进行实验,看到了DataParallel的方法后找到了您的知乎分享,但是我在实现的过程中仍然没有解决显存不足的问题。程序的正确性上应该是没问题的,在小数据集上能够正确地跑出实验结果。
我对您的代码在自己的项目中进行了整合实现,代码类似您文档中的:

os.environ["CUDA_VISIBLE_DEVICES"] = '0, 1'      #有两块GPU,编号分别为0和1
batch_szie = 100
gpu0_bsz = 3500
acc_grad = 1
model = BalancedDataParallel(gpu0_bsz // acc_grad, SessionGraph(opt, n_node), dim=0)   #SessionGraph为我实际用的model
model = model.cuda()

上面的batch_szie和gpu0_bsz这么设置是因为,之前在单GPU上运行时,batchSize为100的7500条数据对显存的占用为9.4/12GB,模型是可以完成训练的;但是紧接着利用模型对测试数据进行预测就报显存不足的错误。
在利用BalancedDataParallel运行的过程中我查看了两块GPU的现存占用情况,发现似乎GPU1基本上没有怎么被利用到:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 456.81       Driver Version: 456.81       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN Xp           WDDM  | 00000000:AF:00.0  On |                  N/A |
| 51%   81C    P2   103W / 250W |   9294MiB / 12288MiB |     44%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp           WDDM  | 00000000:D8:00.0 Off |                  N/A |
| 26%   48C    P2    61W / 250W |   1612MiB / 12288MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

捕获
捕获2

我所使用的model SessionGraph大致如下:

class SessionGraph(Module):
        ...
        self.loss_function = nn.CrossEntropyLoss()
        self.optimizer = torch.optim.Adam(self.parameters(), lr=opt.lr, weight_decay=opt.l2)
        self.scheduler = torch.optim.lr_scheduler.StepLR(self.optimizer, step_size=opt.lr_dc_step, gamma=opt.lr_dc)

我的错误是否在别的地方呢,比如说训练与预测的过程中?
程序运行过程中的数据基本都是一样的,如下所示:

...
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
len(inputs):  2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
len(inputs):  2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
len(inputs):  2
...
@Link-Li
Copy link
Owner

Link-Li commented Nov 10, 2020

我不太理解你为什么要设置gpu0_bsz = 3500,gpu0_bsz参数表示你要在0号GPU上面分配多少条数据,一般gpu0_bsz设置的肯定会比batch size要小

@ZorrowHu
Copy link
Author

我现在设置成:

batch_szie = 100
gpu0_bsz = 50

但是GPU的使用情况依旧。。。大致都是GPU0: 9.4/12GB,GPU1: 1.5/12GB

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 456.81       Driver Version: 456.81       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN Xp           WDDM  | 00000000:AF:00.0  On |                  N/A |
| 50%   81C    P2   253W / 250W |   9340MiB / 12288MiB |     75%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp           WDDM  | 00000000:D8:00.0 Off |                  N/A |
| 26%   49C    P2    61W / 250W |   1649MiB / 12288MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

代码运行到最后是这样报错的

len(inputs):  2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
        Loss:   55605.012
start predicting:  2020-11-10 15:07:37.934568
len(inputs):  2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
len(inputs):  2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
Traceback (most recent call last):
  File "e:/pythonFile/chengzuo/TASGC/src/main.py", line 104, in <module>    
    main()
  File "e:/pythonFile/chengzuo/TASGC/src/main.py", line 80, in main
    hit, mrr = train_test(model, train_data, test_data)
  File "e:\pythonFile\chengzuo\TASGC\src\model.py", line 278, in train_test 
    targets, scores = forward(model, i, test_data)
  File "e:\pythonFile\chengzuo\TASGC\src\model.py", line 238, in forward    
    return targets, model.module.compute_scores(seq_hidden, mask)
  File "e:\pythonFile\chengzuo\TASGC\src\model.py", line 108, in compute_scores
    scores = torch.sum(a * b, -1)  # b,n
RuntimeError: CUDA out of memory. Tried to allocate 1.61 GiB (GPU 0; 12.00 GiB total capacity; 7.49 GiB already allocated; 820.44 MiB free; 8.73 GiB reserved in total by PyTorch)

@Link-Li
Copy link
Owner

Link-Li commented Nov 10, 2020

我不太清楚你模型的大小,你可以gpu0_bsz设置为1,而batch size设置为200试试

@ZorrowHu
Copy link
Author

gpu0_bsz设置为1,而batch size设置为200,GPU0的占用还是非常高,GPU1几乎没动

@Link-Li
Copy link
Owner

Link-Li commented Nov 10, 2020

你在运行python代码之前,在python命令前面加上CUDA_VISIBLE_DEVICES=0,1
然后再试一试,现在似乎在代码里面使用os.environ["CUDA_VISIBLE_DEVICES"] = '0, 1' 没法指定GPU了

@Link-Li
Copy link
Owner

Link-Li commented Nov 10, 2020

同时建议你用单块GPU跑一下,看看一块GPU最大能使用的batch size是多大

@ZorrowHu
Copy link
Author

你是指这样打命令吗?

set CUDA_VISIBLE_DEVICES=0,1 #设置
python main.py #然后再运行自己的程序

我试了一下结果依然没变......
有趣的是我在代码里面调换了一下两块GPU的位置:

os.environ["CUDA_VISIBLE_DEVICES"] = '1, 0' 

这次就是GPU1占用率很高,GPU0占用很低了。
折腾了好久也没能解决,估计只能降低batch size的大小了

@Link-Li
Copy link
Owner

Link-Li commented Nov 10, 2020

CUDA_VISIBLE_DEVICES=0,1 python main.py
感觉就是你模型太大了,数据本身太小了,你单GPU下,batch size为1,占多少显存

@ZorrowHu
Copy link
Author

单GPU的情况下,batch size=1占0.9/12GB,batch size=50占5.9/12GB

@Link-Li
Copy link
Owner

Link-Li commented Nov 10, 2020

我不太清楚你这个是咋回事,我这边用过的都没这个问题。你可以试试官方的多GPU那个DP

@ZorrowHu
Copy link
Author

过了好多天重新回来看你的回复,我觉得你说的“感觉就是你模型太大了,数据本身太小了”应该没错,因为我的数据集本身也就15.8MB。上网查了下,有可能是因为model里面用了太多nn.Linear(),导致模型过于庞大。。。。所以我这个情况应该不是data parallel能解决的吧

@Link-Li
Copy link
Owner

Link-Li commented Nov 21, 2020

如果你单个模型放一个GPU都放不下,那建议把模型拆开,一部分放GPU0,一部分放GPU1。但是这样效率会很低下。而且模型放太多Linear意义不大吧?

@Link-Li Link-Li closed this as completed Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants