Deepspeed and deepspeed.zero.Init() broken for big models. #1041

Vbansal21 · 2021-05-04T12:35:03Z

My System:
Ubuntu groovy 20.10
NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2
torch 1.8.1
deepspeed 0.3.16

For models with many different modules and layers of module inside module, the deepspeed.zero.init() model builder breaks.

[2021-05-04 17:45:45,998] [INFO] [distributed.py:36:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment... [2021-05-04 17:45:46,354] [INFO] [distributed.py:83:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.29.173, master_port=29500 [2021-05-04 17:45:46,354] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-05-04 17:45:49,834] [INFO] [utils.py:30:print_object] AsyncPartitionedParameterSwapper: [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] aio_handle ................... <class 'async_io.aio_handle'> [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] available_buffer_ids ......... [0, 1, 2, 3, 4] [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] available_numel .............. 0 [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] available_params ............. set() [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] elements_per_buffer .......... 100000000.0 [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] id_to_path ................... {} [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] inflight_numel ............... 0 [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] inflight_params .............. [] [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] inflight_swap_in_buffers ..... [] [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] invalid_buffer ............... 1.0 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] min_aio_bytes ................ 1048576 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] param_buffer_count ........... 5 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] param_id_to_buffer_id ........ {} [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] param_id_to_numel ............ {} [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] pending_reads ................ 0 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] pending_writes ............... 0 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] reserved_buffer_ids .......... [] [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] swap_config .................. {'device': 'nvme', 'nvme_path': '/mnt/nvme0n1p3/', 'buffer_count': 5, 'buffer_size': 100000000.0, 'max_in_cpu': 1000000000.0, 'pin_memory': False} [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] swap_element_size ............ 2 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] swap_folder .................. /mnt/nvme0n1p3/zero_stage_3/fp16params/rank0 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] swap_out_params .............. [] nn.functional.linear has been overridden with a more memory efficient version. This will persist unless manually reset. Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/vbansal21/.vscode-insiders/extensions/ms-python.python-2021.4.765268190/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module> cli.main() File "/home/vbansal21/.vscode-insiders/extensions/ms-python.python-2021.4.765268190/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main run() File "/home/vbansal21/.vscode-insiders/extensions/ms-python.python-2021.4.765268190/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file runpy.run_path(target_as_str, run_name=compat.force_str("__main__")) File "/usr/lib/python3.8/runpy.py", line 265, in run_path return _run_module_code(code, init_globals, run_name, File "/usr/lib/python3.8/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/vbansal21/Documents/Custom_Architecture/Tfn_torch_custom/Transformer_vanilla_torch.py", line 573, in <module> model = TransformerModel(ntokens, emsize, nhead, nhid, nlayers,num_parallel_layers, dropout) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 260, in __exit__ _disable_class(subclass) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 256, in _disable_class cls.__init__ = cls._old_init AttributeError: type object 'Deterministic' has no attribute '_old_init'
One temporary solution was to make changes to exit method in the method InsertPostInitMethodToModuleSubClasses declared in partition_parameters.py, to be precise:
def _disable_class(cls):
cls.__init__ = cls._old_init(approx line 255)
to
def _disable_class(cls):
try:
cls.__init__ = cls._old_init
except:
pass
but it broke the whole execution, resulting in error in the initialisation of weights of linear layer deep in the model;
Using /home/vbansal21/.cache/torch_extensions as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0003387928009033203 seconds torch.Size([1, 10, 512]) torch.Size([4096, 512]) torch.Size([1, 10, 4096]) torch.Size([512, 4096]) torch.Size([1, 10, 512]) torch.Size([4096, 512]) torch.Size([1, 10, 4096]) torch.Size([512, 4096]) torch.Size([1, 10, 512]) torch.Size([1]) Traceback (most recent call last): File "/home/vbansal21/Documents/Custom_Architecture/Tfn_torch_custom/Transformer_vanilla_torch.py", line 597, in <module> out = model(inp,assign_to_alt_mem=False) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 943, in forward loss = self.module(*inputs, **kwargs) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 139, in decorate_autocast return func(*args, **kwargs) File "/home/vbansal21/Documents/Custom_Architecture/Tfn_torch_custom/Transformer_vanilla_torch.py", line 400, in forward output = self.transformer_encoder(output) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/vbansal21/Documents/Custom_Architecture/Tfn_torch_custom/Transformer_vanilla_torch.py", line 180, in forward out2 = self.norm2(enc(i)) + i File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/vbansal21/Documents/Custom_Architecture/Tfn_torch_custom/Transformer_vanilla_torch.py", line 132, in forward output = self.ffd(output) + self.pkm(output) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward return F.linear(input, self.weight, self.bias) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 209, in decorate_fwd return fwd(*args, **kwargs) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 61, in forward output = input.matmul(weight.t()) RuntimeError: mat1 dim 1 must match mat2 dim 0
{Here the size of the input that the linear layer recieved and the weight of the linear layer is printed, I modded the code a little bit to do so.}

The text was updated successfully, but these errors were encountered:

janEbert · 2021-05-05T15:54:29Z

Probably this may help: #907 (comment)

Vbansal21 · 2021-05-06T07:59:41Z

Probably this may help: #907 (comment)

This converged to the second problem, i.e.
"RuntimeError: mat1 dim 1 must match mat2 dim 0" which was earlier caused due to changing the main code, this time it happened by itself.

Vbansal21 · 2021-05-28T09:24:37Z

Tried many-many different combinations of settings, still the problem exist. The problem comes when the model is a bit complex, for e.g. Lucidrains' Performer implementation in torch, and the number of parameters and/or size of biggest layer is large ( above some threshold, don't know what threshold and how to change). Tried changing bucket size, buffer size, and many other combinations, but the final error is generally on the linear layer (both, efficient version and vanilla version), the error is cublas Gemmex runtime-error. Don't know if other types of layers will cause this problem or some similar problem, since not able to move beyond this one.

System: Ubuntu 20.04 (freshly installed few days ago)
CUDA:11.3
Driver version:465
My GPU: GTX 1660Ti max-Q Laptop
CPU: i7-9750H
Ram: 24GB DDR4
NVME: Samsung evo 970 Pro 512GB
Primary focus: Zero-infinity Stage-3 Nvme Offloading

xinj7 · 2023-03-03T03:03:14Z

@Vbansal21 did you manage to solve the issue?

Vbansal21 · 2023-03-03T03:29:11Z

@lavaaa7
Well, nope, I had to workaround, hit-n-trial.

tjruwase · 2023-03-03T13:46:02Z

@Vbansal21, @lavaaa7 do you want to re-open and provide newer stack trace. We are recently looking into similar issues concerning zero.Init: #2811 and #2812.

dumpmemory · 2023-03-10T01:55:20Z

@tjruwase and this one #2637

Vbansal21 changed the title ~~deepspeed.zero.Init() error : not attribute _old_init.~~ deepspeed.zero.Init() error : no attribute _old_init. May 4, 2021

Vbansal21 changed the title ~~deepspeed.zero.Init() error : no attribute _old_init.~~ Deepspeed and deepspeed.zero.Init() broken for big models. May 28, 2021

zarzen mentioned this issue Sep 14, 2021

training process get killed when using deepspeed.zero.Init for large bing_bert model microsoft/DeepSpeedExamples#133

Closed

Vbansal21 closed this as completed May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepspeed and deepspeed.zero.Init() broken for big models. #1041

Deepspeed and deepspeed.zero.Init() broken for big models. #1041

Vbansal21 commented May 4, 2021

janEbert commented May 5, 2021

Vbansal21 commented May 6, 2021

Vbansal21 commented May 28, 2021 •

edited

xinj7 commented Mar 3, 2023

Vbansal21 commented Mar 3, 2023 •

edited

tjruwase commented Mar 3, 2023

dumpmemory commented Mar 10, 2023

Deepspeed and deepspeed.zero.Init() broken for big models. #1041

Deepspeed and deepspeed.zero.Init() broken for big models. #1041

Comments

Vbansal21 commented May 4, 2021

janEbert commented May 5, 2021

Vbansal21 commented May 6, 2021

Vbansal21 commented May 28, 2021 • edited

xinj7 commented Mar 3, 2023

Vbansal21 commented Mar 3, 2023 • edited

tjruwase commented Mar 3, 2023

dumpmemory commented Mar 10, 2023

Vbansal21 commented May 28, 2021 •

edited

Vbansal21 commented Mar 3, 2023 •

edited