Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepspeed and deepspeed.zero.Init() broken for big models. #1041

Closed
Vbansal21 opened this issue May 4, 2021 · 7 comments
Closed

Deepspeed and deepspeed.zero.Init() broken for big models. #1041

Vbansal21 opened this issue May 4, 2021 · 7 comments

Comments

@Vbansal21
Copy link

My System:
Ubuntu groovy 20.10
NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2
torch 1.8.1
deepspeed 0.3.16

For models with many different modules and layers of module inside module, the deepspeed.zero.init() model builder breaks.

[2021-05-04 17:45:45,998] [INFO] [distributed.py:36:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment... [2021-05-04 17:45:46,354] [INFO] [distributed.py:83:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.29.173, master_port=29500 [2021-05-04 17:45:46,354] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-05-04 17:45:49,834] [INFO] [utils.py:30:print_object] AsyncPartitionedParameterSwapper: [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] aio_handle ................... <class 'async_io.aio_handle'> [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] available_buffer_ids ......... [0, 1, 2, 3, 4] [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] available_numel .............. 0 [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] available_params ............. set() [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] elements_per_buffer .......... 100000000.0 [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] id_to_path ................... {} [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] inflight_numel ............... 0 [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] inflight_params .............. [] [2021-05-04 17:45:49,834] [INFO] [utils.py:34:print_object] inflight_swap_in_buffers ..... [] [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] invalid_buffer ............... 1.0 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] min_aio_bytes ................ 1048576 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] param_buffer_count ........... 5 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] param_id_to_buffer_id ........ {} [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] param_id_to_numel ............ {} [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] pending_reads ................ 0 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] pending_writes ............... 0 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] reserved_buffer_ids .......... [] [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] swap_config .................. {'device': 'nvme', 'nvme_path': '/mnt/nvme0n1p3/', 'buffer_count': 5, 'buffer_size': 100000000.0, 'max_in_cpu': 1000000000.0, 'pin_memory': False} [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] swap_element_size ............ 2 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] swap_folder .................. /mnt/nvme0n1p3/zero_stage_3/fp16params/rank0 [2021-05-04 17:45:49,835] [INFO] [utils.py:34:print_object] swap_out_params .............. [] nn.functional.linear has been overridden with a more memory efficient version. This will persist unless manually reset. Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/vbansal21/.vscode-insiders/extensions/ms-python.python-2021.4.765268190/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module> cli.main() File "/home/vbansal21/.vscode-insiders/extensions/ms-python.python-2021.4.765268190/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main run() File "/home/vbansal21/.vscode-insiders/extensions/ms-python.python-2021.4.765268190/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file runpy.run_path(target_as_str, run_name=compat.force_str("__main__")) File "/usr/lib/python3.8/runpy.py", line 265, in run_path return _run_module_code(code, init_globals, run_name, File "/usr/lib/python3.8/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/vbansal21/Documents/Custom_Architecture/Tfn_torch_custom/Transformer_vanilla_torch.py", line 573, in <module> model = TransformerModel(ntokens, emsize, nhead, nhid, nlayers,num_parallel_layers, dropout) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 260, in __exit__ _disable_class(subclass) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 256, in _disable_class cls.__init__ = cls._old_init AttributeError: type object 'Deterministic' has no attribute '_old_init'
One temporary solution was to make changes to exit method in the method InsertPostInitMethodToModuleSubClasses declared in partition_parameters.py, to be precise:
def _disable_class(cls):
cls.__init__ = cls._old_init(approx line 255)
to
def _disable_class(cls):
try:
cls.__init__ = cls._old_init
except:
pass
but it broke the whole execution, resulting in error in the initialisation of weights of linear layer deep in the model;
Using /home/vbansal21/.cache/torch_extensions as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0003387928009033203 seconds torch.Size([1, 10, 512]) torch.Size([4096, 512]) torch.Size([1, 10, 4096]) torch.Size([512, 4096]) torch.Size([1, 10, 512]) torch.Size([4096, 512]) torch.Size([1, 10, 4096]) torch.Size([512, 4096]) torch.Size([1, 10, 512]) torch.Size([1]) Traceback (most recent call last): File "/home/vbansal21/Documents/Custom_Architecture/Tfn_torch_custom/Transformer_vanilla_torch.py", line 597, in <module> out = model(inp,assign_to_alt_mem=False) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 943, in forward loss = self.module(*inputs, **kwargs) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 139, in decorate_autocast return func(*args, **kwargs) File "/home/vbansal21/Documents/Custom_Architecture/Tfn_torch_custom/Transformer_vanilla_torch.py", line 400, in forward output = self.transformer_encoder(output) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/vbansal21/Documents/Custom_Architecture/Tfn_torch_custom/Transformer_vanilla_torch.py", line 180, in forward out2 = self.norm2(enc(i)) + i File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/vbansal21/Documents/Custom_Architecture/Tfn_torch_custom/Transformer_vanilla_torch.py", line 132, in forward output = self.ffd(output) + self.pkm(output) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward return F.linear(input, self.weight, self.bias) File "/home/vbansal21/.local/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 209, in decorate_fwd return fwd(*args, **kwargs) File "/home/vbansal21/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/linear.py", line 61, in forward output = input.matmul(weight.t()) RuntimeError: mat1 dim 1 must match mat2 dim 0
{Here the size of the input that the linear layer recieved and the weight of the linear layer is printed, I modded the code a little bit to do so.}

@Vbansal21 Vbansal21 changed the title deepspeed.zero.Init() error : not attribute _old_init. deepspeed.zero.Init() error : no attribute _old_init. May 4, 2021
@janEbert
Copy link
Contributor

janEbert commented May 5, 2021

Probably this may help: #907 (comment)

@Vbansal21
Copy link
Author

Probably this may help: #907 (comment)

This converged to the second problem, i.e.
"RuntimeError: mat1 dim 1 must match mat2 dim 0" which was earlier caused due to changing the main code, this time it happened by itself.

@Vbansal21
Copy link
Author

Vbansal21 commented May 28, 2021

Tried many-many different combinations of settings, still the problem exist. The problem comes when the model is a bit complex, for e.g. Lucidrains' Performer implementation in torch, and the number of parameters and/or size of biggest layer is large ( above some threshold, don't know what threshold and how to change). Tried changing bucket size, buffer size, and many other combinations, but the final error is generally on the linear layer (both, efficient version and vanilla version), the error is cublas Gemmex runtime-error. Don't know if other types of layers will cause this problem or some similar problem, since not able to move beyond this one.

System: Ubuntu 20.04 (freshly installed few days ago)
CUDA:11.3
Driver version:465
My GPU: GTX 1660Ti max-Q Laptop
CPU: i7-9750H
Ram: 24GB DDR4
NVME: Samsung evo 970 Pro 512GB
Primary focus: Zero-infinity Stage-3 Nvme Offloading

@Vbansal21 Vbansal21 changed the title deepspeed.zero.Init() error : no attribute _old_init. Deepspeed and deepspeed.zero.Init() broken for big models. May 28, 2021
@xinj7
Copy link

xinj7 commented Mar 3, 2023

@Vbansal21 did you manage to solve the issue?

@Vbansal21
Copy link
Author

Vbansal21 commented Mar 3, 2023

@lavaaa7
Well, nope, I had to workaround, hit-n-trial.

@tjruwase
Copy link
Contributor

tjruwase commented Mar 3, 2023

@Vbansal21, @lavaaa7 do you want to re-open and provide newer stack trace. We are recently looking into similar issues concerning zero.Init: #2811 and #2812.

@dumpmemory
Copy link

@tjruwase and this one #2637

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants