fix: set `use_reentrant` to `True` to fix `Mixtral-7b` bug #3928

geoffreyangus · 2024-02-09T21:29:04Z

Got the following error when training Mixstral-7b in a multi-GPU setting:

tensor at position 96:
saved metadata: {'shape': torch.Size([142]), 'dtype': torch.int64, 'device': device(type='cuda', index=1)}
recomputed metadata: {'shape': torch.Size([143]), 'dtype': torch.int64, 'device': device(type='cuda', index=1)}
tensor at position 97:
saved metadata: {'shape': torch.Size([142]), 'dtype': torch.int64, 'device': device(type='cuda', index=1)}
recomputed metadata: {'shape': torch.Size([143]), 'dtype': torch.int64, 'device': device(type='cuda', index=1)}
...

This can be resolved by setting reentrant to True ensures

Reentrant checkpoint always recomputes function in its entirety during the backward pass.
Source: https://pytorch.org/docs/stable/checkpoint.html

So shape mismatches should be prevented.

arnavgarg1

Can we also bump the minimum torch version as a follow up PR?

set use_reentrant to True to fix Mixtral-7b bug

9906743

geoffreyangus requested review from w4nderlust, tgaddair, justinxzhao, arnavgarg1, jeffkinnison, Infernaught and alexsherstinsky as code owners February 9, 2024 21:29

arnavgarg1 approved these changes Feb 9, 2024

View reviewed changes

justinxzhao approved these changes Feb 9, 2024

View reviewed changes

geoffreyangus merged commit ea890d9 into master Feb 9, 2024
14 of 17 checks passed

geoffreyangus deleted the use-reentrant-fix branch February 9, 2024 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: set `use_reentrant` to `True` to fix `Mixtral-7b` bug #3928

fix: set `use_reentrant` to `True` to fix `Mixtral-7b` bug #3928

geoffreyangus commented Feb 9, 2024 •

edited

Loading

arnavgarg1 left a comment

fix: set use_reentrant to True to fix Mixtral-7b bug #3928

fix: set use_reentrant to True to fix Mixtral-7b bug #3928

Conversation

geoffreyangus commented Feb 9, 2024 • edited Loading

arnavgarg1 left a comment

Choose a reason for hiding this comment

fix: set `use_reentrant` to `True` to fix `Mixtral-7b` bug #3928

fix: set `use_reentrant` to `True` to fix `Mixtral-7b` bug #3928

geoffreyangus commented Feb 9, 2024 •

edited

Loading