Add try-except block for deepspeed import handling#267
Add try-except block for deepspeed import handling#267Harthi7 wants to merge 21 commits intoinstructlab:mainfrom Harthi7:feature/add-try-except-import-to-deepspeed
Conversation
|
Hey @RobotSail , can you please review my PR |
| except ImportError: | ||
| DeepSpeedCPUAdam = None | ||
| FusedAdam = None | ||
| print("DeepSpeed Adam optimizers are not available. Some features may be unavailable.") |
There was a problem hiding this comment.
I like where you're headed with this. One thing we should do though is guard this print statement to the first rank to prevent it from appearing everywhere in a multi-GPU setting:
| print("DeepSpeed Adam optimizers are not available. Some features may be unavailable.") | |
| local_rank = os.getenv('LOCAL_RANK', None) | |
| if __name__ == '__main__' and (local_rank is None or local_rank == 0): | |
| print("DeepSpeed Adam optimizers are not available. Some features may be unavailable.") |
| from deepspeed.runtime.zero.utils import ZeRORuntimeException | ||
| except ImportError: | ||
| ZeRORuntimeException = None | ||
| print("ZeRORuntimeException is not available. Some features may be unavailable.") |
There was a problem hiding this comment.
In which cases would the ZeRORuntimeException not be available for import?
There was a problem hiding this comment.
From my understanding the only reason why ZeRORuntimeException could not be imported are general issues that can affect any other import, the reason why I wrapped it in an Except/Catch statement is because I assumed it was part of the issue to make DeepSpeed imports optional
There was a problem hiding this comment.
@Harthi7 I see. In that case, could you please group them all into a single section?
|
Thank you for the PR @Harthi7 ! This is a great change. I've left a few comments, nothing major. Mainly I'm not familiar with the ZeRORuntimeException not being available for import, when would this happen? And would this be limited to just the ZeRORuntimeException, or is it part of a greater issue where parts of DeepSpeed are not available for import? Because I think rather than emitting a log about being unable to import that specific exception, it may be better to indicate that an entire section of DeepSpeed can't be imported due a piece not being built properly. |
|
Hello @RobotSail, I update my PR please review the new changes |
| from deepspeed.ops.adam import DeepSpeedCPUAdam, FusedAdam | ||
| from deepspeed.runtime.zero.utils import ZeRORuntimeException | ||
| try: | ||
| from deepspeed.ops.adam import DeepSpeedCPUAdam, FusedAdam |
There was a problem hiding this comment.
So we'll need to actually split this into two different try/catch blocks: one that imports DeepSpeedCPUAdam, and another that imports everything else.
So we'll need to do the following:
try:
# handle CPU optimizer here, it may
from deepspeed.ops.adam import DeepSpeedCPUAdam
except ImportError:
DeepSpeedCPUAdam = None
local_rank = int(os.getenv('LOCAL_RANK', None))
if __name__ == '__main__' and (not local_rank or local_rank == 0):
print("DeepSpeed CPU Optimizer is not available. Some features may be unavailable.")
# handle everything else separately
try:
from deepspeed.ops.adam import FusedAdam
from deepspeed.runtime.zero.utils import ZeRORuntimeException
except ImportError:
FusedAdam = None
ZeRORuntimeException = None
local_rank = int(os.getenv('LOCAL_RANK', None))
if __name__ == '__main__' and (not local_rank or local_rank == 0):
print("DeepSpeed is not available. Some features may be unavailable.")Apologies for any confusion this may have caused on your part.
| DeepSpeedCPUAdam = None | ||
| FusedAdam = None | ||
| ZeRORuntimeException = None | ||
| local_rank = int(os.getenv('LOCAL_RANK', -1)) |
There was a problem hiding this comment.
LOCAL_RANK is typically populated by torchrun, so you'll want to handle the case where it may not be set - in which case we should just assume that we are the main process:
| local_rank = int(os.getenv('LOCAL_RANK', -1)) | |
| local_rank = int(os.getenv('LOCAL_RANK', None)) | |
| if __name__ == '__main__' and (not local_rank or local_rank == 0): |
|
Thank you for updating this PR @Harthi7. Since we're trying to handle the case when DeepSpeed cannot be imported, we should also add some safeguards to prevent the code from proceeding when the necessary imports are not available. So we'll want to do this in two places:
Basically what you'll want to do in either of these cases is check what the distributed training framework being used is, and whether the options we want are available. So for the first piece, you'll want to add something like the following to the if args.distributed_training_framework == 'deepspeed' and not FusedAdam:
# means we can't import anything from deepspeed
raise ImportError("DeepSpeed was selected but we cannot import the `FusedAdam` optimizer")
if args.distributed_training_framework == 'deeppeed' and args.cpu_offload_optimizer and not DeepSpeedCPUAdam:
raise ImportError("DeepSpeed was selected and CPU offloading was requested, but DeepSpeedCPUAdam could not be imported. This likely means you need to build DeepSpeed with the CPU adam flags.") And then we'll want to do something similar for the second piece within the # at the top of file:
from instructlab.training.confg import DistributedBackend
# inside of `run_training`:
if train_args.distributed_backend == DistributedBackend.DeepSpeed:
if not FusedAdam:
raise ImportError("DeepSpeed was selected as the distributed backend, but FusedAdam could not be imported. Please double-check that DeepSpeed is installed correctly")
if train_args.deepspeed_options.cpu_offload_optimizer and not DeepSpeedCPUAdam:
raise ImportError("DeepSpeed CPU offloading was enabled, but DeepSpeedCPUAdam could not be imported. This is most likely because DeepSpeed was not built with CPU Adam. Please rebuild DeepSpeed to have CPU Adam, or disable CPU offloading.") |
Currently, the training library does not exit when an error is encountered within the training loop (invoked through torchrun). This commit updates that functionality so we correctly return an exit code of 1 on child failure. Additionally, this commit also adds the `make fix` command which automatically fixes all trivial issues picked up on by ruff Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>
Signed-off-by: Harthi7 <abdullah-harthi7@live.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…nto a single section Signed-off-by: abdullah-ibm <abdullah@ibm.com>
Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…ssary imports are not available Signed-off-by: abdullah-ibm <abdullah@ibm.com>
Signed-off-by: Nathan Weinberg <nweinber@redhat.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>
Signed-off-by: Harthi7 <abdullah-harthi7@live.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…nto a single section Signed-off-by: abdullah-ibm <abdullah@ibm.com>
Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…ssary imports are not available Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…github.com/Harthi7/training into feature/add-try-except-import-to-deepspeed
this commit adds a new E2E job meant to test integration of training library changes with the CLI's "full" train pipeline to prevent any regressions it also updates the relevant mergify configuration Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
was being incorrectly labeled as 'small" Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
was still using the old AMI from the previous job Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
was still using the old instance type Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
Currently, the training library does not exit when an error is encountered within the training loop (invoked through torchrun). This commit updates that functionality so we correctly return an exit code of 1 on child failure. Additionally, this commit also adds the `make fix` command which automatically fixes all trivial issues picked up on by ruff Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Harthi7 <abdullah-harthi7@live.com> Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…nto a single section Signed-off-by: abdullah-ibm <abdullah@ibm.com>
Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…ssary imports are not available Signed-off-by: abdullah-ibm <abdullah@ibm.com>
…github.com/Harthi7/training into feature/add-try-except-import-to-deepspeed
| HF_TOKEN: ${{ secrets.HF_TOKEN }} | ||
| run: | | ||
| . venv/bin/activate | ||
| <<<<<<< HEAD |
There was a problem hiding this comment.
Did you mean to include this?
|
This pull request has merge conflicts that must be resolved before it can be |
|
@Harthi7 Should we close this PR then? |
Context Issue:
#250 (comment)
Implemented error handling for the import of
deepspeedto prevent runtime crasheswhen the module is unavailable. This improves the robustness of the application.
Issue: #250