Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero infinity xpu support #4130

Merged
merged 19 commits into from Oct 3, 2023
Merged

Zero infinity xpu support #4130

merged 19 commits into from Oct 3, 2023

Conversation

Liangliang-Ma
Copy link
Contributor

@Liangliang-Ma Liangliang-Ma commented Aug 10, 2023

Currently AIO kernel has gap using on XPU device.
This PR fixes issues that

  1. cpu buffer only detect CUDA tensor in aio threads
  2. make is_pinned() more general in all kinds of accelerator
  3. io_getevents() stopped by signal on XPU

Developed and Tested on Megatron Deepspeed.

@Liangliang-Ma
Copy link
Contributor Author

Some change seems not necessary. I wonder if it's model related. For example, assert(_num_pending_ops > 0); should not raise error if thread's queue works fine. But in Megatron-deepspeed it fails. For io_pgetevents, I will test it more to reply you. Thanks.

@Liangliang-Ma
Copy link
Contributor Author

Hi @tjruwase , I did a modification to current PR. Removed env variable dependence. Let me explain some change that this PR has.

  1. use io_pgetevents instead of io_getevents
    It's indeed more general and it sometimes can help to block signals from interrupting the io process. The performance of these two are same.

2)More copy on XPU
This is because Async IO requires aligned mem to be used as buffer. Currently it needs manually set to allocate an aligned place, otherwise io would fail. This could be optimized in future.

  1. changes in _swap_out_unpinned_tensors
    In deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py, def swap_in_optimizer_state read data into buffer whose sizes are all swap_info.numel(). But for def _swap_out_unpinned_tensors in def swap_out_optimizer_state, write buffer sizes are numel of each tensor. This could be a mismatch in some cases.

@tjruwase
Copy link
Contributor

2)More copy on XPU This is because Async IO requires aligned mem to be used as buffer. Currently it needs manually set to allocate an aligned place, otherwise io would fail. This could be optimized in future.

Requiring aligned buffers is a deliberate design decision to simplify the library and ensure high performance. We prefer for the alignment issues to be addressed on the client side where there is more flexibility. For example, clients can easily add padding in a way that minimizes perf impact. However, I think the library needs to be improved to assert on unaligned buffers to make this requirement more explicit.

@tjruwase
Copy link
Contributor

3. changes in _swap_out_unpinned_tensors
In deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py, def swap_in_optimizer_state read data into buffer whose sizes are all swap_info.numel(). But for def _swap_out_unpinned_tensors in def swap_out_optimizer_state, write buffer sizes are numel of each tensor. This could be a mismatch in some cases.

Can you please share an example of this failure case, perhaps in the form of a unit test?

@Liangliang-Ma
Copy link
Contributor Author

2)More copy on XPU This is because Async IO requires aligned mem to be used as buffer. Currently it needs manually set to allocate an aligned place, otherwise io would fail. This could be optimized in future.

Requiring aligned buffers is a deliberate design decision to simplify the library and ensure high performance. We prefer for the alignment issues to be addressed on the client side where there is more flexibility. For example, clients can easily add padding in a way that minimizes perf impact. However, I think the library needs to be improved to assert on unaligned buffers to make this requirement more explicit.

The buffers that get used in swap in and out are all created and initialized in deepspeed python codes. In client side we cannot affect it by changing model codes. By the way, I see that in link deepspeed test has a flag that can create aligned tensor which also calls posix_memalign to initialize buffer. And this flag seems not get used in deepspeed. What about adding it to let client choose how to create their buffer?

@Liangliang-Ma
Copy link
Contributor Author

  1. changes in _swap_out_unpinned_tensors
    In deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py, def swap_in_optimizer_state read data into buffer whose sizes are all swap_info.numel(). But for def _swap_out_unpinned_tensors in def swap_out_optimizer_state, write buffer sizes are numel of each tensor. This could be a mismatch in some cases.

Can you please share an example of this failure case, perhaps in the form of a unit test?

In my env of Megatron-deepspeed with Adam on xpu devices, when run to def _update_param_state_info, the optimizer state that get added to swap_info can be a shape of [1, 22843136, 22843136], which will result in mismatch of swap in&out optimizer states. Because in swap_in the buffer is set to [22843136, 22843136, 22843136] and [256, 22843136, 22843136] in swap_out(aligned). There is an assert to check this mismatch in link.

@Liangliang-Ma
Copy link
Contributor Author

@tjruwase Any comments? Thanks!

@tjruwase
Copy link
Contributor

The buffers that get used in swap in and out are all created and initialized in deepspeed python codes. In client side we cannot affect it by changing model codes.

Actually, the expectation is that the client side send aligned buffers to the library. This is why the tensor swapping utility of deepspeed has many alignment codes to guarantee this condition. Can you share a bit about your client side? Perhaps, deepspeed can expose another wapper to provide this alignment. Would that help?

@tjruwase
Copy link
Contributor

By the way, I see that in link deepspeed test has a flag that can create aligned tensor which also calls posix_memalign to initialize buffer. And this flag seems not get used in deepspeed. What about adding it to let client choose how to create their buffer?

Yes, good observation. Our plan is to use this functionality within deepspeed and also expose to clients. We hope to do that soon.

@tjruwase
Copy link
Contributor

In my env of Megatron-deepspeed with Adam on xpu devices, when run to def _update_param_state_info, the optimizer state that get added to swap_info can be a shape of [1, 22843136, 22843136], which will result in mismatch of swap in&out optimizer states. Because in swap_in the buffer is set to [22843136, 22843136, 22843136] and [256, 22843136, 22843136] in swap_out(aligned).

I am very curious about how this mismatch could happen. It concerns me that our swapping logic is making wrong assumptions. Is it possible to share more details of the training run? Or is it possible to create unit test? Thanks!

@Liangliang-Ma
Copy link
Contributor Author

Liangliang-Ma commented Sep 19, 2023

Actually, the expectation is that the client side send aligned buffers to the library. This is why the tensor swapping utility of deepspeed has many alignment codes to guarantee this condition. Can you share a bit about your client side? Perhaps, deepspeed can expose another wapper to provide this alignment. Would that help?

I add the alignment in get_accelerator(), which will be easy for different platform to create or adjust their buffer and don't influence current library. Exposing another wrapper to client also works for me. I think both can exist together.

@Liangliang-Ma
Copy link
Contributor Author

Liangliang-Ma commented Sep 19, 2023

I am very curious about how this mismatch could happen. It concerns me that our swapping logic is making wrong assumptions. Is it possible to share more details of the training run? Or is it possible to create unit test? Thanks!

I find the root cause of this mismatch now. The reason is that Deepspeed assumes that optimizer state tensors have same numel as param and so swap_in&swap_out can use each tensors' numel or param's numel as length. But the truth is different optimizer have different situation. It could be different numel.

For apex.optimizers.FusedAdam, the optimizer states have same numel as param: apex optimizer states init
For torch.optim.Adam, the optimizer states have three tensor and the first one is scalar: torch adam

I found that my previous change is naive and not enough. So I fallback to use sgd optimizer whose state has same numel as param to bypass this issue. For this part, it can be in another PR or do you want to fix it? @tjruwase

@tjruwase
Copy link
Contributor

I add the alignment in get_accelerator(), which will be easy for different platform to create or adjust their buffer and don't

This solution is fine with me. Thanks!

@tjruwase
Copy link
Contributor

so that -1 value for alignment_bytes means to use system default such as sysconf(_SC_PAGESIZE)? We will document this behavior. What do you think?

Yes, I agree. It will be cleaner.

Sorry, one more adjustment to this. Perhaps we can use 0 value as system default instead of -1. I don't think this change affects the PR but will affect documentation.

@tjruwase
Copy link
Contributor

@Liangliang-Ma, this PR looks good to me after addressing the final suggestions. Thanks so much for this great work.

@Liangliang-Ma
Copy link
Contributor Author

@Liangliang-Ma, this PR looks good to me after addressing the final suggestions. Thanks so much for this great work.

@tjruwase I committed the change as you suggested. Thanks for your reviewing these days!

@tjruwase
Copy link
Contributor

Another thing is that in previous workflow checks, I found that io_pgetevents cannot be found in your unittest env when building AIO ops.

I will pay attention to the new workflow for this issue.

I think this is failing because the unittest env is ubuntu 18.04 which does not have io_pgetevents available. Is it possible for the code to use io_getevents or io_pgetevents as appropriate? I will also investigate if ubuntu version of the unittest can be updated. Thanks!

@loadams
Copy link
Contributor

loadams commented Sep 27, 2023

Another thing is that in previous workflow checks, I found that io_pgetevents cannot be found in your unittest env when building AIO ops.

I will pay attention to the new workflow for this issue.

I think this is failing because the unittest env is ubuntu 18.04 which does not have io_pgetevents available. Is it possible for the code to use io_getevents or io_pgetevents as appropriate? I will also investigate if ubuntu version of the unittest can be updated. Thanks!

This should be updated to be 20.04 now. cc: @tjruwase and @Liangliang-Ma

Is there somewhere we should specify a min libaio version/Ubuntu version required for this?

@tjruwase
Copy link
Contributor

@loadams, thanks for updating the CI ubuntu. I think the package version is tied to the OS version, so we need a way to specify minimum Ubuntu version. Unfortunately, I don't know how to do that. I am not sure we have done this previously.

@mrwyattii, @jeffra any thoughts here? Thanks!

@tjruwase
Copy link
Contributor

@Liangliang-Ma, it looks like CI issues are resolved, except for formatting. Can you please address that? Thanks!

@loadams
Copy link
Contributor

loadams commented Sep 27, 2023

@loadams, thanks for updating the CI ubuntu. I think the package version is tied to the OS version, so we need a way to specify minimum Ubuntu version. Unfortunately, I don't know how to do that. I am not sure we have done this previously.

@mrwyattii, @jeffra any thoughts here? Thanks!

I think we should at a minimum put this in the README in the requirements section? Also perhaps list the failure mode if one doesn't have this?

@tjruwase
Copy link
Contributor

I think we should at a minimum put this in the README in the requirements section? Also perhaps list the failure mode if one doesn't have this? Thanks!

Great point @loadams. @Liangliang-Ma, could please replace io_submit with io_pgetevents in the following compatibility checker function?

def is_compatible(self, verbose=True):
# Check for the existence of libaio by using distutils
# to compile and link a test program that calls io_submit,
# which is a function provided by libaio that is used in the async_io op.
# If needed, one can define -I and -L entries in CFLAGS and LDFLAGS
# respectively to specify the directories for libaio.h and libaio.so.
aio_compatible = self.has_function('io_submit', ('aio', ))

@Liangliang-Ma
Copy link
Contributor Author

@tjruwase Thanks for solving CI issue! Checker function and formatting modification is committed.

@tjruwase tjruwase added this pull request to the merge queue Sep 29, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 29, 2023
@Liangliang-Ma
Copy link
Contributor Author

@tjruwase seems this PR has been blocked by a cpu inference unit test in merge queue, which is out of this PR's scope. Can you please help to check it? Thanks!

@loadams
Copy link
Contributor

loadams commented Oct 2, 2023

@tjruwase seems this PR has been blocked by a cpu inference unit test in merge queue, which is out of this PR's scope. Can you please help to check it? Thanks!

Hi @Liangliang-Ma - this is a known issue, should be resolved in this PR then we can unblock the merge queue.

@tjruwase tjruwase added this pull request to the merge queue Oct 3, 2023
Merged via the queue into microsoft:master with commit 1760627 Oct 3, 2023
16 checks passed
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Oct 9, 2023
* zero infinity xpu support

* remove env var depends

* client align mem

* sync with all accelerators'

* format fix

* add align in pin_memory api

* add missing brackets

* remove align

* modify pin_memory api

* modify pin_memory api to use only on align para

* change value of align bytes

* Update csrc/aio/common/deepspeed_aio_common.cpp

* add version check and change format

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants