-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] Hey, Microsoft...Could you PLEASE Support Your Own OS? #2427
Comments
+1 DeepSpeed is nearly (if not entirely) impossible to install on Windows. |
We hear you. Please try #2428 |
Hi @n00mkrad and @d8ahazard, I wonder if you have any update on whether this PR solved the Windows installation issue? |
Nope. Trying to run it in VS Powershell:
Trying to run in CMD:
|
Solved this by installing the windows 10 SDK...but this is also precisely what I'm grumbling about. Even after getting it to compile, there's no /dist folder and no .whl file, despite the setup.py file clearly indicating this is what should happen. The .bat file is calling python setup.py bdist_whl...yet we get a .egg.info file. If I edit the bat to call pip install setup.py, it gets really mad at me...can't find the error it throws ATM. Like, within the app I'm trying to use deepspeed, I can easily do a try: / import deepspeed command to determine if that dependency exists. Why can't the setup.py script do the same for opts that may be unavailable in Windoze? Last - when I do finally jump through all the hoops and get setup.py to create something in the /build folder, I have to manually spoof the whl-info directory in order for accelerate to recognize this, and even then, it refuses to load due to a missing module. "Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed." |
@tjruwase @RezaYazdaniAminabadi Hi |
@d8ahazard, yes DeepSpeed can work without |
@tjruwase thanks ❤️ if we don't need |
Did Microsoft really consider adapting to windows when developing it? When I start pytorch, it forces linking a GPU with nccl even though I train under cpu only As we all know, nccl cannot be used on win fucking at all |
working with WSL 🎉
|
How did you resolve the |
So it's still not working on Windows. WSL is not always an option depending on the use case. |
@tjruwase I can't manage to run on native windows. 😭 and ubuntu already comes with |
@camenduru, can you share the log of the link error? Thanks! |
@tjruwase https://gist.github.com/camenduru/c9a2d97f229b389fed0b1ad561a243d3 pytorch/pytorch#81642 (this one looks serious) 🥵
is this one necessary? |
|
@camenduru for wsl2, is it passing the pytest-3 tests/unit and other tests? I got it compiled on wsl2 but it is failing almost every test due to nccl issues. If you could provide details as to your installation and whether you are passing the unit tests would be appreciated. |
@Thomas-MMJ DeepSpeed very slow with wsl2 and I deleted everything sorry I can't help 😞 we need working DeepSpeed on native windows maybe 1 year later idk also why we are putting linux kvm between gpu and cpu we will lose ~5% right? |
I think the problem is that it is trying to build all the ops because of the following environment variable setting Can you try setting that env var to zero? |
have you tried using Chat GPT3 to solve it? 1 of the other requirements is Triton and a Russian managed to build a working 2.0 version for Windows a couple days ago but Chat GPT could likely find the other holes keeping it from building properly |
well if anyone feels like tinkering around with this, here's a whl that installs deepspeed version 0.8.0 on windows |
It'll throw up c10d flags looking for NCCL which is Linux only when turned on but this is an issue with either accelerate or my computer bc I get the same error when trying to turn on any sort of distributed training at all in windows and I don't know if I possess the coding knowledge to fix it so I leave it up to y'all |
Oh and it'll error out during accelerate config after saying no to using a deepspeed json file you'd like to use but I got around this by replacing the accelerate config file in windows with a config file I made in WSL |
I must point out that those wheel links redirect to |
Wait, so DeepSpeed is a Microsoft project, and it can't be used on Windows? |
Not without compiling it yourself, sacrificing three chickens to the dark lord Cthulhu, and playing "Hit me baby one more time" on reverse. |
Oh no 😐 I was playing the wrong song. |
So, on windows 10, when I do:
When I setup DS_BUILD_AIO=0, getting bunch of lscpu command is not available, I suppose for now it not getting any better with DS_BUILD_SPARSE_ATTN=0?:
|
Same problem,seems no way to solve the problem,but it works fine on linux... |
What if we all custom build a branch supporting win? I'm honestly tired of so, and so many things not being supported on windows, not allowing me to work with certain packages. Unless we all keep bugging Microsoft with it, they won't really support it on windows, not sure why though. I can only assume something about backwards compatibility and trying to make it work on win 95 |
(Note: these steps are for the interference only mode)
To install the generated .whl, just use: Extra Notes: About the replacement of file pt_binding.cpp: all I did was change lines 531, 532, 539, and 540: New lines 539 and 540: For anyone that just want the final .whl to install using python, here it is (no prayers needed): |
The wheels worked for me in PyTorch 1.13.1 with CUDA 11.7 and python 3.10.9. Thank you. Although, when running a command like
Windows tries to open deepspeed using an application and asks what app it should use to open it. But when importing and running code for deepspeed in python, it works. |
Thank you for the method you provided, but it doesn't work for me with v0.9.2 version (win10+python3.10+vs2019). Could you please provide a solution or a whl file for v0.9.2 version? |
Does DeepSpeed training work with WSL2? I've been going around in circles and have heard 3 different things. I ran into my own errors while installing it on WSL2 but I don't know if I should expect success with a few hours more work or if it's a hopeless cause? I'm also fine using a docker container if that's what it takes, I just can't find a straightforward answer on if training with deepspeed is reasonably expected to work on WSL2 at all |
Yeah, having the same problem, just thought that giving up and switching to wsl might solve the problem but when running, it just fail with: "FAILED: custom_cuda_kernel.cuda.o". |
Deepspeed v0.11.1: Patch release cloned from https://github.com/microsoft/DeepSpeed on 10-28-2003. Compiled for Windows Torch 2.1.0 and CUDA 12.1 .rar because .whl was slightly too big for github.com. Includes 4 fixes described here microsoft/DeepSpeed#2427 (comment) and 4 fixes in other places shown below. I know nothing about C++. I just asked ChatGPT to fix the errors. diff --git a/build_win.bat b/build_win.bat index ec8c8a36..f21d79cc 100644 --- a/build_win.bat +++ b/build_win.bat @@ -1,5 +1,10 @@ @echo off +REM begin-KAS +set DS_BUILD_EVOFORMER_ATTN=0 +set DISTUTILS_USE_SDK=1 +REM end-KAS + set DS_BUILD_AIO=0 set DS_BUILD_SPARSE_ATTN=0 diff --git a/csrc/quantization/pt_binding.cpp b/csrc/quantization/pt_binding.cpp index a4210897..12777603 100644 --- a/csrc/quantization/pt_binding.cpp +++ b/csrc/quantization/pt_binding.cpp @@ -241,11 +241,12 @@ std::vector<at::Tensor> quantized_reduction(at::Tensor& input_vals, .device(at::kCUDA) .requires_grad(false); - std::vector<long int> sz(input_vals.sizes().begin(), input_vals.sizes().end()); - sz[sz.size() - 1] = sz.back() / devices_per_node; // num of GPU per nodes - const int elems_per_in_tensor = at::numel(input_vals) / devices_per_node; + std::vector<int64_t> sz_vector(input_vals.sizes().begin(), input_vals.sizes().end()); + sz_vector[sz_vector.size() - 1] = sz_vector.back() / devices_per_node; // num of GPU per nodes + at::IntArrayRef sz(sz_vector); auto output = torch::empty(sz, output_options); + const int elems_per_in_tensor = at::numel(input_vals) / devices_per_node; const int elems_per_in_group = elems_per_in_tensor / (in_groups / devices_per_node); const int elems_per_out_group = elems_per_in_tensor / out_groups; diff --git a/csrc/transformer/inference/csrc/pt_binding.cpp b/csrc/transformer/inference/csrc/pt_binding.cpp index b7277d1e..a26eaa40 100644 --- a/csrc/transformer/inference/csrc/pt_binding.cpp +++ b/csrc/transformer/inference/csrc/pt_binding.cpp @@ -538,8 +538,8 @@ std::vector<at::Tensor> ds_softmax_context(at::Tensor& query_key_value, if (layer_id == num_layers - 1) InferenceContext::Instance().advance_tokens(); auto prev_key = torch::from_blob(workspace + offset, {bsz, heads, all_tokens, k}, - {hidden_dim * InferenceContext::Instance().GetMaxTokenLength(), - k * InferenceContext::Instance().GetMaxTokenLength(), + {static_cast<unsigned>(hidden_dim * InferenceContext::Instance().GetMaxTokenLength()), + static_cast<unsigned>(k * InferenceContext::Instance().GetMaxTokenLength()), k, 1}, options); @@ -547,8 +547,8 @@ std::vector<at::Tensor> ds_softmax_context(at::Tensor& query_key_value, auto prev_value = torch::from_blob(workspace + offset + value_offset, {bsz, heads, all_tokens, k}, - {hidden_dim * InferenceContext::Instance().GetMaxTokenLength(), - k * InferenceContext::Instance().GetMaxTokenLength(), + {static_cast<unsigned>(hidden_dim * InferenceContext::Instance().GetMaxTokenLength()), + static_cast<unsigned>(k * InferenceContext::Instance().GetMaxTokenLength()), k, 1}, options); @@ -1578,7 +1578,7 @@ std::vector<at::Tensor> ds_rms_mlp_gemm(at::Tensor& input, auto output = at::from_blob(output_ptr, input.sizes(), options); auto inp_norm = at::from_blob(inp_norm_ptr, input.sizes(), options); auto intermediate_gemm = - at::from_blob(intermediate_ptr, {input.size(0), input.size(1), mlp_1_out_neurons}, options); + at::from_blob(intermediate_ptr, {input.size(0), input.size(1), static_cast<int64_t>(mlp_1_out_neurons)}, options); auto act_func_type = static_cast<ActivationFuncType>(activation_type);
I compiled Deepspeed v0.11.1 for windows, cuda 12.1. [python 3.10+Torch 2.1.0+cu121] pip install deepspeed-0.11.2+244040c1-cp310-cp310-win_amd64.whl I had to use these settings; |
this shit still open!!!!??? Bye Bye Microsoft |
Why the heck this is still open? |
NOTE: Training will not work on Windows AT ALL, not even with WSL/WSL2, and not by running Linux in a Virtual Machine. Since my last post I learned some relevant things:
|
Fix #2427 --------- Co-authored-by: Costin Eseanu <costineseanu@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Nice! |
While "I get it"...I really don't get why this still doesn't even have BASIC Windows support.
It is published by Microsoft, right?
Compiling from source on windoze doesn't actually seem to generate a .whl file so it could be re-distributed or something.
Pulling from PIP throws any number of errors, from ADAM not being supported because it requires 'lscpu', or just failing because libaio.so can't be found.
Meaning, that for the past several years, this M$-produced piece of software is mostly useless on the OS they create.
This is one of the most annoying things about Python in general. "It's soooo cross-platform". Until you need a specific library, and realize it was really only ever developed for Linux users until someone threw a slug in the readme about how it MIGHT work with windows, but only if you do a hundred backflips while wearing a blue robe and sacrifice a chicken to Cthulhu.
Python does still support releasing different packages for different operating systems, right?
If that's still true, then it would be fantastic if someone out there could release a proper .whl to pypi for us second-class Windoze users. I really don't feel like spending the next several hours trying to upgrade my instance of WSL2 to the right version that won't lose it's mind if I try to use a specific amount of RAM...
The text was updated successfully, but these errors were encountered: