-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPEX 2.1.10+xpu regression with Stable Diffusion workloads that were working with IPEX 2.0.120+xpu on WSL2/Linux #483
Comments
@simonlui I just picked up your issue and will work on reproducing it first |
Let me know if you need help with the Docker images setup but it should be self explanatory to set up. If you use the base Dockerfile, you will get the currently working version using IPEX 2.0.120+xpu and using the Dockerfile.latest file should get you a version using IPEX 2.1.10+xpu which doesn't work. |
I'm running into an issue trying to build the docker image. I've set my proxy settings but it's complaining about the public key being not available. See the attached log. |
Sorry for the late reply. The issue seems to be with the GPG key you have retrieved which is from Intel itself, and I believe the command that fetches and installs it didn't run correctly. Can you clean out your cache from building the image and try rebuilding it again to see what you get? I checked and can access and use the GPG key without issue here when I rebuilt the image from scratch just a while ago in the United States where I live. That to me means that it's not an outage from fetching the GPG key which was an issue a month ago. I'm not sure if your geographic region, HTTPS proxy settings, or corporate firewall/policy has anything to do with not being able to access it but I do find it weird you aren't able to access your company's own domain here for something like this. I am pulling the key using this line in the Dockerfile which I derived from Intel's own documentation for oneAPI installation for APT Linux system in Step 3 here. Here's the line I used which I believe you are running into an issue with.
Besides a few added arguments, this should be verbatim the commands from that page chained together. Maybe try without proxy? Just for reference, I just use the following when I build my image here locally: Unrelated to the above, but there are a few other pieces of information I want to mention because I forgot to include in my initial bug report which I believe you will find helpful with solving this bug. I built a container using the following arguments on my Linux system to generate the bug report's output.
Obviously, adapt the above to your system but I believe you should be able to replicate it using the same arguments I did. If you want to experiment with the image and installation with a container, you need to override the entrypoint startup.sh script I set. You can do this by adding in
But you should be able to quit out into the container bash shell using Crtl-C and do whatever instead of the normally intended usage of starting and stopping the container. |
The only system I have available with an Arc A770 GPU to work with at this time is a system based outside of the US. Because I'm in the US, and I need VPN to access it, there are proxy settings to be configured. I resolved that, but I might have hit a firewall issue. I'm currently working with my team to resolve this before I can proceed with building the docker image. This is not an issue on US-based systems. And thank you for the additional notes. Your sample docker run command will help me run mine with changes to the arguments. |
Any updates with this issue? |
The fact that it is taking weeks to locate an Intel GPU at Intel is really not inspiring much confidence and AMD is coming out with a cheap 16GB VRAM card this month. My A770 is still in the return window until the end of the month and I am seriously considering taking advantage of it. I got this card to do SD and right now it is a serious struggle. Is there a reason to believe its going to get better? |
You can decide whatever you want to do but there are paths forward. This is only an issue with using the latest code and ComfyUI. Using the previous version of IPEX, IPEX 2.0.120+xpu, SD works correctly with ComfyUI and quite a few extensions too. It should work with no issue with other frontends like stable-diffusion-webui and SD.Next which patched workarounds for IPEX using the latest version if you must have it. |
Yes, I know it's possible to get it working currently after a fashion. I am actually maintaining my own set of patches based on SD.next's whole CUDA emulation layer: comfyanonymous/ComfyUI#476 (comment) - so it is possible to hack it into a working state (with no help from Intel as far as I know) but that is obviously not an idea scenario. I think we have different definitions of "path forward". You are saying the path forward is to stay in the past and just run an old version. Also, the community hacked together something that works around the issues this time but what about the next? I really don't think I am unique in wanting to know that my hardware is going to continue to see support, that problems will be fixed/responded to in a relatively timely way. Running old versions and losing access to performance/security/QoL improvements is not a solution. |
I would not have filed this issue the very day IPEX 2.1.10+xpu released if I didn't have a vested interest in getting it fixed properly. However, with things being at the "fixing" phase of things, "moving forward" is meant to mean being able to still run Stable Diffusion in this scenario although with less than ideal setup, workarounds and hacks to make it work and I wasn't aware if you were stuck at that part so it's great you know how to get it working. But I do agree with your definition too that the status quo is not acceptable going forward, and should be fixed before the next release of IPEX.
Given the way Intel's GPUs have launched thus far as a new player in the field and knowing where they are prioritizing things, I can't say I am surprised at this speed which this is getting addressed and I will say it is frustrating for me too. But the facts also mean that users are going to have to put up with hassle from various fronts like this and people who have reviewed these cards have said as much and various other issues have popped up about these GPUs. Again, you can decide what you want to do based on that and express displeasure at it but I don't believe it is helpful to solving the issue. I actually do want to put in another issue for enhancement talking about the root of it which is the fact most Stable Diffusion front-ends need to implement a workaround layer to implement operations that should be natively supported in IPEX but are not. |
Hi @simonlui thanks for your patience. The team has been supporting issues that came in over the last month and I just came back from a business trip. Now that I'm back, I will be prioritizing your issue and get back to you with an update soon. The first step of any debug is to reproduce the issue, which includes gathering the exact hardware and software resources. We don't have access to your setup, which means there is additional overhead needed before the issue is reproduced. If you have an easier way to reproduce the issue, that would be most helpful and can speed up the time to resolve it. |
The main issue seems to be your access to hardware which I don't know if you have resolved your GPG key access issues yet. Technically, you should be able to reproduce the issue as long as it's the same Arc Alchemist GPU family since the sample workflow in ComfyUI that would reproduce this error is a simple SD 1.5 workflow which should run on about everything as long as you have an Arc family graphics card and IPEX so even an A310 would work here. If you have to be even more strict there, you could possibly use a GPU using the same chip of which I think includes the other A770 8GB, the A750 and the A580. I think the Flex 170 is the only card Intel produces though that reasonably gets close but not exactly to the A770 16 GB since it uses the same chip and memory configuration but not firmware so I guess you can try it with that if you have access to that. |
@simonlui - I took over this ticket recently and have been able to reproduce this issue. q,k,v all have dtype of torch.float16. Just for kicks I made an addition in /deps_latest/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py
After making the above change, the model does run, but I do see numerous counts of data_type mismatches further down the lane. In the mean time, I will talk to IPEX engineering team and bring to their attention. |
This makes sense as the backtrace does complain about this but the strange thing is that ComfyUI does work with the previous version of IPEX somehow if you run my non-latest docker image that does the setup for that. This either seems like a restriction that was put in or regression from a change in IPEX. This may belong in a different issue for this and please tell me if you think so but currently, other IPEX implementations have workarounds that works around functions IPEX does not implement correctly so it works with everything needed to run Stable Diffusion. Those workarounds shouldn't really exist if IPEX was doing everything correctly. ComfyUI's implementation worked with IPEX without those workarounds last version because some things were not being run GPU like the Text Encoder. If you run ComfyUI with
It depends on the model used but the default Stable Diffusion checkpoint ComfyUI tells you to use and downloads for the default workflow is FP16 by default from RunwayML's release. And ComfyUI is not my application, this is an open source frontend for Stable Diffusion at https://github.com/comfyanonymous/ComfyUI. The code you are referring to is at https://github.com/comfyanonymous/ComfyUI/blob/d76a04b6ea61306349861a7c4657567507385947/comfy/ldm/modules/attention.py#L382 so I hope that helps with you tracking down this issue more. |
Thanks @simonlui . I do understand that newer version does not work, but both PyTorch and IPEX both are upgraded. Some comments here - b. Yes the function "torch.nn.functional.scaled_dot_product_attention" is in question. Please refer to similar issue tagged here - pytorch/pytorch#110213 . Perhaps related? and pointing towards a PyTorch difference rather than IPEX. c. I tested fp16 models of CompVis/stablediffusionv1.4 and https://huggingface.co/runwayml/stable-diffusion-v1-5 on the latest PyTorch and IPEX released and they both worked fine. |
This is an autocast issue rather than sdpa issue. sdpa is working as intended, autocast should've catched it before it hit sdpa. There are more autocast issues i have worked around here: |
Sorry, I didn't have time until now to check and reply to this until now. a.) The docker arrangement works with a Nvidia GPU on another machine I had access to, but was using WSL2 through Windows. |
Diffusers also catches it with this line: |
Has there been any updates with this issue? It seems like per the prior release cycles for IPEX on XPU, a new version will be coming soon, so I would like to know if this has been addressed or if we will need to keep working around the issue. Thanks. |
@simonlui - This task is on priority list on our engineering team side and they are actively working on it. However, it will likely not get fixed in the upcoming release. So the workarounds on the application side will be needed for some more time. |
@simonlui - I received word from the engineering team that a public commit addressing the issue has been merged. I have given a test on our internal branch and the issue seems to go away. Can you please build the source on your side and verify the functional correctness on your side. I also tried batch sizes > 1 till 6 and they all run fine. |
@intel-ravig I can confirm it does indeed work after I compiled xpu-main in this repository's Docker.compile image process, although I had difficulty proceeding past a certain point because of deepspeed issues I think your development team still have yet to resolve. But since it still split out the main packages and I don't think that matters too much in this scenario as that is a separate acceleration unit from the GPUs themselves, I forced installed them with pip and tried out ComfyUI and it seems to work fine without issue. I even tried out offloading everything to the GPU itself and I have had issues with the past with the CLIP encoder crashing also and that was also solved by this SDP fix. Thank you and thanks to the engineering team for getting this in. This will help us all in lowering our maintenance burden hacking fixes to work around this issue. Edit: I should note this is with the base SDXL models being loaded with torch.xpu.optimize too so that also works well here. |
@simonlui - Okay great! Thanks for confirming. If you are satisfied with the fix, please close this issue. |
Sorry for the delay, I am closing the issue as it is fixed. The deepspeed issue I am pretty sure you guys will rectify at some point since it does seem the Docker image is updated often. |
Describe the bug
Running ComfyUI, a Stable Diffusion frontend, with my Docker image here, I am getting issues with the simplest default workflow which uses Stable Diffusion 1.5. I am getting the following even after disabling ipex.optimize().
This and other complex processing workloads for most part had no issues with IPEX 2.0.120+xpu. I know also this isn't a Docker issue either from the image or Docker itself because doing the installation sanity check yields the following.
There are similar but probably different problems with other Stable Diffusion front ends, where I also tested out SD.Next and the dev branch of stable-diffusion-webui to sanity check if it was just a ComfyUI issue and although not in Docker, I got similar issues with similar simple workloads and setup with the latest IPEX on both frontends. SD.Next just says loading IPEX fails and doesn't do anything while stable-diffusion-webui does have a backtrace but IPEX support is in beta and it has to do with Generators not existing which is a separate issue from this one which I can file if someone requests it.
Versions
Ran inside Ubuntu 22.04 Docker container with aforementioned image with Fedora 39 host
PyTorch version: 2.1.0a0+cxx11.abi
PyTorch CXX11 ABI: Yes
IPEX version: 2.1.10+xpu
IPEX commit: a12f9f6
Build type: Release
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: N/A
Clang version: N/A
IGC version: N/A
CMake version: N/A
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.6.6-200.fc39.x86_64-x86_64-with-glibc2.35
Is XPU available: True
DPCPP runtime version: N/A
MKL version: N/A
GPU models and configuration:
[0] _DeviceProperties(name='Intel(R) Arc(TM) A770 Graphics', platform_name='Intel(R) Level-Zero', dev_type='gpu, support_fp64=1, total_memory=16288MB, max_compute_units=512, gpu_eu_count=512)
Intel OpenCL ICD version: 23.17.26241.33-647~22.04
Level Zero version: 1.3.26241.33-647~22.04
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 5950X 16-Core Processor
CPU family: 25
Model: 33
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 0
Frequency boost: enabled
CPU max MHz: 5084.0000
CPU min MHz: 550.0000
BogoMIPS: 6800.05
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Virtualization: AMD-V
L1d cache: 512 KiB (16 instances)
L1i cache: 512 KiB (16 instances)
L2 cache: 8 MiB (16 instances)
L3 cache: 64 MiB (2 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==2.1.10+xpu
[pip3] numpy==1.26.2
[pip3] torch==2.1.0a0+cxx11.abi
[pip3] torchaudio==2.1.0a0+cxx11.abi
[pip3] torchvision==0.16.0a0+cxx11.abi
[conda] N/A
The text was updated successfully, but these errors were encountered: