Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model does not run correctly on CUDA Capability 7.x GPUs #59

Open
joshabramson opened this issue Nov 15, 2024 · 36 comments
Open

Model does not run correctly on CUDA Capability 7.x GPUs #59

joshabramson opened this issue Nov 15, 2024 · 36 comments
Labels
bug Something isn't working

Comments

@joshabramson
Copy link
Collaborator

A note from us at Google DeepMind:

We have now tested accuracy on V100 and there are serious issues with the output (looks like random noise). Users have reported similar issues with RTX 2060S and RTX Quadro 4000.

For now the only supported and tested devices are A100 and H100.

@aozalevsky
Copy link

if you're interested in stats, works fine on

NVIDIA GeForce RTX 3080, Driver Version: 535.183.01

@smg3d
Copy link

smg3d commented Nov 16, 2024

Could you be specific about what was not working on your side on the V100? How to recognize that there is a problem? Is non-sense structure THE indicator of numerical inaccuracies?

The mentioned post with non-sense structures on Quadro 4000 and RTX 2060S were not done in your docker environment...

Because on my side, predictions look perfect on an old Quadro P3000 6GB for several proteins and ligand complexes (i.e. on a 6 years-old Thinkpad laptop and a mobile GPU with compute < 8.0),. Also works great on RTX-3090.

Other than non-sense structures, what other observation could indicate that we have numerical inaccuracy? Is there a controlled test we could do to identify potential numerical inaccuracy in our setup?

@joshabramson
Copy link
Collaborator Author

joshabramson commented Nov 16, 2024

The nonsense structure is the indicator of the problem here - output will look almost random. The problem appears related to bfloat16, which is not supported on older GPU. We will continue to investigate next week.

Interesting to know that it does work on some older GPU, thanks for the report. Even if the major issue under investigation here isn't present, please note we have not done any large scale numerical verification of outputs on devices other than A100/H100.

@smg3d
Copy link

smg3d commented Nov 16, 2024

Thank you for the precision @joshabramson. I will watch for "exploded" structures, and report the specifics if ever it happens on one of my GPUs.

The P3000 definitely does not support natively BF16 (CUDA capability 6.1). I guess it emulates it via float32 compute.

Since it is quite probable that several people will try to run AF3 on their available hardware, here are some details of my setup where it works perfect so far.

Number of tokens (12 runs so far on that GPU) : 167-334 tokens, so largest bucket size tested was 512.

Largest test:
334 tokens (Running model inference for seed 1 took 618.33 seconds.) (yup... slow, but it works... while driving two 4K monitors)

Typical inference speed for < 256 tokens : 150-190 seconds per seed (so typically less than 3 minutes for < 256 tokens)

GPU : Quadro P3000, Pascal architecture, Computer Capability = 6.1 (ThinkPad P71 laptop)

Docker : default setup, NOT using unified memory

run_alphafold.py option : --flash_attention_implementation=xla

nvidia-smi

Sat Nov 16 20:53:47 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro P3000                   Off |   00000000:01:00.0  On |                  N/A |
| N/A   59C    P0             52W /   75W |    5593MiB /   6144MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1672      G   /usr/lib/Xorg                                 519MiB |
|    0   N/A  N/A      1963      G   cinnamon                                      195MiB |
|    0   N/A  N/A      2581      G   ...seed-version=20241115-050104.422000         90MiB |
|    0   N/A  N/A     52689      C   python                                       4750MiB |
+-----------------------------------------------------------------------------------------+

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Sep_12_02:18:05_PDT_2024
Cuda compilation tools, release 12.6, V12.6.77
Build cuda_12.6.r12.6/compiler.34841621_0

deviceQuery

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro P3000"
  CUDA Driver Version / Runtime Version          12.6 / 12.6
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6032 MBytes (6325141504 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1215 MHz (1.22 GHz)
  Memory Clock rate:                             3504 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.6, CUDA Runtime Version = 12.6, NumDevs = 1, Device0 = Quadro P3000
Result = PASS

neofetch

                   -`                    ***@neo
                  .o+`                   -------- 
                 `ooo/                   OS: Arch Linux x86_64 
                `+oooo:                  Host: 20HKCTO1WW ThinkPad P71 
               `+oooooo:                 Kernel: 6.6.60-1-lts 
               -+oooooo+:                Uptime: 3 hours, 39 mins 
             `/:-:++oooo+:               Packages: 1900 (pacman), 9 (flatpak) 
            `/++++/+++++++:              Shell: bash 5.2.37 
           `/++++++++++++++:             Resolution: 3840x2160, 3840x2160 
          `/+++ooooooooooooo/`           DE: Cinnamon 6.2.9 
         ./ooosssso++osssssso+`          WM: Mutter (Muffin) 
        .oossssso-````/ossssss+`         WM Theme: cinnamon (Adapta) 
       -osssssso.      :ssssssso.        Theme: Adwaita-dark [GTK2/3] 
      :osssssss/        osssso+++.       Icons: Faenza [GTK2/3] 
     /ossssssss/        +ssssooo/-       Terminal: terminator 
   `/ossssso+/:-        -:/+osssso+-     CPU: Intel i7-7820HQ (8) @ 2.900GHz 
  `+sso+:-`                 `.-/+oso:    GPU: NVIDIA Quadro P3000 Mobile 
 `++:.                           `-/+/   Memory: 8323MiB / 64093MiB 
 .`                                 `/

@Augustin-Zidek Augustin-Zidek pinned this issue Nov 18, 2024
@Augustin-Zidek Augustin-Zidek added the bug Something isn't working label Nov 18, 2024
@jurgjn
Copy link

jurgjn commented Nov 18, 2024

We ran the "2PV7" example from the docs on all GPU models available on our cluster with the following results:

gpu ranking_score driver_ver cuda
rtx_2080_ti -99.68 535.183.06 12.2
rtx_3090 0.67 550.127.05 12.4
rtx_4090 0.67 550.127.05 12.4
titan_rtx -99.78 550.127.05 12.4
quadro_rtx_6000 -99.74 550.90.07 12.4
v100 -99.78 550.127.05 12.4
a100_pcie_40gb 0.67 550.127.05 12.4
a100_80gb 0.67 550.127.05 12.4

Specifically, a ranking score of -99 corresponds to noise/explosion, and a ranking score of 0.67 corresponds to a visually compelling output structure.

Update (20.11): added driver/cuda versions reported by nvidia-smi.

@Augustin-Zidek
Copy link
Collaborator

Augustin-Zidek commented Nov 18, 2024

Thanks @jurgjn, this is incredibly useful information!

These are the GPU capabilities (see https://developer.nvidia.com/cuda-gpus) for the GPUs mentioned:

rtx_2080_ti      7.5  (bad)
rtx_3090         8.6
rtx_4090         8.9
titan_rtx        7.5  (bad)
quadro_rtx_6000  7.5  (bad)
v100             7.0  (bad)
a100_pcie_40gb   8.0
a100_80gb        8.0

Looks like anything with GPU capability < 8.0 produces bad results.

@Augustin-Zidek
Copy link
Collaborator

Quick update: I've pushed a08cffd that makes AlphaFold 3 fail if run on a GPU with capability < 8.0 to prevent people getting bad results by surprise.

In case you want to test with old GPUs, just remove the check added in a08cffd.

@lucajovine
Copy link

Just to add one more piece of info, I am using a RTX A6000 (capability 8.6) and so far all looks well.

@gwirn
Copy link

gwirn commented Nov 19, 2024

RTX A5000 (capability 8.6) works well too

@Augustin-Zidek
Copy link
Collaborator

Could more people test with capability 6.x?

Based on the result above from @smg3d, it looks that maybe only capability 7.x is broken, while 6.x (and >8.0) might be fine.

I.e. current theory:

6.0 <= Capability < 7.0  Maybe OK? We need more data.
7.0 <= Capability < 8.0  BAD
8.0 <= Capability        OK

@smg3d
Copy link

smg3d commented Nov 19, 2024

I wonder if it could be a driver effect? I noticed several people are mentioning they are using older driver. Might be useful to know which driver and Cuda @jurgjn was using on his system.

I was using Driver 560.35.03 and Cuda V12.6.77 (Actually just upgraded to driver 565 today).

@georgkempf
Copy link

I could now try AF3 on a Quadro P4000 (Pascal) and like @smg3d reported for P3000, on this GPU it works. This test was done with the same driver and cuda versions (565.57.01, cuda_12.5.r12.5) as the tests on RTX 2060S (Turing) and Quadro RTX 4000 (Turing).

@zhanglzu
Copy link

zhanglzu commented Nov 22, 2024

V100 also meet "exploded" structures

NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6

@recorderlegend
Copy link

Quadro RTX 8000 also got exploding structures

Driver Version 555.42.06, CUDA Version 12.5

@smg3d
Copy link

smg3d commented Nov 24, 2024

Could more people test with capability 6.x?

Based on the result above from @smg3d, it looks that maybe only capability 7.x is broken, while 6.x (and >8.0) might be fine.

I.e. current theory:

6.0 <= Capability < 7.0  Maybe OK? We need more data.
7.0 <= Capability < 8.0  BAD
8.0 <= Capability        OK

I can confirm that it runs well on P100 (capability 6.0).

So far it has been confirmed that it runs well on the following 6.x Capability:

  • P3000
  • P4000
  • P100

And so far there has been no reports of "exploded structures" on 6.x Capability.

smg3d added a commit to smg3d/alphafold3 that referenced this issue Nov 24, 2024
- Add explicit check for compute capability < 6.0
- Keep check for range [7.0, 8.0)
- Update error message to clarify working versions (6.x and 8.x)
- Addresses issue google-deepmind#59
@smg3d
Copy link

smg3d commented Nov 24, 2024

I think it would be good for users to be able to use AlphaFold3 on Pascal GPUs (without requiring them to modify code). The data on this issue strongly suggest that the "exploded structures" problem does not affect Pascal GPUs (compute capability 6.x).

Moreover, there are still several clusters with P100s, and these often have 0 or very short wait time (compared to the A100s). For example, on one of the Canadian national clusters, AF3 jobs on P100 currently start immediately, whereas jobs on the A100 (on the same cluster) often have 10-30 minutes wait time in the queue. So for a single inference job on small-medium size protein complexes, we get our predictions back much faster with the P100, despite the inference being ~5x slower (358 sec vs 73 sec on the tested dimer).

I tested and submitted a small PR to allow Pascal GPUs to run without raising the error message.

@phbradley
Copy link

I got a nice looking structure for the 2PV7 example on an NVIDIA GeForce GTX 1080 Ti, compute capability 6.1.

$ nvidia-smi
Mon Nov 25 05:58:31 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off | 00000000:5E:00.0 Off |                  N/A |
| 23%   28C    P8               8W / 250W |    140MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     63790      C   python                                      138MiB |
+---------------------------------------------------------------------------------------+

@phbradley
Copy link

Following up on the previous comment, I ran some docking simulations on our old cluster, which is a mix of "RTX 2080 Ti" and "GTX 1080 Ti" nodes. All the ~20 jobs on the 1080s worked OK, all of the ~20 jobs on the 2080s gave exploded structures and ranking_scores of -99. Looks like the 2080s have compute capability 7.5 and the 1080s have compute capability 6.1, so this fits with the "7.0 <= CC < 8.0 is bad" theory.

@joshabramson
Copy link
Collaborator Author

Thanks for all the reports and suggestions here. Update from our side:

We identified where the issue with bfloat16 V float32 is for V100, after fixing that structures are no longer exploded, but

  • accuracy drops for larger bucket sizes (some targets, e.g. pdb id 7s4a, are consistently worse with V100 compared to A100)
  • compilation fails with a segfault before we get close to large enough buckets for device oom

We are investigating these issues with the XLA team, but in the meantime we do not believe V100s are safe to use even without exploding structures.

We also tested P100s, which have capability less than 7, and there we can run without any changes (other than turning flash attention implementation to 'xla') up to 1024 tokens, and with no regression in accuracy compared to A100. However, given the issues we see on V100, we have reservations about removing any restrictions on gpu versions just yet. Users are free to remove the hard error from the code themselves.

@phbradley
Copy link

Awesome, thank you for the update and for digging into this! Is it easy to say what the bfloat16 "partial fix" for V100s is? In case we wanted to try doing some testing on other 7<=CC<8 GPUs?

@joshabramson
Copy link
Collaborator Author

The partial fix is to convert any bfloat16 params to float32 directly after loading them, and to set bfloat16 to 'none' in the global config. But as mentioned above, beware, this can result in structures that look reasonable enough, but turn out to be less accurate than our verified models.

@gotero

This comment was marked as off-topic.

@jsspencer

This comment was marked as off-topic.

@ai-bits
Copy link

ai-bits commented Nov 28, 2024

Originally I came here to make sure the code / model can run distributed over 2 or more GPUs, because my 2 RTX 4000 Ada "only" have 20GB each. (combined the 40GB of an A100)
If I understood right here, simpler tasks can already be run on "ancient" (consumer) GPUs with only 6GB of VRAM,.., so I should be able to run AF - at least on one GPU and see if it can be distributed later.

Remains the question if we could please update the installation documentation to be less intimidating by mentioning lower end hardware and not only $30.000+ irons, e.g. for students to get their feet wet.

Thanks
G.

@joshabramson joshabramson changed the title Model does not run correctly on non A100/H100 GPUs Model does not run correctly on CUDA Capability 7.x GPUs Nov 29, 2024
@Augustin-Zidek
Copy link
Collaborator

Updates (and some good news):

  • Firstly, thanks everyone who tested on their GPUs and reported their results!
  • GPUs with compute capability 6.x won't raise an error after 2eb6555.
  • We've updated the documentation to better reflect the situation in e56abb7.
  • We are testing a fix proposed by the XLA team for the issue on V100. We will post an update here once we know more.

@samuelmf1
Copy link

Does anyone know if inference works on the Apple M4 chip? (Or any Apple M series GPU, for that matter.)

@OrangeyO2
Copy link

I appreciate all the helpful resources on this thread.
I was hoping I could get some help on the following questions (sorry if they are naive!)

  1. I do not have H100/A100 available for use. Inference speed aside, is A40 equally good in terms of prediction accuracy as the tested ones (H100/A100/P100)?
  2. Out of P100, A40, L40, is P100 still the most recommended one to use? (I have seen in the documentation that only H100/A100/P100 are officially tested but still wanted to ask since A40/L40 are newer products and may be faster...?)

Thank you!!

@Augustin-Zidek
Copy link
Collaborator

Hi @OrangeyO2,

I do not have H100/A100 available for use. Inference speed aside, is A40 equally good in terms of prediction accuracy as the tested ones (H100/A100/P100)?

A40 has compute capability 8.6 and uses the Ampere architecture (A100 also uses Ampere) so it should be fine. That being said, we haven't done any large-scale tests on that particular GPU type.

Out of P100, A40, L40, is P100 still the most recommended one to use? (I have seen in the documentation that only H100/A100/P100 are officially tested but still wanted to ask since A40/L40 are newer products and may be faster...?)

P100 is compute capability 6.0 (Pascal), A40 is 8.6 (Ampere), L40 is 8.9 (Ada Lovelace). As such, I would recommend A40 or L40 as they will be significantly faster than the P100. They are likely to be ok, but I recommend you run some accuracy tests.

@YoavShamir5
Copy link

Thanks for the info and for keeping us updated with the status!

Are there some generic accuracy tests one can run on different GPU types (that were not specified abode) to make sure that this V100 issue is not taking place? Does the V100 issue basically lead to random-looking output no matter the input, or just in specific cases?

@OrangeyO2
Copy link

@Augustin-Zidek
Thank you for your reply and suggestions! I really appreciate all the help provided here.
Seconding with @YoavShamir5's comment, it would be great if you could help us determine whether a particular GPU is accurate enough for use (even if it is not a completely large scale test).
I wonder if the predicted structure(s) from the example sequence(s) (like 2PV7) run on A100/H100 can be shared so that we can compare them to the results from our favourite GPUs.
For my personal use case, I could compare A40 vs P100 results though.

@shuibizai
Copy link

Has anyone succeeded on Tesla T4 (capability 7.5).

Driver Version: 550.54.15,CUDA Version: 12.4 .

--flash_attention_implementation=xla

Is there any way to keep it from predicting "exploded structures"

@smg3d
Copy link

smg3d commented Dec 8, 2024

Has anyone succeeded on Tesla T4 (capability 7.5).
Driver Version: 550.54.15,CUDA Version: 12.4 .
--flash_attention_implementation=xla
Is there any way to keep it from predicting "exploded structures"

I did test it on our T4 cluster (with CUDA 12.6 and --flash_attention_implementation=xla), and it explodes. At the moment, I do think there is a fix (although a "partial fix" is mentioned above.

@joshabramson
Copy link
Collaborator Author

Please avoid the partial fix mentioned above if possible as it can give less accurate output than expected. We are working on a complete fix and will update on timelines very soon.

@YoavShamir5
Copy link

Hi, thanks again for this great tool - is there any news regarding how us users can make sure that our GPUs are ok accuracy-wise? Is the issue discussed here related just to random/obviously wrong structures predicted? Or is this GPU-accuracy-issue more nuanced than that? I am looking for a benchmark to verify the validity of different GPU models

@joshabramson
Copy link
Collaborator Author

We are pretty sure CUDA capability 7 GPUs all face the same issue, and should not currently be used. CUDA capability 6 or >=8 are fine. As per comments above, there are some hacks that can move away from exploding structures for cc 7 gpus, but then numerical accuracy is not on par with what we expect.

Please await the full fix for cc 7 gpus, which is coming soon.

@YoavShamir5
Copy link

Great, thanks @joshabramson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests