Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How fast should dorado simplex (modified bases) basecalling be on a PromethION data tower with 4 A100s? #252

Closed
rainwala opened this issue Jun 20, 2023 · 28 comments
Assignees
Labels
question Issue is a question

Comments

@rainwala
Copy link

rainwala commented Jun 20, 2023

I am basecalling pod5 directories on a PromethION data tower (with 4 A100s), using dorado version 3.

On a pod5 directory with 760GB pods5s, I had changed the CUDA version to either 11.4 or 12 (I can't remember now), which made one of the 4 A100s not be recognised, but I basecalled anyway, and this took 5 hours! I did a factory resent on the PromethION data twoer after that to make sure all 4 A100s got recognised again.

After the factory reset, this machine has the following versions for software and drivers: Ubuntu 20.04, and Nvidia Driver Version: 515.65.01 and CUDA Version: 11.7 . On this setup, with 4 A100s, dorado basecalling is taking 19 hours on a 780GB directory.

The specific command in both cases was:
dorado basecaller /home/prom/dorado-0.3.0-linux-x64/models/dna_r10.4.1_e8.2_400bps_hac@v4.1.0 pod5_all/ --modified-bases 5mCG_5hmCG -x "cuda:1,2,3,4" > mod_bases.bam

I had made sure to make the cuda device numbering the same as the system device IDs, using this:
export CUDA_DEVICE_ORDER=PCI_BUS_ID

Could you please tell me what might explain the discrepancy, and are there any benchmarks for what speed to expect with 4 A100s?

@iiSeymour iiSeymour self-assigned this Jun 20, 2023
@iiSeymour iiSeymour added the question Issue is a question label Jun 20, 2023
@iiSeymour
Copy link
Member

Hey @rainwala the driver version on the PromethION is recent and the CUDA version doesn't need changing to run dorado.

You should expect hac@v4.1.0 with 5mCG_5hmCG to run at around 1.5e8 Samples/s which is ~55 Gbases an hour.

Two things I would check if you are seeing much lower performance than this:

  • Check nvidia-smi to confirm dorado is running on the 4x A100 and not the T1000.
  • Check nvidia-smi to confirm guppy_basecaller_server is not running. The basecall server is always active on device via a systemd service and will be holding on to GPU memory which will effect dorado performance. To temporarily stop the basecall server run systemctl stop guppyd.

@rainwala
Copy link
Author

rainwala commented Jun 20, 2023

HI @iiSeymour , nvidai-smi confirms that dorado is running on all 4 A100s and not the T1000, but so is guppy_basecaller_server . The former is taking ~40GB graphics RAM, and the latter around 2GB. I have stopped guppy_basecaller_server using the systemctl command you specified (how do I restart it for all 4 GPUs -- is that just systemctl start guppyd?) . Now the predicted run time for dorado is still ~20 hours, so no improvement there. How do I get dorado to output the bascalling speed?

@iiSeymour
Copy link
Member

Yes, systemctl start guppyd to restart the service. Dorado will not allocate more memory dynamically at runtime so you will need to restart dorado to take advantage of the freed memory.

@rainwala
Copy link
Author

I did restart dorado, and there was no improvement in the predicted time (sorry that wasn't clear)

@rainwala
Copy link
Author

Also, how does one see the speed for dorado is there an option to diplay that?

@iiSeymour
Copy link
Member

Dorado will report the speed at the end of calling, during, you have the ETA from the progress bar only.

Is pod5_all/ on the PromethION's /data volume?

Can you copy and paste the nvidia-smi output here (with dorado running).

@rainwala
Copy link
Author

rainwala commented Jun 20, 2023

Yes pod5_all/ is on the /data volume. Here is the output of nvidia-smi with dorado running:

nvidia-smi
Tue Jun 20 13:18:56 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T1000 8GB    On   | 00000000:17:00.0 Off |                  N/A |
| 35%   35C    P8    N/A /  50W |    220MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:31:00.0 Off |                    0 |
| N/A   51C    P0    72W / 300W |  37608MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  On   | 00000000:4B:00.0 Off |                    0 |
| N/A   48C    P0    66W / 300W |  37616MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  On   | 00000000:B1:00.0 Off |                    0 |
| N/A   49C    P0    85W / 300W |  37608MiB / 81920MiB |      1%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100 80G...  On   | 00000000:CA:00.0 Off |                    0 |
| N/A   48C    P0    73W / 300W |  37616MiB / 81920MiB |     32%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11410      G   /usr/lib/xorg/Xorg                 39MiB |
|    0   N/A  N/A     31717      G   /usr/lib/xorg/Xorg                119MiB |
|    0   N/A  N/A     31871      G   /usr/bin/gnome-shell               14MiB |
|    0   N/A  N/A     32436      G   ...AAAAAAAAA= --shared-files       19MiB |
|    1   N/A  N/A    304210      C   dorado                          37603MiB |
|    2   N/A  N/A    304210      C   dorado                          37611MiB |
|    3   N/A  N/A    304210      C   dorado                          37603MiB |
|    4   N/A  N/A    304210      C   dorado                          37611MiB |
+-----------------------------------------------------------------------------+

@Kirk3gaard
Copy link

Kirk3gaard commented Jun 20, 2023

Dorado will report the speed at the end of calling, during, you have the ETA from the progress bar only.

Is pod5_all/ on the PromethION's /data volume?

Can you copy and paste the nvidia-smi output here (with dorado running).

Would actually be pretty cool to get the current speed reported somewhat live Gbp/hour to estimate whether the GPU resource used is suitable for the amount of data + patience that the user has.

@iiSeymour
Copy link
Member

@rainwala the GPU power draw and utilisation are really bad 🤔 what's the CPU load and system memory available like?

Can you test the performance on this set without calling mods?

@rainwala
Copy link
Author

rainwala commented Jun 20, 2023

@iiSeymour this is what it looks like with this command:
dorado basecaller /home/prom/dorado-0.3.0-linux-x64/models/dna_r10.4.1_e8.2_400bps_hac@v4.1.0 pod5_all/ -x "cuda:1,2,3,4" > basecalled.bam

nvidia-smi
Tue Jun 20 14:48:05 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T1000 8GB    On   | 00000000:17:00.0 Off |                  N/A |
| 35%   36C    P8    N/A /  50W |    220MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:31:00.0 Off |                    0 |
| N/A   53C    P0    77W / 300W |  37330MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  On   | 00000000:4B:00.0 Off |                    0 |
| N/A   51C    P0    71W / 300W |  37330MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  On   | 00000000:B1:00.0 Off |                    0 |
| N/A   48C    P0    72W / 300W |  37330MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100 80G...  On   | 00000000:CA:00.0 Off |                    0 |
| N/A   51C    P0    77W / 300W |  37330MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11410      G   /usr/lib/xorg/Xorg                 39MiB |
|    0   N/A  N/A     31717      G   /usr/lib/xorg/Xorg                119MiB |
|    0   N/A  N/A     31871      G   /usr/bin/gnome-shell               14MiB |
|    0   N/A  N/A     32436      G   ...AAAAAAAAA= --shared-files       19MiB |
|    1   N/A  N/A    309836      C   dorado                          37325MiB |
|    2   N/A  N/A    309836      C   dorado                          37325MiB |
|    3   N/A  N/A    309836      C   dorado                          37325MiB |
|    4   N/A  N/A    309836      C   dorado                          37325MiB |
+-----------------------------------------------------------------------------+
free -g
              total        used        free      shared  buff/cache   available
Mem:            503          16          13           0         473         483
Swap:            76           3          72

@rainwala
Copy link
Author

Would the nvidia-smi output for an anaologous call to guppy be instructive? Also, what is the best command to check cpu load? I'm not currently running anything else on that machine.

@iiSeymour
Copy link
Member

You can use uptime or htop for CPU load average. I'm still stumped so yes a guppy comparison would be good, also can you add -v to your command and paste the extra debug output from dorado.

$ dorado basecaller  /home/prom/dorado-0.3.0-linux-x64/models/dna_r10.4.1_e8.2_400bps_hac@v4.1.0 pod5_all/ -v -x "cuda:1,2,3,4"  > basecalled.bam

@rainwala
Copy link
Author

uptime
 15:02:45 up 6 days, 23:54,  2 users,  load average: 2.83, 3.04, 4.04
dorado basecaller  /home/prom/dorado-0.3.0-linux-x64/models/dna_r10.4.1_e8.2_400bps_hac@v4.1.0 pod5_all/ -v -x "cuda:1,2,3,4"  > basecalled.bam
[2023-06-20 15:03:37.055] [info] > Creating basecall pipeline
[2023-06-20 15:03:39.136] [debug] Auto batch size: GPU memory available: 84.005093376GB
[2023-06-20 15:03:39.136] [debug] Auto batch size: testing up to 2880 in steps of 64
[2023-06-20 15:03:40.672] [debug] Auto batchsize: 64, time per chunk 23.934502 ms
[2023-06-20 15:03:40.685] [debug] Auto batchsize: 128, time per chunk 0.097016 ms
[2023-06-20 15:03:40.697] [debug] Auto batchsize: 192, time per chunk 0.063674666 ms
[2023-06-20 15:03:40.709] [debug] Auto batchsize: 256, time per chunk 0.048292 ms
[2023-06-20 15:03:40.722] [debug] Auto batchsize: 320, time per chunk 0.039056 ms
[2023-06-20 15:03:40.735] [debug] Auto batchsize: 384, time per chunk 0.033664 ms
[2023-06-20 15:03:40.748] [debug] Auto batchsize: 448, time per chunk 0.029757714 ms
[2023-06-20 15:03:40.762] [debug] Auto batchsize: 512, time per chunk 0.026712 ms
[2023-06-20 15:03:40.776] [debug] Auto batchsize: 576, time per chunk 0.024108445 ms
[2023-06-20 15:03:40.793] [debug] Auto batchsize: 640, time per chunk 0.027188798 ms
[2023-06-20 15:03:40.811] [debug] Auto batchsize: 704, time per chunk 0.025173819 ms
[2023-06-20 15:03:40.829] [debug] Auto batchsize: 768, time per chunk 0.023344 ms
[2023-06-20 15:03:40.847] [debug] Auto batchsize: 832, time per chunk 0.021720616 ms
[2023-06-20 15:03:40.865] [debug] Auto batchsize: 896, time per chunk 0.020315427 ms
[2023-06-20 15:03:40.884] [debug] Auto batchsize: 960, time per chunk 0.019246934 ms
[2023-06-20 15:03:40.903] [debug] Auto batchsize: 1024, time per chunk 0.018334 ms
[2023-06-20 15:03:40.921] [debug] Auto batchsize: 1088, time per chunk 0.017387293 ms
[2023-06-20 15:03:40.941] [debug] Auto batchsize: 1152, time per chunk 0.017018666 ms
[2023-06-20 15:03:40.965] [debug] Auto batchsize: 1216, time per chunk 0.019384421 ms
[2023-06-20 15:03:40.989] [debug] Auto batchsize: 1280, time per chunk 0.018668 ms
[2023-06-20 15:03:41.013] [debug] Auto batchsize: 1344, time per chunk 0.018083047 ms
[2023-06-20 15:03:41.037] [debug] Auto batchsize: 1408, time per chunk 0.017295273 ms
[2023-06-20 15:03:41.062] [debug] Auto batchsize: 1472, time per chunk 0.016562087 ms
[2023-06-20 15:03:41.086] [debug] Auto batchsize: 1536, time per chunk 0.016073333 ms
[2023-06-20 15:03:41.112] [debug] Auto batchsize: 1600, time per chunk 0.01591168 ms
[2023-06-20 15:03:41.138] [debug] Auto batchsize: 1664, time per chunk 0.015662154 ms
[2023-06-20 15:03:41.164] [debug] Auto batchsize: 1728, time per chunk 0.015246223 ms
[2023-06-20 15:03:41.194] [debug] Auto batchsize: 1792, time per chunk 0.016740572 ms
[2023-06-20 15:03:41.225] [debug] Auto batchsize: 1856, time per chunk 0.016388414 ms
[2023-06-20 15:03:41.256] [debug] Auto batchsize: 1920, time per chunk 0.015994133 ms
[2023-06-20 15:03:41.287] [debug] Auto batchsize: 1984, time per chunk 0.015632516 ms
[2023-06-20 15:03:41.318] [debug] Auto batchsize: 2048, time per chunk 0.0153555 ms
[2023-06-20 15:03:41.350] [debug] Auto batchsize: 2112, time per chunk 0.015092364 ms
[2023-06-20 15:03:41.382] [debug] Auto batchsize: 2176, time per chunk 0.014870588 ms
[2023-06-20 15:03:41.415] [debug] Auto batchsize: 2240, time per chunk 0.014634514 ms
[2023-06-20 15:03:41.449] [debug] Auto batchsize: 2304, time per chunk 0.014512001 ms
[2023-06-20 15:03:41.486] [debug] Auto batchsize: 2368, time per chunk 0.01579027 ms
[2023-06-20 15:03:41.524] [debug] Auto batchsize: 2432, time per chunk 0.015402106 ms
[2023-06-20 15:03:41.561] [debug] Auto batchsize: 2496, time per chunk 0.015052717 ms
[2023-06-20 15:03:41.599] [debug] Auto batchsize: 2560, time per chunk 0.0148652 ms
[2023-06-20 15:03:41.638] [debug] Auto batchsize: 2624, time per chunk 0.014742634 ms
[2023-06-20 15:03:41.677] [debug] Auto batchsize: 2688, time per chunk 0.014545143 ms
[2023-06-20 15:03:41.717] [debug] Auto batchsize: 2752, time per chunk 0.01448186 ms
[2023-06-20 15:03:41.757] [debug] Auto batchsize: 2816, time per chunk 0.014286909 ms
[2023-06-20 15:03:41.798] [debug] Auto batchsize: 2880, time per chunk 0.0142350225 ms
[2023-06-20 15:03:41.866] [debug] - set batch size for cuda:1 to 2880
[2023-06-20 15:03:43.018] [debug] Auto batch size: GPU memory available: 84.005093376GB
[2023-06-20 15:03:43.018] [debug] Auto batch size: testing up to 2880 in steps of 64
[2023-06-20 15:03:44.110] [debug] Auto batchsize: 64, time per chunk 17.056538 ms
[2023-06-20 15:03:44.122] [debug] Auto batchsize: 128, time per chunk 0.09708 ms
[2023-06-20 15:03:44.135] [debug] Auto batchsize: 192, time per chunk 0.06344 ms
[2023-06-20 15:03:44.147] [debug] Auto batchsize: 256, time per chunk 0.047424 ms
[2023-06-20 15:03:44.159] [debug] Auto batchsize: 320, time per chunk 0.038704 ms
[2023-06-20 15:03:44.172] [debug] Auto batchsize: 384, time per chunk 0.032954667 ms
[2023-06-20 15:03:44.185] [debug] Auto batchsize: 448, time per chunk 0.029206857 ms
[2023-06-20 15:03:44.199] [debug] Auto batchsize: 512, time per chunk 0.026314 ms
[2023-06-20 15:03:44.212] [debug] Auto batchsize: 576, time per chunk 0.023884444 ms
[2023-06-20 15:03:44.230] [debug] Auto batchsize: 640, time per chunk 0.026993599 ms
[2023-06-20 15:03:44.247] [debug] Auto batchsize: 704, time per chunk 0.024773818 ms
[2023-06-20 15:03:44.265] [debug] Auto batchsize: 768, time per chunk 0.023160001 ms
[2023-06-20 15:03:44.283] [debug] Auto batchsize: 832, time per chunk 0.02176246 ms
[2023-06-20 15:03:44.301] [debug] Auto batchsize: 896, time per chunk 0.020273143 ms
[2023-06-20 15:03:44.320] [debug] Auto batchsize: 960, time per chunk 0.019012267 ms
[2023-06-20 15:03:44.338] [debug] Auto batchsize: 1024, time per chunk 0.017951 ms
[2023-06-20 15:03:44.357] [debug] Auto batchsize: 1088, time per chunk 0.017162353 ms
[2023-06-20 15:03:44.376] [debug] Auto batchsize: 1152, time per chunk 0.016718222 ms
[2023-06-20 15:03:44.399] [debug] Auto batchsize: 1216, time per chunk 0.01911579 ms
[2023-06-20 15:03:44.423] [debug] Auto batchsize: 1280, time per chunk 0.018365601 ms
[2023-06-20 15:03:44.446] [debug] Auto batchsize: 1344, time per chunk 0.017573334 ms
[2023-06-20 15:03:44.470] [debug] Auto batchsize: 1408, time per chunk 0.016996363 ms
[2023-06-20 15:03:44.494] [debug] Auto batchsize: 1472, time per chunk 0.016381914 ms
[2023-06-20 15:03:44.519] [debug] Auto batchsize: 1536, time per chunk 0.016021334 ms
[2023-06-20 15:03:44.544] [debug] Auto batchsize: 1600, time per chunk 0.01572928 ms
[2023-06-20 15:03:44.570] [debug] Auto batchsize: 1664, time per chunk 0.015275077 ms
[2023-06-20 15:03:44.596] [debug] Auto batchsize: 1728, time per chunk 0.015051852 ms
[2023-06-20 15:03:44.626] [debug] Auto batchsize: 1792, time per chunk 0.016623428 ms
[2023-06-20 15:03:44.656] [debug] Auto batchsize: 1856, time per chunk 0.016238345 ms
[2023-06-20 15:03:44.686] [debug] Auto batchsize: 1920, time per chunk 0.015859732 ms
[2023-06-20 15:03:44.717] [debug] Auto batchsize: 1984, time per chunk 0.015399226 ms
[2023-06-20 15:03:44.748] [debug] Auto batchsize: 2048, time per chunk 0.0151765 ms
[2023-06-20 15:03:44.780] [debug] Auto batchsize: 2112, time per chunk 0.014999273 ms
[2023-06-20 15:03:44.812] [debug] Auto batchsize: 2176, time per chunk 0.014755295 ms
[2023-06-20 15:03:44.844] [debug] Auto batchsize: 2240, time per chunk 0.014609829 ms
[2023-06-20 15:03:44.877] [debug] Auto batchsize: 2304, time per chunk 0.014315999 ms
[2023-06-20 15:03:44.914] [debug] Auto batchsize: 2368, time per chunk 0.0156112425 ms
[2023-06-20 15:03:44.952] [debug] Auto batchsize: 2432, time per chunk 0.015391579 ms
[2023-06-20 15:03:44.990] [debug] Auto batchsize: 2496, time per chunk 0.015180308 ms
[2023-06-20 15:03:45.028] [debug] Auto batchsize: 2560, time per chunk 0.01493 ms
[2023-06-20 15:03:45.066] [debug] Auto batchsize: 2624, time per chunk 0.014632585 ms
[2023-06-20 15:03:45.105] [debug] Auto batchsize: 2688, time per chunk 0.014421334 ms
[2023-06-20 15:03:45.145] [debug] Auto batchsize: 2752, time per chunk 0.014245953 ms
[2023-06-20 15:03:45.184] [debug] Auto batchsize: 2816, time per chunk 0.0140629085 ms
[2023-06-20 15:03:45.225] [debug] Auto batchsize: 2880, time per chunk 0.014019555 ms
[2023-06-20 15:03:45.294] [debug] - set batch size for cuda:2 to 2880
[2023-06-20 15:03:46.369] [debug] Auto batch size: GPU memory available: 84.005093376GB
[2023-06-20 15:03:46.369] [debug] Auto batch size: testing up to 2880 in steps of 64
[2023-06-20 15:03:47.517] [debug] Auto batchsize: 64, time per chunk 17.935799 ms
[2023-06-20 15:03:47.529] [debug] Auto batchsize: 128, time per chunk 0.096976 ms
[2023-06-20 15:03:47.542] [debug] Auto batchsize: 192, time per chunk 0.064106666 ms
[2023-06-20 15:03:47.554] [debug] Auto batchsize: 256, time per chunk 0.047732 ms
[2023-06-20 15:03:47.566] [debug] Auto batchsize: 320, time per chunk 0.039350398 ms
[2023-06-20 15:03:47.579] [debug] Auto batchsize: 384, time per chunk 0.033544 ms
[2023-06-20 15:03:47.593] [debug] Auto batchsize: 448, time per chunk 0.029504 ms
[2023-06-20 15:03:47.606] [debug] Auto batchsize: 512, time per chunk 0.026706 ms
[2023-06-20 15:03:47.620] [debug] Auto batchsize: 576, time per chunk 0.024090666 ms
[2023-06-20 15:03:47.638] [debug] Auto batchsize: 640, time per chunk 0.0272912 ms
[2023-06-20 15:03:47.655] [debug] Auto batchsize: 704, time per chunk 0.025134545 ms
[2023-06-20 15:03:47.673] [debug] Auto batchsize: 768, time per chunk 0.023203999 ms
[2023-06-20 15:03:47.691] [debug] Auto batchsize: 832, time per chunk 0.021965537 ms
[2023-06-20 15:03:47.710] [debug] Auto batchsize: 896, time per chunk 0.020472001 ms
[2023-06-20 15:03:47.728] [debug] Auto batchsize: 960, time per chunk 0.019220266 ms
[2023-06-20 15:03:47.747] [debug] Auto batchsize: 1024, time per chunk 0.01829 ms
[2023-06-20 15:03:47.766] [debug] Auto batchsize: 1088, time per chunk 0.017434353 ms
[2023-06-20 15:03:47.786] [debug] Auto batchsize: 1152, time per chunk 0.016886223 ms
[2023-06-20 15:03:47.809] [debug] Auto batchsize: 1216, time per chunk 0.019280842 ms
[2023-06-20 15:03:47.833] [debug] Auto batchsize: 1280, time per chunk 0.0185752 ms
[2023-06-20 15:03:47.857] [debug] Auto batchsize: 1344, time per chunk 0.01782019 ms
[2023-06-20 15:03:47.881] [debug] Auto batchsize: 1408, time per chunk 0.017187636 ms
[2023-06-20 15:03:47.905] [debug] Auto batchsize: 1472, time per chunk 0.016608 ms
[2023-06-20 15:03:47.930] [debug] Auto batchsize: 1536, time per chunk 0.01616 ms
[2023-06-20 15:03:47.956] [debug] Auto batchsize: 1600, time per chunk 0.015733121 ms
[2023-06-20 15:03:47.981] [debug] Auto batchsize: 1664, time per chunk 0.0154412305 ms
[2023-06-20 15:03:48.007] [debug] Auto batchsize: 1728, time per chunk 0.015094519 ms
[2023-06-20 15:03:48.037] [debug] Auto batchsize: 1792, time per chunk 0.016684571 ms
[2023-06-20 15:03:48.067] [debug] Auto batchsize: 1856, time per chunk 0.016217379 ms
[2023-06-20 15:03:48.098] [debug] Auto batchsize: 1920, time per chunk 0.015837334 ms
[2023-06-20 15:03:48.129] [debug] Auto batchsize: 1984, time per chunk 0.015494194 ms
[2023-06-20 15:03:48.160] [debug] Auto batchsize: 2048, time per chunk 0.015186 ms
[2023-06-20 15:03:48.191] [debug] Auto batchsize: 2112, time per chunk 0.014938667 ms
[2023-06-20 15:03:48.223] [debug] Auto batchsize: 2176, time per chunk 0.014673412 ms
[2023-06-20 15:03:48.256] [debug] Auto batchsize: 2240, time per chunk 0.014528457 ms
[2023-06-20 15:03:48.289] [debug] Auto batchsize: 2304, time per chunk 0.01435289 ms
[2023-06-20 15:03:48.326] [debug] Auto batchsize: 2368, time per chunk 0.015494486 ms
[2023-06-20 15:03:48.363] [debug] Auto batchsize: 2432, time per chunk 0.01525221 ms
[2023-06-20 15:03:48.400] [debug] Auto batchsize: 2496, time per chunk 0.015081846 ms
[2023-06-20 15:03:48.438] [debug] Auto batchsize: 2560, time per chunk 0.0148072 ms
[2023-06-20 15:03:48.477] [debug] Auto batchsize: 2624, time per chunk 0.014563122 ms
[2023-06-20 15:03:48.515] [debug] Auto batchsize: 2688, time per chunk 0.014437334 ms
[2023-06-20 15:03:48.554] [debug] Auto batchsize: 2752, time per chunk 0.014119443 ms
[2023-06-20 15:03:48.594] [debug] Auto batchsize: 2816, time per chunk 0.014046908 ms
[2023-06-20 15:03:48.634] [debug] Auto batchsize: 2880, time per chunk 0.013978667 ms
[2023-06-20 15:03:48.709] [debug] - set batch size for cuda:3 to 2880
[2023-06-20 15:03:49.906] [debug] Auto batch size: GPU memory available: 84.005093376GB
[2023-06-20 15:03:49.906] [debug] Auto batch size: testing up to 2880 in steps of 64
[2023-06-20 15:03:50.939] [debug] Auto batchsize: 64, time per chunk 16.144989 ms
[2023-06-20 15:03:50.952] [debug] Auto batchsize: 128, time per chunk 0.098088 ms
[2023-06-20 15:03:50.964] [debug] Auto batchsize: 192, time per chunk 0.063450664 ms
[2023-06-20 15:03:50.976] [debug] Auto batchsize: 256, time per chunk 0.047488 ms
[2023-06-20 15:03:50.989] [debug] Auto batchsize: 320, time per chunk 0.0389824 ms
[2023-06-20 15:03:51.001] [debug] Auto batchsize: 384, time per chunk 0.033514667 ms
[2023-06-20 15:03:51.014] [debug] Auto batchsize: 448, time per chunk 0.029154286 ms
[2023-06-20 15:03:51.028] [debug] Auto batchsize: 512, time per chunk 0.026256 ms
[2023-06-20 15:03:51.042] [debug] Auto batchsize: 576, time per chunk 0.023742221 ms
[2023-06-20 15:03:51.059] [debug] Auto batchsize: 640, time per chunk 0.026743999 ms
[2023-06-20 15:03:51.076] [debug] Auto batchsize: 704, time per chunk 0.024583273 ms
[2023-06-20 15:03:51.094] [debug] Auto batchsize: 768, time per chunk 0.023158668 ms
[2023-06-20 15:03:51.112] [debug] Auto batchsize: 832, time per chunk 0.021790769 ms
[2023-06-20 15:03:51.130] [debug] Auto batchsize: 896, time per chunk 0.020232001 ms
[2023-06-20 15:03:51.148] [debug] Auto batchsize: 960, time per chunk 0.018919466 ms
[2023-06-20 15:03:51.167] [debug] Auto batchsize: 1024, time per chunk 0.017942 ms
[2023-06-20 15:03:51.185] [debug] Auto batchsize: 1088, time per chunk 0.017100235 ms
[2023-06-20 15:03:51.205] [debug] Auto batchsize: 1152, time per chunk 0.016607111 ms
[2023-06-20 15:03:51.228] [debug] Auto batchsize: 1216, time per chunk 0.019087158 ms
[2023-06-20 15:03:51.251] [debug] Auto batchsize: 1280, time per chunk 0.0183072 ms
[2023-06-20 15:03:51.275] [debug] Auto batchsize: 1344, time per chunk 0.017591618 ms
[2023-06-20 15:03:51.299] [debug] Auto batchsize: 1408, time per chunk 0.017016001 ms
[2023-06-20 15:03:51.323] [debug] Auto batchsize: 1472, time per chunk 0.016354783 ms
[2023-06-20 15:03:51.348] [debug] Auto batchsize: 1536, time per chunk 0.015932666 ms
[2023-06-20 15:03:51.373] [debug] Auto batchsize: 1600, time per chunk 0.0156064 ms
[2023-06-20 15:03:51.398] [debug] Auto batchsize: 1664, time per chunk 0.015152001 ms
[2023-06-20 15:03:51.424] [debug] Auto batchsize: 1728, time per chunk 0.014886518 ms
[2023-06-20 15:03:51.453] [debug] Auto batchsize: 1792, time per chunk 0.016473142 ms
[2023-06-20 15:03:51.483] [debug] Auto batchsize: 1856, time per chunk 0.016115861 ms
[2023-06-20 15:03:51.513] [debug] Auto batchsize: 1920, time per chunk 0.015751468 ms
[2023-06-20 15:03:51.544] [debug] Auto batchsize: 1984, time per chunk 0.015448258 ms
[2023-06-20 15:03:51.575] [debug] Auto batchsize: 2048, time per chunk 0.015164 ms
[2023-06-20 15:03:51.607] [debug] Auto batchsize: 2112, time per chunk 0.014888242 ms
[2023-06-20 15:03:51.638] [debug] Auto batchsize: 2176, time per chunk 0.014634353 ms
[2023-06-20 15:03:51.671] [debug] Auto batchsize: 2240, time per chunk 0.014399086 ms
[2023-06-20 15:03:51.704] [debug] Auto batchsize: 2304, time per chunk 0.014253333 ms
[2023-06-20 15:03:51.740] [debug] Auto batchsize: 2368, time per chunk 0.015570594 ms
[2023-06-20 15:03:51.778] [debug] Auto batchsize: 2432, time per chunk 0.015290948 ms
[2023-06-20 15:03:51.815] [debug] Auto batchsize: 2496, time per chunk 0.014902564 ms
[2023-06-20 15:03:51.852] [debug] Auto batchsize: 2560, time per chunk 0.0146612 ms
[2023-06-20 15:03:51.891] [debug] Auto batchsize: 2624, time per chunk 0.014552195 ms
[2023-06-20 15:03:51.929] [debug] Auto batchsize: 2688, time per chunk 0.014444191 ms
[2023-06-20 15:03:51.969] [debug] Auto batchsize: 2752, time per chunk 0.014331163 ms
[2023-06-20 15:03:52.009] [debug] Auto batchsize: 2816, time per chunk 0.014204364 ms
[2023-06-20 15:03:52.049] [debug] Auto batchsize: 2880, time per chunk 0.014007823 ms
[2023-06-20 15:03:52.128] [debug] - set batch size for cuda:4 to 2880
[

@rainwala
Copy link
Author

In case it helps (while I work out the best analogous guppy command), here is the usage and GPU power draw with nothing running:

uptime
 15:15:23 up 7 days, 6 min,  2 users,  load average: 1.49, 2.63, 3.37
nvidia-smi
Tue Jun 20 15:13:55 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T1000 8GB    On   | 00000000:17:00.0 Off |                  N/A |
| 35%   36C    P8    N/A /  50W |    217MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:31:00.0 Off |                    0 |
| N/A   46C    P0    47W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  On   | 00000000:4B:00.0 Off |                    0 |
| N/A   45C    P0    44W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  On   | 00000000:B1:00.0 Off |                    0 |
| N/A   43C    P0    47W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100 80G...  On   | 00000000:CA:00.0 Off |                    0 |
| N/A   43C    P0    47W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11410      G   /usr/lib/xorg/Xorg                 39MiB |
|    0   N/A  N/A     31717      G   /usr/lib/xorg/Xorg                119MiB |
|    0   N/A  N/A     31871      G   /usr/bin/gnome-shell               14MiB |
|    0   N/A  N/A     32436      G   ...AAAAAAAAA= --shared-files       19MiB |
+-----------------------------------------------------------------------------+

@rainwala
Copy link
Author

rainwala commented Jun 20, 2023

@iiSeymour OK and here is the analogous guppy command and nvidia-smi output

/opt/ont/guppy/bin/guppy_basecaller -i fast5_all/ -s fastq -c /opt/ont/guppy/data/dna_r10.4.1_e8.2_400bps_hac.cfg -x "cuda:1,2,3,4"
ONT Guppy basecalling software version 6.2.11+e17754edc, minimap2 version 2.22-r1101
config file:        /opt/ont/guppy/data/dna_r10.4.1_e8.2_400bps_hac.cfg
model file:         /opt/ont/guppy/data/template_r10.4.1_e8.2_400bps_hac.jsn
input path:         fast5_all/
save path:          fastq
chunk size:         2000
chunks per runner:  256
minimum qscore:     9
records per file:   4000
num basecallers:    4
gpu device:         cuda:1,2,3,4
kernel path:        
runners per device: 4

Use of this software is permitted solely under the terms of the end user license agreement (EULA).By running, copying or accessing this software, you are demonstrating your acceptance of the EULA.
The EULA may be found in /opt/ont/guppy/bin
Found 16981 fast5 files to process.
Init time: 4075 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
nvidia-smi
Tue Jun 20 15:20:14 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T1000 8GB    On   | 00000000:17:00.0 Off |                  N/A |
| 48%   55C    P8    N/A /  50W |    220MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:31:00.0 Off |                    0 |
| N/A   41C    P0    65W / 300W |   2364MiB / 81920MiB |     92%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  On   | 00000000:4B:00.0 Off |                    0 |
| N/A   44C    P0    65W / 300W |   2364MiB / 81920MiB |     79%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  On   | 00000000:B1:00.0 Off |                    0 |
| N/A   44C    P0   115W / 300W |   2364MiB / 81920MiB |     47%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100 80G...  On   | 00000000:CA:00.0 Off |                    0 |
| N/A   45C    P0   186W / 300W |   2364MiB / 81920MiB |     23%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11410      G   /usr/lib/xorg/Xorg                 39MiB |
|    0   N/A  N/A     31717      G   /usr/lib/xorg/Xorg                119MiB |
|    0   N/A  N/A     31871      G   /usr/bin/gnome-shell               14MiB |
|    0   N/A  N/A     32436      G   ...AAAAAAAAA= --shared-files       19MiB |
|    1   N/A  N/A    311623      C   ...uppy/bin/guppy_basecaller     2361MiB |
|    2   N/A  N/A    311623      C   ...uppy/bin/guppy_basecaller     2361MiB |
|    3   N/A  N/A    311623      C   ...uppy/bin/guppy_basecaller     2361MiB |
|    4   N/A  N/A    311623      C   ...uppy/bin/guppy_basecaller     2361MiB |
+-----------------------------------------------------------------------------+


@iiSeymour
Copy link
Member

@rainwala thanks, can you also check the ETA with dorado for -x cuda:1, -x cuda:1,2 and -x cuda:1,2,3.

@rainwala
Copy link
Author

rainwala commented Jun 21, 2023

@iiSeymour the guppy run tookk 12 hours, so just over half the time for the anaologus dorado run.
Caller time: 44512640 ms, Samples called: 867668096134, samples/s: 1.94926e+07
As for the ETA with the GPU combinations for dorado, here they are:

-x cuda:1 = 5hr:20 min
-x cuda:1,2 = 5hr
-x cuda:1,2,3 = 19hr
-x cuda: 3,4. = 5h 30min

I'm not sure how to interpret this except that it's good I can get 5 hours again, but. only with a subset of GPUs, and I don't seem to get any improvement from going from 1 GPU to 2 GPUs, and then performance drops off a cliff with 3 and 4 GPUs..

I get similar results with modified_bases (except that 2 GPUs does seem to be better than 1, with 4h:30min vs 6h ETA), so it's not about modified_bases, but rather it seems to be about dorado's performace with multiple A100s on a promethion data tower.

On modified_bases, with -x cuda:1,2, I got this speed: Basecalled @ Samples/s: 4.830830e+07

@iiSeymour
Copy link
Member

Thanks @rainwala I have a theory. Can you try again the above test but with CUDA_DEVICE_ORDER=FASTEST_FIRST and cuda:0, cuda:0,1, cuda:0,1,2 & cuda:0,1,2,3.

@rainwala
Copy link
Author

HI @iiSeymour , I set
export CUDA_DEVICE_ORDER=FASTEST_FIRST

and then the ETAs were as follows:
cuda:0 = 6h:30
cuda:0,1 = 4h:50
cuda:0,1,2 = 21h
cuda:0,1,2,3 = 16h

@iiSeymour
Copy link
Member

@rainwala thanks for providing all the information - we are putting together a fix now.

@rainwala
Copy link
Author

Thanks @iiSeymour !

@iiSeymour
Copy link
Member

@rainwala this should be resolve in v0.3.1.

@rainwala
Copy link
Author

rainwala commented Jun 27, 2023 via email

@iiSeymour
Copy link
Member

@rainwala
Copy link
Author

rainwala commented Jun 28, 2023 via email

@homeveg
Copy link

homeveg commented Jun 28, 2023

I have a similar problem with slightly different initial system settings.
I am trying to run a Dorado base-calling on a Rocky Linux server with 4 x A100 with 80GB graphic memory per card, 64-core CPU and 256GB RAM.
As an input, I have about 150Gb of pod5 data, generated with the latest MinKNOW software from the MinION run.
I was using first the dorado-0.3.0, and later dorado-0.3.1.
In both cases, the problem is very similar: When the base-calling has started, it shows ET about 2h:30m for v.0.3.0 and 2h10m for v.0.3.1 and after some time, the ET was already >1day for a v.0.3.0 and >6h for v.0.3.1.

The v.0.3.1 finished 9% within 13 minutes (currently running) and shows 02h:08m estimated run time, but then the ET starts to grow rapidly. After 25 min of run time, it shows still 9% and about 3h 35m. After 30 min - 10% progress and 4h35m ET
Update: It's still running, 17h run time, and 2d21h estimated run time...

I was trying to run it with fully automated settings, and using slightly customized values for "batchsize", "chunksize" and "overlap", which results in a ~5-7% decrease of ET in the beginning, compared to an auto mode.
With the optimized settings, I have about 44% of GPU memory usage, not more than 10% of CPU usage with some small peaks, and system RAM usage below 5%.
The GPU in the initialisation phase, is not very stable, always jumping from 2-3% to 20-25%, after 7-8% of a progress - the GPU usage increases to 40-50% with some 100% spikes. And this is a point, when the ET start increasing.

Here is my command:
dorado basecaller --verbose --batchsize 3100 --chunksize 12000 --overlap 750 --device "cuda:all" --min-qscore 10 --recursive --emit-fastq $ConfPath/dna_r10.4.1_e8.2_400bps_hac@v4.2.0 $WD/$Project/$FCID/ > $WD/$Project/$FCID/fastq_dorado/$Project.hac.fastq

Below is the nvidia-smi:

Wed Jun 28 16:03:55 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe           On | 00000000:17:00.0 Off |                    0 |
| N/A   49C    P0               71W / 300W|  36165MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe           On | 00000000:31:00.0 Off |                    0 |
| N/A   54C    P0              309W / 300W|  36165MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe           On | 00000000:B1:00.0 Off |                    0 |
| N/A   48C    P0              104W / 300W|  36165MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe           On | 00000000:CA:00.0 Off |                    0 |
| N/A   49C    P0               71W / 300W|  36165MiB / 81920MiB |      4%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2688830      C   dorado                                    36160MiB |
|    1   N/A  N/A   2688830      C   dorado                                    36160MiB |
|    2   N/A  N/A   2688830      C   dorado                                    36160MiB |
|    3   N/A  N/A   2688830      C   dorado                                    36160MiB |
+---------------------------------------------------------------------------------------+

@rainwala
Copy link
Author

rainwala commented Jul 6, 2023

@iiSeymour , wow the ETA for basecalling (no modified bases) with 4 A100s is now ~1 hour 15 minutes! It's around 3 hours with modified bases. What did you fix in version 0.3.1 to make this possible?

@rainwala
Copy link
Author

rainwala commented Jul 6, 2023

I consider this issue closed now. @homeveg maybe you would like to start a new issue with the problem you are facing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Issue is a question
Projects
None yet
Development

No branches or pull requests

4 participants