-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[doc] profiling NVMe and configuring aio
param section
#998
Comments
Benchmark results from DGX-2 node, which has 8 Micron 9200 NVMe raided into a single volume.
|
The sweep results suggest that zero-infinity can be configured to do offloading at read rate of 3GB/sec and write rate of 2.6GB/sec. So you want to configure the asynchronous I/O module similar to configurations that achieve these numbers. Specifically, you want to add the following to deepspeed config: aio: {
"block_size": 262144,
"queue_depth": 32,
"thread_count": 1,
"single_submit": false,
"overlap_events": true
} |
Unfortunately, I just noticed a bug in the write sweep script, which may lower the write perf. Basically, it is not doing the multi-process sweep because of this oversight. |
Oh, I missed that new config section! So how does a user correlate their benchmark results to the above config that you prepared based on my benchmark results? The description of each param at asynchronous I/O module is very terse and ideally we need a few paras explaining how to choose those values. Which are good defaults, and which numbers should be changed according to one's results. Thank you! |
I looked again through your paper - please correct me if I'm wrong, but it looks like we need at least 3GB/s NVME<->CPU bandwidth per GPU, so really my NVME setup is only barely good for a single GPU to meet the efficiency standard you define in the paper. Am I wrong? Especially given your report earlier:
So practically it's not enough to just have a single NVMe to benefit from ZeRO-Infinity if I understand it correctly. |
aio: {
"block_size": 262144,
"queue_depth": 32,
"thread_count": 1,
"single_submit": false,
"overlap_events": true
}
The rest are based the results of your sweep as follows: Your best read config was Your best write config was Unfortunately, users don't currently have the ability to have separate read and write configurations, so they need to combine the best of both. Fortunately, in this case, and in most cases, the best read and write configurations are the same or similar. Another challenge that is more obvious to me now is that reasonable defaults are hard to set because of device and system differences. Prior to your experiments, Does this help? |
This helps a lot, thank you! Can we make |
Can you clarify the efficiency standard you are referring to in the paper? Whether or not a single 3GB/sec NVMe is sufficient for ZeRO-Infinity depends on a number of factors including:
|
And it was mentioned several times in previous sections.
Thank you for sharing these considerations / questions to ask, @tjruwase. How do we translate these into something actionable by the user. That is what exact steps they follow to set up each of these values:
besides And you addressed how to get these numbers/flags - which is great!
|
aio
param section
With the updated benchmark I get slightly worse results than before for write (was 2.59), and it has now switched to
The read benchmark hasn't changed the best throughput, but the winning config has changed too!
So am I correct that I now need to change my config to:
But what should I do with Also, why does the read benchmark runs |
@tjruwase, what do you think about putting the |
@stas00, regarding the changing best configs and results, I think since the perf differences are so small I would put it down to noise. Also, as you notice the top 10 (or more) perfs are very similar, so it seems to me that the most significant factors are Regarding the write benchmark, not dropping the caches, I did not do that because I thought it was more significant for reads as opposed to writes. If possible, can you please check if adding dropping caches to write benchmark makes much difference? I will update the script in my next PR. |
That sounds fantastic! Thank you, @tjruwase I'd also add that since we currently have a single config perhaps the final script should take the output of both parsers? or take read and write log dirs , run the parser on them and dump the recommended config, so the user will need to run:
and of course the first 2 can also be merged into the 3rd item as the next stage. But this would be a great start.
I will test and get back to you. |
oh and we probably should have the instruction |
OK, so adding invalidating caching for write, had a negligible impact of 1e-2 difference. Hence, it's probably safe to not need it (as it slows the overall run time as well I think).
|
Here are results from an A100 hyperplane server from Lambda, using DeepSpeed master and the instructions collected above! Micron 7300 2TB NVMe (Max Read 3GB/s, Max Write 1.9 GB/s) Read
For most of the configurations the difference is negligible. Range of 0.33-2.07 GB/s Write
Again looking at the first 100ish tail outputs, the difference is negligible. Range of 1.23-1.995 GB/s I think we can potentially reduce the grid search space to hone in on suggestions initially. it might also be a good idea to compare the same configurations across our 3 environments to see what the differential compared to the max throughput is |
@SeanNaren, thanks for sharing your results. This is great since it is a different device from @stas00 and myself, so this is helping to evaluate the perf portability. Please see a few observations and questions below.
Regarding further investigation of the relatively poor read performance, can you help with the following:
[global]
bs=1M
iodepth=32
direct=1
ioengine=libaio
group_reporting
time_based
runtime=120
numjobs=2
name=raw-read
rw=read
directory=/local_nvme/
thread
[job1]
filename=random_400MB.pt |
Perhaps we should try and find a few more devices to acquire more data before we prune? Currently Sean's device seems to be one-off - basing a choice on 3 inputs is a bit hard and we might miss something. |
@stas00, I completely agree. It would be awesome to find additional devices. Unfortunately, I don't have access to such diversity here. Also by pruning, I meant only the default setting of the search script, users will have the option of defining their own space. |
I added "4. Contribute your data" instructions to the OP - let's see if we can get some contributions. I made a call to community inviting to contribute: |
I need to use Setup is OK (as per below) but none of the DS optimisers will JIT compile (they all return an error at runtime re: the sm_86 arch). Pretty sure problem is still outstanding for this card
|
Thank you for wanting to help us to gather the data, @thefazzer! I have the same card, it works without problems if you have the right torch/cuda setup. Let's not derail this thread and discuss this in a separate issue here on Deepspeed? Please post the output of your Thank you! |
I looked all over the internet for this specific NVMe, and the reported max write was 1.9GB/s and these were the numbers collated from my run! I can confirm with Lambda on this.
I haven't looked into this, but cloud providers may provide another standard for NVMe if we're trying to just collected data numbers.
This is shared with the OS as a root drive, there was nothing else running on the node (other than sys processes) at the time of running this benchmark.
More than happy to, however unsure how to run this! Maybe I missed something but if you could explain, can run it! |
@SeanNaren, we have seen that device sharing with OS as a root drive does impact performance, especially read, even if nothing else is running on the node. For fio, update the directory and filename fields of config file (e.g, |
@tjruwase, could we please change the benchmark to bail out if it can't run with the error message, otherwise it dumps the error into the benchmark files. and it will do it for all 400 files w/o user knowing it's not really running...
Also as I mentioned earlier the WARNING:
should probably be an assert Thank you! |
Hmm, let me take a look at both issues today. |
oh, and btw, my earlier suggestion to move Perhaps should just warn the user with echo once of why we are asking to
|
@stas00, two points on error handling.
|
A bit late to the party, but I think my machine has 2 NVMe SSDs and an RTX 3090 24GB. Happy to contribute some data if there is still a need! |
@alexcoca, thanks! Please contribute when you can. |
Here are my results. Much more consumer pc level, but I have Kioxia KXG6AZNV512G paired with a RX 5600XT. ('write', 'single', 'overlap', 1, 1, 4, 1048576) = 2.7205507521200336 ('read', 'single', 'overlap', 1, 1, 16, 262144) = 3.0350070577251524 |
@stanleyshly, excellent! Thanks so much for contributing. Do you mind updating with a description of your NVMe SSDs and the peak read/write speeds for sequential access? |
@tjruwase I'm running a single 512GB NVMe SSD(Kioxia KXG6AZNV512G). it looks to have TLC flash. |
Thanks. By the way, I was concerned about the fact that reported write speed exceeded the peak. So, I did a search and found this link which indicates the write peak is 2800MB/s. Do you have any thoughts on this? Either way, it looks like we are pretty much saturating the SSD. |
I'm not exactly sure why. Could it be the DRAM cache? |
It is unlikely DRAM cache because the library uses O_DIRECT. Can you share the source of your peak numbers? Does the link I shared look believable to you? |
I'm using the script right here: https://unix.stackexchange.com/a/480191 |
Got it. In general, when it comes to peak numbers, I tend to trust the manufacturer specs over software (benchmarks). |
I see, I'll use those next time. |
@tjruwase, how should the benchmarking be approached on distributed FS? It'd be pointless to flush the local cache. Should the benchmark be changed to create a new file before each measurement? |
@stas00, that is a good point. I have a POV creating a new file for each measurement. I will try to create a PR for that. |
Do we know how the caching mechanism work? e.g. would it be enough to do:
before each measurement to invalidate the cache, rather than the much slower recreation with |
I was told that this is the tool used for profiling multi-node IO https://github.com/hpc/ior - I tried it on a single node so far and it seems to work quite well. |
See here: microsoft/DeepSpeed#998
@gaowayne, there are no capacity requirements. Are you experiencing issues? |
Results from my Crucial T705, as requested by @tjruwase :) Note I had issues with this version, so I wound up running @stas00's benchmark here: https://github.com/stas00/ml-engineering/tree/master/storage#fio Advertises 14.5GB/s read / 12.7GB/s write fio benchmark results for workhorse on 2024-07-08-13:05:50partition /mnt/superfast/fio-test filesize=16k read
filesize=16k write
filesize=1m read
filesize=1m write
filesize=1g read
filesize=1g write
filesize=5g read
filesize=5g write
|
@muellerzr, can you please some details, or a stack trace of the issue encountered? We need to fix it. Thanks! |
@tjruwase when I ran the benchmark, rather than GB/s all values were “None” :) |
@tjruwase I think so, I ran the following: cd csrc/aio/py_test
dd if=/dev/urandom of=input.file count=400 bs=1M
mkdir read-logs
./run_read_sweep.sh input.file read-logs
python parse_aio_stats.py --log_dir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -1 |
The output of which gave:
|
Can you dump contents of one of the files in |
@muellerzr, I just realized/remembered that those are the old steps. Sorry for not catching that earlier. Below are links to the new instructions. Whenever you get a chance to try, please let me know of any issues. |
Let's use this issue to gather instructions on how to profile one's CPU<->NVMe setup.
(@tjruwase and I have been editing this post)
You need to do this on every new CPU/NVMe setup in order to configure: aio param section.
The following NVMe benchmark measures end-to-end performance of how fast it can read/write CPU<->NVMe. so make sure to test this on the actual system that you intend to use it on.
For this demonstration we are going to use:
1. Preparation
You may have to also install
libaio-dev
if the Deepspeed NVMe driver fails to build. On Ubuntu it's just:Depending on the speed of your NVMe, each benchmark could run for 30min or longer.
Important: make sure you're not doing any other I/O on the device you're testing or you will get incorrect results.
2. Run Read Benchmark
This benchmark assumes the current working directory is on the NVMe drive. If it's not, copy the
csrc/aio/py_test
folder to your NVMe drive and run the test there.You can, of course, use it to test non-NVMe drivers (e.g. SSD).
The tail of the list should show the fastest speeds.
Here is the best result for the read benchmark:
3. Run Write Benchmark
The write report best result:
4. Contribute your data
We need more read/write data for various devices to figure out how to make the configuration process automated.
If you're contributing your data, please post:
Important: please make sure not to do any other I/O on the device under benchmark.
5. Derive the
aio
params blockNow we need to figure out how to use the results of the benchmark to configure
aio
.Here is the final result:
Most of this config block values come from the benchmarks best results for read and write - i.e. which configuration gives us the highest GB/s throughput (the higher the number the better)
Schema of each line in results is as follows:
read or write | single or block event completion | overlap or sequential event submission | # processes | intra-process parallelism | queue depth | block size | GB/sec
The best read config was:
which corresponds to
single_submit=false, overlap_events=true, queue_depth=32, block_size=262144
single_submit=true
if the 2nd column issingle
instead ofblock
.overlap_events=false
if the 3rd column issequential
instead ofoverlap
.The best write config was :
which corresponds to:
single_submit=false, overlap_events=true, queue_depth=32, block_size=262144
Unfortunately, users don't currently have the ability to have separate read and write configurations, so they need to combine the best of both. Fortunately, in this case, and in most cases, the best read and write configurations are the same or similar.
Reasonable defaults are hard to set because of device and system differences. On many setups we tested
block_size=1M
had consistently seemed optimal across two clusters, but in this particular setup,block_size=256K
seems to be optimal.Finally, the last remaining config value is
thread_count=1
is reasonable default, since this is per-rank configuration.TODO: this config generation can be automated, but need to figure out what to do if read and write top benchmark don't agree.
Sample stats: for XPG Gammix s11 pro 2tb NVMe drive with published specs of:
The benchmark records throughput for ~400 different configuration combinations
and so now we can choose a single configuration that will lead to the highest throughput for read and write
I tried my 860 Evo SSD and getting ~0.5 GB/s read throughput. So about ~6x slower.
TODO/Questions to @tjruwase:
[ ] so we have a huge range of numbers - e.g. for read 1 to 3GB/s - so I suppose this is the effective range depending on the kind of task, so the low and the high should be considered - but how does this correlate to training? which of the 400 data points are most relevant? That's too much data for a user to make sense of. Perhaps it should just report and the min and max?
[ ] what are the good numbers? So that the users will know if their NVMe is fast enough? I'm thinking the numbers from the paper?
The text was updated successfully, but these errors were encountered: