Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] profiling NVMe and configuring aio param section #998

Open
stas00 opened this issue Apr 23, 2021 · 78 comments
Open

[doc] profiling NVMe and configuring aio param section #998

stas00 opened this issue Apr 23, 2021 · 78 comments

Comments

@stas00
Copy link
Collaborator

stas00 commented Apr 23, 2021

Let's use this issue to gather instructions on how to profile one's CPU<->NVMe setup.

(@tjruwase and I have been editing this post)

You need to do this on every new CPU/NVMe setup in order to configure: aio param section.

The following NVMe benchmark measures end-to-end performance of how fast it can read/write CPU<->NVMe. so make sure to test this on the actual system that you intend to use it on.

For this demonstration we are going to use:

  1. XPG Gammix s11 pro 2tb NVMe drive
  2. Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz setup.

1. Preparation

cd /somewhere/on/nvme/drive/you/want/to/test
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed

You may have to also install libaio-dev if the Deepspeed NVMe driver fails to build. On Ubuntu it's just:

apt install libaio-dev

Depending on the speed of your NVMe, each benchmark could run for 30min or longer.

Important: make sure you're not doing any other I/O on the device you're testing or you will get incorrect results.

2. Run Read Benchmark

cd csrc/aio/py_test
dd if=/dev/urandom of=input.file count=400 bs=1M
mkdir read-logs
./run_read_sweep.sh input.file read-logs
python parse_aio_stats.py --log_dir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -1

This benchmark assumes the current working directory is on the NVMe drive. If it's not, copy the csrc/aio/py_test folder to your NVMe drive and run the test there.

You can, of course, use it to test non-NVMe drivers (e.g. SSD).

The tail of the list should show the fastest speeds.

Here is the best result for the read benchmark:

('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208

3. Run Write Benchmark

# cd csrc/aio/py_test
mkdir write-test-data
mkdir write-logs
./run_write_sweep.sh 400 write-test-data write-logs
python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -1

The write report best result:

('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324

4. Contribute your data

We need more read/write data for various devices to figure out how to make the configuration process automated.

If you're contributing your data, please post:

  1. Your NVMe device name/size
  2. advertised max read/write spec (google: "device name spec")
  3. the results for the last 10 lines i.e.:
python parse_aio_stats.py --logdir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -10
python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -10

Important: please make sure not to do any other I/O on the device under benchmark.

5. Derive the aio params block

Now we need to figure out how to use the results of the benchmark to configure aio.

Here is the final result:

            "aio": {
                "block_size": 262144,
                "queue_depth": 32,
                "thread_count": 1,
                "single_submit": false,
                "overlap_events": true
            }

Most of this config block values come from the benchmarks best results for read and write - i.e. which configuration gives us the highest GB/s throughput (the higher the number the better)

Schema of each line in results is as follows:

  • read: read or write | single or block event completion | overlap or sequential event submission | # processes | intra-process parallelism | queue depth | block size | GB/sec
  • write: it's the same as read, plus 2nd column is the size of the written data.

The best read config was:

('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208

which corresponds to single_submit=false, overlap_events=true, queue_depth=32, block_size=262144

single_submit=true if the 2nd column is single instead of block.
overlap_events=false if the 3rd column is sequential instead of overlap.

The best write config was :

('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324, 

which corresponds to: single_submit=false, overlap_events=true, queue_depth=32, block_size=262144

Unfortunately, users don't currently have the ability to have separate read and write configurations, so they need to combine the best of both. Fortunately, in this case, and in most cases, the best read and write configurations are the same or similar.

Reasonable defaults are hard to set because of device and system differences. On many setups we tested block_size=1M had consistently seemed optimal across two clusters, but in this particular setup, block_size=256K seems to be optimal.

Finally, the last remaining config value is thread_count=1 is reasonable default, since this is per-rank configuration.

TODO: this config generation can be automated, but need to figure out what to do if read and write top benchmark don't agree.


Sample stats: for XPG Gammix s11 pro 2tb NVMe drive with published specs of:

  • max read speed of up to 3500 MB/s
  • max write speed of up to 3000 MB/s

The benchmark records throughput for ~400 different configuration combinations

  • read between 1.0-3.17 GB/s,
  • write between 1.2-2.59 GB/s
    and so now we can choose a single configuration that will lead to the highest throughput for read and write

I tried my 860 Evo SSD and getting ~0.5 GB/s read throughput. So about ~6x slower.


TODO/Questions to @tjruwase:

[ ] so we have a huge range of numbers - e.g. for read 1 to 3GB/s - so I suppose this is the effective range depending on the kind of task, so the low and the high should be considered - but how does this correlate to training? which of the 400 data points are most relevant? That's too much data for a user to make sense of. Perhaps it should just report and the min and max?

[ ] what are the good numbers? So that the users will know if their NVMe is fast enough? I'm thinking the numbers from the paper?

@tjruwase
Copy link
Contributor

Benchmark results from DGX-2 node, which has 8 Micron 9200 NVMe raided into a single volume.

  Peak 1-process multi-process
Read 28 25.3 25.6
Write 24.8 19.2 21.7

@tjruwase
Copy link
Contributor

tjruwase commented Apr 23, 2021

The sweep results suggest that zero-infinity can be configured to do offloading at read rate of 3GB/sec and write rate of 2.6GB/sec. So you want to configure the asynchronous I/O module similar to configurations that achieve these numbers. Specifically, you want to add the following to deepspeed config:

aio: {
 "block_size": 262144,
 "queue_depth": 32,
 "thread_count": 1,
"single_submit": false,
"overlap_events": true
}

@tjruwase
Copy link
Contributor

Unfortunately, I just noticed a bug in the write sweep script, which may lower the write perf. Basically, it is not doing the multi-process sweep because of this oversight.
I will merge #1001 to address this issue asap. This PR also avoids deleting the log folder, but rather creates an aio_perf_sweep subfolder. This subfolder is deleted on reruns though.

@stas00
Copy link
Collaborator Author

stas00 commented Apr 23, 2021

The sweep results suggest that zero-infinity can be configured to do offloading at read rate of 3GB/sec and write rate of 2.6GB/sec. So you want to configure the asynchronous I/O module similar to configurations that achieve these numbers. Specifically, you want to add the following to deepspeed config:

aio: {
 "block_size": 262144,
 "queue_depth": 32,
 "thread_count": 1,
"single_submit": false,
"overlap_events": true
}

Oh, I missed that new config section!

So how does a user correlate their benchmark results to the above config that you prepared based on my benchmark results?

The description of each param at asynchronous I/O module is very terse and ideally we need a few paras explaining how to choose those values. Which are good defaults, and which numbers should be changed according to one's results.

Thank you!

@stas00
Copy link
Collaborator Author

stas00 commented Apr 23, 2021

I looked again through your paper - please correct me if I'm wrong, but it looks like we need at least 3GB/s NVME<->CPU bandwidth per GPU, so really my NVME setup is only barely good for a single GPU to meet the efficiency standard you define in the paper. Am I wrong?

Especially given your report earlier:

Benchmark results from DGX-2 node, which has 8 Micron 9200 NVMe raided into a single volume.
Peak 1-process multi-process
Read 28 25.3 25.6
Write 24.8 19.2 21.7

So practically it's not enough to just have a single NVMe to benefit from ZeRO-Infinity if I understand it correctly.

@tjruwase
Copy link
Contributor

  1. I just added the section into the docs based on your feedback, so you did not miss it.

  2. Sorry, I was not clear on how I came up with

aio: {
 "block_size": 262144,
 "queue_depth": 32,
 "thread_count": 1,
"single_submit": false,
"overlap_events": true
}

thread_count=1 is reasonable default, since this is per-rank configuration.

The rest are based the results of your sweep as follows:

Your best read config was ('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208, which corresponds to
single_submit=false, overlap_events=true, queue_depth=32, block_size=262144.

Your best write config was ('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324, corresponding to
single_submit=false, overlap_events=true, queue_depth=32, block_size=262144.

Unfortunately, users don't currently have the ability to have separate read and write configurations, so they need to combine the best of both. Fortunately, in this case, and in most cases, the best read and write configurations are the same or similar.

Another challenge that is more obvious to me now is that reasonable defaults are hard to set because of device and system differences. Prior to your experiments, block_size=1M had consistently seemed optimal across two clusters, but in your case, block_size=256K seems to be optimal.

Does this help?

@stas00
Copy link
Collaborator Author

stas00 commented Apr 23, 2021

This helps a lot, thank you!

Can we make parse_aio_stats.py take in both read and write reports and generate the recommended config for the user?

@tjruwase
Copy link
Contributor

tjruwase commented Apr 23, 2021

I looked again through your paper - please correct me if I'm wrong, but it looks like we need at least 3GB/s NVME<->CPU bandwidth per GPU, so really my NVME setup is only barely good for a single GPU to meet the efficiency standard you define in the paper. Am I wrong?
So practically it's not enough to just have a single NVMe to benefit from ZeRO-Infinity if I understand it correctly.

Can you clarify the efficiency standard you are referring to in the paper?

Whether or not a single 3GB/sec NVMe is sufficient for ZeRO-Infinity depends on a number of factors including:

  1. What is offloaded to NVMe. For example if only optimizer state is offloaded, then only the CPU would access NVMe.

  2. How well the asynchronous NVMe accesses overlap with forward and backward (and optimizer update). ZeRO-Infinity leverages asynchrony to hide/minimize the NVMe latencies.

  3. Model and batch sizes are since they increase forward and backward time more than they increase optimizer and NVMe time. We have seen cases where optimizer time (which includes NVMe) is much smaller than forward/backward time.

  4. If possible then scaling to more nodes linearly increases the NVMe bandwidth

@stas00
Copy link
Collaborator Author

stas00 commented Apr 23, 2021

I looked again through your paper - please correct me if I'm wrong, but it looks like we need at least 3GB/s NVME<->CPU bandwidth per GPU, so really my NVME setup is only barely good for a single GPU to meet the efficiency standard you define in the paper. Am I wrong?
So practically it's not enough to just have a single NVMe to benefit from ZeRO-Infinity if I understand it correctly.

Can you clarify the efficiency standard you are referring to in the paper?

snapshot_5

And it was mentioned several times in previous sections.

Whether or not a single 3GB/sec NVMe is sufficient for ZeRO-Infinity depends on a number of factors including:

1. What is offloaded to NVMe. For example if only optimizer state is offloaded, then only the CPU would access NVMe.

2. How well the asynchronous NVMe accesses overlap with forward and backward (and optimizer update). ZeRO-Infinity leverages asynchrony to hide/minimize the NVMe latencies.

3. Model and batch sizes are since they increase forward and backward time more than they increase optimizer and NVMe time.  We have seen cases where optimizer time (which includes NVMe) is much smaller than forward/backward time.

4. If possible then scaling to more nodes linearly increases the NVMe bandwidth

Thank you for sharing these considerations / questions to ask, @tjruwase.

How do we translate these into something actionable by the user. That is what exact steps they follow to set up each of these values:

            "offload_optimizer": {​​​​​
                "device": "nvme",
                "nvme_path": "/local_nvme",
                "pin_memory": true,
                "buffer_count": 4,
                "fast_init": false
            }​​​​​,
            "offload_param": {​​​​​
                "device": "nvme",
                "nvme_path": "/local_nvme",
                "pin_memory": true,
                "buffer_count": 5,
                "buffer_size": 1e8,
                "max_in_cpu": 1e9
            }​​​​​

besides device, nvme_path and pin_memory which don't need explanation.

And you addressed how to get these numbers/flags - which is great!

            "aio": {​​​​​
                "block_size": 262144,
                "queue_depth": 32,
                "thread_count": 1,
                "single_submit": false,
                "overlap_events": true
            }​​​​​

@stas00 stas00 changed the title [doc] compile NVMe profiling docs [doc] profiling NVMe and configuring aio param section Apr 24, 2021
@stas00
Copy link
Collaborator Author

stas00 commented Apr 24, 2021

With the updated benchmark I get slightly worse results than before for write (was 2.59), and it has now switched to single_submit = true as the best

python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -10
('write', '400MB', 'single', 'overlap', 1, 1, 4, 1048576) = 2.542780607573091
('write', '400MB', 'block', 'overlap', 1, 1, 32, 1048576) = 2.549606370281151
('write', '400MB', 'single', 'overlap', 1, 1, 16, 524288) = 2.560568126052968
('write', '400MB', 'block', 'overlap', 1, 1, 16, 1048576) = 2.5607282070838893
('write', '400MB', 'single', 'overlap', 1, 1, 8, 524288) = 2.569547474836188
('write', '400MB', 'block', 'overlap', 1, 1, 8, 524288) = 2.577944913420765
('write', '400MB', 'block', 'overlap', 1, 1, 4, 262144) = 2.580567932852312
('write', '400MB', 'single', 'overlap', 1, 1, 4, 262144) = 2.584932481576203
('write', '400MB', 'block', 'overlap', 1, 1, 32, 262144) = 2.5864627469800396
('write', '400MB', 'single', 'overlap', 1, 1, 32, 262144) = 2.586675086832965

The read benchmark hasn't changed the best throughput, but the winning config has changed too!

('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.159043494691866
('read', 'block', 'overlap', 1, 1, 4, 1048576) = 3.1590617679099946
('read', 'block', 'overlap', 1, 1, 8, 1048576) = 3.1595369457938087
('read', 'single', 'overlap', 1, 1, 8, 262144) = 3.1604938271604937
('read', 'block', 'overlap', 1, 1, 8, 262144) = 3.1612316918107815
('read', 'block', 'overlap', 1, 1, 16, 524288) = 3.1612926877741097
('read', 'single', 'overlap', 1, 1, 8, 524288) = 3.1613170868185194
('read', 'block', 'overlap', 1, 1, 8, 524288) = 3.1615855011664906
('read', 'single', 'overlap', 1, 1, 16, 131072) = 3.1634717867128006
('read', 'single', 'overlap', 1, 1, 32, 131072) = 3.1637100215689946

So am I correct that I now need to change my config to:

"single_submit": true,

But what should I do with block_size - the read benchmark is at 131072, whereas the write one is at 262144 - how do we reconcile this?

Also, why does the read benchmark runs echo 1 > /proc/sys/vm/drop_caches, but the write one doesn't? Is it not necessary because it writes and the cache is then always fully invalidated?

@stas00
Copy link
Collaborator Author

stas00 commented Apr 24, 2021

@tjruwase, what do you think about putting the 400MB column last or removing it completely - since it's always the same number it doesn't tell anything to the user? Then it'd be easier to see the 2 sets aligned. Or alternatively have the same column for read too?

@tjruwase
Copy link
Contributor

@stas00, regarding the changing best configs and results, I think since the perf differences are so small I would put it down to noise. Also, as you notice the top 10 (or more) perfs are very similar, so it seems to me that the most significant factors are overlap=true, queue_depth >=4, block_size >= 256K. All of these observations strengthen your call for users to be shielded from this information overload. So I will focus on having the script generate an optimal aio config setting, while hiding the optimality logic in the script. Does that sound reasonable.

Regarding the write benchmark, not dropping the caches, I did not do that because I thought it was more significant for reads as opposed to writes. If possible, can you please check if adding dropping caches to write benchmark makes much difference? I will update the script in my next PR.

@stas00
Copy link
Collaborator Author

stas00 commented Apr 24, 2021

@stas00, regarding the changing best configs and results, I think since the perf differences are so small I would put it down to noise. Also, as you notice the top 10 (or more) perfs are very similar, so it seems to me that the most significant factors are overlap=true, queue_depth >=4, block_size >= 256K. All of these observations strengthen your call for users to be shielded from this information overload. So I will focus on having the script generate an optimal aio config setting, while hiding the optimality logic in the script. Does that sound reasonable.

That sounds fantastic! Thank you, @tjruwase

I'd also add that since we currently have a single config perhaps the final script should take the output of both parsers? or take read and write log dirs , run the parser on them and dump the recommended config, so the user will need to run:

  1. read benchmark
  2. write benchmark
  3. create_config read-logs write-logs

and of course the first 2 can also be merged into the 3rd item as the next stage. But this would be a great start.

Regarding the write benchmark, not dropping the caches, I did not do that because I thought it was more significant for reads as opposed to writes. If possible, can you please check if adding dropping caches to write benchmark makes much difference? I will update the script in my next PR.

I will test and get back to you.

@stas00
Copy link
Collaborator Author

stas00 commented Apr 24, 2021

oh and we probably should have the instruction sudo ./run_read_sweep.sh input.file read-logs so the sudo prompt doesn't come as a strange surprise after the script started.

@stas00
Copy link
Collaborator Author

stas00 commented Apr 24, 2021

OK, so adding invalidating caching for write, had a negligible impact of 1e-2 difference. Hence, it's probably safe to not need it (as it slows the overall run time as well I think).

('write', '400MB', 'block', 'overlap', 1, 1, 2, 524288) = 2.5284379827435344
('write', '400MB', 'single', 'overlap', 1, 1, 8, 1048576) = 2.536109060119592
('write', '400MB', 'block', 'overlap', 1, 1, 4, 524288) = 2.5423465809286765
('write', '400MB', 'block', 'overlap', 1, 1, 8, 1048576) = 2.551528129258322
('write', '400MB', 'single', 'overlap', 1, 1, 32, 524288) = 2.5574265894943213
('write', '400MB', 'single', 'overlap', 1, 1, 4, 524288) = 2.572638084590551
('write', '400MB', 'block', 'overlap', 1, 1, 32, 524288) = 2.575145071954432
('write', '400MB', 'block', 'overlap', 1, 1, 16, 262144) = 2.5767529201574613
('write', '400MB', 'single', 'overlap', 1, 1, 16, 262144) = 2.577214990758583
('write', '400MB', 'block', 'overlap', 1, 1, 8, 262144) = 2.583110769162854

@SeanNaren
Copy link
Contributor

Here are results from an A100 hyperplane server from Lambda, using DeepSpeed master and the instructions collected above!

Micron 7300 2TB NVMe (Max Read 3GB/s, Max Write 1.9 GB/s)

Read

('read', 'block', 'overlap', 2, 1, 8, 524288) = 2.0683565061893523
('read', 'block', 'overlap', 2, 1, 16, 524288) = 2.0690931110843103
('read', 'block', 'overlap', 2, 1, 16, 1048576) = 2.071279891429738
('read', 'block', 'sequential', 2, 1, 16, 1048576) = 2.0751389262701263
('read', 'block', 'sequential', 2, 1, 32, 524288) = 2.0761578914021417
('read', 'block', 'sequential', 2, 1, 32, 1048576) = 2.0790717269594086

For most of the configurations the difference is negligible. Range of 0.33-2.07 GB/s

Write

('write', '400MB', 'block', 'sequential', 1, 1, 4, 131072) = 1.9950197565890813

Again looking at the first 100ish tail outputs, the difference is negligible. Range of 1.23-1.995 GB/s

I think we can potentially reduce the grid search space to hone in on suggestions initially. it might also be a good idea to compare the same configurations across our 3 environments to see what the differential compared to the max throughput is

@tjruwase
Copy link
Contributor

tjruwase commented Apr 26, 2021

@SeanNaren, thanks for sharing your results. This is great since it is a different device from @stas00 and myself, so this is helping to evaluate the perf portability. Please see a few observations and questions below.

  1. Am I reading correctly that the peak hardware write rate is 1.9GB/s, while the benchmark is reporting 1.99?

  2. The 67% read perf of peak rate is much lower than @stas00 and I have observed, and I would like some help further understanding what is going on. I will defer the questions to the end of this post.

  3. I agree that reducing the search space is critical, as @stas00 already noted. However, your results which show sequential > overlap deviates from our observations, making it harder for me to confidently propose things to prune. So far the following seems like the reasonable reduced space: block, [sequential|overlap], queue_depth=[4,8,16], block_size>=[256K,512K,1M]. What do you guys (@stas00) think?

Regarding further investigation of the relatively poor read performance, can you help with the following:

  1. Did the benchmark have exclusive access to the device, or is the device shared with other system components, such as the OS?

  2. Can you run an equivalent fio experiment to the best read configuration using the following fio config file:

[global]
bs=1M
iodepth=32
direct=1
ioengine=libaio
group_reporting
time_based
runtime=120
numjobs=2
name=raw-read
rw=read
directory=/local_nvme/
thread

[job1]
filename=random_400MB.pt

@stas00
Copy link
Collaborator Author

stas00 commented Apr 26, 2021

3. I agree that reducing the search space is critical, as @stas00 already noted. However, your results which show sequential > overlap deviates from our observations, making it harder for me to confidently propose things to prune. So far the following seems like the reasonable reduced space: block, [sequential|overlap], queue_depth=[4,8,16], block_size>=[256K,512K,1M]. What do you guys (@stas00) think?

Perhaps we should try and find a few more devices to acquire more data before we prune? Currently Sean's device seems to be one-off - basing a choice on 3 inputs is a bit hard and we might miss something.

@tjruwase
Copy link
Contributor

@stas00, I completely agree. It would be awesome to find additional devices. Unfortunately, I don't have access to such diversity here. Also by pruning, I meant only the default setting of the search script, users will have the option of defining their own space.

@stas00
Copy link
Collaborator Author

stas00 commented Apr 26, 2021

I added "4. Contribute your data" instructions to the OP - let's see if we can get some contributions.

I made a call to community inviting to contribute:
https://discuss.huggingface.co/t/deepspeed-zero-infinity-looking-for-nvme-device-benchmarks/5787

@thefazzer
Copy link

I need to use sparse-attn/support-latest-triton as I'm running an RTX 3090

Setup is OK (as per below) but none of the DS optimisers will JIT compile (they all return an error at runtime re: the sm_86 arch).

Pretty sure problem is still outstanding for this card

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]

@stas00
Copy link
Collaborator Author

stas00 commented Apr 27, 2021

Thank you for wanting to help us to gather the data, @thefazzer!

I have the same card, it works without problems if you have the right torch/cuda setup.

Let's not derail this thread and discuss this in a separate issue here on Deepspeed? Please post the output of your python -m torch.utils.collect_env and tag @stas00 so I won't miss it - and I will help you to get it working and then we can get back here.

Thank you!

@SeanNaren
Copy link
Contributor

  1. Am I reading correctly that the peak hardware write rate is 1.9GB/s, while the benchmark is reporting 1.99?

I looked all over the internet for this specific NVMe, and the reported max write was 1.9GB/s and these were the numbers collated from my run! I can confirm with Lambda on this.

  1. I agree that reducing the search space is critical, as @stas00 already noted. However, your results which show sequential > overlap deviates from our observations, making it harder for me to confidently propose things to prune. So far the following seems like the reasonable reduced space: block, [sequential|overlap], queue_depth=[4,8,16], block_size>=[256K,512K,1M]. What do you guys (@stas00) think?

I haven't looked into this, but cloud providers may provide another standard for NVMe if we're trying to just collected data numbers.

Regarding further investigation of the relatively poor read performance, can you help with the following:

  1. Did the benchmark have exclusive access to the device, or is the device shared with other system components, such as the OS?

This is shared with the OS as a root drive, there was nothing else running on the node (other than sys processes) at the time of running this benchmark.

  1. Can you run an equivalent fio experiment to the best read configuration using the following fio config file:

More than happy to, however unsure how to run this! Maybe I missed something but if you could explain, can run it!

@tjruwase
Copy link
Contributor

@SeanNaren, we have seen that device sharing with OS as a root drive does impact performance, especially read, even if nothing else is running on the node.

For fio, update the directory and filename fields of config file (e.g, config.fio), then install and run as follows:
setup: sudo apt-get install fio
run: fio config.fio

@stas00
Copy link
Collaborator Author

stas00 commented Apr 27, 2021

@tjruwase, could we please change the benchmark to bail out if it can't run with the error message, otherwise it dumps the error into the benchmark files. and it will do it for all 400 files w/o user knowing it's not really running...

cat write-logs/aio_perf_sweep/write_400MB_single_overlap_t1_p1_d1_bs128K.txt
Testing deepspeed_aio python frontend
args = Namespace(block_size='128K', gpu=False, handle=True, io_parallel=1, loops=1, overlap_events=True, queue_depth=1, read_file=None, single_submit=True, threads=1, validate=False, write_file='write-test-data/ds_aio_write_400MB.pt', write_size='400M')
tid 0: schedule = {'pre': <function pre_handle_write at 0x7f517fcfe790>, 'post': <function post_handle at 0x7f517fcfe820>, 'main': <function main_parallel_write at 0x7f517fcfe940>}
tid 0: running pre-task
tid 0: Allocate tensor of size 419430400 bytes
tid 0: Write file write-test-data/ds_aio_write_400MB.pt.0 of size 419430400 bytes from buffer on device cpu
 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/stas/anaconda3/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/stas/anaconda3/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/stas/hf/DeepSpeed/csrc/aio/py_test/ds_aio_handle.py", line 144, in _aio_handle_tasklet
    ctxt = schedule["pre"]((args, tid))
  File "/home/stas/hf/DeepSpeed/csrc/aio/py_test/ds_aio_handle.py", line 57, in pre_handle_write
    ctxt = pre_handle(args, tid, False)
  File "/home/stas/hf/DeepSpeed/csrc/aio/py_test/ds_aio_handle.py", line 32, in pre_handle
    handle = AsyncIOBuilder().load().aio_handle(args.block_size,
  File "/home/stas/hf/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 215, in load
    return self.jit_load(verbose)
  File "/home/stas/hf/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 219, in jit_load
    raise RuntimeError(
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./test_ds_aio.py", line 117, in <module>
    main()
  File "./test_ds_aio.py", line 113, in main
    multiprocess_function(args, False)
  File "/home/stas/hf/DeepSpeed/csrc/aio/py_test/ds_aio_handle.py", line 174, in aio_handle_multiprocessing
    pool_results = p.map(_aio_handle_tasklet, pool_params)
  File "/home/stas/anaconda3/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/stas/anaconda3/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
RuntimeError: Unable to JIT load the async_io op due to it not being compatible due to hardware/software issue.

Also as I mentioned earlier the WARNING:

 [WARNING]  async_io requires the libraries: ['libaio-dev'] but are missing.

should probably be an assert

Thank you!

@tjruwase
Copy link
Contributor

Hmm, let me take a look at both issues today.

@stas00
Copy link
Collaborator Author

stas00 commented Apr 27, 2021

oh, and btw, my earlier suggestion to move sudo outside the shell script was a bad idea - since then it then tries to run from a different environment and chances are it won't work. So sudo inside the read benchmark it is.

Perhaps should just warn the user with echo once of why we are asking to sudo, perhaps:

echo FYI: if sudo password prompt pops up, this is to enable flashing of the io cache

@tjruwase
Copy link
Contributor

@stas00, two points on error handling.

  1. I am able to make benchmark break early with something like below. Is this sufficient?
bash run_write_sweep.sh 400 /local_nvme/aio_test_write /tmp/write_400MB_1p
sync; sudo bash -c 'echo 1 > /proc/sys/vm/drop_caches'
python ./test_ds_aio.py --write_file /local_nvme/aio_test_write/ds_aio_write_400MB.pt --write_size 400M --io_parallel 1 --queue_depth 1 --block_size 128K --single_submit --overlap_events --handle --threads 1 &> /tmp/write_400MB_1p/aio_perf_sweep/write_single_overlap_t1_p1_d1_bs128K.txt
sync
Benchmark failed - for error details examine /tmp/write_400MB_1p/aio_perf_sweep/write_single_overlap_t1_p1_d1_bs128K.txt
  1. Turning the missing libaio-dev into an assert is problematic because there are high level codes (e.g., unit tests) that are able to handle this situation without failing. On the other hand, the perf script (test_ds_aio.py) is not one of such high level codes, so perhaps we could make it fail gracefully in such situations. On the other hand, perhaps (1) should take of this for the purpose of perf sweep runs. What do you think?

@alexcoca
Copy link

alexcoca commented Mar 3, 2023

A bit late to the party, but I think my machine has 2 NVMe SSDs and an RTX 3090 24GB. Happy to contribute some data if there is still a need!

@tjruwase
Copy link
Contributor

tjruwase commented Mar 3, 2023

@alexcoca, thanks! Please contribute when you can.

@stanleyshly
Copy link

stanleyshly commented Mar 13, 2023

Here are my results. Much more consumer pc level, but I have Kioxia KXG6AZNV512G paired with a RX 5600XT.

('write', 'single', 'overlap', 1, 1, 4, 1048576) = 2.7205507521200336
('write', 'single', 'overlap', 1, 1, 4, 131072) = 2.7087618128071
('write', 'single', 'overlap', 1, 1, 4, 262144) = 2.76478876804563
('write', 'single', 'overlap', 1, 1, 4, 524288) = 2.768320438497632
('write', 'single', 'overlap', 1, 1, 8, 1048576) = 2.759545308318217
('write', 'single', 'overlap', 1, 1, 8, 131072) = 2.746375937026458
('write', 'single', 'overlap', 1, 1, 8, 262144) = 2.7491408934707904
('write', 'single', 'overlap', 1, 1, 8, 524288) = 2.5591842600171195
('write', 'single', 'overlap', 2, 1, 1, 131072) = 2.0435386995764873
('write', 'single', 'overlap', 2, 1, 1, 262144) = 2.461640166923964

('read', 'single', 'overlap', 1, 1, 16, 262144) = 3.0350070577251524
('read', 'single', 'overlap', 1, 1, 16, 524288) = 3.0359068622909384
('read', 'single', 'overlap', 1, 1, 8, 524288) = 3.038226316706907
('read', 'single', 'overlap', 1, 1, 32, 524288) = 3.0395960068197967
('read', 'single', 'overlap', 1, 1, 32, 262144) = 3.0459079906748814
('read', 'single', 'overlap', 1, 1, 8, 1048576) = 3.047913860715694
('read', 'single', 'overlap', 1, 1, 16, 1048576) = 3.0482484292701804
('read', 'single', 'overlap', 1, 1, 4, 1048576) = 3.05166068155742
('read', 'single', 'overlap', 1, 1, 32, 131072) = 3.0525988265827997
('read', 'single', 'overlap', 1, 1, 32, 1048576) = 3.0528547844331237

@tjruwase
Copy link
Contributor

@stanleyshly, excellent! Thanks so much for contributing. Do you mind updating with a description of your NVMe SSDs and the peak read/write speeds for sequential access?

@stanleyshly
Copy link

@tjruwase I'm running a single 512GB NVMe SSD(Kioxia KXG6AZNV512G). it looks to have TLC flash.
With a Queue depth of 32 and one thread, the max sequential read looks to be 3190MB/s, with a max sequential write of 2458 MB/s.

@tjruwase
Copy link
Contributor

Thanks.

By the way, I was concerned about the fact that reported write speed exceeded the peak. So, I did a search and found this link which indicates the write peak is 2800MB/s. Do you have any thoughts on this?

Either way, it looks like we are pretty much saturating the SSD.

@stanleyshly
Copy link

I'm not exactly sure why. Could it be the DRAM cache?

@tjruwase
Copy link
Contributor

It is unlikely DRAM cache because the library uses O_DIRECT. Can you share the source of your peak numbers? Does the link I shared look believable to you?

@stanleyshly
Copy link

stanleyshly commented Mar 14, 2023

I'm using the script right here: https://unix.stackexchange.com/a/480191

@tjruwase
Copy link
Contributor

Got it. In general, when it comes to peak numbers, I tend to trust the manufacturer specs over software (benchmarks).

@stanleyshly
Copy link

I see, I'll use those next time.

@stas00
Copy link
Collaborator Author

stas00 commented Mar 23, 2023

@tjruwase, how should the benchmarking be approached on distributed FS? It'd be pointless to flush the local cache.

Should the benchmark be changed to create a new file before each measurement?

@tjruwase
Copy link
Contributor

@stas00, that is a good point. I have a POV creating a new file for each measurement. I will try to create a PR for that.

@stas00
Copy link
Collaborator Author

stas00 commented Mar 23, 2023

Do we know how the caching mechanism work? e.g. would it be enough to do:

echo 1 >> input.file

before each measurement to invalidate the cache, rather than the much slower recreation with dd or does it cache each section of the file separately?

@stas00
Copy link
Collaborator Author

stas00 commented Mar 27, 2023

I was told that this is the tool used for profiling multi-node IO https://github.com/hpc/ior - I tried it on a single node so far and it seems to work quite well.

eisene added a commit to CompassionAI/garland that referenced this issue Mar 30, 2023
@gaowayne
Copy link

@tjruwase @stas00 guys I am new to this. may I know what is NVMe drive capacity requirement, if in the future, deepspeed need more density NVMe SSD like QLC SSD?

@tjruwase
Copy link
Contributor

@gaowayne, there are no capacity requirements. Are you experiencing issues?

@muellerzr
Copy link

Results from my Crucial T705, as requested by @tjruwase :)

Note I had issues with this version, so I wound up running @stas00's benchmark here: https://github.com/stas00/ml-engineering/tree/master/storage#fio

Advertises 14.5GB/s read / 12.7GB/s write
fio benchmark results (I also added in a 5gb speed, considering all sharded model checkpoints are 5gb)

fio benchmark results for workhorse on 2024-07-08-13:05:50

partition /mnt/superfast/fio-test

filesize=16k read

lat msec bw MBps IOPS jobs
3.6 1097.9 281061 16

filesize=16k write

lat msec bw MBps IOPS jobs
0.4 9597.0 2456838 16

filesize=1m read

lat msec bw MBps IOPS jobs
0.8 4772.6 1221783 16

filesize=1m write

lat msec bw MBps IOPS jobs
0.6 6794.8 1739458 16

filesize=1g read

lat msec bw MBps IOPS jobs
1.0 4203.4 1076057 16

filesize=1g write

lat msec bw MBps IOPS jobs
0.6 6763.6 1731488 16

filesize=5g read

lat msec bw MBps IOPS jobs
0.9 4696.8 1202384 16

filesize=5g write

lat msec bw MBps IOPS jobs
2.2 1791.5 458622 16

@tjruwase
Copy link
Contributor

tjruwase commented Jul 8, 2024

Results from my Crucial T705, as requested by @tjruwase :)

Note I had issues with this version,

@muellerzr, can you please some details, or a stack trace of the issue encountered? We need to fix it. Thanks!

@muellerzr
Copy link

@tjruwase when I ran the benchmark, rather than GB/s all values were “None” :)

@tjruwase
Copy link
Contributor

tjruwase commented Jul 8, 2024

@tjruwase when I ran the benchmark, rather than GB/s all values were “None” :)

@muellerz, did the following script generate a folder of log files?

image

@muellerzr
Copy link

@tjruwase I think so, I ran the following:

cd csrc/aio/py_test
dd if=/dev/urandom of=input.file count=400 bs=1M
mkdir read-logs
./run_read_sweep.sh input.file read-logs
python parse_aio_stats.py --log_dir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -1

@muellerzr
Copy link

The output of which gave:

('read', 'single', 'sequential', 8, 1, 8, 524288) = None

@tjruwase
Copy link
Contributor

tjruwase commented Jul 8, 2024

Can you dump contents of one of the files in read-logs/aio_perf_sweep?

@tjruwase
Copy link
Contributor

tjruwase commented Jul 9, 2024

@tjruwase I think so, I ran the following:

cd csrc/aio/py_test
dd if=/dev/urandom of=input.file count=400 bs=1M
mkdir read-logs
./run_read_sweep.sh input.file read-logs
python parse_aio_stats.py --log_dir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -1

@muellerzr, I just realized/remembered that those are the old steps. Sorry for not catching that earlier. Below are links to the new instructions. Whenever you get a chance to try, please let me know of any issues.

  1. [doc] profiling NVMe and configuring aio param section #998 (comment)
  2. [doc] profiling NVMe and configuring aio param section #998 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants