IO Performance Issues #4

kylegenova · 2020-08-01T02:36:04Z

Previously (See #3) a user reported poor IO performance that was bottlenecking training. In response to this, a recent commit 801f5b1 added new flags to meshes2dataset.py "--optimize" and "--optimize_only". These flags generate a sharded and compressed tfrecords dataset for reduced IO overhead. The files are written to a subdirectory inside the dataset_directory path. The train.py script looks for that directory, and if available uses it for training rather than the existing files (which remain because they are useful for interactive visualization and the evaluation scripts).

Commit f78dbc4 enables this behavior by default. If you are an existing user experiencing less than 100% GPU utilization, please ctrl+c training, git pull, rerun the meshes2dataset.py script with the flags --optimize and --optimize_only (the latter flag skips the first part of dataset creation, which was already run), and rerun the training command (no change to the training command is required; it will resume using the new tfrecords data). Unfortunately this new meshes2dataset.py step can take several hours on shapenet, and also consumes ~3mb extra disk space per dataset element (totaling 129GB extra on ShapeNet). However, in the tested cases it has resulted in 100% gpu utilization. With this change, I experience ~3.5 steps/sec with a batch size of 24 on a V100, and ~2 steps/sec with a batch size of 24 on a P100.

The size of the shards and their contents could be further optimized, but without a failing example I'm not sure what the optimal settings are. If you experience less than 100% gpu utilization after this change, please comment below and I will do my best to address your issue. Similarly, if you can confirm 100% utilization on a networked HDD, that would be highly appreciated, since I can't easily test on that setup.

One other minor note is that a byproduct of this change is that the 10K points per sample are no longer randomly chosen from 100K each time a mesh is seen; instead the same 10K points are used each time. Because those 10K points are never seen by the network directly, but rather used to generate local pointclouds based on the SIF elements, I anticipate no effect from this change, unless the dataset is extremely small. However I will verify this quantitatively before closing the issue.

The text was updated successfully, but these errors were encountered:

chengzhag · 2020-08-01T14:11:27Z

The tfrecords dataset improved training speed on a machine with limited storage speed significantly! Thanks for your update!

With a GbE connection storage server, a single 1080ti GPU can train at a speed of 1.5 steps/sec at almost full utilization:

Previously, the same GPU can only achieve less than 0.6 steps/sec with the same network storage server, or 1.0 steps/sec with a nvme SSD (tested today).

chengzhag · 2020-08-01T14:18:38Z

PS: with the tfrecord dataset and a nvme SSD, a 1080ti can achieve 1.8 steps/sec at full utilization:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO Performance Issues #4

IO Performance Issues #4

kylegenova commented Aug 1, 2020 •

edited

Loading

chengzhag commented Aug 1, 2020

chengzhag commented Aug 1, 2020

IO Performance Issues #4

IO Performance Issues #4

Comments

kylegenova commented Aug 1, 2020 • edited Loading

chengzhag commented Aug 1, 2020

chengzhag commented Aug 1, 2020

kylegenova commented Aug 1, 2020 •

edited

Loading