You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training a ResNet50 on ImageNet for a project and noticed the following issues:
Top-1 Accuracy drops significantly and the PyTorch DataLoader version is almost always better on the test set.
There is a less drastic difference in terms of training-time metrics but there is around 2% drop in accuracy per epoch in comparison.
There is little to no speedup due to FFCV on RTXA6000 GPU.
For some context
Dataset was generated using the ffcv-imagenet repository and the standard parameters of 500, 0.50, 90. I have provided the dataloader object I have created below to note if there is any issue.
Every other component is exactly the same for the two runs
In the first epoch of FFCV the time is 38 minutes per epoch, and this does no improve at all. With a PyTorch DataLoader timing is 41 minutes per epoch, barely noticeable.
Hi @andrewilyas thanks for the prompt response. This a custom harness we wrote on our own for training + pruning networks. Could you let me know any specific aspects I could send across?
Btw, I found one interesting result that seems to have fixed the problem almost -- that is adding "shuffled_indices=True" while regenerating the beton.
The loss/accuracy curves now look like this
Blue - FFCV (w/o shuffle_indices = True while creating the beton)
Orange - PyTorch DataLoader
maroon - FFCV (w/ shuffle indices = True while creating the beton)
Hope this could help people out, I think this is related to usage of OrderOption.QUASI_RANDOM as indicated in issue #304 .
Regarding speed-up
I also tested this on the CelebA dataset, and could provide code to reproduce -- there was little or no speedup achieved due to use of FFCV and there are very little throughput gains using FFCV (38 mins v/s 41 mins) on ImageNet [Device: NVIDIA A6000]
Is this speedup only to be expected on mixed-precision training?
(Please let me know if its better to open a different issue for the speedup related component)
Hello,
Training a ResNet50 on ImageNet for a project and noticed the following issues:
For some context
As you can see above, the performance is far worse when FFCV is used.
Would appreciate any insight into why this is happening and what could be done to improve.
Thanks!
The text was updated successfully, but these errors were encountered: