-
Notifications
You must be signed in to change notification settings - Fork 225
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Make CNN w/pooling propagation use depthwise separable CNN with a sma…
…ll but > 1 #channels. This triggers a different CuDNN routine that leads to a >4X speedup! For TPU use 128 channels, speedup is ~3X. TPU remains slow thought, due to padding elsewhere. Namely, as Blake explained to me in the chat, for TPU we need to have a >=128-sized dimension in the covariance tensor _throughout_ the code (at least - there might be other padding issues). This should be doable by having batches of sizes e.g. (128, 4), but I think batching now only works with square batches. For CPU no change. New benchmark for float32, 21 layers 3x3-SAME-CNN-ReLU-GAP is 0.0027001 seconds per NTK entry per V100 (0.0029 w/ 8-GPU batching), which brings us down to theoretical 937 hours for lower triangle of 50K kernel (+ batching/beam overhead). float64 is ~ 0.0065 seconds (2.3X slower than float32). PiperOrigin-RevId: 307117825
- Loading branch information
Showing
1 changed file
with
55 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters