Hi,
My name is Alexander, we are mostly working on high-quality fast Speech Recognition. We mostly use strided 1D Convolutions and Transformer modules. We mostly use progressive strides (i.e. 2 - 2 - 2) in our models. I have tried overall stride of 2 and 4 and 8 was better (on a limited compute budget ofc). But I have not looked into other configurations, like 2 - 4 for example (very similar to an example you show in your blog post).
As you have noted in your blog post, having an expressive stride module is of utmost importance. Your post made me think about improving our models further and I took the liberty of looking through your code briefly.
Though the idea in itself is kind of simple (just use a RNN to a batched input, in case of 1D it is even easier) a series of questions arose after taking a look at your code because it kind of heavily departs from PyTorch standard practices. So I would like to know whether all of this is a bug or feature, so to say:
(0)
The module contains .to(torch.device("cuda") but it does not inherit from FastGRNNCUDA (it inherits from FastGRNN).
It seems a bit strange, so why is it the case?
(1)
You use .to(torch.device("cuda"), though it is a standard practice in PyTorch to write device agnostic code. Does this imply that:
- This code is NOT meant for multi-node (or multi-device) parallelization (e.g. DP , DDP)?
- This code is NOT meant to be run later on x86 (quantized or pruned) inference afterwards?
(2)
I saw some pruning and low-rank snippets in your code.
Low-rank does not seem to be used in RNNPool.
(3)
I took a quick look at RNNCell and FastGRNNCell. Apart from having some utilities for model size estimation, seeming unused low rank options and pruning, apart from param initialization I cannot really why you went for implementing these classes from scratch instead of just using the standard PyTorch ones. Is there some reasoning behind it?
(4)
The network is composed of a base VGG network followed by the
added multibox conv layers. Each multibox layer branches into
You seem to apply RNNPool to a VGG encoder. VGG is usually slow and large, is there any reason you do not apply this scheme to a mobilenet?
Hi,
My name is Alexander, we are mostly working on high-quality fast Speech Recognition. We mostly use strided 1D Convolutions and Transformer modules. We mostly use progressive strides (i.e. 2 - 2 - 2) in our models. I have tried overall stride of 2 and 4 and 8 was better (on a limited compute budget ofc). But I have not looked into other configurations, like 2 - 4 for example (very similar to an example you show in your blog post).
As you have noted in your blog post, having an expressive stride module is of utmost importance. Your post made me think about improving our models further and I took the liberty of looking through your code briefly.
Though the idea in itself is kind of simple (just use a RNN to a batched input, in case of 1D it is even easier) a series of questions arose after taking a look at your code because it kind of heavily departs from PyTorch standard practices. So I would like to know whether all of this is a bug or feature, so to say:
(0)
The module contains
.to(torch.device("cuda")but it does not inherit fromFastGRNNCUDA(it inherits fromFastGRNN).It seems a bit strange, so why is it the case?
(1)
You use
.to(torch.device("cuda"), though it is a standard practice in PyTorch to write device agnostic code. Does this imply that:(2)
I saw some pruning and low-rank snippets in your code.
Low-rank does not seem to be used in RNNPool.
(3)
I took a quick look at
RNNCellandFastGRNNCell. Apart from having some utilities for model size estimation, seeming unused low rank options and pruning, apart from param initialization I cannot really why you went for implementing these classes from scratch instead of just using the standard PyTorch ones. Is there some reasoning behind it?(4)
You seem to apply RNNPool to a VGG encoder. VGG is usually slow and large, is there any reason you do not apply this scheme to a mobilenet?