New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can the TCN module be made faster? #49
Comments
Thanks for your interest in our work! The longer the sequence, the better the efficiency of TCN when compared to LSTM. This is because TCN still has depth (e.g., 8 layers, and each layer needs to process T tokens in parallel). So one way is to train with longer sequence length. The other way to speed up TCN is to play with the kernel size and the dilation factor. Obviously, once you have larger receptive field per layer, then you can reduce the number of layers. The transpose operation is indeed not optimal, but I definitely don't think it is responsible for the efficiency problem here. You can also do profiling to see which part of the model takes the most amount of time: https://pytorch.org/docs/stable/autograd.html#torch.autograd.profiler.profile |
Thanks for your response! I'm benchmarking a TCN model that is identical to your model in char_cnn_test.py versus a two-layer RNN model that I created. The models have roughly the same number of parameters:
and
The length of the sequences are 320 and I'm benchmarking on Google collab with the GPU option. I'm using the Penn Tree Bank dataset that you used. Logs for the TCN:
So as you can see each epoch takes 114 seconds. Logs for the RNN:
only 33 seconds per epoch. Here is my benchmark program: https://github.com/bjourne/python3-libs/blob/master/tools/char_lm.py |
The profiles for the training loops for the networks follows. For the TCN:
and for the RNN:
|
As you can see, TCN has a lot more (though cheap) addition and normalization operations whereas RNNs make fewer such calls. The most straightforward way to speed up TCN is therefore to reduce its depth or increase the dilation factor. However, generally, when you consider a TCN with a multi-layer LSTM, I don't' think TCN would be that slow. |
I have also benchmarks comparing the TCN for Keras with LSTM:s. https://github.com/philipperemy/keras-tcn and in those the TCN wins. So I don't think the problem is inherent with the TCN structure rather something must be wrong with your implementation. |
Possibly, yeah, but if you compare keras-tcn with the tcn.py in this repo, they are essentially the same structure which depends on the iterations of padded Conv1d (cf. https://github.com/philipperemy/keras-tcn/blob/master/tcn/tcn.py#L111 and https://github.com/philipperemy/keras-tcn/blob/master/tcn/tcn.py#L145). The TCN implementation in this repo is a very simple one (as you probably have already noticed) and I can hardly see where it "must be wrong". There is also likely to be a framework-based difference in how these modules are implemented (e.g., different frameworks usually implement RNN modules in different ways). In case you are interested, for example, this page finds that PyTorch's RNNs (e.g., GRUs) are a lot faster than in Keras (see the orange bars); whereas CNNs are about as fast. Thanks for the insightful & careful benchmarking! I think it's very useful to investigate the most efficient way of implementing and deploying TCNs in general :-) |
Have you looked at https://arxiv.org/abs/1611.09482 Although I think the performance increase here occurs when you're predicting an autoregressive process which is quite inefficient with the default conv structure. I've looked at the pytorch LSTM source and it seems to be almost entirely implemented in C++ so you cannot really compare performance directly. In my mind at least TCN is an RNN (at least when you're modeling an autoregressive process), but it's trained using teacher forcing in the same way that Matlab trains a NARX model in "open loop" but then predicts using "closed loop". You can write a NARX(p) model as an RNN with the state being the p previous inputs and the state update being to shift the previous state vector by one and append the new "previous output". Similarly you can express a TCN as a series of Conv's, or as a recursive filter on a moving window. |
I went with a transformer model instead. For my problem I get just as low loss as with a tcn but it trains much faster. |
I'm using your TCN module for a language modeling task. My code follows the structure of your char_cnn code. It works but the performance is very bad compared to an LSTM network. Each epoch with the TCN network takes about 10 times longer. Do you know if the performance can be improved? Here is the forward method from the TCN class:
Perhaps it is the transpose calls that is making the code slow?
The text was updated successfully, but these errors were encountered: