New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoder-decoder fails at KMeans attention #4
Comments
@tomweingarten ohh, that's weird, I tried it on some random input tensors and it worked what is the shape of your source and target tensors? |
I have a working script for encoder / decoder here that may help https://github.com/lucidrains/routing-transformer/blob/master/examples/toy_tasks/enc_dec_copy_task.py |
@tomweingarten feel free to send me your full script if you have trouble debugging it. from the trace, it looks to be related to interaction between the kmeans and the reversible network also, how is this architecture working out for you? what kind of results are you seeing on your end? |
After some more testing, it looks like this only happens if I run generate() before the first call of the model. Something seems to go wrong with initializing kmeans under those circumstances. I'd like to try this on your script as well to verify it isn't my script but haven't gotten to do that yet. |
@tomweingarten oh! Yes, that makes sense because the means are initialized on first backwards during training. I'll put in some better asserts! |
you can't evaluate if there are no means from which to do the clustering! |
@tomweingarten ok, i've put in a fix in 0.7.2 that will make it stop erroring, but it is still best to train before evaluating in encoder / decoder, or the decoder means will be initialized with only a few samples |
@tomweingarten thanks for reporting this btw |
Thanks! You're right that it is silly to run the generate() method before fitting. I do it just as a last check to make sure I haven't done anything weird like accidentally load a checkpoint when I shouldn't have. Thanks for the fix! |
@tomweingarten no problem! are you seeing good results? anything interesting you found exploring the hyperparameter space? |
I've been struggling to get the encoder-decoder model to converge. I can get the loss down to about 2 for my model. Based on the Reformer (trax implementation) and TransformerXL (huggingface implementation) I'm expecting something like 0.5. With routing-transformer I see loss plateau at that level for some time, then explode. Have you experienced something like this before? Here are my latest parameters: Set up Encoder-Decoder modelmodel = RoutingTransformerEncDec( |
@tomweingarten turns out there was an issue with un-shared QK and causal networks, which has been fixed in the latest minor version bump your setting looks fine except it is best to mix local attention heads with the global kmeans attention heads. both Aurko and Aran, researchers who work on this variant of sparse attention, advise to do all local attention except for the last couple layers. https://github.com/lucidrains/routing-transformer/blob/master/examples/enwik8_simple/train.py#L47 https://arxiv.org/abs/2004.05150 this paper has more experimental results with how best to distribute the local inductive bias |
@tomweingarten how high is your learning rate? do you have gradient clipping turned on? |
@tomweingarten Aurko sent this to me https://www.aclweb.org/anthology/2020.acl-main.672.pdf |
@lucidrains Quick update: Running with the new version fixed my training loss problem! Unfortunately I'm seeing some weird results for predictions that I can't quite explain yet, but it's going to take me a bit longer to dig into why that is. I'm also going to play around with mixed attention head locality too, thanks for the tip! |
@tomweingarten awesome! glad to hear it is converging! |
I haven't been able to dig into the root cause here yet, but I'm getting the following error when trying to run an encoder-decoder:
Here are my model params:
The text was updated successfully, but these errors were encountered: