-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Faster GatedRecurrent #655
Conversation
gate_values = self.gate_activation.apply( | ||
states.dot(self.state_to_gates) + gate_inputs) | ||
update_values = gate_values[:, :self.dim] | ||
reset_values = gate_values[:, self.dim:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you benchmark if this is faster than two separate matrix multiplications? I can remember @pbrakel saying that in Theano the benefit from performing a single GEMM is actually cancelled out by the slicing operations needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I did, it is faster. I will make a nice benchmark for the record tomorrow (all GPUs are busy), so far my rough estimate is that you can win 20% on 250 GRU units with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test a backward path as well? Because Theano may need to reallocate memory for a merge op (which is the gradient for split).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20% estimate is for both forward and backward. I got 10% speedup for a big model, for which GRU takes roughly half of the time.
But will take actual profiles tomorrow.
An Easter egg: it becomes non-trivial to initial both reset and update matrices orthogonally, since they are one matrix now. |
And one more thing: it may be more efficient to store a transpose of the weight matrix since it is faster to split by the first dimension. I'm not sure that GPU stores matrices in C order, though. |
AFAIK it does store arrays in C order.
|
Results, columns stand for number of neurons, the units are milliseconds. The batch size is 10. Depending on the number of neurons, the new implementation can be quite a bit better, And apparently, our LSTM implementation is slow.
The code can be found at https://github.com/rizar/blocks-benchmarks |
Cool. Did you try @dmitriy-serdyuk's suggestion to see if it made a difference? (Also, you need a rebase.) |
I am not sure I understand it, to be honest. It is not the weight matrix, but the dot product results that are split. I guess I can switch the order to terms, in the line 531 and transpose the subtensors to get |
I guess the idea would be to check if this makes a difference gate_values = self.gate_activation.apply(self.state_to_gates.dot(states.T) + gate_inputs.T)
update_values = gate_values[:self.dim]
reset_values = gate_values[self.dim:] |
To make it work I had to a few more transpositions, and as a result it got worse: gate_values = self.gate_activation.apply(
self.state_to_gates.dot(states.T) + gate_inputs.T)
update_values = gate_values[:self.dim].T
reset_values = gate_values[self.dim:].T
|
I guess it's because |
Right, and I guess the following is the reason why def slice_last(x, no):
return x.T[no*self.dim: (no+1)*self.dim].T As soon as tests pass, I think this PR will be ready to merge. |
I changed initialization procedure so that it was identical to the earlier version of |
Removing the slicing from the LSTM has been on our todo-list for a while now because it is indeed the most likely cause of terrible speed. Of course it has a lot more parameters than the other models but it shouldn't be that much slower. I'll see if I can find some time to look at it today. |
@pbrakel , let me make a blind guess, I think it is not slicing, but transposition that hurts so much. You can use my benchmarking script, by the way. |
@rizar that script would be very helpful. I think the slicing might actually not be the main reason for slow simulation time but mainly for a slowing down of the gradient computations. |
@rizar Thanks, Dima! |
@bartvm , do you know why Scrutinizer was not even run on this PR? |
It says "Status: Ignored; this pull-request is not mergeable." because GitHub sent the following payload:
Which is odd... Because |
Scrutinizer says OK. Merging? |
Less multiplications in the inner graph should be faster.
Also, I remove the flags switching of the gates. If one wants to research his custom recurrent transitions, they will still need more flexibility.