Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Faster GatedRecurrent #655

Merged
merged 2 commits into from
May 22, 2015
Merged

WIP: Faster GatedRecurrent #655

merged 2 commits into from
May 22, 2015

Conversation

rizar
Copy link
Contributor

@rizar rizar commented May 21, 2015

Less multiplications in the inner graph should be faster.

Also, I remove the flags switching of the gates. If one wants to research his custom recurrent transitions, they will still need more flexibility.

gate_values = self.gate_activation.apply(
states.dot(self.state_to_gates) + gate_inputs)
update_values = gate_values[:, :self.dim]
reset_values = gate_values[:, self.dim:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you benchmark if this is faster than two separate matrix multiplications? I can remember @pbrakel saying that in Theano the benefit from performing a single GEMM is actually cancelled out by the slicing operations needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I did, it is faster. I will make a nice benchmark for the record tomorrow (all GPUs are busy), so far my rough estimate is that you can win 20% on 250 GRU units with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test a backward path as well? Because Theano may need to reallocate memory for a merge op (which is the gradient for split).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20% estimate is for both forward and backward. I got 10% speedup for a big model, for which GRU takes roughly half of the time.

But will take actual profiles tomorrow.

@rizar
Copy link
Contributor Author

rizar commented May 21, 2015

An Easter egg: it becomes non-trivial to initial both reset and update matrices orthogonally, since they are one matrix now.

@dmitriy-serdyuk
Copy link
Contributor

And one more thing: it may be more efficient to store a transpose of the weight matrix since it is faster to split by the first dimension. I'm not sure that GPU stores matrices in C order, though.

@dwf
Copy link
Contributor

dwf commented May 21, 2015 via email

@rizar
Copy link
Contributor Author

rizar commented May 22, 2015

Results, columns stand for number of neurons, the units are milliseconds. The batch size is 10.

Depending on the number of neurons, the new implementation can be quite a bit better, And apparently, our LSTM implementation is slow.

100 250 1000 2000
SimpleRecurrent 8 14 68 96
new GatedRecurrent 16 31 152 270
old GatedRecurrent 16 41 199 280
LSTM 37 89 303 1032

The code can be found at https://github.com/rizar/blocks-benchmarks

@bartvm
Copy link
Member

bartvm commented May 22, 2015

Cool. Did you try @dmitriy-serdyuk's suggestion to see if it made a difference? (Also, you need a rebase.)

@rizar
Copy link
Contributor Author

rizar commented May 22, 2015

I am not sure I understand it, to be honest. It is not the weight matrix, but the dot product results that are split. I guess I can switch the order to terms, in the line 531 and transpose the subtensors to get reset_values and update_values, but for me it is not obvious that is will be faster that way.

@bartvm
Copy link
Member

bartvm commented May 22, 2015

I guess the idea would be to check if this makes a difference

gate_values = self.gate_activation.apply(self.state_to_gates.dot(states.T) + gate_inputs.T)
update_values = gate_values[:self.dim]
reset_values = gate_values[self.dim:]

@rizar
Copy link
Contributor Author

rizar commented May 22, 2015

To make it work I had to a few more transpositions, and as a result it got worse:

gate_values = self.gate_activation.apply(              
    self.state_to_gates.dot(states.T) + gate_inputs.T) 
update_values = gate_values[:self.dim].T               
reset_values = gate_values[self.dim:].T                
100 250 1000 2000
this GatedRecurrent 88 104 230 388

@bartvm
Copy link
Member

bartvm commented May 22, 2015

I guess it's because states needs to be transposed, makes sense.

@rizar
Copy link
Contributor Author

rizar commented May 22, 2015

Right, and I guess the following is the reason why LSTM is slow:

def slice_last(x, no):
    return x.T[no*self.dim: (no+1)*self.dim].T

As soon as tests pass, I think this PR will be ready to merge.

@rizar
Copy link
Contributor Author

rizar commented May 22, 2015

I changed initialization procedure so that it was identical to the earlier version of GatedRecurrent.

@pbrakel
Copy link
Contributor

pbrakel commented May 22, 2015

Removing the slicing from the LSTM has been on our todo-list for a while now because it is indeed the most likely cause of terrible speed. Of course it has a lot more parameters than the other models but it shouldn't be that much slower. I'll see if I can find some time to look at it today.

@rizar
Copy link
Contributor Author

rizar commented May 22, 2015

@pbrakel , let me make a blind guess, I think it is not slicing, but transposition that hurts so much. You can use my benchmarking script, by the way.

@pbrakel
Copy link
Contributor

pbrakel commented May 22, 2015

@rizar that script would be very helpful. I think the slicing might actually not be the main reason for slow simulation time but mainly for a slowing down of the gradient computations.

@rizar
Copy link
Contributor Author

rizar commented May 22, 2015

@pbrakel
Copy link
Contributor

pbrakel commented May 22, 2015

@rizar Thanks, Dima!

@rizar
Copy link
Contributor Author

rizar commented May 22, 2015

Rebased to account for #600

Also, fixes #399

@rizar
Copy link
Contributor Author

rizar commented May 22, 2015

@bartvm , do you know why Scrutinizer was not even run on this PR?

@bartvm
Copy link
Member

bartvm commented May 22, 2015

It says "Status: Ignored; this pull-request is not mergeable." because GitHub sent the following payload:

        "mergeable": null,
        "mergeable_state": "unknown",

Which is odd... Because mergeable is supposed to be a Boolean value.

@rizar
Copy link
Contributor Author

rizar commented May 22, 2015

Scrutinizer says OK. Merging?

bartvm added a commit that referenced this pull request May 22, 2015
WIP: Faster GatedRecurrent
@bartvm bartvm merged commit e0067a2 into mila-iqia:master May 22, 2015
@lamblin lamblin mentioned this pull request Jul 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants