WIP: Faster GatedRecurrent #655

rizar · 2015-05-21T14:43:08Z

Less multiplications in the inner graph should be faster.

Also, I remove the flags switching of the gates. If one wants to research his custom recurrent transitions, they will still need more flexibility.

bartvm · 2015-05-21T14:47:38Z

blocks/bricks/recurrent.py

+        gate_values = self.gate_activation.apply(
+            states.dot(self.state_to_gates) + gate_inputs)
+        update_values = gate_values[:, :self.dim]
+        reset_values = gate_values[:, self.dim:]


Did you benchmark if this is faster than two separate matrix multiplications? I can remember @pbrakel saying that in Theano the benefit from performing a single GEMM is actually cancelled out by the slicing operations needed.

Yes I did, it is faster. I will make a nice benchmark for the record tomorrow (all GPUs are busy), so far my rough estimate is that you can win 20% on 250 GRU units with that.

Can you test a backward path as well? Because Theano may need to reallocate memory for a merge op (which is the gradient for split).

20% estimate is for both forward and backward. I got 10% speedup for a big model, for which GRU takes roughly half of the time.

But will take actual profiles tomorrow.

rizar · 2015-05-21T16:26:59Z

An Easter egg: it becomes non-trivial to initial both reset and update matrices orthogonally, since they are one matrix now.

dmitriy-serdyuk · 2015-05-21T18:23:37Z

And one more thing: it may be more efficient to store a transpose of the weight matrix since it is faster to split by the first dimension. I'm not sure that GPU stores matrices in C order, though.

dwf · 2015-05-21T18:29:42Z

AFAIK it does store arrays in C order.

rizar · 2015-05-22T13:27:50Z

Results, columns stand for number of neurons, the units are milliseconds. The batch size is 10.

Depending on the number of neurons, the new implementation can be quite a bit better, And apparently, our LSTM implementation is slow.

	100	250	1000	2000
SimpleRecurrent	8	14	68	96
new GatedRecurrent	16	31	152	270
old GatedRecurrent	16	41	199	280
LSTM	37	89	303	1032

The code can be found at https://github.com/rizar/blocks-benchmarks

bartvm · 2015-05-22T13:46:24Z

Cool. Did you try @dmitriy-serdyuk's suggestion to see if it made a difference? (Also, you need a rebase.)

rizar · 2015-05-22T13:55:24Z

I am not sure I understand it, to be honest. It is not the weight matrix, but the dot product results that are split. I guess I can switch the order to terms, in the line 531 and transpose the subtensors to get reset_values and update_values, but for me it is not obvious that is will be faster that way.

bartvm · 2015-05-22T13:58:18Z

I guess the idea would be to check if this makes a difference

gate_values = self.gate_activation.apply(self.state_to_gates.dot(states.T) + gate_inputs.T)
update_values = gate_values[:self.dim]
reset_values = gate_values[self.dim:]

rizar · 2015-05-22T14:05:48Z

To make it work I had to a few more transpositions, and as a result it got worse:

gate_values = self.gate_activation.apply(              
    self.state_to_gates.dot(states.T) + gate_inputs.T) 
update_values = gate_values[:self.dim].T               
reset_values = gate_values[self.dim:].T

	100	250	1000	2000
this GatedRecurrent	88	104	230	388

bartvm · 2015-05-22T14:08:58Z

I guess it's because states needs to be transposed, makes sense.

rizar · 2015-05-22T14:14:44Z

Right, and I guess the following is the reason why LSTM is slow:

def slice_last(x, no):
    return x.T[no*self.dim: (no+1)*self.dim].T

As soon as tests pass, I think this PR will be ready to merge.

rizar · 2015-05-22T14:51:47Z

I changed initialization procedure so that it was identical to the earlier version of GatedRecurrent.

pbrakel · 2015-05-22T14:52:42Z

Removing the slicing from the LSTM has been on our todo-list for a while now because it is indeed the most likely cause of terrible speed. Of course it has a lot more parameters than the other models but it shouldn't be that much slower. I'll see if I can find some time to look at it today.

rizar · 2015-05-22T14:54:40Z

@pbrakel , let me make a blind guess, I think it is not slicing, but transposition that hurts so much. You can use my benchmarking script, by the way.

pbrakel · 2015-05-22T14:57:49Z

@rizar that script would be very helpful. I think the slicing might actually not be the main reason for slow simulation time but mainly for a slowing down of the gradient computations.

rizar · 2015-05-22T15:03:15Z

@pbrakel , here it is: https://github.com/rizar/blocks-benchmarks/blob/master/benchmark-rnns.py

pbrakel · 2015-05-22T15:06:11Z

@rizar Thanks, Dima!

rizar · 2015-05-22T15:24:17Z

Rebased to account for #600

Also, fixes #399

rizar · 2015-05-22T16:20:36Z

@bartvm , do you know why Scrutinizer was not even run on this PR?

bartvm · 2015-05-22T16:33:26Z

It says "Status: Ignored; this pull-request is not mergeable." because GitHub sent the following payload:

        "mergeable": null,
        "mergeable_state": "unknown",

Which is odd... Because mergeable is supposed to be a Boolean value.

rizar · 2015-05-22T16:54:09Z

Scrutinizer says OK. Merging?

WIP: Faster GatedRecurrent

bartvm reviewed May 21, 2015
View reviewed changes

rizar force-pushed the fast_gru2 branch from 4a76053 to 65dc651 Compare May 22, 2015 13:56

rizar force-pushed the fast_gru2 branch from 65dc651 to 560af62 Compare May 22, 2015 14:46

rizar force-pushed the fast_gru2 branch from 560af62 to aa31a19 Compare May 22, 2015 15:08

-faster GatedRecurrent

13687ec

rizar force-pushed the fast_gru2 branch from aa31a19 to 13687ec Compare May 22, 2015 15:12

Indentation change

ead7a96

bartvm added a commit that referenced this pull request May 22, 2015

Merge pull request #655 from rizar/fast_gru2

e0067a2

WIP: Faster GatedRecurrent

bartvm merged commit e0067a2 into mila-iqia:master May 22, 2015

lamblin mentioned this pull request Jul 7, 2015

Optimized GatedRecurrent #399

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Faster GatedRecurrent #655

WIP: Faster GatedRecurrent #655

rizar commented May 21, 2015

bartvm May 21, 2015

rizar May 21, 2015

dmitriy-serdyuk May 21, 2015

rizar May 21, 2015

rizar commented May 21, 2015

dmitriy-serdyuk commented May 21, 2015

dwf commented May 21, 2015 via email

rizar commented May 22, 2015

bartvm commented May 22, 2015

rizar commented May 22, 2015

bartvm commented May 22, 2015

rizar commented May 22, 2015

bartvm commented May 22, 2015

rizar commented May 22, 2015

rizar commented May 22, 2015

pbrakel commented May 22, 2015

rizar commented May 22, 2015

pbrakel commented May 22, 2015

rizar commented May 22, 2015

pbrakel commented May 22, 2015

rizar commented May 22, 2015

rizar commented May 22, 2015

bartvm commented May 22, 2015

rizar commented May 22, 2015

WIP: Faster GatedRecurrent #655

WIP: Faster GatedRecurrent #655

Conversation

rizar commented May 21, 2015

bartvm May 21, 2015

Choose a reason for hiding this comment

rizar May 21, 2015

Choose a reason for hiding this comment

dmitriy-serdyuk May 21, 2015

Choose a reason for hiding this comment

rizar May 21, 2015

Choose a reason for hiding this comment

rizar commented May 21, 2015

dmitriy-serdyuk commented May 21, 2015

dwf commented May 21, 2015 via email

rizar commented May 22, 2015

bartvm commented May 22, 2015

rizar commented May 22, 2015

bartvm commented May 22, 2015

rizar commented May 22, 2015

bartvm commented May 22, 2015

rizar commented May 22, 2015

rizar commented May 22, 2015

pbrakel commented May 22, 2015

rizar commented May 22, 2015

pbrakel commented May 22, 2015

rizar commented May 22, 2015

pbrakel commented May 22, 2015

rizar commented May 22, 2015

rizar commented May 22, 2015

bartvm commented May 22, 2015

rizar commented May 22, 2015