There are few packages which can help with implementation of RNNs and their need for high performance calculations.  I like caffe the most but it can be chalenging especially when it comes to adding new code as you need to deal with C++ and Cuda.  There is also Theano, but I am not a great fun with heavy computational tree optimisation especially during evaluation stage.  There is also Torch based on Lua which is ... well I don't know what Torch can do at this stage ...
  
Hence this post will be about implementing linear regression using Cuda Tensors and Torch 7.  The example is based on 

In [34]:
require 'cutorch';
require 'cunn';
require 'optim';

torch.setdefaulttensortype( 'torch.FloatTensor' )

For this exercise we will use fairly large table

In [35]:
x_len = 1000000
x_width = 2

X = torch.CudaTensor( x_len, x_width ):normal()
A = torch.CudaTensor{ {1}, {2} }
Y = torch.mm( X, A ) + torch.CudaTensor( x_len, 1 ):normal( 3.0, 1.0 )

Let's define linear layer to express our regression.  NN package will take care of gradient derivation as well as forward and backward passes

In [36]:
lin_layer = nn.Linear( (#X)[2], (#Y)[2] )
model = nn.Sequential()
model:add( lin_layer )
model:cuda()
criterion = nn.MSECriterion()
criterion:cuda()
params, dl_dparams = model:getParameters()

In [37]:
sgd_params = {
    learningRate = 1e-3,
    learningRateDecay = 1e-4,
    weightDecay = 0,
    momentum = 0
}
epochs = 100
batch_size = 50000

In [39]:
function train( X, Y )
    
    local current_loss = 0

    -- for each mini batch
    for t = 1,(#X)[1], batch_size do
        -- prepare mini batch
        local inputs = torch.CudaTensor( batch_size, x_width )
        local targets = torch.CudaTensor( batch_size )

        local i_start = 1
        local i_end = batch_size
        
        local x_start = t
        local x_end = math.min( t + batch_size - 1, (#X)[1] )
        
        inputs[ {{i_start, i_end}} ] = X[ {{x_start, x_end}} ]:clone()
        targets[ {{i_start, i_end}} ] = Y[ {{x_start, x_end}} ]:clone()
        
        -- eval function to minimise 
        feval = function( params_new )
            -- clean up 
            collectgarbage()

            if params ~= params_new then
                params:copy( params_new )
            end

            -- reset gradients (gradients are always accumulated, to accomodate batch methods)
            dl_dparams:zero()

            -- evaluate the loss function and its derivative wrt x, for that sample
            local outputs = model:forward( inputs )
            local loss = criterion:forward( outputs, targets )
            local backprop = criterion:backward( outputs, targets )
            model:backward( inputs, backprop )

            -- return loss and dloss/dparams
            return loss, dl_dparams
        end

        -- run SGD
        _, fs = optim.sgd( feval, params, sgd_params )
        current_loss = current_loss + fs[1]
    end
    
    return current_loss
end

In [41]:
time = sys.clock()
local cumm_loss = 0.
for i = 1, epochs do
    cumm_loss = train( X, Y )
--    io.write( '.' )
end

print( 'Final loss = ' .. cumm_loss / (#X)[1] )

-- time taken
time = sys.clock() - time
print( "Time per epoch = " .. (time*1) .. '[s]')


Final loss = 2.0011432409287e-05	
Time per epoch = 37.918411016464[s]	


Let's take a look at recovered parameters.  They should be equal to:  
```
1  
2  
3
```

In [42]:
print( params )

 0.9987
 1.9987
 2.9975
[torch.CudaTensor of size 3]



Not bad.