Enhancement of linear regression library #286

Closed
rcurtin opened this Issue Dec 29, 2014 · 14 comments

Projects

None yet

1 participant

@rcurtin
Member
rcurtin commented Dec 29, 2014

Reported by sumedhghaisas on 29 May 43610575 20:26 UTC
I have been working on my project which used linear regression on a small scale. For that matter I modified linear regression files of mlpack.

Things I have added :

  1. Regularisation - ridge regularization option is provided(lambda). New Constructor implemented for that purpose. The implemented equation for calculation of theta is -
    theta = inv(trans(X) * X + lambda * I) * trans(X) * y;
    This implementation is giving comparable, on an average better performance considering time. The theta we get through this if we put lambda = 0 is as same as the theta we get from QR decomposition.

  2. cost function calculation : function implemented in Linear Regression class gives cost of the current parameters on the points and responses passed on that function.
    cost = 1/m * sum(y - X*theta)^2

Things I modified :

  1. The implementation of predict function was through loop in a loop which is much of time overhead. I have implemented a a matrix implementation of it which(Maybe) would improve the time performance.

I have attached both linear_regression.cpp and linear_regression.hpp with my ticket. both the files are properly documented with detailed description of the input formats.

Its not very much but these new function can be very helpful in certain situations. Maybe u can include them in newer releases.

@rcurtin rcurtin self-assigned this Dec 29, 2014
@rcurtin rcurtin added this to the mlpack 1.0.7 milestone Dec 29, 2014
@rcurtin rcurtin closed this Dec 29, 2014
@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by rcurtin on 23 Aug 43611178 08:11 UTC
Hello Sumedh,

Thank you very much for the contribution! I will be taking a look through it shortly and integrating your improvements. The improvement of Predict() looks good; I will have to make sure it still passes the test. I do think the equation for theta, as implemented, is not correct because Armadillo uses column-major memory ordering. See http://www.mlpack.org/doxygen.php?doc=matrices.html for a better explanation (and let me know if that doesn't clarify).

Are you willing to write a test for the ridge regression and the cost function calculation? Everything that mlpack provides is tested; you can find the tests in src/mlpack/tests/. We already have one for linear regression in src/mlpack/tests/linear_regression_test.cpp. A good test for ridge regression might be to give a dataset where the inversion will fail if lambda = 0 (i.e. pass in a data matrix X where (X * X^T) is not invertible), but will give good results for some nonzero lambda. You could use the current test as an example. A test for the cost should not be hard; give an example dataset or two, and compare the cost with the cost that is expected for that dataset.

Let me know if you have any questions. Again, thanks for the contribution. I will be working at least the modifications to Predict() in over the next day or two, and once I have a test for the new functionality I am happy to add that too.

Thanks!

Ryan

@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by sumedhghaisas on 18 Jun 43612218 20:56 UTC
Yeah i got to know about the column major implementation. But as far as the functionality is considered it doesn't affect anything. The load data function is actually loading the data after transposing the matrix and same for saving.(Saving the data after transposing). Check the test.cpp i have implemented. If you transpose the loaded data then everything works fine. Except of course as u mentioned when trans(T) * X is not invertible. I would think of some better implementation for that.
But except that everything is fine. I have checked my answers against octave modules and they both giving the same results. Just check test.cpp and pass csv file in that.
Yeah i will try to write the test function, these implementations are passing the current tests though, I will add cost calculation and everything.

@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by sumedhghaisas on 22 Apr 43612527 15:02 UTC
I have changed my implementation so that to adjust with current column matrix convention. Column matrix will boost the speed calculations as the storing is column major, thats why in implementation I have multiplied trans(trans(parameters) * points) and not trans(points) * points so that major multiplication will be boost up my column major. And i have used arma::solve in place of inverse, does that solve the problem?? cause i dont know the exact implementation of arma::solve.

and I have returned predictions as a column matrix but looking at your code I am confused about the conventional return. If your code wants to return predictions as row matrix then just remove the trans function in the implementation.

and about the test, these files pass the test, and i am confused what to implement in the test function.

@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by rcurtin on 20 Sep 43613286 18:38 UTC

But except that everything is fine. I have checked my answers against octave modules and they both giving the same results.
Just check test.cpp and pass csv file in that.

So, the idea behind the test is that you've tested it by hand, and I trust you that it's right, but two years from now, someone might change a line in there and everything might break -- and they're not going to take the time to run the tests again. So writing a test just makes the testing you were doing automatic. That way, if someone makes a change, all they have to do is 'make mlpack_test' and then run the test to know that everything is still working right. If you send me the Octave numbers you were comparing against, I can show you an example of what I mean. There is already a linear regression test for standard linear regression you can run; if you 'make mlpack_test' you can then run 'mlpack_test -t LinearRegressionTest' and see if it still passes the test.

Thanks for changing the implementation to work with column-major matrices. data::Load() automatically transposes data (usually, data is stored in row-major format, but we need it in column-major format), so your code was calling data::Load(), which transposes, and then transposing again. arma::solve() should work in place to arma::inv(), to my knowledge. If it still passes your tests then there is no problem. :)

I'll take a look into the return type for the predictions and let you know what that ought to be.

@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by sumedhghaisas on 1 Jun 43620560 16:32 UTC
I calculated the cost of the model on the test case of linear regression and it turns ought to be

      cost is 9.07497e-05

So i thought of adding lines

double cost = lr.ComputeCost(points,responses);

BOOST_REQUIRE_SMALL(cost,0.01);

is it correct?? that way we can test the cost function... but how to test the lambda input and output???

@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by rcurtin on 3 Mar 43643804 16:11 UTC
Hello Sumedh,

Sorry for the slow response. I have been busy. I would suggest this as a check for the cost:

const double cost = lr.ComputeCost(points, responses);

// 0.01% tolerance.
BOOST_REQUIRE_CLOSE(cost, 9.07497e-5, 0.001)

It might be easy to write tests for some other simple corner cases for ComputeCost(). For instance, if the responses fit the model perfectly, then the cost should be zero. It shouldn't be too hard to make a dataset like that... here's one in two dimensions, with the responses as the one-dimensional column vector to the right:

[0]([0)  [[1 1]([0]
)   [[2 2](2]
)   [[1 2](4]
)   [[6 2](3]
)   [[2 6](8]
)]  [8]]

That's the relation y = x_1 + x_2, which the linear regression model should fit perfectly (giving a cost of 0). You could write a test function to run linear regression on that dataset then ensure the cost of that dataset is 0.

@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by rcurtin on 5 Sep 43646432 03:16 UTC
Ok; I've incorporated some of your changes in r15650, and added you as a contributor in r15651. If you'd prefer that I remove your name or change the email, let me know. When we work out the tests for the other functionality (the ridge regression and the cost computation), I'll add them in.

@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by sumedhghaisas on 18 Jul 43647788 17:02 UTC
The 0.01 tolerance tests works for the cost computation, it would be better than the rigid 0 test i guess. To test the ridge regression we can check the the value with lambda = 0 and with lambda = 0.1 it comes out as
lambda = 0 :: 6.28998e-06
lambda = 0.1 :: 8.36135e-06

we can add the range restriction on the variation in lambda as we did for the cost computation..

@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by rcurtin on 10 Feb 43698543 08:46 UTC
Hello Sumedh,

I apologize again for the slow response. I had a paper deadline yesterday and was very busy preparing for it...

Which dataset are you referring to for your check for ridge regression? Also, I think another good test for ridge regression is to make sure that it solves the problem it was intended to solve: when (X'X) isn't invertible. So we should design some data matrix that has a non-invertible covariance matrix, and then choose a ridge regression parameter large enough so that adding to the diagonal is enough to make it invertible and the regression will be successful. Does that make sense?

@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by sumedhghaisas on 28 Mar 43725163 16:04 UTC
0 [1 1 [2 2 [1 2 [6 2 [2 6] [8]]

This dataset gives cost as (1.88099 * 10 ^ -31).
So this dataset surely can be used for checking ComputeCost. Designing a matrix such that X * X' is non-invertible makes sense. But this way we need to have different datasets to check each different functionality.
After verifying the cost function we can just apply lambda value of 10 to this dataset and check the cost is high enough. With lambda value of 10 this dataset gives cost as 0.3278. Can this be used as the check for ridge regression?
If the cost function is verified and its giving high results with high lambda can we assume ridge regression is correct??

@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by rcurtin on 22 Dec 43736781 15:26 UTC
Ok, that looks good for a test for ComputeCost(). I've commited r15835, r15836, and r15837 where I added your implementation of ComputeCost() although I renamed it to ComputeError() to be more in line with the terms used elsewhere in mlpack. I modified it minorly to avoid a matrix copy of the predictors matrix.

I found that the dataset you gave can be predicted by linear regression perfectly, so it should have an error of 0. The test checks that ComputeCost() returns a number smaller than 1e-25 (since floating point error can mean that it won't be exactly 0).

I also added another dataset that linear regression can't predict perfectly:

[ 0  16 ]([)   [0 ]([)
 [  1   8 ]    [ 2 ]
 [  2   4 ]    [ 4 ]
 [  4   2 ]    [ 3 ]
 [  8   1 ]    [ 8 ]
 [ 16   0 ]    [ 8 ]]

I hand-calculated the error in Octave and it gives 1.189500337, so the test makes sure ComputeError() returns a value close to that.

These datasets could be used to check that ridge regression is successful; however, the best test for ridge regression will be a matrix where lambda must be added to the diagonal for the matrix to be invertible. If you can provide a matrix like this, and then the value of lambda that makes it invertible, I'm happy to write the test and integrate the last bit of support. Then, I will also write a test that runs ridge regression for some lambda on some test dataset, and make sure the results are similar to that given by MATLAB's implementation of ridge regression.

@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by sumedhghaisas on 23 Nov 43739326 11:14 UTC
It sounds very trivial but can we consider a matrix with all entries zero as our test for ridge regression. X * X' will be non invertible but any non zero lambda will convert that matrix to invertible. It doesn't make any sense from conceptual point of view as there can't be any dataset with all values zero but the test can be considered purely mathematical.

@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by rcurtin on 22 Apr 43739607 13:08 UTC
Clever! I've written a test for that and another test to ensure ridge regression with a very small lambda is equivalent to linear regression.

I also refactored the computation of the model to use the QR factorization even when ridge regression is used, and I've minimized copies by avoiding insert_rows() and shed_rows(). A call to insert_rows() and shed_rows() is two memory allocations and copies, but if we just copy the matrix once, it ends up being faster.

Here's a link to more information on using the QR decomposition for ridge regression, if you are interested:
http://math.stackexchange.com/questions/299481/

I'm going to close this ticket since all of the functionality you originally submitted is now implemented. You can use ridge regression with the 'linear_regression' executable by specifying a parameter to --lambda.

Thanks again for your contribution. The features you've designed will be available with the 1.0.7 release, which should happen by early October. :)

@rcurtin
Member
rcurtin commented Dec 30, 2014

Commented by sumedhghaisas on 17 Aug 43740937 03:14 UTC
Looking forward to that release :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment