Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multivariate Linear Regression support #260

Open
sbrbot opened this issue Nov 26, 2017 · 19 comments
Open

Multivariate Linear Regression support #260

sbrbot opened this issue Nov 26, 2017 · 19 comments

Comments

@sbrbot
Copy link

sbrbot commented Nov 26, 2017

There is a support for Regression in MathPHP but that's only the Simple Linear Regression. Is there an intention to support Multivariate (Linear) Regression in future or other types of regressions (like simple exp or 2. polynomial regression)?

@markrogoyski
Copy link
Owner

Hi, thanks for your interest in MathPHP.
Yes, the intention is to support all the useful statistical regression methods. Since there is now a feature request for these we can prioritize these features in the backlog.

I'll update this issue thread when there is progress on implementing these features. Thanks again for your input.

@Beakerboy
Copy link
Contributor

Beakerboy commented Nov 27, 2017

I’m pretty sure the Linear regression trait is written in a generic enough way that it will work for multiple linear regressions. All it would take is a polynomial model and a class that joins the trait with the model.

Edit: Looking at the code, I think it’s good for polynomial, but not multiple linear. However, only some small tweaks would need to be done.

@sbrbot
Copy link
Author

sbrbot commented Nov 28, 2017

I agree, only small tweeks would be needed to implement multivariate linear regression since it is just matter of matrix calculation (ß=(X'X).inv)X'Y) and MathPHP does have support for matrices . I do have one PHP library for such a regression but it uses external library for matices wich does not implement Gauss-Jordan optimization for inverse matrix calculation and calculation is too slow for bigger matirces. Instead I could use MathPHP's matix calculator. But since MathPHP has intention to cover broad spectra of mathematical areas, the support for multivariate linear regression could be implemented.

In MathPHP syntax calculating multivariate linear regression coefficients would be:

$ß=$X->transpose()->multiply($X)->inverse()->mutiply($X->transpose())->multiply($Y);

@sbrbot sbrbot changed the title No Multiple Linear Regression No Multivariate Linear Regression Nov 28, 2017
@sbrbot sbrbot changed the title No Multivariate Linear Regression Multivariate Linear Regression support Nov 28, 2017
@sbrbot
Copy link
Author

sbrbot commented Nov 28, 2017

I intend will switch from old regression library to MathPHP but just let me check if MathPHP's matrix->inverse() method implements Gauss-Jordan optimization/elimination (or any other better optimization if there is one). :-)

I checked inside code, but NO, unfortunately MathPHP does not implement Gauss-Jordan optimization for calculating inverse matrix but uses standard calculation with augmented submatrices. If I have matrix for more vaiables, that goes into exponential growth of recursive calculation of augmented submatices, and it becomes slow.

Hm, maybe I could do something about it inside MathPHP...

@Beakerboy
Copy link
Contributor

If you have the knowledge, willingness, and time to refactor the inverse method to an algorithm that is faster, we would love to have it. Thanks for any help!

@markrogoyski
Copy link
Owner

@sbrbot
Can you explain what is the optimization you think the MathPHP inverse is not using?

The basic strategy is to reduce to RREF and then find the inverse. We use Gaussian elimination to reduce to REF, and then continue on all the way to RREF. My understanding of the terms is like this:

  • Gaussian elimination: process to reduce a matrix to REF
  • Gauss-Jordan elimination: Continue with Gaussian elimination from REF until you get to RREF.

Based on my understanding that is what we are doing. Is there something else in the meaning of Gauss-Jordan elimination that we are not doing?

@sbrbot
Copy link
Author

sbrbot commented Dec 6, 2017

Thanks Mark, you're right that's what I need. I see now, MathPHP does use Gauss-Jordan elimination! I was wrong, did not see it on first sight. However, reducing matrix on RREF means that matrix should have only one's as diagonal elements (https://en.wikipedia.org/wiki/Row_echelon_form). MathPHP does not normalize rows (dividing them with first non-zero element) with ->rref() fn.

@Beakerboy
Copy link
Contributor

@sbrbot Do you have an example of some input and expected output that we can compare with what the function currently produces?

@markrogoyski
Copy link
Owner

markrogoyski commented Dec 7, 2017

@sbrbot Can you explain what you mean by not normalizing rows vs what it is currently doing?

Please take a look at the unit test cases for the rref method. There are 55 test cases just on rref directly which have an input matrix and the output RREF form.
https://github.com/markrogoyski/math-php/blob/develop/tests/LinearAlgebra/MatrixDecompositionsTest.php#L300

Look at the data provider immediately below the test to see the inputs and expected outputs. All test cases were generated with authoritative third-party sources. Are there any results there that you think should be computed differently?

For reference, the be REF, a matrix must satisfy:

  • all non zero rows are above any zero rows
  • the leading coefficient of a non zero row is always to the right of the leading coefficient of the row above it.

Then, to be in RREF, a matrix must satisfy:

  • Be in REF
  • Every leading coefficient is 1
  • The leading coefficient is the only non zero value in its column

If you can provide some sample data and output where MathPHP's rref method is not doing that I can take a look at it. If you think there are some other criteria that the output matrix must fulfill in order to be considered in RREF, please let me know what you think those are and what the documented source of the definition is.

I'm happy that people look thoroughly at MathPHP's code and hold it up to scrutiny. I want the code to be correct. That's why I've written thousands and thousands of tests. However, in this case, I'm not convinced there is any issue. But if you can provide some data that shows otherwise I will definitely take a look!

Thanks.

@sbrbot
Copy link
Author

sbrbot commented Dec 7, 2017 via email

@Beakerboy
Copy link
Contributor

Beakerboy commented Dec 7, 2017

If you need a matrix of complex numbers, you can use an instance of the ObjectMatrix, and populate the elements with Complex objects. The object matrix is able to be populated with any PHP object that implements the objectArithmetic interface. This means that the object supports addition subtraction and multiplication. With the object matrix, one can do matrix algebra on the Polynomial, Complex, or Rational number object in MathPHP. I am working 128 bit integer and floating point objects as well.

Edit: It looks like the ObjectMatrix is only for square matrices at this time. It’s used internally to ease calculation of Eigenvalues.

@markrogoyski
Copy link
Owner

The ObjectMatrix is a work in progress, but eventually could be used to find things like the inverse of a matrix composed of ComplexNumbers once all the matrix methods are reimplemented within ObjectMatrix to use ObjectArithmetic.

Speaking of SquareMatrix, I remember we talked about removing that, and you convinced me we should do it. I have it in the backlog of items to do, but I don't remember what the reason for removing it was.

@sbrbot
Copy link
Author

sbrbot commented Dec 8, 2017 via email

@Beakerboy
Copy link
Contributor

I think my opinion was to move all the square methods to a trait. Then, if the Factory determines the matrix it square, it can add the trait to the class. This would prevent us from having square and rectangular versions of every matrix type.

@keikland
Copy link

New to this lib, but getting into it now. Multiple linear regression is a feature I certainly need. The discussion on implementation thus far has been based on the textbook approach, which for serious use is slow and can be low accuracy. Since full-rank X'X is a positive definite symmetric matrix, a QR (or Cholesky, preferably scale-stabilized) factorization is the normal inversion/equation solving mechanism.

There are other methods too for extended functionality, like SVD and rank-deficient Moore-Penrose inverse (with solution vector norm minimization) that could be added as a selectable solution class (perhaps including Gauss-Jordan elimination/explicit inverse, and complex versions). sbrbot's advice to use the DFT_matrix approach is highly unstandard, probably numerically unstable (e.g. for calculation of normal equations) and not efficient for general use.

In most cases, a sparse implementation in column-major format speeds up calculations dramatically and also opens the way into bias-reducing extensions (2 and 3-stage least squares), confidence intervals and advanced hypothesis testing.

Anyway, for the first step, definitely go with a sparse model and Cholesky factorization (QR up front if an efficient PHP code class can be found). Generally, for the select cases where access to the explicit inverse is useful, e.g. normal equations, it can always be computed in an additional step.

Kjell

@markrogoyski
Copy link
Owner

Hi @keikland
Thanks for your interest in MathPHP.

Some of the next things I am going to investigate implementing are matrix QR decomposition and SVD. Basically, build up the basic tools that can be used and applied to implement other things. If you have thoughts or opinions on prioritizing features please voice your opinion. And if you can offer any expertise in some of the complex matrix operations, don't hesitate to contribute if you'd like to.

Mark

@keikland
Copy link

Hi Mark,
QR would be great, and I agree that it is essential to establish a strong/fast basic toolset with key matrix-matrix and matrix-vector operations. I don't know MathPHP well enough yet, but the structure of the lib suggests that much of the logic is already present. It would be great to interface with a fast BLAS library implementation, but then the issue of packing-analysis-repacking loop arises, much like CPU vs GPU. Probably a wrong choice, but should be considered.

Array operations already seem to be a lot faster with PHP 7 than earlier version, and with clang-like features and performance coming in v7.3 or v8 it may be that it a numerical index native PHP array may actually become quite good. Compatibility and maintainability would also benefit. What I am basically saying is that a very optimized pure PHP implementation may be the best way to go, but already now creating a modular structure to be able to easily implement parallel processing (e.g. matrix-matrix multiplication and Shur-complement like factorization.)

On the feature set, I would suggest to include (provision for) automatic branching to rank-deficient operations, if detected by the "naive" QR factorization. That would mean recasting as Moore-Penrose inverse and additional matrix-matrix multiplications. However, the matrix/vector operations are simple enough. My preferred reference here is the old Rao book on Statistical Inference, which is also great as a guide for implementing parametric (composite/multi-factor) hypothesis testing.

Almost forgot to mention that fast matrix transposition and/or explicitly maintaining a separate row-major version of key sparse matrices can be key to performance. Memory is cheap these days, but PHP's new arrays also require a lot less memory than before.

As a side-node, it is interesting that, with the focus these days in with machine learning, multiple linear regression is not more in focus. It is basically the earliest form of big-data inference. Many neural network implementations are incredibly rank-deficient, but the resolution of rank-deficiency is not transparent. Usually it is ignored, and it translates into very weak forecasts if underlying factors change only slightly. With a rank-deficient multiple regression approach, whether alone or integrated into a ML framework, it is both easier to diagnose and has a transparent principle "cost of error" resolution mechanism.

In summary: PHP arrays+QR+fast matrix/vector operations(+rank-deficient MLR+parallelism). These features give the foundations for two/tree-stage least squares extensions and maximum likelihood estimation too. Note also that there is a lot of focus in the academic and advanced user communities on the numerical accuracy of the algorithms. It would be very good if MathPHP could include some of the standard test sets both for development purposes and to document reliability.

Kjell

@markrogoyski
Copy link
Owner

Hi Kjell,

Thanks again for your interest and comments about improving MathPHP.

The basic premise of the library is this would all you would need to include to get all the common math functionality you would use, so the goal is to do everything in PHP without other external dependencies.

You mentioned some standard tests. Can you provide some links or resources for those tests? MathPHP has 100% unit test coverage and strives for correctness by going beyond 100% test coverage on important features. I'd be happy to add any additional tests if you can provide a source or pull request.

Thanks again!
Mark

@keikland
Copy link

keikland commented Feb 26, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants