Skip to content

mcordingley/Regression

Repository files navigation

Regression

Build Status Code Climate Code Climate

What Is This All About?

This is a library for regression analysis of data. That is, it attempts to find the line of best fit to describe a relationship within the data. It takes in a series of training observations, each consisting of features and an outcome, and finds how much each feature contributes to the outcome.

As a concrete example, consider house prices. Square footage, the number of bathrooms, the age of the house, and whether or not the house has a finished basement may all affect the final sale price of a home. For thoroughly contrived reasons, you want to start pricing houses in your area. You'd find a bunch of homes that had already sold and enter their square footage, etc. as features and their sale prices as outcomes and then run a regression with that data. You'd get on the other end of the process how much each square foot is worth, each bathroom is worth, etc. With that information, you could then start to predict the price of new homes that come onto the market and have not sold. If your model is solid, you'll find out which houses are overpriced or underpriced!

This library also handles logistic regression, in which the outcomes are booleans. In this case, the regression would give you how much each feature contributes to the probability of the outcome and the prediction process would give you the probability of the outcome for a given new example.

Quick Start

As always, start with Composer:

composer require mcordingley/Regression

For those who cannot or do not want to use Composer in a given project, you can pull down a copy of this library and run composer install followed by php build-phar.php to generate a PHAR archive that can be included into your project.

Your first step in running a regression will be to load your data into an Observations object. This can be done either with individual training examples with $observations->add($exampleFeatures, $outcome); or in bulk with Observations::fromArray($arrayOfExampleFeatures, $arrayOfOutcomes). For most uses, you will want to add one additional feature to the beginning of your feature list for each training example. This will be the number 1.0, which represents the y-intercept term. If omitted, the regression line will be forced through the origin. Note that you can also create derived features, such as the square or log of some feature, if its contribution to the outcome is non-linear.

You then can create an instance of LeastSquares and call regress on it with your collection of observations. Depending on the size of your dataset, this make take some time to execute, but it will return an array of coefficients representing the relative effect of each feature on the outcomes. If you included 1.0 as your first feature for each training example, then the first coefficient will be the y-intercept. Pass these coefficients into a Predictor object to immediately start predicting the outcomes for new data or store them for later use.

Warning: Regression can be computationally expensive, especially if you're using gradient descent. Always run your regressions off-line, either as nightly batch jobs or through some queue service. Predictions are cheap and can be done on-line with the coefficients generated by an off-line regression.

Putting it all together:

use MCordingley\Regression\Algorithm\LeastSquares;
use MCordingley\Regression\Observations;
use MCordingley\Regression\Predictor\Linear;

$observations = new Observations;

// Load the data
foreach ($data as $datum) {
    // Note addition of a constant for the first feature.
    $observations->add(array_merge([1.0], $datum->features), $datum->outcome);
}

$algorithm = new LeastSquares;
$coefficients = $algorithm->regress($observations);

$predictor = new Linear($coefficients);
$predictedOutcome = $predictor->predict(array_merge([1.0], $hypotheticalFeatures));

Gathering Regression Statistics

For linear regression, it's possible to obtain detailed statistics about how well the regression fits the data. Doing so is relatively simple and best if done immediately after performing a regression. Details on what each term means and how to interpret them is a bigger subject than can be covered in this documentation, but the there is an entry on the Minitab blog that provides a good start on interpreting your regression.

use MCordingley\Regression\StatisticsGatherer\Linear;

$gatherer = new Linear($observations, $coefficients, $predictor);

$gatherer->getFStatistic(); // etc.

Logistic Regression

Logistic regression is implemented by way of gradient descent, which is detailed below. The key things when doing a logistic regression are that you use an instance of the GradientDescent algorithm with the Logistic gradient to perform the regression. Your Schedule and StoppingCriteria should be picked to best match your data and which descent algorithm you've chosen.

Given below is an example with what should be your default setup. This configuration is appropriate for most logistic regressions. Note that you will want to normalize your features before feeding them in.

use MCordingley\Regression\Algorithm\GradientDescent\Batch;
use MCordingley\Regression\Algorithm\GradientDescent\Schedule\Adam;
use MCordingley\Regression\Algorithm\GradientDescent\Gradient\Logistic as LogisticGradient;
use MCordingley\Regression\Algorithm\GradientDescent\StoppingCriteria\GradientNorm;
use MCordingley\Regression\Observations;
use MCordingley\Regression\Predictor\Logistic as LogisticPredictor;

$algorithm = new Batch(new LogisticGradient, new Adam, new GradientNorm);
$coefficients = $algorithm->regress(Observations::fromArray($features, $outcomes));

$predictor = new LogisticPredictor($coefficients);
$predictedOutcomeProbability = $predictor->predict($novelFeatures);

Gradient Descent

Sometimes, LeastSquares regression is not a viable option. This can happen if the data set becomes too large to be run through LeastSquares in a reasonable amount of time or if performing logistic regression, though certainly other, more esoteric, reasons may exist. In these cases, we find an approximate solution through an iterative numeric process called "gradient descent". Putting together an effective descent regression can be a complicated process with many different options. These options are detailed below.

Normalizing Features

Most of the time, you will want to normalize your features before feeding them in to the Observations class. What this means is altering your data so that each feature has an average of zero and unit variance. Intuitively, this "straightens" the path of the descent process, leading to a much quicker convergence on a result. Sometimes, this can be the difference between a rapid convergence and a regression that fails to converge.

While it isn't necessary to have the average and variance brought exactly to zero and one, respectively, it helps to bring them within an order of magnitude of these values. In the GradientDescent tests, for example, the GRE scores are divided by 100 to bring them within the range of zero to ten. Boolean features are allowed to remain as 0.0 or 1.0, as those values are very close, as is.

Fully normalizing a feature can be achieved by this formula: ($value - $averageOfValue) / $standardDeviationOfValue, though if calculating the standard deviation is too much trouble, then ($value - $averageOfValue) / ($maxOfValue - $minOfValue) can work just as well. More details can be found on this blog post.

Choice of Algorithm

Currently, there are three main descent algorithms to choose from: Batch, Stochastic, and MiniBatch. Batch will go through all of the data for each iteration. This can take longer, but leads to much more stable descent processes and should be your default choice. Stochastic uses just a single, randomly-drawn example from the training data for each iteration. For very large data sets, this can lead to faster convergence than the Batch process, but has the disadvantage of being much noisier on a per-iteration basis. MiniBatch is a blend of the other two approaches in which random batches of a specified size are drawn from the set of training data. This leads to somewhat more stable data on each iteration than Stochastic, but still avoids having to deal with the entire data set with each iteration.

The Adam step schedule is a good default for all gradient descents. The GradientNorm stopping criteria works well for Batch descents. Convergence is considerably trickier for Stochastic and MiniBatch descents. Right now, the recommendation is to run enough iterations to bring the descent close to convergence and then halt it with MaxIterations. To get the descent to settle as it converges, wrap your descent schedule with ExponentialDecay, giving it the same value for its scale as you did for the max iterations. The logistic test data used to test Batch settles near convergence after 10,000,000 iterations with the Stochastic descent and the recommended schedule.

When starting with a new project, it helps to tinker with the different options to find the best fit for your data. The DescentSpy stopping criteria is supplied to aid in this process. It decorates another stopping criteria and will call a specified callback on each iteration before delegating to the decorated stopping criteria. There is an example use of this class in the GradientDescent test folder with the DescentDebugger trait used to tune the descent test cases.

Over-Fitting and Regularization

It's possible for a regression to select coefficients that more accurately describe the training data at the cost of accuracy against novel data from the same process being modeled. This is known as "over-fitting". There are a few different ways to combat this. One method is "cross-validation" in which a portion of the training data is kept aside from the regression and is used to check how accurately the resulting regression model describes novel data.

Another tool to fight over-fitting is called "regularization" and involves building a penalty against each coefficient that scales with how far the coefficient strays from zero. The Regularized class decorates another Gradient instance and provides this functionality to the gradient descent process. Pass 1 into its constructor for L1 regularization or 2 for L2 regularization. Regularization for LeastSquares is planned for when an elegant implementation can be found that works for both L1 and L2 regularization.

L2 regularization spreads the penalty across coefficients, penalizing larger coefficients more heavily than small ones. This is good at reducing overall over-fitting and should be the default choice. L1 regularization penalizes coefficients equally no matter their size. This tends to drive the coefficients for unneeded features down to zero.

These concepts are discussed in more detail on MSDN. Scroll down to "Understanding Regularization".

Extending the Library

The entire library is written against interfaces with as much functionality as possible pulled out into collaborating objects. This means that you can easily swap in your own classes in place of the provided ones. In particular, the Gradient, Schedule, and StoppingCriteria interfaces are intended points of extension. If you have written an implementation of one of these that you think would be of use to others, please submit it with accompanying tests in a pull request.

Change Log

2.2.0

  • Extract interfaces from Observations and Observation.

2.1.0

  • Add InverseRootDecay
  • Add ExponentialDecay

2.0.0

  • Require PHP 7.0 or higher.
  • Require LinearAlgebra version 2.
  • Capitalize namespaces.
  • Add scalar and return type hints.
  • Make setters on Observations fluent.

1.1.0

  • Add Adam update schedule.
  • Tweak default values for RmpsProp.
  • Add NthIteration criteria decorator.
  • Add SteppedCriteria criteria decorator.

1.0.0

  • First stable release.