# Item: Linear Regression

## Contents
* [Overview](#overview) 
    * [Linear Regression](#linear_regression)
    * [MLE](#mle)
    * [Normal equations](#normal_equations)
    * [How Good Is The Fit?](#how_good_is_the_fit)
      * [```R^2``` Coefficient](#r2_coefficient)
* [Include files](#include_files)
* [Program structure](#prg_struct)
* [The main function](#m_func)
* [Results](#results)
* [Source Code](#source_code)
* [References](#refs)

## <a name="overview"></a> Overview

<a href="https://en.wikipedia.org/wiki/Linear_regression">Linear regression</a> is the work horse of statistics and supervised machine learning [1]. Although the typical model is linear (and hence the name), when augmented with kernels or other forms of basis function expansion, it can model also non-linear relationships. 

### <a name="linear_regression"></a> Linear Regression

In Statistics, <a href="https://en.wikipedia.org/wiki/Linear_regression">linear regression</a> is a mathematical approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression.  For more than one explanatory variable, the process is called multiple linear regression. The figure below shows a linear regression model fitted in a data set.


<figure>
<img src="linear_regression.png" alt="linear regression model"
	title="linear regression model" width="400" height="350" />
<figcaption>Figure: Linear regression model. Image from [2].</figcaption>
</figure>

Contrary to classification that is concerned with
class indexes, the outcome of a linear regression model or more general of a regression model is a real number.

The basic model has the following form

$$\hat{y} = w_0 + w_1x_1+...w_mx_m = \sum_{i=0}^{m}w_ix_i, ~~x_0 = 1$$

where $x_i$ are the explanatory or independent variables and $w_i$ are the parameters of the model.

### <a name="mle"></a> MLE

Linear regression is a rather simple mathematical model and to this it owes its widespread use. We still need a way to estimate the parameters vector $\mathbf{w}$. The most common way is to compute the <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">MLE</a> (maximum likelihood estimate). This is defined as [1]  

$$\mathbf{\hat{w}}=\text{argmax}_{\mathbf{w}}log p(D|\mathbf{w})$$

where $D$ is the set of the observations we have at hand. Let's assume that these observations are **iid**. Hence we can write [1]

$$l(\mathbf{w}) = log p(D|\mathbf{w}) = \sum_{i=1}^{N} log p(y_i | \mathbf{x}_i, \mathbf{w})$$

In order to proceed further, we need a model about $p(y_i | \mathbf{x}_i, \mathbf{w})$. A conventient assumption is the Gaussina distribution. This assumption ensues the following [1]

$$l(\mathbf{w}) = \frac{-1}{2\sigma^2}RSS(\mathbf{w}) - \frac{N}{2} log(2\pi \sigma^2)$$

where $RSS$ stands for the residual sums of squares (also known as sum of squared errors (SSE):

$$RSS(\mathbf{w}) = \sum_{i=1}^{N}(y_i - \mathbf{w}^T\mathbf{x}_i)^2$$ 

Defining the residual as 

$$\epsilon_i = y_i - \mathbf{w}^T\mathbf{x}_i$$

we can write the $RSS$ as

$$RSS(\mathbf{w}) = ||\boldsymbol{\epsilon}||_{2}^{2} = \sum_{i=1}^{N}\epsilon_{i}^2$$. 

Hence, the MLE estimator for $\mathbf{w}$ minimizes the RSS (the last expression for  $l(\mathbf{w})$ only depends on $RSS$), so this method is known as <a href="https://en.wikipedia.org/wiki/Least_squares">least squares</a> [1].

### <a name="normal_equations"></a> Normal equations

For simple problems, the solution to the minimization problem can be obtained using the normal equations [1]

$$\mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^T\mathbf{y}$$

which results into the following estimate

$$\mathbf{\hat{w}}_{OLS} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

### <a name="how_good_is_the_fit"></a> How Good Is The Fit?

So we established the linear regression  model but how can we measure how good it is?
One metric to do so is the so-called ```R^2``` Coefficient also called the Coefficient of determination

#### <a name="r2_coefficient"></a> $R^2$ coefficient

The coefficient is defined as

![R2](r2.gif)

where  SSR and SST are defined respectively as


![SSR](ssr.gif)

![SST](sst.gif)


## <a name="include_files"></a> Include files

```
#include "cubic_engine/base/cubic_engine_types.h"
#include "cubic_engine/ml/supervised_learning/regressor.h"
#include "cubic_engine/optimization/serial_gradient_descent.h"
#include "cubic_engine/optimization/utils/gd_control.h"

#include "kernel/maths/functions/real_vector_polynomial.h"
#include "kernel/maths/errorfunctions/mse_function.h"
#include "kernel/utilities/data_set_loaders.h"

#include <iostream>
```

## <a name="m_func"></a> The main function

```
int main(){

    using cengine::uint_t;
    using cengine::real_t;
    using cengine::DynMat;
    using cengine::DynVec;
    using cengine::GDControl;
    using cengine::Gd;
    using cengine::LinearRegression;
    using kernel::RealVectorPolynomialFunction;
    using kernel::MSEFunction;

    try{

        auto dataset = kernel::load_car_plant_dataset();

        // The regressor to use. use a hypothesis of the form
        // f = w_0 + w_1*x_1
        // set initial weights to 0
        LinearRegression regressor({0.0, 0.0});

        // the error function to to use for measuring the error
        MSEFunction<RealVectorPolynomialFunction,
                    DynMat<real_t>,
                    DynVec<uint_t>> mse(regressor.get_model());

        GDControl control(10000, kernel::KernelConsts::tolerance(), GDControl::DEFAULT_LEARNING_RATE);
        control.show_iterations = false;
        Gd gd(control);

        auto result = regressor.train(dataset.first, dataset.second, gd, mse);
        std::cout<<result<<std::endl;

        std::cout<<"Intercept: "<<regressor.coeff(0)<<" slope: "<<regressor.coeff(1)<<std::endl;

        DynVec<real_t> point{1.0, 5.7};
        auto value = regressor.predict(point);

        std::cout<<"Value: "<<value<<std::endl;
    }
    catch(std::exception& e){

        std::cerr<<e.what()<<std::endl;
    }
    catch(...){

        std::cerr<<"Unknown exception occured"<<std::endl;
    }

    return 0;
}

```

## <a name="results"></a> Results

```
# iterations:..7322
# processors:..1
# threads:.....1
Residual:......9.99585e-09
Tolerance:.....1e-08
Convergence:...Yes
Total time:....0.118251
Learning rate:..0.01

Intercept: 0.385994 slope: 0.415595
Value: 2.75489
```

## <a name="source_code"></a> Source code

<a href="../exe.cpp">exe.cpp</a>

## <a name="refs"></a> References

1. Kevin P. Murphy, ```Machine Learning A Probabilistic Perspective```, The MIT Press
2. <a href="https://en.wikipedia.org/wiki/Linear_regression">Linear regression</a>
3. <a href="https://en.wikipedia.org/wiki/Least_squares">Least squares</a>