In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from latools import *
from sympy import *
init_printing(use_latex=True)

# Fitting a linear function to data

Suppose that we have a set of data:
$$
\begin{array}{c|c|c}
i & \alpha_i & \beta_i\\\hline
1 & 2 & 3\\
2 & 1 & 1\\
3 & 4 & 6\\
4 & 3 & 3\\
5 & 7 & 8
\end{array}
$$
Scientists suspect that there is a linear relationship relating the variables $\alpha$ and $\beta$:
$$
\beta_i = \alpha_i m+d
$$
where $m$ and $d$ are contants to be determined. We can set up the problem of finding $m$ and $d$ as a linear system:
\begin{align*}
2m+d &=3\\
1m+d &=1\\
4m+d &=6\\
3m+d &=3\\
7m+d &=8
\end{align*}
We can formulate this in matrix form:
$$
\begin{bmatrix}2&1\\1&1\\4&1\\3&1\\7&1\end{bmatrix}
\begin{bmatrix}m\\d\end{bmatrix}=
\begin{bmatrix}3\\1\\6\\3\\8\end{bmatrix}
$$
Let's try to solve it using our standard methods. Define the augmented matrix:

In [None]:
A = rational_matrix([[2,1],[1,1],[4,1],[3,1],[7,1]])
A

In [None]:
b = rational_matrix([[3],[1],[6],[3],[8]])
b

In [None]:
M = Matrix.hstack(A,b)
M

In [None]:
R = reduced_row_echelon_form(M)
R

Thus, the system is equivalent to:
\begin{align*}
m&=0\\
d&=0\\
0&=1
\end{align*}
Due to the last equation, this system is inconsistent. This happens because _there is no straight line that goes through all the points_:

In [None]:
plt.plot(A[:,0], b[:], 'o')
plt.axis([-1,8,-1,9])
None

In [None]:
A[:,0]

The fact that the data does not perfectly fit a straight line is not surprising. Two factors may be at play:

- The data contains measurement errors. Even if the "actual" values fall on a straight line, the measured values will not.
- The straight line model is not completely accurate. It may be valid as a first approximation, but it will be necessary to adjust it with a more refined model.

This is a situation where we can find a _Least Squares Solution_. To do this we first compute $A_1=A^TA$ and $\mathbf{b}_1=A.T * b$:

In [None]:
A1 = A.T * A
A1

In [None]:
b1 = A.T * b
b1

In [None]:
M = Matrix.hstack(A1, b1)
M

In [None]:
R = reduced_row_echelon_form(M)
R

So, we get the solution:
$$
\begin{bmatrix}m\\d\end{bmatrix}=\begin{bmatrix}\frac{123}{106}\\\frac{27}{106}\end{bmatrix}
$$
That is, the line that best fits our data (in the least squares sense) is:
$$
\beta=\frac{123}{106}\alpha + \frac{27}{106}
$$
Let's now display the data points again, together with the linear approximation:

In [None]:
xlss = R[:,2]
m, d = xlss
m, d

In [None]:
plt.plot(A[:,0], b[:], 'o')
xvalues = np.linspace(0,7,300)
yvalues = m * xvalues + d
plt.plot(xvalues, yvalues, '--', color='red', lw=2)
plt.axis([-1,8,-1,9])
None

To estimate the error in the approximation, we can compute the residuals:

In [None]:
r = (b - A * xlss)
r

In [None]:
[float(vv) for vv in r]

The squared length of the residuals is a measure of how good the linear model is:

In [None]:
float(r.norm()**2)

The interpretation of this number is the following: any other pair $(m,d)$ would yield a larger value of $||\mathbf{r}||^2$ for this data set.