Let $D = \{\vec{x}_i, y_i\}_{i=1}^N$ a dataset, where each $\vec{x}_i$ is an $\mathbb{R}^{d-1}$ vector and each $y_i \in \mathbb{R}$ is a tag. The idea is to estimate the data or to make a prediction, so we will use a linear model to adjust $D$, we are seeking an hyperplane given by 

\begin{equation*}
\vec{w}^T \vec{x} + b = 0
\end{equation*}

Now, the error of our prediction is given by $\epsilon_i = \vec{x}_i^T \vec{w} + b - y_i$ for $i = 1, \ldots, N$. We can write this error in a matrix form as follows 

$$
  \begin{pmatrix}
     1 & x_1^{(1)} & \cdots & x_{1}^{(d-1)} \\
     \vdots & \vdots & \ddots & \vdots \\
     1 & x_N^{(1)} & \cdots & x_N^{(d-1)} 
  \end{pmatrix}
  \begin{pmatrix}
    b  \\
   w_1 \\
   \vdots \\
   x_{d-1}
   \end{pmatrix} -
  \begin{pmatrix}
   y_1 \\
   \vdots \\
   y_N
  \end{pmatrix}
$$
In order to have an accurate model we seek to minimize the error and in order to do this we minimize the square of the norm of the error $\vec{\epsilon}$, thus we present the following optimization problem:

\begin{equation*}
   \mbox{arg min}_{\vec{x}} \|\epsilon \|_2^2 = \mbox{arg min}_{\vec{x}} \| Ax - y \|_2^2
\end{equation*}


By doing this we arrive at a least squares problem, where the objective function is $f(\vec{x})  = \| Ax - y \|_2^2 $. Now we try to solve this problem to see which conditions $A$ and $\vec{y}$ must satisfy so that the problem has a solution. First  we write the function in a more friendly way so that we can work with it more easily.

\begin{align*}
 f(\vec{x}) & = \| Ax - y \|_2^2 \\
              & = (Ax-y)^{T}(Ax-y) \\
              & = (x^{T}A^T-y^T)(Ax-y) \\
              & = x^{T}A^{T}Ax-x^{T}A^{T}y-y^{T}Ax- y^{T}y \\
              & = x^{T}A^{T}Ax-2x^{T}A^{T}y- y^{T}y
\end{align*}

Now we find the derivative of $f$ with respect to $\vec{x}$

\begin{align*}
\frac{\partial}{\partial \vec{x}}f(\vec{x})  & = \frac{\partial}{\partial \vec{x}}(\vec{x}^{T}A^{T}A\vec{x}-2\vec{x}^{T}A^{T}y- y^{T}y) \\
                                 & = 2A^{T}A\vec{x}-2A^{T}y
\end{align*}

We can clearly see that the critical points of the function are $A^{T}A\vec{x} = A^{T}y$, now we see which conditions must $A$ satisfy in order to ibtain a solution for the problem:

* If the columns of the matrix $A$ are linearly independent, then the matrix is of complete rank and thus  $A^TA$ that is $N \times N$ will also be of complete rank, in other words, $rank(A) = N$, hence $(A^TA)^{-1}$ exists and this implies that  $\vec{x} = (A^TA)^{-1}A^T\vec{y} = A^+\vec{y}$, from where $A^+$ is known as the pseudoinverse of Moore-Penrose $A$.

* In the opposite case, we know that $A^TA$ is a square matrix $N \times N$ thus we can consider its $QR$ factorization, however, we know that $R$  is upper-triangular but it is not invertible, furthermore, this matrix does not have all of its pivots, hence the solution for this system has at least one free variable, then this system has an infinite amount of solutions. For this we discard this case.

Thence, as a first condition for $A$ we have that it has to have linearly independent columns, which directly implies that $N > d$. This means that the quantity of data is bigger than the amount of characteristics that this data has, otherwise, there would be a base of size $d$ for $\mathbb{R}^N$, and this makes no sense.

Otherwise, the size of the $y_i$ does not affect the soluction of the problem and thus we can tag the data with an arbitrary number and we could solve the least squares minimization problem. 

