# Real Estate estimator

In the following challenge, we will try to figure out if a linear relationship exists between the **price** of a flat and a few usual criterions like surface etc...

⚠️ Pandas is forbidden in this challenge: Welcome to the [NumPy documentation](https://docs.scipy.org/doc/numpy/reference/) which will be your friend throughout this exercise. You can also find help on this [NumPy cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)

In [1]:
# Load the NumPy library
import numpy as np

We have been able to collect data for 4 flats below: Their `surface` (square feet), `bedrooms` and `floors` numbers are the 3 **features** of our problem, and the `price` (in thousands of $) is our **target**:

|flats |surface|bedrooms|floors|price|
|------|-------------|--------|------|------------|
|flat1 |620|1|1|244|
|flat2 |3280|4|2|671|
|flat3 |1900|2|2|504|
|flat4 |1320|3|3|510|

A first approach is to try to find a **linear** relation between the `price` and the 3 features, by solving this system of equations:

$$\begin{cases}
    244 = \theta_0 + 620\theta_1 + 1\theta_2 + 1\theta_3 \\
    671 = \theta_0 + 3280\theta_1 + 4\theta_2 + 2\theta_3 \\
    504 = \theta_0 + 1900\theta_1 + 2\theta_2 + 2\theta_3 \\
    510 = \theta_0 + 1320\theta_1 + 3\theta_2 + 3\theta_3
\end{cases}$$

Which can be translated into a matricial equation:

$$Y = X\theta$$

$$\begin{bmatrix}
    244 \\
    671 \\
    504 \\
    510
\end{bmatrix} = \begin{bmatrix}
    1 & 620 & 1 & 1 \\
    1 & 3280 & 4 & 2 \\
    1 & 1900 & 2 & 2 \\
    1 & 1320 & 3 & 3
\end{bmatrix} \begin{bmatrix}
    \theta_1 \\
    \theta_2 \\
    \theta_3 \\
    \theta_4
\end{bmatrix}$$

where $Y$ is the vector of `Price`, $X$ is the matrix of features and $\theta$ (theta) is the vector of coefficients to be found.

If $\theta$ is found, the price of any new flat could be estimated using $$Y_{flat5} = X_{flat5}\theta$$

## 1. Define the matrix `X` of features:

❓ Create a (4,3) `numpy.ndarray`_ storing the values of the 3 features (surface, bedrooms, floors) for the 4 observations. Double check it's `shape`, `size` and `dim`

In [3]:
features = [[620,1,1], [3280,4,2], [1900,2,2], [1320,3,3]]
X = np.array(features)
X

array([[ 620,    1,    1],
       [3280,    4,    2],
       [1900,    2,    2],
       [1320,    3,    3]])

In [4]:
print(X.shape)
print(X.ndim)
print(X.dtype)

(4, 3)
2
int64


❓Add a "constant" vector of 1's to create the (4,4) matrix `X` representing the linear system of equation

🤔 As you probably noticed, the linear system of equation includes a $\theta_0$ coefficient which appears in the 4 equations. We need an additional feature to represent the y-intercept of the linear regression line (We talk about an [affine relation](https://math.stackexchange.com/questions/275310/what-is-the-difference-between-linear-and-affine-function) rather than a strict linear relation between the `price` and the features - more on that next week).

In [5]:
# Define x0 as a (4,1) vector filled with 1 with the fastest NumPy method
x0 = np.ones((4,1))
x0

array([[1.],
       [1.],
       [1.],
       [1.]])

In [6]:
# Use `numpy.hstack` to create the (4,4) matrix X by concatenating x0 
# to your previous (4,3) matrix
X = np.hstack((x0,X))
X

array([[1.00e+00, 6.20e+02, 1.00e+00, 1.00e+00],
       [1.00e+00, 3.28e+03, 4.00e+00, 2.00e+00],
       [1.00e+00, 1.90e+03, 2.00e+00, 2.00e+00],
       [1.00e+00, 1.32e+03, 3.00e+00, 3.00e+00]])

## 2. Define the vector `Y` of `Price`s

$Y = \begin{bmatrix}
    244 \\
    671 \\
    504 \\
    510
\end{bmatrix}$

In order to match our matricial representation $Y = X\theta$, what should be the shape of $Y$ ? Define $Y$ below

<details>
    <summary>Hint</summary>

Y should be a (4,1) array, equivalent to a flat "vector", represented vertically
</details>

In [7]:
# Define Y here
prices = [[244], [671], [504], [510]]
y = np.array(prices)
print(y.shape)
y

(4, 1)


array([[244],
       [671],
       [504],
       [510]])

## 3 Find the solution of the system

Now is the time to find the vector of coefficients $\theta$ !

The solution of the equation is:
 
$$ X\theta = Y \\
\Leftrightarrow X^{-1}X\theta = X^{-1}Y \\
\Leftrightarrow \theta = X^{-1}Y$$

where $X^{-1}$ is the inverse of $X$.

In [8]:
# Compute the inverse of the matrix X with the right NumPy method
Xinv = np.linalg.inv(X)
Xinv

array([[ 1.64516129e+00,  4.42419702e-17, -2.90322581e-01,
        -3.54838710e-01],
       [-5.37634409e-04, -2.50426246e-19,  1.07526882e-03,
        -5.37634409e-04],
       [ 3.70967742e-01,  5.00000000e-01, -1.24193548e+00,
         3.70967742e-01],
       [-6.82795699e-01, -5.00000000e-01,  8.65591398e-01,
         3.17204301e-01]])

You can check if the inversion worked by testing:

$$X^{-1}X = I_4$$
where $I_4$ is the 4 by 4 identity matrix.

In [9]:
# Define I4 using the right NumPy method
I4 = np.eye(4)
I4

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

Now compute $X^{-1}X$:

In [11]:
# Your code
XinvX = np.dot(Xinv,X)
XinvX

array([[ 1.00000000e+00, -5.10702591e-14, -5.55111512e-17,
        -2.77555756e-16],
       [-3.25260652e-19,  1.00000000e+00, -8.67361738e-19,
        -4.33680869e-19],
       [ 3.33066907e-16,  5.07149878e-13,  1.00000000e+00,
         3.33066907e-16],
       [-2.77555756e-16, -6.10622664e-13, -7.21644966e-16,
         1.00000000e+00]])

Does it look like $I_4$?

⛔️ If it doesn't, you probably used the `*` operator to perform the multiplication between $X^{-1}$ and $X$. Here we want to perform the matrix product. You should find the right Numpy method to do so.

✅ If it does, you noticed that you do not exactly get zeros and ones in the resulting product. To be sure, you can try the [`numpy.allclose()`](https://numpy.org/doc/stable/reference/generated/numpy.allclose.html?highlight=allclose#numpy.allclose) method to check your result:

In [12]:
# Your code
np.allclose(XinvX, I4)

True

You are finally able to find $\theta = X^{-1}Y$:

In [14]:
# Compute theta
theta = np.dot(Xinv, y)
theta

array([[ 74.12903226],
       [  0.13655914],
       [-10.72580645],
       [ 95.93010753]])

In [20]:
# Using linalg.solve
np.linalg.solve(X, y)

array([[ 74.12903226],
       [  0.13655914],
       [-10.72580645],
       [ 95.93010753]])

## 4. Estimation of a new flat price

You finally solved the system finding $\theta$, you are able to estimate the `Price` (in thousands of $) of a 5th flat given these characteristics:

- `Surface`: 3000 $ft^2$
- `Bedrooms`: 5 
- `Floors`: 1

with the following formula:

$$Y_{flat5} = X_{flat5}\theta$$

In [15]:
# Define X5
flat_5 = [[1,3000,5,1]]
X5 = np.array(flat_5)

# Compute Y5
y5 = np.dot(X5, theta)

# You should find a Price of 526,000 $
y5

array([[526.10752688]])

## 5. Reality check

In reality, flat price is never entirely determined by it's surface, bedroom and floor numbers.

Let's imagine that we measure the real price $Y_{flat5}$ at $700,000$ instead of $526,000$ as predicted. Could we take this new information into account to improve our model?

Update the linear system of equation $X\theta = Y$ to reflect this new datapoint measured

In [16]:
# Create the new matrix of feature X of shape (5,4)
X2 = np.vstack((X, X5))
print(X2.shape)
X2

(5, 4)


array([[1.00e+00, 6.20e+02, 1.00e+00, 1.00e+00],
       [1.00e+00, 3.28e+03, 4.00e+00, 2.00e+00],
       [1.00e+00, 1.90e+03, 2.00e+00, 2.00e+00],
       [1.00e+00, 1.32e+03, 3.00e+00, 3.00e+00],
       [1.00e+00, 3.00e+03, 5.00e+00, 1.00e+00]])

In [17]:
# Create new Y of shape (5,1)
flat_5_price = np.array([[700]])
y2 = np.vstack((y, flat_5_price))
print(y2.shape)
y2

(5, 1)


array([[244],
       [671],
       [504],
       [510],
       [700]])

Let's try to predict the price of a 6th flat from our updated model.  
To do so, try to solve $\theta$ from $X\theta = Y$ using [`numpy.linalg.solve`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.solve.html).

What is going on? What can you conclude?

In [18]:
# Your code
np.linalg.solve(X2, y2)

LinAlgError: Last 2 dimensions of the array must be square

<details>
    <summary>👉 Explanations</summary>

$X$ is not a square matrix, therefore it cannot be inversible: $X^-1$ does not exist, and $\theta$ cannot be computed from $Y = X\theta$ 
    
Our initial approach, which consists in finding a closed mathematical formula to compute an exact flat price as linear combination of only 3 features **does not hold** true for our 5 observed flats. 

Instead, we will learn in the coming weeks to find ways to **approximate** a flat price based on these features.

For instance, instead of solving $Y = X\theta$ we could find $\hat{\theta}$ that minimizes the error $e = X\hat{\theta} - Y $: This approach is called a **linear regression model**

This new estimator can then be used to give an **approximate** estimation of the price on any new flats with $Y_{flat_6} = X_{flat_6} \hat{\theta}$ 

</details>

🏁 Congratulations! Don't forget to commit and push your notebook before moving on to the next challenge! 