## Review (2)

Thanks for the update. It looks correct now so I'm accepting your project. Good luck.

---

## Review

Hi Jing. This is Soslan. As always I've added all my comments to new cells with different coloring.

<div class="alert alert-success" role="alert">
  If you did something great I'm using green color for my comment
</div>

<div class="alert alert-warning" role="alert">
If I want to give you advice or think that something can be improved, then I'll use yellow. This is an optional recommendation.
</div>

<div class="alert alert-danger" role="alert">
  If the topic requires some extra work so I can accept it then the color will be red
</div>

I have just one issue to the theoretical part. There was a mistake in your arguments when you looked at the angle between two vectors you forgot to consider their magnitudes, they are also small. I think you should use another reasoning. I left a comment for you. Good luck.

---



# Background Information

A insurance company wants to protect its clients' data. The task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. This is called data masking. We are also expected to prove that the algorithm works correctly. Additionally, the data should be protected in such a way that the quality of machine learning models doesn't suffer. Follow these steps to develop a new algorithm:
- Construct a theoretical proof using properties of models and the given task;
- Formulate an algorithm for this proof;
- Check that the algorithm is working correctly when applied to real data.

# Project Instruction

1. Download and look into the data.
2. Provide a theoretical proof based on the equation of linear regression. The features are multiplied by an invertible matrix. Show that the quality of the model is the same for both sets of parameters: the original features and the features after multiplication. How are the weight vectors from MSE minimums for these models related?
3. State an algorithm for data transformation to solve the task. Explain why the linear regression quality won't change based on the proof above.
4. Program your algorithm using matrix operations. Make sure that the quality of linear regression from sklearn is the same before and after transformation. Use the R2 metric.

### Import dataset

In [88]:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score

data = pd.read_csv('/datasets/insurance_us.csv')
data.head()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


<div class="alert alert-success" role="alert">
Correct start. Data was opened correctly.</div>

### Provide a theoretical proof based on the equation of linear regression. 
Provide a theoretical proof based on the equation of linear regression. The following equations will form a basis of our analysis today. Let's choose MSE as our loss function. To minimize it, we will use it in the context of linear regression. Thus we have the following formula:

    w = arg_w min MSE(Xw, y)
    
The minimum MSE is obtained when the weights are equal to this value:

    w = (X.t * X)^-1 * X.t * y
    
where 
- the transposed feature matrix is multiplied by itself;
- the matrix inverse to the result is calculated;
- the inverse matrix is multiplied by the transposed feature matrix;
- the result is multiplied by the vector of the target feature values.

### The features are multiplied by an invertible matrix.

In [89]:
# feature matrix

matrix_original = data.drop('Insurance benefits', axis=1).values
matrix_original.shape

(5000, 4)

In [122]:
# find a random invertible matrix of corresponding size

tol = 1e-12
while True:
   A = np.random.rand(matrix_original.shape[1], matrix_original.shape[0]);
   B = A @ A.T;
   err = np.abs(B @ np.linalg.inv(B) - np.identity(matrix_original.shape[1]))
   if err.all() < tol:
      break

# feature matrix multiplied by an invertible matrix

matrix_new = matrix_original @ B
matrix_new.shape

(5000, 4)

<div class="alert alert-success" role="alert">
Correct</div>


### Show that the quality of the model is the same for both sets of parameters: the original features and the features after multiplication. 

In [123]:
# determine the set of parameters for the original features

y = data['Insurance benefits'].values
w = np.linalg.inv(matrix_original.T @ matrix_original) @ matrix_original.T @ y
display(w)

# determine the set of parameters for the new features

v = np.linalg.inv(matrix_new.T @ matrix_new) @ matrix_new.T @ y
display(v)

array([-4.43854686e-02,  2.33356224e-02, -1.17739038e-05, -4.55168125e-02])

array([-6.54841463e-05,  9.01471566e-05,  3.56435998e-05, -7.15728779e-05])

In [124]:
# show that the quality of the model is the same before and after

def mse(target, predictions):
    return((target - predictions)**2).mean()

pred_original = matrix_original @ w
display(mse(y, pred_original))

pred_new = matrix_new @ v
display(mse(y, pred_new))

0.1494551172773669

0.1494551172778526

It's observed that the two mse metrics are practically the same.

### How are the weight vectors from MSE minimums for these models related?

In [125]:
# Ths is what we are comparing against

w

array([-4.43854686e-02,  2.33356224e-02, -1.17739038e-05, -4.55168125e-02])

In [126]:
# We found that w is equivalent to the following

B@v

array([-4.43852530e-02,  2.33356184e-02, -1.17738878e-05, -4.55167757e-02])

In [127]:
# Next, we want to find a formula in terms of w so that v can be replaced

display(v)
display(w@np.linalg.inv(B))

array([-6.54841463e-05,  9.01471566e-05,  3.56435998e-05, -7.15728779e-05])

array([-6.54845156e-05,  9.01472933e-05,  3.56437333e-05, -7.15728238e-05])

From the above, we showed that v can be expressed in terms of w. Now, we are ready to move on to step3 which is to state an algorithm for data transformation.

### State an algorithm for data transformation to solve the task.

In [128]:
# Note both B and B inverse are symmetric

display(B)
display(np.linalg.inv(B))

array([[1686.06778671, 1255.37879644, 1245.09216   , 1278.73772363],
       [1255.37879644, 1691.51482931, 1252.25039352, 1279.49217018],
       [1245.09216   , 1252.25039352, 1651.48707909, 1260.6693517 ],
       [1278.73772363, 1279.49217018, 1260.6693517 , 1705.35412254]])

array([[ 0.0018012 , -0.00050437, -0.00053569, -0.00057618],
       [-0.00050437,  0.00180493, -0.00055843, -0.0005632 ],
       [-0.00053569, -0.00055843,  0.00185075, -0.00054749],
       [-0.00057618, -0.0005632 , -0.00054749,  0.00184571]])

- v_min = w_min dot the inverse of a random matrix 
- MSE(X*A*v_min, y) =  MSE(X*A*w_min*A^-1, y) = MSE(X*A*A^-1*w_min, y) = MSE(X*w_min, y)
- Since both A and A^-1 are symmetric, order does not matter when multiplying with a vector

### Explain why the linear regression quality won't change based on the proof above.

In [132]:
# Let's examine dot product of w and v, ths gives us insight about the relation of the two vectors in terms of how similar

v@w

8.267534130756125e-06

- We noticed that both w and v are non-zero vectors but their dot product is a almost zero scalar. This indicates that the angle (theta) between the two vectors in a cosine is 0. 
- Please note that w dot v is equivalent to the magnitidue of vector w multiply the magnitude of vector v then multiply cosine theta. The only way that we can get zero is the cosine theta is 0 because both vectors has non-zero magnitude.
- The result of zero means the vectors are perpendicular to each other.
- With all said, having two vectors that are orthogonal to each other means that when one of such vectors is multiplied with the feature matrix, it spans the entire matrix and "rotate" it to another direction rather than scaling the original martrix. This is why our linear regression quality does not change. 

In [147]:
# reviewer's code for calculation of angle cos between two vectors

(v@w)/(np.linalg.norm(v)*np.linalg.norm(w))

0.890165298678149

<div class="alert alert-danger" role="alert">
No, unfortunately this is not a reason. Please look at sizes of your vectors w and v. They are very small. That is why the dot product is almost zero. You chosed $B$ with big values so the vectov v is almost close to zero. If you take instead of B it inverse or some other invertible matrix with smaller values you obtain another result.</div>

Here it is better to compare two formulas:

$$
w = (X^T * X)^{-1} * X^T * y
$$

and

$$
v = ((X*B)^T * X*B)^{-1} * (X*B)^T * y
$$

ant try to find some connections.

Edited:

Thanks for the hint, to explain why the regression quality won't change, let's actually take one step back and re-state some of the apparent relations that are needed. We would need the following in order to answer this question:
- First we know that w = arg_w min MSE(Xw, y)
- To answer why the regression quality doesn't change we need to answer the question why MSE(Xw, y) = MSE(XBv, y)
- From the previous step, we discovered that v = w * B_inverse = B_inverse * w, we can further prove this is true since B is symmetric and when a symmetric matrix is multiplied by a vector, the position of is vector does not matter.
- Applying this equation to the MSE assessment we found that even though feature matrix X is multiplied with some invertible B
- This effect can ultimately be reset because when X * B * B_inverse * w, we will get back X * Identity * w as an equivelant.
- This makes MSE(Xw, y) = MSE(XBv, y) which indicates that the quality of the regression won't change. 

<div class="alert alert-warning" role="alert">
OK, your reasoning looks correct, but actually this fact is correct for general case, when B isn't symmetric. Try to think how to set this up. It is a good exersize.</div>

### Program your algorithm using matrix operations. Make sure that the quality of linear regression from sklearn is the same before and after transformation. Use the R2 metric.

In [148]:
# We previous constructed the predictions using matrix operations, now we will use them

display(r2_score(y, pred_original))
display(r2_score(y, pred_new))

0.30322655304822976

0.30322655304596546

It's observed that the two r2 metrics are practically the same.