The Sure Tomorrow insurance company wants to protect its clients' data. Your task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. Prove that the algorithm works correctly

The data should be protected in such a way that the quality of machine learning models doesn't suffer. You don't need to pick the best model.

## 1. Data downloading

In [5]:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [6]:
df = pd.read_csv('/datasets/insurance_us.csv')

In [7]:
df.head()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Gender                5000 non-null int64
Age                   5000 non-null float64
Salary                5000 non-null float64
Family members        5000 non-null int64
Insurance benefits    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [9]:
df.columns = [x.lower() for x in df.columns]
df.columns = [x.replace(' ', '_') for x in df.columns]

In [10]:
df.describe()

Unnamed: 0,gender,age,salary,family_members,insurance_benefits
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


## 2. Multiplication of matrices

In this task, you can write formulas in *Jupyter Notebook.*

To write the formula in-between the text, frame it with dollar signs \\$; if it should be outside the text —  with double signs \\$\\$. These formulas are written in markup language *LaTeX.* 

For example, we wrote down linear regression formulas. You can copy and edit them to solve the task.

You don't have to use *LaTeX*.

Denote:

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
\min_w d_2(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

** Answer:** 
$$
w' = ((XP)^T XP)^{-1}(XP)^T y
$$
$$
w' = ((X^T)(P^T)XP)^{-1}X^T P^T y
$$
$$
w' = ((X^T)X)^{-1}((P^T)P)^{-1}X^T P^T y
$$
$$
w' = ((X^T)X)^{-1}(P^T)^{-1}P^{-1}X^T P^T y
$$
$$
w' = ((X^T)X)^{-1}P^{-1}X^T y
$$
$$
w' = P^{-1}((X^T)X)^{-1}X^T y
$$
$$
w' = P^{-1}w
$$

** Justification:** 
Here we lay out the formula for our feature being multiplied by our matrix. By working through in this way, we prove that we can safely transform the data and maintain data integrity. This demonstrats that tranformation will not impact our prediction values. 

## 3. Transformation algorithm

** Algorithm**

$$
a^1 = X^1w^1
$$
$$
X^1 = XP
$$
$$
w^1 = P^{-1}w
$$
$$
a^1 = XPP^{-1}w = Xw = a
$$

** Justification**

By transforming our data with this algorithm, we can run regression on masked data without losing accuracy. In our case, we will generate a random matrix to use on our original data to apply this transformation. By specifying the random seed, we can make a repeatable way to mask this data. 

## 4. Algorithm test

In [11]:
X = df.drop('insurance_benefits', axis=1)
y = df['insurance_benefits']

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [13]:
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

In [14]:
r2 = r2_score(y_test, predictions)
print('R2 Score:', r2)

R2 Score: 0.42096019377164573


In [15]:
np.random.seed(47)
new_matrix = np.random.normal(size =(4, 4))
inverse_of_matrix = np.linalg.inv(new_matrix)
print(new_matrix @ inverse_of_matrix )

[[ 1.00000000e+00 -2.27096519e-16 -2.65170542e-17  5.32922142e-18]
 [ 1.03253931e-16  1.00000000e+00 -1.11195754e-16  1.21495686e-16]
 [ 2.92626867e-16 -1.03771921e-16  1.00000000e+00 -1.62305695e-16]
 [ 2.49844119e-16 -7.26113207e-17 -1.61950698e-16  1.00000000e+00]]


In [17]:
X_new = X.dot(new_matrix)

In [18]:
model.fit(X_new, y)
r2 = r2_score(y_test, predictions)
print('R2 Score:', r2)

R2 Score: 0.42096019377164573


With our data transformed, we achieved the same R2 score as we did with our original data. 

More importantly, this ensures we can properly obfuscate the data and still create an accurate prediction model. 