<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-description" data-toc-modified-id="Project-description-1">Project description</a></span><ul class="toc-item"><li><span><a href="#Download-and-look-into-data" data-toc-modified-id="Download-and-look-into-data-1.1">Download and look into data</a></span></li><li><span><a href="#Mathematical-proof-of-transformation-algorithm" data-toc-modified-id="Mathematical-proof-of-transformation-algorithm-1.2">Mathematical proof of transformation algorithm</a></span></li><li><span><a href="#Proving-the-transformation-algorithm-with-Linear-Regression" data-toc-modified-id="Proving-the-transformation-algorithm-with-Linear-Regression-1.3">Proving the transformation algorithm with Linear Regression</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-2">Conclusion</a></span></li></ul></div>

# Project description

The Sure Tomorrow insurance company wants to protect its client's data.<br>

<b>The task</b> is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data. Data should be protected in such way that the quality of machine learning algorithm doesn't suffer.<br>

<b>Data description</b><br>
- <b>Features</b>: insured person's gender, age, salary, and number of family members<br>
- <b>Target</b>: number of insurance benefits received by the insured person over the last five years.<br>

## Download and look into data

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [2]:
df = pd.read_csv('insurance_us.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Gender              5000 non-null   int64  
 1   Age                 5000 non-null   float64
 2   Salary              5000 non-null   float64
 3   Family members      5000 non-null   int64  
 4   Insurance benefits  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [4]:
df.describe()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


In [5]:
df.head(5)

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [6]:
features = df.drop('Insurance benefits', axis=1)
target = df['Insurance benefits']

In [7]:
X0_train, X0_test, y0_train, y0_test = train_test_split(features, target, test_size=0.25, random_state=123)

## Mathematical proof of transformation algorithm

Predictions:<br>
$$
a = Xw
$$

The task of training linear regression is:<br>
$$
w = \arg\min_w MSE(Xw, y)
$$

The minimum MSE value is obtained when the weights are equal to this value:<br>
$$
w = (X^T X)^{-1} X^T y
$$

To prove the transformation algorithm we will replace matrix X with matrix ZX (X multiplied by Z):

$$
a_{new} = XZw_{new}
$$

$$
w_{new} = ((XZ)^T XZ)^{-1}(XZ)^T y
$$

Substitute $w_{new}$ with $a_{new}$ and transform:

$$
a_{new} = X Z ((XZ)^T XZ)^{-1}(XZ)^T y = \\\\X Z (Z^TX^T XZ)^{-1}Z^T X^T y = \\\\X Z (X^TXZ)^{-1} (Z^T)^{-1} Z^T X^T y = \\\\X Z Z^{-1}(X^TX)^{-1} (Z^T)^{-1} Z^T X^T y = \\\\X E (X^TX)^{-1} E  X^T y = \\\\X (X^TX)^{-1} X^T y =  X w
$$

Therefore, $a_{new} = a$.

## Proving the transformation algorithm with Linear Regression

Let's create a 4x4 square matrix "q", which elements are randomly generated numbers.

In [8]:
q = np.random.rand(4,4) # Obfurscation matrix
q

array([[0.67423488, 0.30692777, 0.77369117, 0.49016026],
       [0.71870266, 0.03408503, 0.79948296, 0.09257557],
       [0.56202021, 0.70224244, 0.4680993 , 0.0050344 ],
       [0.28085794, 0.55871047, 0.63622126, 0.13571152]])

In [9]:
# checking if the matrix is invertible
np.linalg.inv(q)

array([[ 0.62420171,  0.43057608,  1.78637441, -2.61446499],
       [ 0.0506053 , -1.06712933,  1.05096615,  0.50617976],
       [-0.8521951 ,  1.10182695, -1.57844922,  2.38488562],
       [ 2.4949903 , -1.66323472, -0.62382537, -0.48507011]])

In [10]:
X0_train_q = X0_train@q
X0_train_q.shape

(3750, 4)

In [11]:
X0_test_q = X0_test@q
X0_test_q.shape

(1250, 4)

If we multiply the matrix "q" and our given matrix with original data this is what we will get:

In [12]:
X0_train_q.head()

Unnamed: 0,0,1,2,3
2413,23964.858168,29917.974933,19967.060953,218.004086
1471,14016.819513,17487.724997,11681.067059,128.759579
1196,25645.370644,32023.0734,21364.515832,231.790235
1509,19874.81086,24791.387083,16563.716301,182.386065
4110,21041.791336,26265.196813,17531.672132,191.553808


Now let's check if the transformation works correctly and that masking did not change the quality of a machine learning model that can be used on this data.

We will define a LinearRegression class that will train the linear regression model and make predictions.

In [13]:
class LinearRegression:
    def fit(self, train_features, train_target):
        X = np.concatenate((np.ones((train_features.shape[0], 1)), train_features), axis=1)
        y = train_target
        w = np.linalg.inv(X.T@X)@X.T@y
        self.w = w[1:]
        self.w0 = w[0]

    def predict(self, test_features):
        return test_features.dot(self.w) + self.w0

In [14]:
# Original data
model_1 = LinearRegression()
# Training model on the train data
model_1.fit(X0_train, y0_train)
# Making predictions on the test set
predictions_1 = model_1.predict(X0_test)
r2_score(y0_test, predictions_1)

0.4301846999093346

In [15]:
# Obfuscated data
model_2 = LinearRegression()
# Training the model on the obfuscated train data
model_2.fit(X0_train_q, y0_train)
# Making predictions on the test set
predictions_2 = model_2.predict(X0_test_q)
r2_score(y0_test, predictions_2)

0.430184710963604

# Conclusion

We can see that the coefficient of determination - R2 - is the same in both cases, thus we can conclude that the obfuscation did not change the data in a way that that it would interfere with a model quality.