# Protecting Client Personal Data

We need to protect the data of clients of the insurance company "Even if there is a flood". We need to develop a data transformation method that makes it difficult to recover personal information from it. We need to justify its correct operation.

We need to protect the data so that the quality of machine learning models does not degrade during the transformation. Selecting the best model is not required.

## Data loading

In [1]:
import pandas as pd 
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

In [2]:
df = pd.read_csv('/datasets/insurance.csv')

In [3]:
df.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [5]:
df.describe()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


We have a dataset of 5,000 objects, with no gaps. The dataset has five features, all of which are integers and rational numbers.

* Features: gender, age, and salary of the insured, and number of family members.
* Target feature: number of insurance payments made to the client over the past five years.

## Matrix multiplication

Notations:

- $a$ — model predictions

- $X$ — feature matrix (column zero consists of ones)

- $y$ — target feature vector

- $P$ — matrix by which features are multiplied

- $w$ — vector of linear regression weights (element zero equals the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
w = \arg\min_w MSE(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

### Task statement

Features are multiplied by an invertible matrix. Will the quality of the linear regression change?

1. It will change. In this case, please provide examples of matrices.
2. It will not change. In this case, it is necessary to indicate how the linear regression parameters in the original problem and in the transformed one are related.

### Task description

We need to establish an identity between a and a', where
    
$a = X w$ 
    
$a' = X P w'$ 
 
    
The modified weight vector is calculated using the following formula:
    
$
w' = ((XP)^T XP)^{-1} (XP)^T y
$  


### Task solution

Let's expand the brackets for the model's prediction formulas

$a = X w$

$a = X ((X^T X)^{-1} X^T y)$

$a = X   (X^T X)^{-1} X^T y$


$a' = X P w'$ 

$a' = X P (((XP)^T XP)^{-1} (XP)^T y)$ 

$a' = X P ((P^T(X^TX)P)^{-1} P^TX^T y)$ 

$a' = X P P^{-1} (X^TX)^{-1}(P^T)^{-1}P^TX^Ty$ 

$a' = X E (X^TX)^{-1}EX^Ty$ 


As a result, we can state that $a = a'$, since multiplication by the identity matrix $E$ returns the same matrix.

## Conversion algorithm

**Algorithm**

1. The original matrix can be of any size and must be invertible.
2. The `DataEncryption` class is used for encryption.
3. Instantiate the `DataEncryption` class and pass it the feature matrix $X$.
4. Call the `encrypt_data()` class method.
    1. The method generates a matrix $P$ by which the features will be multiplied. The matrix $P$ is generated in a size that allows it to be multiplied by the matrix $X$.
    2. The matrix $P$ is checked for invertibility. If the matrix is ​​not invertible, the matrix generation function with a new random seed is called recursively.
5. Obtain the encrypted feature matrix.

**Justification**

As a result of calculating the mathematical equation in Section 2.3, it is proven that multiplying a feature matrix by another matrix does not change the predictions of model $a$.

## Verification of the algorithm

### Feature extraction

In [6]:
# Select the target feature and features for training.
features = df.drop('Insurance payments', axis=1)
target = df['Insurance payments']

In [7]:
X = features.values
X

array([[1.00e+00, 4.10e+01, 4.96e+04, 1.00e+00],
       [0.00e+00, 4.60e+01, 3.80e+04, 1.00e+00],
       [0.00e+00, 2.90e+01, 2.10e+04, 0.00e+00],
       ...,
       [0.00e+00, 2.00e+01, 3.39e+04, 2.00e+00],
       [1.00e+00, 2.20e+01, 3.27e+04, 3.00e+00],
       [1.00e+00, 2.80e+01, 4.06e+04, 1.00e+00]])

In [8]:
y = target.values
y

array([0, 1, 0, ..., 0, 0, 0])

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123456)

### Implementation of the algorithm

In [10]:
# Class for data encryption
class DataEncryption:
    def __init__(self, X_train, X_test):
        self.X_train = X_train
        self.X_test = X_test
        
    def generate_random_normal(self):
        np.random.seed(np.random.randint(1, 100000))
        
        self.random_matrix = np.random.normal(
            np.random.randint(1, 10), 
            np.random.randint(1, 5), 
            size=(self.X_train.shape[1], 
                  self.X_train.shape[1])
        ).round(1)
        
        try:
            np.linalg.inv(self.random_matrix)
        except LinAlgError:
            self.generate_random_normal()
        
    def encrypt_data(self):
        self.generate_random_normal()
        self.X_encrypted_train = self.X_train @ self.random_matrix
        self.X_encrypted_test = self.X_test @ self.random_matrix

### Testing the algorithm's operation

In [11]:
# Train the model using the original features and check the R2 metric
model = LinearRegression()

model.fit(X_train, y_train)
predictions = model.predict(X_test)

r2_score(y_test, predictions)

0.4192116037042798

In [12]:
# Train the model on the transformed features and check the R2 metric
model = LinearRegression()

encrypter = DataEncryption(X_train, X_test)
encrypter.encrypt_data()

model.fit(encrypter.X_encrypted_train, y_train)
predictions = model.predict(encrypter.X_encrypted_test)

r2_score(y_test, predictions)

0.41921160370362653

**Conclusion:**

Multiplying features by an invertible matrix helps encrypt data and does not affect the R2 metric value.