# Protection of personal data of clients

We need to protect the customer data of the insurance company "Though the flood". It is necessary to develop such a method of data transformation so that it would be difficult to recover personal information from them.

It is necessary to protect the data so that the quality of machine learning models does not deteriorate during the transformation. 

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import RandomState
from scipy import stats as st
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

## Data exploration

In [2]:
try:
    df = pd.read_csv('insurance.csv')

except:
    df = pd.read_csv('/datasets/insurance.csv')

In [3]:
df.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


Изучим инфомрацию в представленном датасете.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Пол                5000 non-null   int64  
 1   Возраст            5000 non-null   float64
 2   Зарплата           5000 non-null   float64
 3   Члены семьи        5000 non-null   int64  
 4   Страховые выплаты  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


In [5]:
df.describe()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


From the information provided, it can be seen that the missing values are missing in the dataset.

Let's take a closer look at the column on insurance payments. It shows that the maximum value of insurance payments for the insured is 5. Let's check the distribution of the number of payments.

In [9]:
df[df.duplicated()]

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
281,1,39.0,48100.0,1,0
488,1,24.0,32900.0,1,0
513,0,31.0,37400.0,2,0
718,1,22.0,32600.0,1,0
785,0,20.0,35800.0,0,0
...,...,...,...,...,...
4793,1,24.0,37800.0,0,0
4902,1,35.0,38700.0,1,0
4935,1,19.0,32700.0,0,0
4945,1,21.0,45800.0,0,0


In [6]:
df['Страховые выплаты'].unique()

array([0, 1, 2, 3, 5, 4])

In [7]:
df['Страховые выплаты'].value_counts()

0    4436
1     423
2     115
3      18
4       7
5       1
Name: Страховые выплаты, dtype: int64

In [9]:
ratio = df.loc[df['Страховые выплаты'] != 0].count()/ df.loc[df['Страховые выплаты']].count()
ratio

Пол                  0.1128
Возраст              0.1128
Зарплата             0.1128
Члены семьи          0.1128
Страховые выплаты    0.1128
dtype: float64

**Conclusion.**

1) The data is preprocessed.

2) The percentage of insurance payments is 11.3%.

## Matrix multiplication (Theory. Proof of the correct operation of the encryption algorithm).

To write a formula inside the text, surround it with dollar symbols \\$; if outside, double symbols \\$\\$. These formulas are written in the *LaTeX layout language.*

For example, we have written down linear regression formulas. You can copy and edit them to solve the problem.

It is not necessary to work in *LaTeX*.

Designations:

- $X$ — feature matrix (the zero column consists of units)

- $y$ — vector of the target feature

- $P$ - is the matrix by which the signs are multiplied

- $w$ — vector of linear regression weights (the zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

The task of training:

$$
w = \arg\min_w MSE(Xw, y)
$$

Learning formula:

$$
w = (X^T X)^{-1} X^T y
$$

According to the condition of the task, we need to encrypt the source data and preserve the quality of the model. 

Simply multiplying the feature matrix by the number k does not make sense, since it will be easy for the 3rd person to guess how the data can be decrypted.
As an encryption, you can use multiplication of the original matrix by a random reversible matrix, which will act as a key. Only a limited number of users have the key.

Let's make the following assumption.

**Assumption:** Let $X$ be the initial matrix of features of size (m,n). The prediction of the target feature will be made according to the formula:

$$
a = X w
$$

Take an invertible matrix $P$ of size (k,k) of random elements such that $k = n$

Then the predictions based on the characteristics of the original and encrypted matrix will be equal:
$$
a = a_1
$$

where, $a_1 = X_1 w_1$, $X_1 = XP$


**Justification:** Suppose that the predictions based on the characteristics of the original matrix $X$ will be equal to the predictions based on the product of the matrix $X$ by the matrix $P$, that is, $a = a_1$.

Based on this equality , we will compose the following equation and make algebraic transformations:

$
a = a_1
$

$
X w = X_1 w_1
$

$
X ((X)^T X)^{-1} (X)^T y = X_1 ((X_1)^T X_1)^{-1} (X_1)^T y
$

Given the conditions of the problem, we transform the right side of the equation.

$
X ((X)^T X)^{-1} (X)^T y = X P ((X P)^T X P)^{-1} (X P)^T y
$

$
X ((X)^T X)^{-1} (X)^T y = X P ((X)^T (P)^T X P)^{-1} (P)^T(X)^T y
$

$
X ((X)^T X)^{-1} (X)^T y = X P (P)^{-1} ((X)^T X )^{-1} ((P)^T)^{-1}  (P)^T (X)^T y
$

$
X ((X)^T X)^{-1} (X)^T y = E X ((X)^T X )^{-1} E (X)^T y
$

Given the property of multiplication by unit matrices $A E = E A = A$ we get
$X((X)^T X)^{-1}(X)^T y = X ((X)^T X )^{-1} (X)^T y$ or $a = a_1$.

**The assumption is proven**

## Matrix Multiplication (Project)

For convenience, we will use the notation given in the theory above.

- $X$ — feature matrix (the zero column consists of units)

- $y$ — vector of the target feature

- $P$ is the matrix by which the signs are multiplied

- $w$ is a vector of linear regression weights (the zero element is equal to the shift).

At the first stage, we will identify the signs by which we will make predictions.

In [10]:
features = df.drop(['Страховые выплаты'], axis = 1)
target = df['Страховые выплаты']

features.shape

(5000, 4)

To try to encrypt the data, let's try to multiply the original features by an invertible matrix.

Let's create a random matrix and make it reversible in order to multiply it by signs further.

In [11]:
np.random.seed(4)
P = np.random.normal(3, 2.5, size=(4, 4))

Let's check the reversibility of the matrix.

A matrix is invertible if and only if it is non-degenerate, that is, its determinant (|P|) is not zero.

In [12]:
np.linalg.det(P)

143.91547866794653

Hence the matrix is invertible. Now we multiply the matrix of initial features by the reversible one.

In [13]:
X = np.array(features)
Z = X @ P
Z.shape

(5000, 4)

In [14]:
Z

array([[190086.31722309,   6481.55885936, 225572.26968272,
        138082.58633029],
       [145657.71110413,   4949.324121  , 172837.24213868,
        105851.95544556],
       [ 80499.79985182,   2729.59492082,  95520.19993329,
         58511.08296225],
       ...,
       [129905.40054968,   4439.78776184, 154160.08672537,
         94340.79156763],
       [125319.7480295 ,   4288.3752857 , 148707.45731147,
         91022.35671767],
       [155585.28628877,   5312.28186077, 184634.25214955,
        113003.83852479]])

In [15]:
X

array([[1.00e+00, 4.10e+01, 4.96e+04, 1.00e+00],
       [0.00e+00, 4.60e+01, 3.80e+04, 1.00e+00],
       [0.00e+00, 2.90e+01, 2.10e+04, 0.00e+00],
       ...,
       [0.00e+00, 2.00e+01, 3.39e+04, 2.00e+00],
       [1.00e+00, 2.20e+01, 3.27e+04, 3.00e+00],
       [1.00e+00, 2.80e+01, 4.06e+04, 1.00e+00]])

To check the effect of changing features, we will create a separate dataset with Z features.

In [16]:
df_Z = pd.DataFrame(data = Z, columns = ['Пол', 'Возраст', 'Зарплата', 'Члены семьи'])
df_Z.head()

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи
0,190086.317223,6481.558859,225572.269683,138082.58633
1,145657.711104,4949.324121,172837.242139,105851.955446
2,80499.799852,2729.594921,95520.199933,58511.082962
3,159786.230432,5463.031976,189625.526394,116029.542903
4,100037.159926,3404.479172,118707.367963,72689.487184


Let's create a LinReg_manual class - a logistic regression written manually.

To begin with, let's check on the source data.

In [17]:
features_Z = df_Z

class LinReg_manual:
    def fit(self, train_features, train_target):
        X = np.concatenate((np.ones((train_features.shape[0], 1)), train_features), axis=1)
        y = train_target
        w = ((np.linalg.inv(X.T @ X)) @ X.T) @ y
        self.w = w[1:]
        self.w0 = w[0]

    def predict(self, test_features):
        return test_features.dot(self.w) + self.w0
    
model = LinReg_manual()

model.fit(features, target)

predictions = model.predict(features)

r2_score(target, predictions)

0.42494550286668

In [18]:
model.fit(features_Z, target)
predictions_Z = model.predict(features_Z)
r2_score(target, predictions_Z)

0.4249455028658199

**Conculsion**

The quality of linear regression has not changed.

This is due to the fact that the multiplication of the matrix of initial features by an invertible matrix of random elements occurs scalar and, in fact, the elements of the random matrix are the coefficients by which we multiply the initial features. Therefore, the quality of the model does not change.

## Algorithm

**Алгоритм**

Considering the above point, the order of the algorithm most suitable for solving the problem will look like this:

1) At the input we get the original dataset. Let's call it ***df***.

$$
df
$$

2) We select the target attribute and the others by which we will predict.

$$
features
$$

$$
target
$$

3) Create a matrix of features ***X*** for prediction and check its dimension.

$$
X = np.array(features)
$$

$$
(m, n)
$$

4) Let's create a square matrix ***P*** from random elements with the difference (k,k), provided that the width of the first matrix (푚×푛) is equal to the height of the second matrix In (k×K):):

$$
k = n
$$

In this case, the matrix ***P*** must be reversible (The determinant of the matrix P must not be equal to 0)

5) We produce matrix multiplications ***X*** and ***P***, the product of which will be the new matrix ***Z***

$$
Z = X * P
$$

**Justification**

The essence of encryption is that only a certain circle of people have access to the source information and there are no leaks. 

In this case, the reversible matrix will be the key by which you can both encrypt the data (P) and decrypt it (P^-1).

At the same time, taking into account the earlier verification and the above in the conclusion of the section **Matrix Multiplication (Project)**, the quality of the linear regression model for predicting the number of insurance payments will not deteriorate.

## Checking the algorithm

Let's prepare the Lineage_2 model and include in it the process of encrypting the original features.

In [19]:
class LinReg_2:
    def fit(self, train_features, train_target):
        np.random.seed(4)
        P = np.random.normal(3, 2.5, size=(4, 4))
        X = np.array(train_features)
        Z = X @ P
        Z_1 = np.concatenate((np.ones((train_features.shape[0], 1)), train_features), axis=1)
        y = train_target
        w = ((np.linalg.inv(Z_1.T @ Z_1)) @ Z_1.T) @ y
        self.w = w[1:]
        self.w0 = w[0]

    def predict(self, test_features):
        return test_features.dot(self.w) + self.w0

Let's compare the quality of the models.

In [20]:
model = LinReg_2()

model.fit(features, target)

predictions = model.predict(features)

r2_score(target, predictions)

0.42494550286668

# Conclusion

We see that the quality of the model when adding the encryption process to the class remained at the same level.