We need to protect the clients data of the insurance company and develop a method of data transformation. This method should be difficult to recover personal information. 

Use Linear Regression. r2_score must be the same for two variants.
You don't need to choose the best model.

## 1. About data

In [1]:
#library import
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [2]:
#check data
insurance = pd.read_csv('insurance.csv')
display(insurance.head())

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [3]:
#rename columns
insurance.rename(columns={'Пол': 'gender', 
                   'Возраст': 'age',
                   'Зарплата': 'salary',
                   'Члены семьи': 'family_members',
                   'Страховые выплаты': 'insurance_payments'}, inplace=True)

#change type
insurance[['age','salary']] = insurance[['age','salary']].astype('int')

insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   gender              5000 non-null   int64
 1   age                 5000 non-null   int32
 2   salary              5000 non-null   int32
 3   family_members      5000 non-null   int64
 4   insurance_payments  5000 non-null   int64
dtypes: int32(2), int64(3)
memory usage: 156.4 KB


In [4]:
#check nulls
insurance.isnull().sum()

gender                0
age                   0
salary                0
family_members        0
insurance_payments    0
dtype: int64

In [5]:
#check duplicates
insurance = insurance.drop_duplicates().reset_index(drop=True)
insurance.duplicated().sum()

0

### Conclusion 

5000 entries and 5 columns:
- gender
- age
- family member
- insurance payments

No nulls.
We had 3% dulicates and now they have been deleted. Also, age and gender are converted to int.

In [6]:
#features & target
features = insurance.drop(['insurance_payments'], axis=1)
target = insurance['insurance_payments']

#train & test
features_insurance_train, features_insurance_test, target_insurance_train, target_insurance_test = train_test_split(
    features, target, test_size=0.25, random_state=12345)

In [7]:
#LinearRegression without change
model = LinearRegression()
model.fit(features_insurance_train, target_insurance_train)
predictions = model.predict(features_insurance_test)
print(r2_score(target_insurance_test, predictions))

0.42307727615837565


In [8]:
#random matrix:
state = np.random.RandomState(12345)
random_matrix = state.normal(size=(4, 4))
random_matrix

array([[-0.20470766,  0.47894334, -0.51943872, -0.5557303 ],
       [ 1.96578057,  1.39340583,  0.09290788,  0.28174615],
       [ 0.76902257,  1.24643474,  1.00718936, -1.29622111],
       [ 0.27499163,  0.22891288,  1.35291684,  0.88642934]])

In [9]:
#check inversity
inverse_matrix = np.linalg.inv(random_matrix)
random_matrix.dot(inverse_matrix)    

array([[ 1.00000000e+00, -5.55111512e-17,  0.00000000e+00,
         1.11022302e-16],
       [ 0.00000000e+00,  1.00000000e+00, -5.55111512e-17,
        -2.22044605e-16],
       [ 0.00000000e+00, -1.38777878e-17,  1.00000000e+00,
        -2.77555756e-17],
       [ 1.11022302e-16,  5.55111512e-17,  0.00000000e+00,
         1.00000000e+00]])

In [10]:
#use random matrix
new_features_train = features_insurance_train.dot(random_matrix)
new_features_test = features_insurance_test.dot(random_matrix)

from sklearn.metrics import mean_squared_error, r2_score
model = LinearRegression()
model.fit(new_features_train, target_insurance_train)
pred_train = model.predict(new_features_train)
pred_test = model.predict(new_features_test)

print("Test")
print("R2 =", r2_score(target_insurance_test, pred_test))

Test
R2 = 0.42307727615811896


#### Conclusion

We used random matrix, because randomn is the worst thing to decode.

And we were right. 
R-2 on to modeal are 0.42.