# Creating a data transforming algorithm

The Sure Tomorrow insurance company wants to protect its clients' data. The task is to develop a data transforming algorithm that would make it hard to recover personal information from the transformed data.

The data should be protected in such a way that the quality of machine learning models doesn't suffer.

## 1. Data downloading

In [None]:
import pandas as pd
import numpy as np
from numpy.linalg import inv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [None]:
df = pd.read_csv('/datasets/insurance_us.csv')

In [None]:
df.head()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
0,1,41.0,49600.0,1,0
1,0,46.0,38000.0,1,1
2,0,29.0,21000.0,0,0
3,0,21.0,41700.0,2,0
4,1,28.0,26100.0,0,0


In [None]:
df.describe()

Unnamed: 0,Gender,Age,Salary,Family members,Insurance benefits
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.499,30.9528,39916.36,1.1942,0.148
std,0.500049,8.440807,9900.083569,1.091387,0.463183
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
Gender                5000 non-null int64
Age                   5000 non-null float64
Salary                5000 non-null float64
Family members        5000 non-null int64
Insurance benefits    5000 non-null int64
dtypes: float64(2), int64(3)
memory usage: 195.4 KB


#### Converting the data to the necessary types

In [None]:
df['Age'] = df['Age'].astype('int')

In [None]:
df['Salary'] = df['Salary'].astype('int')

#### Checking for duplicate rows

In [None]:
df.duplicated().sum()

153

In [None]:
df = df.drop_duplicates().reset_index(drop=True)

In [None]:
df.shape

(4847, 5)

#### Setting target and features

In [None]:
target = df['Insurance benefits']

In [None]:
features = df.drop('Insurance benefits', axis=1)

### Conclusion

`datasets/insurance_us.csv` was opened and examined for general information.

There are 5 columns and 5000 rows in the file.

The datatype for the `Age` and the `Salary` columns in this dataset were converted from float to int, since both are counted in whole numbers.

153 duplicated rows were found using `.duplicated()`. The `drop_duplicates()` and `reset_index` function was used to remove these rows.

Lastly, the appropriate target and features were set in accordance to the project guidelines:
- Features: insured person's gender, age, salary, and number of family members.
- Target: number of insurance benefits received by the insured person over the last five years.

## 2. Multiplication of matrices

- $X$ — feature matrix (zero column consists of unities)

- $y$ — target vector

- $P$ — matrix by which the features are multiplied

- $w$ — linear regression weight vector (zero element is equal to the shift)

Predictions:

$$
a = Xw
$$

Training objective:

$$
\min_w d_2(Xw, y)
$$

Training formula:

$$
w = (X^T X)^{-1} X^T y
$$

**Answer:**

$$ X' = XP $$

$$ w' = ((XP)^T XP)^{-1} (XP)^T y $$
$$ = (P^T X^T (XP))^{-1} (XP)^T y $$
$$ = (P^T(X^T X)P)^{-1} (XP)^T y $$
$$ = P^{-1} (X^T  X)^{-1} (P^T)^{-1} (XP)^T y $$
$$ = P^{-1} (X^T  X)^{-1} (P^T)^{-1} P^T X^T y $$
$$ = P^{-1} (X^T X)^{-1} X^T y $$
$$ = P^{-1} w $$

**Justification:**

$$ a = Xw $$

$$ a' = X'w' $$
$$ = XP P^{-1} w $$
$$ = Xw $$
$$ = a $$

## 3. Transformation algorithm

1. Create an 4 x 4 invertible matrix.
    - The dimensions are 4 x 4 because we are working with 4 features.
2. Confirm invertibility using `numpy.linalg.inv()`.
3. Transform the features by multiplying the features matrix with the invertible matrix.
4. Run two linear regressions, one with untransformed features and the other with the transformed features.
5. Compare the two linear regression qualities using the R2 metric.
    - The qualities should be the same, as explained by the proof in step 2.

## 4. Algorithm test

#### Create the invertible matrix

In [None]:
np.random.seed(12345)
P = np.random.normal(size=(features.shape[1], features.shape[1]))

#### Confirm invertibility

In [None]:
inv(P)

array([[-1.31136747,  0.3921804 ,  0.18868055, -0.67088287],
       [ 1.75872714,  0.14106138, -0.17773045,  0.79787127],
       [-0.41702659, -0.22854768,  0.3550602 ,  0.33039819],
       [ 0.58912996,  0.19073027, -0.5545481 ,  0.6259302 ]])

`inv(P)` did not return an error here, so P is confirmed to be invertible.

#### Transform the features matrix with the invertible matrix

In [None]:
transformed_features = features @ P

#### Linear regression before transformation

In [None]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25, random_state = 12345)

In [None]:
print("--- Train Sizes (Rows, Columns) ---")
print("target_train:", target_train.shape)
print("features_train:", features_train.shape)
print("")
print("--- Test Sizes (Rows, Columns) ---")
print("target_test:", target_test.shape)
print("features_test:", features_test.shape)

--- Train Sizes (Rows, Columns) ---
target_train: (3635,)
features_train: (3635, 4)

--- Test Sizes (Rows, Columns) ---
target_test: (1212,)
features_test: (1212, 4)


In [None]:
model = LinearRegression()
model.fit(features_train, target_train)
predictions = model.predict(features_test)
r2_score(target_test, predictions)

0.4230772761583642

#### Linear regression after transformation

In [None]:
transformed_features_train, transformed_features_test = train_test_split(transformed_features, test_size=0.25, random_state = 12345)

In [None]:
print("--- Train Size (Rows, Columns) ---")
print("transformed_features_train:", transformed_features_train.shape)
print("")
print("--- Test Size (Rows, Columns) ---")
print("transformed_features_test:", transformed_features_test.shape)

--- Train Size (Rows, Columns) ---
transformed_features_train: (3635, 4)

--- Test Size (Rows, Columns) ---
transformed_features_test: (1212, 4)


In [None]:
transformed_model = LinearRegression()
transformed_model.fit(transformed_features_train, target_train)
transformed_predictions = transformed_model.predict(transformed_features_test)
r2_score(target_test, transformed_predictions)

0.4230772761581383

### Conclusion

Both the linear regression models, with and without transformation, result in the same R2 score of `0.423`.

Thus, we can conclude that the quality of the linear regression remains the same after transformation.