# Personal Data Protection for Insurance Company Clients

**Project Description**

You need to protect the data of the clients of the "Even in case of flood" insurance company. Develop a method of data transformation so that it is difficult to restore personal information from it. Justify the correctness of its operation.

The goal is to protect the data in such a way that the quality of machine learning models does not deteriorate after transformation. It is not necessary to select the best model.

Work plan:

1. Load and examine the data.
2. Multiply the features by a reversible matrix. Check if the quality of linear regression changes.
   1. It changes. Provide examples of matrices.
   2. It does not change. Indicate how the parameters of linear regression in the original problem are related to those in the transformed one.
3. Propose a data transformation algorithm to solve the problem. Show why the quality of linear regression will not change.
4. Implement this algorithm by applying matrix operations. Verify that the quality of linear regression from `sklearn` does not differ before and after transformation using the `R2` metric.

**Data Description**

The dataset is located in the file `/datasets/insurance.csv`.

1. **Features:** gender, age, insured's salary, number of family members.
2. **Target feature:** the number of insurance claims made by the client in the last 5 years.

In [1]:
import pandas as pd
import numpy as np

from collections import defaultdict
from IPython.display import display

from fast_ml import eda
from ydata_profiling import ProfileReport

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score

In [2]:
FIG_WIDTH = 9 * 100
FIG_HEIGHT = 5 * 100
RANDOM_SEED = 42

In [3]:
try:
    raw_claims = pd.read_csv('insurance.csv')
except:
    raw_claims = pd.read_csv('/datasets/insurance.csv')

## Exploratory Data Analysis

### Data Description

Let's explore the main dependencies in the data before using them in machine learning algorithms.

Summary Tables:

In [4]:
display(eda.df_info(raw_claims))

Unnamed: 0,data_type,data_type_grp,num_unique_values,sample_unique_values,num_missing,perc_missing
Пол,int64,Numerical,2,"[1, 0]",0,0.0
Возраст,float64,Numerical,46,"[41.0, 46.0, 29.0, 21.0, 28.0, 43.0, 39.0, 25....",0,0.0
Зарплата,float64,Numerical,524,"[49600.0, 38000.0, 21000.0, 41700.0, 26100.0, ...",0,0.0
Члены семьи,int64,Numerical,7,"[1, 0, 2, 4, 3, 5, 6]",0,0.0
Страховые выплаты,int64,Numerical,6,"[0, 1, 2, 3, 5, 4]",0,0.0


In [5]:
display(round(raw_claims.describe(), 2))

Unnamed: 0,Пол,Возраст,Зарплата,Члены семьи,Страховые выплаты
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.5,30.95,39916.36,1.19,0.15
std,0.5,8.44,9900.08,1.09,0.46
min,0.0,18.0,5300.0,0.0,0.0
25%,0.0,24.0,33300.0,0.0,0.0
50%,0.0,30.0,40200.0,1.0,0.0
75%,1.0,37.0,46600.0,2.0,0.0
max,1.0,65.0,79000.0,6.0,5.0


Detailed overview:

In [6]:
ProfileReport(raw_claims).to_widgets()

Summarize dataset: 100%|██████████| 30/30 [00:01<00:00, 17.19it/s, Completed]                                   
Generate report structure: 100%|██████████| 1/1 [00:00<00:00,  1.23it/s]
Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

                                                             

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Let's tidy up the dataset before analysis.

In [7]:
df_claims = (
    raw_claims
    .copy()
    .rename(columns={
        'Пол': 'is_male', 'Возраст': 'age', 'Зарплата': 'income',
        'Члены семьи': 'family_members_count', 'Страховые выплаты': 'claims_count'
    })
    .astype({
        'age': 'int64', 'income': 'int64'
    })
)

Preliminary observations:

1. There are no missing values in the dataset, but there are duplicate rows. They should not affect the solution of the business task, so we will leave them as they are.

2. The dataset is balanced with respect to the `is_male` feature, with equal representation between the two genders. This is good as it reduces the risk of bias in the data.

3. The dataset covers a wide range of ages, from 18 to 65 years old with a mean value of around 31 years. This means that the data covers a broad range of clients.

4. For the `claims_count` column, we can observe that the mean value is `0.15`, closer to the minimum value of `0`, and the median is also `0`. This indicates that a large number of people in the dataset do not receive insurance claims.

## Feature Transformation

Let's check what transformations we can apply to the original features without deteriorating the quality of the ML models.

### Multiplication by an Invertible Matrix

First, let's see what happens if we multiply the features by an invertible matrix. This task can be solved explicitly. For this purpose, let's assume that:

1. $X$ - feature matrix
2. $y$ - target vector
3. $w$ - matrix of linear regression coefficients
4. $w_0$ - constant coefficient
5. $A$ - invertible matrix we multiply by

Then the linear regression model (1) is:

$$
y = X \cdot w + w_0
$$

The new feature matrix (2) is:

$$
X' = X \cdot A
$$

And the new model (3) becomes:

$$
y = X' \cdot w' + w_0
$$

Then substituting (2) into (3) gives (4):

$$
y = X \cdot (Aw') + w_0
$$

Here $Aw' = w''$ - the new weight vector. This can be rewritten again as (5):

$$
y = X \cdot w'' + w_0
$$

This means that after the transformation, it is possible to find a new weight vector $w''$ for which the prediction vector $y$ remains unchanged.

Note that $w'$ is not equal to $w$, and only thanks to the relation $w'' = Aw'$ we can make the same predictions in our transformed space as in the original space.

Thus, the weights of the model in the transformed feature space ($w'$) are related to the weights of the model in the original feature space ($w$) by the equation $w' = A^{-1}w$ (which we obtained by solving the equation $w'' = Aw'$ for $w'$). It is worth noting that all this is possible only because $A$ is invertible.

### Implementation of Transformation

The algorithm we will use is as follows:

1. Generating an invertible matrix:
    1. The matrix is generated randomly with dimensions matching the number of features in the $X$ matrix.
    2. The matrix must be invertible, which is checked by attempting to calculate the inverse matrix. If the matrix is non-invertible, a new random matrix is generated until an invertible matrix is found.

2. Data transformation:
    1. The feature matrix $X$ is multiplied by the generated invertible matrix. This transformation does not change the target vector $y$.
    2. The transformed features and target values are split into training and testing sets.

Further work will be standard for an ML project.

Let's write the functions that will perform the transformations for us:

In [8]:
def get_transform_matrix(features: np.array) -> np.array:
    """ 
    Generates an invertible transformation matrix with the same number of columns as the input features.

    Parameters:
    features (np.array): The feature matrix.

    Returns:
    np.array: An invertible transformation matrix.
    """
    while True:
        transform = np.random.rand(features.shape[1], features.shape[1])
        if np.linalg.det(transform) != 0:
            return transform

In [9]:
transform = get_transform_matrix(df_claims.drop('claims_count', axis=1).values)

### Model Evaluation

In the last section, let's check what we've achieved. First, let's split the datasets:

In [10]:
def get_data_splits(data: pd.DataFrame, target: str, test_size=0.25):
    """
    Splits input features and target into training and validation sets and
    returns them in a dictionary format for easy access.
    
    Parameters:
    data (pd.DataFrame): The dataframe with features and target.
    target (str): The target column name.
    test_size (float): Proportion of the dataset to include in the test split.

    Returns:
    dict: Dictionary containing split data.
    """
    
    X_train, X_valid, y_train, y_valid = train_test_split(
        data.drop(target, axis=1), data[target],
        test_size=test_size, random_state=RANDOM_SEED
    )
    
    dct_splits = {
        'train': {'features': X_train, 'target': pd.DataFrame(y_train, columns=[target])},
        'valid': {'features': X_valid, 'target': pd.DataFrame(y_valid, columns=[target])}
    }
    
    return dct_splits

And let's write a function to apply the transformation:

In [11]:
def apply_transform(dct_split: dict, transform: np.ndarray) -> dict:
    """
    Apply a transformation to the feature matrix in a split data dictionary.
    
    This function multiplies the feature matrix in the input dictionary by a given
    transformation matrix using the dot product. The resulting transformed features and
    the original target values are then returned as a new dictionary.
    
    Parameters:
    dct_split (dict): A dictionary containing the data split.
    Expected keys are 'features' and 'target' with values as numpy arrays.
    
    transform (numpy.ndarray): A transformation matrix that will be used
    to transform the feature matrix by matrix multiplication.

    Returns:
    dict: A new dictionary with the same structure as the input, but with the feature
    matrix transformed by the given transformation matrix.
    """
    return {
        'features': np.dot(dct_split['features'], transform),
        'target': dct_split['target']
    }

Let's record the original and transformed data:

In [12]:
dct_splits = {
    'original': get_data_splits(df_claims, 'claims_count'),
    'transformed': {
        split: apply_transform(get_data_splits(df_claims, 'claims_count')[split], transform)
        for split in ['train', 'valid']
    }
}

Let's create linear models and train them:

In [13]:
models = {
    name: LinearRegression().fit(
        dct_splits[name]['train']['features'],
        dct_splits[name]['train']['target']
    )
    for name in dct_splits
}

Let's look at the R2 metric:

In [14]:
for name in models:
    print(f"{name} R2 score: ", end="")
    print(round(
        r2_score(
            dct_splits[name]['valid']['target'],
            models[name].predict(dct_splits[name]['valid']['features'])
        ), 5
    ))

original R2 score: 0.42548
transformed R2 score: 0.42548


## General Conclusions

In this project, we explored methods for ensuring the security of client data for the insurance company "Even Flood". We utilized an approach based on data transformation by multiplying features by a reversible matrix.

1. **Data Analysis**: Initial data analysis showed that the provided data does not contain explicit missing values or anomalies, allowing us to quickly proceed to the next steps.

2. **Data Transformation**: We verified the claim that multiplying features by a reversible matrix does not change the quality of linear regression. To confirm this, we conducted a series of mathematical operations and deductions, which confirmed the claim.

3. **Development of Data Transformation Algorithm**: We developed an algorithm for data transformation, which was then implemented using matrix operations.

4. **Quality Check of Linear Regression**: After applying the data transformation, we trained two linear regression models: one on the original data and the other on the transformed data. Evaluating the quality of these models using the R2 metric showed that the quality of linear regression did not change after the transformation.

As a result, we can conclude that the proposed method of data transformation allows us to ensure the security of clients' personal data without compromising the quality of machine learning models.