## Imports

In [23]:
# Scikit-learn
from sklearn.datasets import fetch_california_housing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, LabelEncoder, OneHotEncoder, OrdinalEncoder, LabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import KFold, RandomizedSearchCV
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.tree import DecisionTreeRegressor

# Data manipulation and math
import pandas as pd
import numpy as np
import math

# Visualization
import matplotlib.pyplot as plt

## Datasets fetch
**TODO:** short description of datasets

In [16]:
!kaggle datasets download -p data ranja7/vehicle-insurance-customer-data
insurance_df = pd.read_csv('data/vehicle-insurance-customer-data.zip')
cancer_df = fetch_california_housing(as_frame=True)

vehicle-insurance-customer-data.zip: Skipping, found more recently modified local copy (use --force to force download)


# How does satisfying group and individual fairness notions affect the performance of machine learning models and how do they compare?
In this section we will explore how group and individual fairness notions affect the performance of a machine learning model. First, we motivate this research questions by contextualizing its relevance in problems of machine learning. We discuss each notion separately. Then, we briefly look at the theoretic background of these notions and predict the expected performance. We conclude by putting these notions to the test.

## Motivation
Machine learning applications are increasingly being applied in the industry. Legislators, insurers and banks are playing catch-up to integrate this technology in the process of decision-making. Supervised learning often relies on historical data. This means that bias present in the data is transferred to the model. Perpetuating this bias is not only unfair, but often unlawful or contrary to company policy. Chouldechova & Roth (2018) identify three causes of unfairness:
* **Bias in training data**: historical data that has human bias embedded in it. A classic example is the disproportionate amount of crime committed by some marginalized and ostracized communities. However, this can often be explained by considering the socio-economic situation. Also, these areas might be policed at a higher rate, which further skews crime-prediction models.
* **Minimizing average error**: a majority group will be more accurately represented in a model than a minority group. Naturally, it follows from the fact that the majority group has a larger representation and thus minimizing errors will benefit more if the error of each individual has the same weight.
* **Related to exploration**: online learning models that gets updated with new information while being used, can greatly benefit from the information gained of taking suboptimal decisions. This can be either amoral (e.g. for medical procedures) or benefit/disadvantage certain groups.

Different fairness notions have been introduced in the literature to mitigate problems like these. Knowing which notion to satisfy, depends on the problem at hand.

## Fairness notions
We will now introduce each fairness notion that is going to be applied.

### Statistical parity
There exists different group fairness notions (for a full list, see Barocas et al., 2017). Group fairness notions seek to be fair for a protected group. In this work, we only consider **statistical parity** (also known as **demographic parity**). It is satisfied for a given (sensitive) group attribute when the *positive* classification distribution for each group is identical to that of the entire population (Barocas et al., 2017). This means that predictions needs to be statistically independent with respect to the group attribute. For a group attribute $G$ and attribute to predict $Y$ (with "$+$" deemed a *positive* prediction), it must hold that:
$$\forall a,b \in G: \mathbb{P}(Y = + \mid G = a) = \mathbb{P}(Y = + \mid G = b)$$

### Individual fairness
As opposed to group fairness notions, **individual fairness** does not look at a sensitive property in particular. Individual fairness is satisfied when similar individuals are classified similarly, in proportion to their degree of similarity (Dwork et al., 2012). To accomplish this, we need a distance metric $d$ to quantify similarity between individuals and a distance metric $D$ to compare the difference between distributions. Individual fairness is satisfied for a classifier $M$ if it satisfies the $(D, d)$-**Lipschitz property**. It entails that for any two individuals $x$, $y$ it holds that:
$$D(M(x), M(y)) \le d(x, y)$$

## Fulfilling notions

### Preprocessing
We start by performing preprocessing on the dataset. Skewed data that follows an exponential-like distribution is transformed using the logarithm. Nominal categories are one-hot encoded using dummy columns for each possible category. Ordinal categories are processed appropriately.

In [36]:
ctf = ColumnTransformer(
    transformers=[
        (
            'exp_dist',
            FunctionTransformer(lambda x: np.log(x, where=x>0)), # log transform to remove data skew
            ['Customer Lifetime Value', 'Total Claim Amount']
        ), (
            'scale_income',
            FunctionTransformer(lambda x: x / 1000),
            'Income'
        ), (
            'encode_gender',
            LabelEncoder().fit(['M', 'F']), # M=0, F=1
            'Gender'
        ), (
            'encode_ordinal',
            OneHotEncoder(),
            ['EmploymentStatus', 'Marital Status', 'Policy Type', 'Policy', 'Vehicle Class']
        ), (
            'encode_coverage',
            OrdinalEncoder(categories=['Basic', 'Extended', 'Premium']),
            'Coverage'
        ), (
            'encode_education',
            OrdinalEncoder(categories=['High School or Below', 'College', 'Bachelor', 'Master', 'Doctor']),
            'Education'
        ), (
            'encode_size',
            OrdinalEncoder(categories=['Small', 'Medsize', 'Large']),
            'Vehicle Size'
        ), (
            'pass',
            'passthrough',
            ['Number of Open Complaints', 'Number of Policies']
        )
    ],
    remainder='drop'
)

# Select K best hyperparameters based on recommended F-test
insurance_pipe = Pipeline([
    ("col_trans", ctf),
    ("k_best", SelectKBest(f_regression))
])

insurance_ttr = TransformedTargetRegressor(
    regressor=insurance_pipe,
    transformer=LabelBinarizer()
)

In [38]:
insurance_y = insurance_df['Response']
insurance_X =  insurance_df.drop('Response', axis=1)

In [None]:
num_splits = 5
cv = KFold(n_splits=num_splits, shuffle=True, random_state=0)

## Sources
* Chouldechova, A., & Roth, A. (2018). The frontiers of fairness in machine learning. *arXiv preprint arXiv:1810.08810*.
* Barocas, S., Hardt, M., & Narayanan, A. (2017). Fairness in machine learning. Nips tutorial, 1, 2017.
* Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012, January). Fairness through awareness. In *Proceedings of the 3rd innovations in theoretical computer science conference* (pp. 214-226).