# Examples and exercises for Lecture Adversarial Regularization Regimes for Classification Tasks

In [1]:
import os
from pathlib import Path

import pandas as pd
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
import numpy as np

from risk_learning.arr import (
    convert_to_categorical,
    make_feature_combination_array,
    make_feature_combination_score_array,
    make_trend_reports, 
    make_data_trend_reports
)

## Example Simpson's Paradox Data

In [2]:
datadir = Path(os.getcwd()) / 'data'
data_path = datadir / 'adversarial-default-for-x-validation.csv'

df = pd.read_csv(data_path)
df

Unnamed: 0,default,gender,occupation
0,0,0,1
1,1,0,0
2,1,1,1
3,0,0,0
4,0,1,1
...,...,...,...
595,0,0,0
596,0,0,1
597,1,0,0
598,1,0,0


In [3]:
label_mapping_values = dict(gender=[0, 1], occupation=[0, 1])
data_categories = label_mapping_values.copy()
data_categories['default'] = [0, 1]
df = convert_to_categorical(df, data_categories)
df.head(10)

Unnamed: 0,default,gender,occupation
0,0,0,1
1,1,0,0
2,1,1,1
3,0,0,0
4,0,1,1
5,1,0,0
6,0,0,0
7,1,0,0
8,0,0,0
9,1,0,0


## Exercise: Simpson or not?

Difficulty: (*)

Prove that this dataset exhibites Simpson's paradox.

## Exercises: non-trivial regularization regime

* Which optimizer ("solver") for logistic regression seems best suited for the above dataset? https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. Difficulty (*)
* Calculate the "true" trends for female default for each occupation subgroup. Note that in sklearn, the inverse regularization parameter is  used, $C$, so to approximate the usual $c=0$, set $C$ to a large value. Difficulty (**)
* Show that this dataset is adversarial for logistic regression for inverse regularization parameter $C=0.05$. Difficulty: (**)
* Show that this dataset is still adversarial for k-fold cross-validated logistic regression if $k=5$, the default setting.