## Library Imports

In [9]:
# Data manipulation
import pandas as pd
import numpy as np

# ML tools
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Fairness
from fairlearn.metrics import MetricFrame, demographic_parity_difference, equalized_odds_difference, selection_rate

## Data Loading and Preprocessing


The **UCI Adult Income dataset** is used to predict whether a person earns more than \$50K/year.  


The raw dataset requires preprocessing because:  
- It **has no headers**, so column names must be assigned.  
- It contains **categorical variables** (e.g. `workclass`, `education`, `occupation`) and the model expects numerical input.  
- It uses `?` for **missing values** that the model cannot handle directly.  


Preprocessing steps:  
1. Assign column names.  
2. Drop rows with missing values.  
3. Convert the target variable `income` to **binary classification** (0 = ≤50K, 1 = >50K).  
4. One-hot encode categorical features.  


The dataset is then ready for modelling.


In [10]:
# Define column names as UCI dataset has no headers
column_names = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
    'hours-per-week', 'native-country', 'income'
]

# Load dataset
data = pd.read_csv(
    'adult.data', 
    header=None, 
    names=column_names, 
    na_values='?', 
    sep=r',\s*', 
    engine='python'
)

# Drop rows with missing values
data.dropna(inplace=True)

# Convert target variable to binary format
data['income'] = (data['income'] == '>50K').astype(int)

# One-hot encode categorical variables
categorical_cols = [
    'workclass', 'education', 'marital-status',
    'occupation', 'relationship', 'race',
     'sex', 'native-country'
]
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)

# Split data into features and target variable
X = data.drop(['income', 'fnlwgt'], axis=1)
y = data['income']

print("Data preparation complete.")


Data preparation complete.


## Model Training


A **Logistic Regression classifier** is used as a starting point as it is simple, fast and works well for binary classification.


The dataset is split into **80% training data** and **20% testing data**.


**Feature scaling** is applied to help the model learn more efficiently (faster convergence) and produce more consistent results.


The model is trained on the scaled training data, and predictions (`y_pred`) are made on the test set.


The model's performance is measured using **accuracy**.


In [11]:
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialise Logistic Regression model
model = LogisticRegression(max_iter=5000, solver='lbfgs')
model.fit(X_train_scaled, y_train)

# Predict
y_pred = model.predict(X_test_scaled)

# Evaluate model
accuracy = model.score(X_test_scaled, y_test)

print(f"Model accuracy: {accuracy:.2f}")


Model accuracy: 0.85


The Logistic Regression model achieves around **85% accuracy** on the test set.
This serves as a baseline for testing other models and potential improvements.

## Fairness Audit

The model is audited for bias using the **Fairlearn** library, focusing first on the sensitive feature of sex to see if its predictions are fair.
- **MetricFrame** is used to group **accuracy** and **selection_rate** (how often >$50K predicted) into males and females to measure the model's performance and outcome gaps.

- **Demographic Parity** is calculated to show the bias in positive outcomes between groups.

- **Equalised Odds** is calculated to see if the model is making errors at a similar rate across groups.

In [13]:
# Extract sensitive feature for fairness analysis
sensitive_sex = data.loc[X_test.index, 'sex_Male']

# Compute metrics by group
metrics_by_sex = MetricFrame(metrics={'accuracy': accuracy_score, 'selection_rate': selection_rate},
                             y_true=y_test,
                             y_pred=y_pred,
                             sensitive_features=sensitive_sex)

# Compute main fairness difference metrics
dpd_sex = demographic_parity_difference(y_true=y_test, y_pred=y_pred, sensitive_features=sensitive_sex)
eod_sex = equalized_odds_difference(y_true=y_test, y_pred=y_pred, sensitive_features=sensitive_sex)

# Print a clear report
print("--- Fairness Audit: Sex ----\n")
print("Performance and Selection Rates by Group:\n", metrics_by_sex.by_group)
print(f"\nDemographic Parity Difference (bias in outcomes): {dpd_sex:.3f}")
print(f"Equalised Odds Difference (bias in error rates): {eod_sex:.3f}\n")

--- Fairness Audit: Sex ----

Performance and Selection Rates by Group:
           accuracy  selection_rate
sex_Male                          
False     0.928608        0.080571
True      0.815570        0.265963

Demographic Parity Difference (bias in outcomes): 0.185
Equalised Odds Difference (bias in error rates): 0.086



## Fairness Audit Results (Sex)

The results show a clear bias in how the model performs and its outcomes based on sex.

- **Accurcy:** The model is far better at predicting income for females (92.9%) than for males (81.6%).

- **Outcomes:** The Demographic Parity Difference is **high** (0.185), which confirms the model is biased as it predicts a high income (>$50K) for men much more often than for women.

**Conclusion:** With respect to sex, the model is not fair, as it appears to be replicating historical biases found in the 1994 data.
