# Task 3: Baseline Model Comparison (Logistic Regression)

In this notebook, you will train a Logistic Regression model to predict the `prior_hiring_decision` target variable. You will also begin your fairness analysis by establishing a baseline for accuracy and fairness metrics.

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt

# Load Data
train_df = pd.read_csv('../data/train.csv')
val_df = pd.read_csv('../data/val.csv')
test_df = pd.read_csv('../data/test.csv')

print("Training Shape:", train_df.shape)
print("Validation Shape:", val_df.shape)

## 1. Feature Preprocessing
We will use all available features for this baseline model. 
- Categorical features will be One-Hot Encoded.
- Numerical features will be Scaled.

In [None]:
# Define features
target = 'prior_hiring_decision'
features = [c for c in train_df.columns if c != target]

# Identify types
numerical_cols = train_df[features].select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = train_df[features].select_dtypes(include=['object', 'category']).columns.tolist()

print("Numerical:", numerical_cols)
print("Categorical:", categorical_cols)

# Preprocessing Pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_cols)
    ])

# Full Pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(max_iter=1000))])

## 2. Model Training

In [None]:
X_train = train_df[features]
y_train = train_df[target]

X_val = val_df[features]
y_val = val_df[target]

# Train
clf.fit(X_train, y_train)

## 3. Evaluation (Accuracy)
Report the accuracy on the validation set.

In [None]:
y_pred = clf.predict(X_val)
acc = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {acc:.4f}")
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_pred))

## 4. Fairness Metric Implementation
**TODO:** Select your fairness metric (Demographic Parity, Equal Opportunity, or Equalized Odds) and calculate it for this model. You should compare the metric across different groups (e.g., Race).

In [None]:
# TODO: Implement fairness metric calculation here

## 5. ROC Curve Analysis
**TODO:** Plot the ROC Curve for this model on the validation set and calculate the AUC. Discuss what this tells you about the model's ability to distinguish classes at different thresholds.

In [None]:
# TODO: Plot ROC Curve here