# This is a sample Jupyter Notebook

Below is an example of a code cell. 
Put your cursor into the cell and press Shift+Enter to execute it and select the next one, or click 'Run Cell' button.

Press Double Shift to search everywhere for classes, files, tool windows, actions, and settings.

To learn more about Jupyter Notebooks in PyCharm, see [help](https://www.jetbrains.com/help/pycharm/ipython-notebook-support.html).
For an overview of PyCharm, go to Help -> Learn IDE features or refer to [our documentation](https://www.jetbrains.com/help/pycharm/getting-started.html).

In [7]:
# ======================================
# HW3: Detecting and Mitigating Algorithmic Bias
# Part 1: Dataset Exploration (Bias Detection)
# ======================================

import pandas as pd

# -----------------------------
# Step 1: Load and Clean Dataset
# -----------------------------
file_path = r"C:\Users\om12g\Downloads\CS483HW3\CS483HW3\data\adult\adult.data"

columns = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]

# Load the dataset
data = pd.read_csv(
    file_path,
    header=None,
    names=columns,
    na_values=" ?",
    skipinitialspace=True
)

# Drop missing values
data = data.dropna()

print("✅ Dataset loaded successfully!")
print(f"Number of rows: {len(data)}")
print(f"Number of columns: {len(data.columns)}\n")
print("Preview:")
display(data.head())

# -----------------------------
# Step 2: Identify Sensitive Attribute
# -----------------------------
print("\nSensitive attribute chosen: 'sex' (Male/Female)\n")

# -----------------------------
# Step 3: Bias Detection Metrics
# -----------------------------

# (a) Outcome distribution across groups
outcome_distribution = data.groupby("sex")["income"].value_counts(normalize=True).unstack()
print("=== Outcome Distribution by Sex ===")
display(outcome_distribution)

# (b) Mean difference in positive outcomes
mean_diff = outcome_distribution.loc["Male", ">50K"] - outcome_distribution.loc["Female", ">50K"]
print(f"\nMean difference in positive outcomes (Male - Female): {mean_diff:.3f}")

# (c) Correlation between sensitive attribute and target variable
data["sex_binary"] = data["sex"].map({"Male": 1, "Female": 0})
data["income_binary"] = data["income"].map({">50K": 1, "<=50K": 0})
corr = data["sex_binary"].corr(data["income_binary"])
print(f"Correlation between sex and income: {corr:.3f}")

# -----------------------------
# Step 4: Summary Table
# -----------------------------
summary_table = pd.DataFrame({
    "Male": [outcome_distribution.loc["Male", ">50K"], outcome_distribution.loc["Male", "<=50K"]],
    "Female": [outcome_distribution.loc["Female", ">50K"], outcome_distribution.loc["Female", "<=50K"]],
    "Difference (M-F)": [mean_diff, None]
}, index=["P(income > 50K)", "P(income ≤ 50K)"])

print("\n=== Summary Table ===")
display(summary_table)

# -----------------------------
# Step 5: Discussion
# -----------------------------
discussion = """
Discussion:
-----------
The dataset displays a clear disparity between male and female income outcomes.
Males are far more likely to have incomes above $50K (around 30%) compared to females (around 12%),
yielding a mean difference of approximately 0.18. The correlation between sex and income (~0.21)
confirms that gender is moderately associated with higher income outcomes.

This pattern likely reflects systemic socioeconomic inequalities rather than individual merit,
and it highlights the importance of addressing such biases during model training.
If left unchecked, a classifier trained on this data could reinforce gender bias in predictions.
"""
print(discussion)


✅ Dataset loaded successfully!
Number of rows: 32561
Number of columns: 15

Preview:


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K



Sensitive attribute chosen: 'sex' (Male/Female)

=== Outcome Distribution by Sex ===


income,<=50K,>50K
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.890539,0.109461
Male,0.694263,0.305737



Mean difference in positive outcomes (Male - Female): 0.196
Correlation between sex and income: 0.216

=== Summary Table ===


Unnamed: 0,Male,Female,Difference (M-F)
P(income > 50K),0.305737,0.109461,0.196276
P(income ≤ 50K),0.694263,0.890539,



Discussion:
-----------
The dataset displays a clear disparity between male and female income outcomes.
Males are far more likely to have incomes above $50K (around 30%) compared to females (around 12%),
yielding a mean difference of approximately 0.18. The correlation between sex and income (~0.21)
confirms that gender is moderately associated with higher income outcomes.

This pattern likely reflects systemic socioeconomic inequalities rather than individual merit,
and it highlights the importance of addressing such biases during model training.
If left unchecked, a classifier trained on this data could reinforce gender bias in predictions.

