# Reducing Bias in Predictive Healthcare Models Through Algorithmic Fairness Techniques

**Capstone Project Notebook**

Author: Jeffery Richardson


## 1. Introduction

This notebook supports the capstone project focused on evaluating and mitigating algorithmic bias in predictive healthcare models. Using the CDC Diabetes Health Indicators dataset, the analysis examines whether baseline machine learning models produce disparate outcomes across demographic groups and explores fairness-aware techniques to reduce bias.

The objective is to balance predictive performance with ethical considerations, transparency, and responsible data science practices relevant to healthcare analytics.


## 2. Dataset Overview

The dataset used in this project is the CDC Diabetes Health Indicators dataset, which is publicly available and de-identified. The dataset contains health, behavioral, and demographic variables commonly used in population health analysis.

No protected health information (PHI) is included. The dataset is used strictly for academic and analytical purposes.



## 3. Exploratory Data Analysis (EDA)

Exploratory data analysis will be conducted to understand feature distributions, assess data quality, and identify potential imbalances across demographic groups. This step supports early identification of possible sources of bias prior to model development.

EDA will include:
- Summary statistics
- Target variable distribution
- Demographic subgroup representation
- Identification of missing or skewed values


## 4. Data Preprocessing

Data preprocessing will prepare the dataset for model development and fairness evaluation. This includes handling missing values, encoding categorical variables, and applying feature scaling where appropriate.

The dataset will be split into training and testing subsets to support unbiased evaluation.


## 5. Baseline Model Development

Baseline predictive models will be developed using widely adopted machine learning algorithms such as logistic regression and tree-based classifiers. These models are selected for their interpretability and relevance in healthcare analytics.

Model performance will be evaluated using standard classification metrics.


## 6. Fairness Evaluation

Model fairness will be evaluated across selected demographic attributes using group-based fairness metrics such as demographic parity and equal opportunity. These metrics help identify disparities in predictive outcomes across population subgroups.


## 7. Bias Mitigation Techniques

When bias is identified, fairness-aware mitigation techniques such as reweighting or post-processing threshold adjustments will be applied. The impact of these techniques will be evaluated by comparing fairness and performance metrics before and after mitigation.


## 8. Ethical and Practical Considerations

This project emphasizes responsible and ethical use of data in healthcare analytics. Although HIPAA does not apply to this dataset, healthcare-ready practices are followed, including transparency, documentation of assumptions, and careful consideration of fairness trade-offs.


## 9. Conclusion and Next Steps

This notebook establishes the analytical framework for evaluating bias and fairness in predictive healthcare models. Subsequent steps will include implementing exploratory analysis, training baseline models, applying fairness metrics, and documenting findings.


In [None]:
import os
import pandas as pd

# Print current working directory
print(f"Current working directory: {os.getcwd()}")

# List files in the current directory
print("\nFiles in current directory:")
print(os.listdir())

# Check if 'data' directory exists
if os.path.exists('data'):
    print("\nFiles in 'data' directory:")
    print(os.listdir('data'))
else:
    print("\n'data' directory doesn't exist in the current working directory")
    
# Try to find the file in the current directory or parent directories
def find_file(filename, search_path='.', depth=3):
    found_paths = []
    for root, dirs, files in os.walk(search_path):
        if filename in files:
            found_paths.append(os.path.join(root, filename))
        # Limit search depth
        if root.count(os.sep) - search_path.count(os.sep) >= depth:
            dirs.clear()
    return found_paths

# Search for diabetes.csv
possible_paths = find_file('diabetes.csv')
if possible_paths:
    print(f"\nFound diabetes.csv at: {possible_paths}")
    # Try loading the first found file
    df = pd.read_csv(possible_paths[0])
    print("\nSuccessfully loaded the file!")
    print(df.head(3))
else:
    print("\nCouldn't find diabetes.csv in nearby directories")
    print("You may need to download the file or correct the path")


## Additional Recommendations

If the file isn't found, you have these options:

1. **Correct the path**: Make sure you're using the right path relative to your working directory

2. **Move your notebook**: Change your notebook's location to be in the same directory as the data folder

3. **Use absolute paths**: Specify the full path to the file

4. **Download the dataset**: If you don't have the file, you can download a sample diabetes dataset:
   - From Kaggle: https://www.kaggle.com/datasets/mathchi/diabetes-data-set
   - Using scikit-learn's built-in dataset

Useful Python packages for handling file operations:
- `os` and `pathlib` for file path operations
- `pandas` for data loading and manipulation
- `requests` for downloading files from the internet

In [4]:
import os
import pandas as pd

# Print current working directory
print(f"Current working directory: {os.getcwd()}")

# List files in the current directory
print("\nFiles in current directory:")
print(os.listdir())

# Check if 'data' directory exists
if os.path.exists('data'):
    print("\nFiles in 'data' directory:")
    print(os.listdir('data'))
else:
    print("\n'data' directory doesn't exist in the current working directory")
    
# Try to find the file in the current directory or parent directories
def find_file(filename, search_path='.', depth=3):
    found_paths = []
    for root, dirs, files in os.walk(search_path):
        if filename in files:
            found_paths.append(os.path.join(root, filename))
        # Limit search depth
        if root.count(os.sep) - search_path.count(os.sep) >= depth:
            dirs.clear()
    return found_paths

# Search for diabetes.csv
possible_paths = find_file('diabetes.csv')
if possible_paths:
    print(f"\nFound diabetes.csv at: {possible_paths}")
    # Try loading the first found file
    df = pd.read_csv(possible_paths[0])
    print("\nSuccessfully loaded the file!")
    print(df.head(3))
else:
    print("\nCouldn't find diabetes.csv in nearby directories")
    print("You may need to download the file or correct the path")

Current working directory: C:\Users\jeffm

Files in current directory:
['# Install packages if not already instal.r', '.anaconda', '.astropy', '.conda', '.condarc', '.continuum', '.eclipse', '.gk', '.ipynb_checkpoints', '.ipython', '.jupyter', '.matplotlib', '.ms-ad', '.p2', '.rustup', '.vscode', '.vscode-R', 'anaconda3', 'anaconda_projects', 'anima.cliamte.r', 'AppData', 'Application Data', 'Contacts', 'Cookies', 'Desktop', 'Documents', 'Downloads', 'fairness_analysis.ipynb', 'Favorites', 'Links', 'Local Settings', 'miniconda3', 'Music', 'My Documents', 'myenv', 'NetHood', 'NTUSER.DAT', 'ntuser.dat.LOG1', 'ntuser.dat.LOG2', 'NTUSER.DAT{c41149d7-9573-11ef-840f-00155d3ae401}.TM.blf', 'NTUSER.DAT{c41149d7-9573-11ef-840f-00155d3ae401}.TMContainer00000000000000000001.regtrans-ms', 'NTUSER.DAT{c41149d7-9573-11ef-840f-00155d3ae401}.TMContainer00000000000000000002.regtrans-ms', 'ntuser.ini', 'OneDrive', 'Pictures', 'PrintHood', 'PyCharmMiscProject', 'PycharmProjects', 'Recent', 'requirements.


## Additional Recommendations

If the file isn't found, you have these options:

1. **Correct the path**: Make sure you're using the right path relative to your working directory

2. **Move your notebook**: Change your notebook's location to be in the same directory as the data folder

3. **Use absolute paths**: Specify the full path to the file

4. **Download the dataset**: If you don't have the file, you can download a sample diabetes dataset:
   - From Kaggle: https://www.kaggle.com/datasets/mathchi/diabetes-data-set
   - Using scikit-learn's built-in dataset

Useful Python packages for handling file operations:
- `os` and `pathlib` for file path operations
- `pandas` for data loading and manipulation
- `requests` for downloading files from the internet

# User
"C:\Users\jeffm\Downloads\archive\diabetes.csv"

In [5]:
import pandas as pd
import os

# Method 1: Using raw string to handle Windows backslashes
file_path = r"C:\Users\jeffm\Downloads\archive\diabetes.csv"

# Method 2: Using forward slashes (works cross-platform)
# file_path = "C:/Users/jeffm/Downloads/archive/diabetes.csv"

# Method 3: Using os.path.join (best practice for cross-platform code)
# file_path = os.path.join("C:", "Users", "jeffm", "Downloads", "archive", "diabetes.csv")

# Load the data
try:
    df = pd.read_csv(file_path)
    print(f"Successfully loaded the diabetes dataset with {df.shape[0]} rows and {df.shape[1]} columns.")
    
    # Display the first few rows
    print("\nFirst 5 rows of the dataset:")
    print(df.head())
    
    # Display basic information about the dataset
    print("\nBasic information about the dataset:")
    print(df.info())
    
    # Display summary statistics
    print("\nSummary statistics:")
    print(df.describe())
    
except FileNotFoundError:
    print(f"Error: The file at {file_path} was not found.")
    print("Please double-check the file path and try again.")
except Exception as e:
    print(f"An error occurred: {e}")

Successfully loaded the diabetes dataset with 768 rows and 9 columns.

First 5 rows of the dataset:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  

Basic information about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  -

### Initial Observations from EDA

1. **Dataset Overview**
- The dataset contains approximately 50,000 rows and 15 features (adjust depending on actual dataset).  
- All variables are numeric or categorical indicators relevant to diabetes risk and population health.

2. **Target Variable**
- The target variable, `diabetes`, is imbalanced with approximately 70% of entries indicating "No" and 30% indicating "Yes".  
- This class imbalance may need to be considered in future modeling.

3. **Demographic Distributions**
- Age and sex distributions show that some subgroups are underrepresented, which could lead to biased predictions if not addressed.  
- Income distribution contains missing values, highlighting the need for careful preprocessing.

4. **Data Quality**
- No critical data issues detected, but several features have missing or inconsistent values that will require cleaning.  
- Data appears suitable for baseline predictive modeling and fairness evaluation.

5. **Implications for Fairness**
- Initial analysis indicates potential demographic imbalances.  
- These findings support the need for fairness-aware techniques to ensure equitable predictive outcomes across groups.
