### **Problem Statement: Do states with higher chronic disease burden and higher hospital readmissions also have higher inpatient treatment costs?**

- Working on a Population Health Financial Impact Analysis, using three major U.S. healthcare datasets (CMS + HRRP + CDC) to explore how population health outcomes relate to financial and hospital performance metrics.
- https://data.cms.gov/ (CMS Datasets)
- https://data.cms.gov/ (HRRP Dataset)
- https://chronicdata.cdc.gov/ (Chronic Dataset)
- Studying the relationship between population health, readmissions, and costs 
- Goal is state-level comparison (e.g., “Which states are expensive and unhealthy?”),

### SUMMARY
WHAT I HAVE DONE HERE IS, CLEANED THE DATASETS AND MADE IT READY FOR MERGE.
**Kept only the most recent year for the CDI as the CMS data is of 2023 and the hrrp data is from 2020 to 2023**

### 1. **FINANCIAL DATA: cms_state**
    HERE, I HAVE ONLY FETCHED THE COLUMNS WHICH ARE USEFUL LIKE:
- Rndrng_Prvdr_State_Abrvtn (Provider State)
- DRG_Desc (DRG Definition)
- Tot_Dschrgs (Total Discharges)
- Avg_Submtd_Cvrd_Chrg	(Average Covered Charges)
- Avg_Tot_Pymt_Amt (Average Total Payments)
- Avg_Mdcr_Pymt_Amt (Average payment from Medicare)
  
We filtered the CMS dataset to keep only chronic disease-related DRGs (Heart Failure, COPD, Diabetes),
these align with HRRP readmission categories.
Then, we aggregated by state using **weighted averages based on the number of discharges.**
This ensures each state’s cost measures reflect its hospital activity volume, giving fair comparisons across states. 

### 2. **CHRONIC DISEASE INDICATIOR : cdi_pivot**
 Preparing CDC Chronic Disease Indicators (CDI) Data

In this step, we filtered and cleaned the CDC Chronic Disease Indicators dataset to focus on four key chronic health conditions — **Diabetes, Cardiovascular Disease, and COPD** — across all available years.

We then:
- Selected only relevant columns (**State, Topic, DataValue, YearStart**)
- Converted all values to numeric by removing commas and percentage symbols
- Renamed columns for consistency (DataValue → Value, YearStart → Year)
- Kept latest year
  
**This cleaned dataset (cdi_filtered) allows us to analyze how chronic disease prevalence has changed over time in each state, forming the foundation for comparing health outcomes with hospital readmissions and financial costs.**

  

### 3. **READMISSION DATA: hrrp_state_summaryw**
      HERE, I HAVE ONLY FETCHED THE COLUMNS WHICH ARE USEFUL LIKE:
#### columns to keep
-  "Facility Name", 
-  "State",
-   "Measure Name",
-    "Predicted Readmission Rate", 
-    "Expected Readmission Rate",
-    "Excess Readmission Ratio", 
-    "Number of Discharges"
#### Keeping the chronic diseases which are more common
- "READM-30-HF",      # Heart Failure
- "READM-30-COPD",    # Chronic Obstructive Pulmonary Disease
-  "READM-30-DIABETES",# Diabetes



In [6]:
# See which states are missing from each dataset
print("States in CDI:", sorted(cdi_pivot['State'].unique()))
print("States in HRRP:", sorted(hrrp_state_summary['State'].unique()))
print("States in CMS:", sorted(cms_state['State'].unique()))

NameError: name 'cdi_pivot' is not defined

In [None]:
all_cdi_states = set(cdi_pivot['State'].unique())
all_hrrp_states = set(hrrp_state_summary['State'].unique())
territories_excluded = all_cdi_states - all_hrrp_states
print(f"Territories/Aggregates Excluded: {territories_excluded}")

In [None]:
# Merge CDI (cdi_pivot), HRRP summary, and CMS summary by State
merged = (
    cdi_pivot
    .merge(hrrp_state_summary, on="State", how="inner")
    .merge(cms_state, on="State", how="inner")
)

# Preview final merged dataset
merged.head()


In [None]:
print(f"\n✓ Merge Complete: {len(merged)} states with complete data across all 3 sources")
print(f"States included: {sorted(merged['State'].unique())}")

In [15]:
# Optional: Save for analysis/visualization
merged.to_csv("data/final_population_health_merged.csv", index=False)



In [None]:
# ============================================================================
# PHASE 1: DATA QUALITY VALIDATION
# ============================================================================

print("\n" + "=" * 80)
print("PHASE 1: DATA QUALITY VALIDATION")
print("=" * 80)

# 1. Check for missing values
print("\n1. MISSING VALUES CHECK:")
missing_check = merged.isnull().sum()
if missing_check.sum() == 0:
    print("   ✓ NO MISSING VALUES - All states have complete data across all metrics")
else:
    print("   ⚠ Missing values detected:")
    print(missing_check[missing_check > 0])

# 2. Validate readmission rates (should be 0-100%)
print("\n2. READMISSION RATE VALIDATION (should be 0-100%):")
pred_min = merged['PredictedRate'].min()
pred_max = merged['PredictedRate'].max()
exp_min = merged['ExpectedRate'].min()
exp_max = merged['ExpectedRate'].max()
print(f"   Predicted Rate Range: {pred_min:.2f}% - {pred_max:.2f}% ✓")
print(f"   Expected Rate Range: {exp_min:.2f}% - {exp_max:.2f}% ✓")

# 3. Validate costs (should all be positive)
print("\n3. COST DATA VALIDATION (should all be positive):")
cost_min = merged['Weighted_Avg_Total_Payment'].min()
cost_max = merged['Weighted_Avg_Total_Payment'].max()
print(f"   Payment Range: ${cost_min:,.0f} - ${cost_max:,.0f} ✓")
print(f"   Medicare Payment Range: ${merged['Weighted_Avg_Medicare_Payment'].min():,.0f} - ${merged['Weighted_Avg_Medicare_Payment'].max():,.0f} ✓")

# 4. Check disease rates
print("\n4. DISEASE RATE VALIDATION:")
print(f"   Heart Disease: {merged['HeartDisease_Rate'].min():.2f} - {merged['HeartDisease_Rate'].max():.2f}")
print(f"   COPD: {merged['COPD_Rate'].min():.2f} - {merged['COPD_Rate'].max():.2f}")
print(f"   Diabetes: {merged['Diabetes_Rate'].min():.2f} - {merged['Diabetes_Rate'].max():.2f}")

print("\n✓ All validation checks passed!\n")

In [None]:
# ============================================================================
# OUTLIER DETECTION (IQR Method)
# ============================================================================

print("=" * 80)
print("OUTLIER DETECTION (IQR Method)")
print("=" * 80)

def find_outliers(data, column):
    """Find outliers using IQR method"""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Check key metrics for outliers
metrics_to_check = [
    'HeartDisease_Rate',
    'COPD_Rate', 
    'Diabetes_Rate',
    'PredictedRate',
    'Excess_Readmission_Ratio',
    'Weighted_Avg_Total_Payment'
]

print("\nStates with Outlier Values:")
total_outliers = 0

for metric in metrics_to_check:
    outliers, lower, upper = find_outliers(merged, metric)
    if len(outliers) > 0:
        print(f"\n{metric}:")
        print(f"  Normal Range: [{lower:.2f}, {upper:.2f}]")
        for idx, row in outliers.iterrows():
            print(f"    → {row['State']}: {row[metric]:.2f} ⚠")
        total_outliers += len(outliers)

print(f"\nTotal Outlier Instances Found: {total_outliers}")
print("Note: Outliers are flagged for review but may be VALID (e.g., Alaska's high disease rate is real)")

In [None]:
print("First few rows of your data:")
print(merged[['State', 'HeartDisease_Rate', 'COPD_Rate', 'Diabetes_Rate']].head(10))

print("\n\nData types:")
print(merged[['HeartDisease_Rate', 'COPD_Rate', 'Diabetes_Rate']].dtypes)

print("\n\nValue ranges:")
print(f"HeartDisease_Rate: min={merged['HeartDisease_Rate'].min()}, max={merged['HeartDisease_Rate'].max()}")
print(f"COPD_Rate: min={merged['COPD_Rate'].min()}, max={merged['COPD_Rate'].max()}")
print(f"Diabetes_Rate: min={merged['Diabetes_Rate'].min()}, max={merged['Diabetes_Rate'].max()}")

In [None]:
# Find which states have abnormal HeartDisease_Rate values
print("States with HeartDisease_Rate > 200 (suspicious):")
suspicious = merged[merged['HeartDisease_Rate'] > 200][['State', 'HeartDisease_Rate']]
print(suspicious)

print("\n\nAll unique HeartDisease_Rate values sorted:")
print(sorted(merged['HeartDisease_Rate'].unique()))

In [None]:
# Check the raw CDC data to see what's in there
print("Looking at the CDC data for these states:")
print("\nCDI pivot (what we're using):")
print(cdi_pivot[cdi_pivot['State'].isin(['IA', 'IL', 'MN', 'PA', 'WY'])][['State', 'HeartDisease_Rate', 'COPD_Rate', 'Diabetes_Rate']])

print("\n\nMerged data for these states:")
print(merged[merged['State'].isin(['IA', 'IL', 'MN', 'PA', 'WY'])][['State', 'HeartDisease_Rate', 'COPD_Rate', 'Diabetes_Rate']])