### **Problem Statement: Do states with higher chronic disease burden and higher hospital readmissions also have higher inpatient treatment costs?**

- Working on a Population Health Financial Impact Analysis, using three major U.S. healthcare datasets (CMS + HRRP + CDC) to explore how population health outcomes relate to financial and hospital performance metrics.
- https://data.cms.gov/ (CMS Datasets)
- https://data.cms.gov/ (HRRP Dataset)
- https://chronicdata.cdc.gov/ (Chronic Dataset)
- Studying the relationship between population health, readmissions, and costs 
- Goal is state-level comparison (e.g., “Which states are expensive and unhealthy?”),

#### Financial Data

### CMS Data Cleaning and State-Level Aggregation

This step processes the **Medicare Inpatient Hospital dataset** to align it with our Population Health project.  
Each hospital record lists average costs per DRG (Diagnosis Related Group).  
To merge it meaningfully with HRRP (readmission) and CDC (chronic disease) datasets — both aggregated by **state** —  
we need to summarize CMS data to the **state level**.

We calculate **weighted averages** for financial metrics (covered charges, total payments, Medicare payments)  
using the number of discharges as the weight.  
This ensures that procedures with higher patient volume contribute proportionally more to the state’s average cost.  
The result is a single record per state showing the overall financial impact of inpatient care.


Columns to keep!
- Rndrng_Prvdr_State_Abrvtn (Provider State)
- DRG_Desc (DRG Definition)
- Tot_Dschrgs (Total Discharges)
- Avg_Submtd_Cvrd_Chrg	(Average Covered Charges)
- Avg_Tot_Pymt_Amt (Average Total Payments)
- Avg_Mdcr_Pymt_Amt (Average payment from Medicare)




In [None]:
import pandas as pd

columns_to_keep = [
    "Rndrng_Prvdr_State_Abrvtn",
    "DRG_Desc",
    "Tot_Dschrgs",
    "Avg_Submtd_Cvrd_Chrg",
    "Avg_Tot_Pymt_Amt",
    "Avg_Mdcr_Pymt_Amt"
]

cms = pd.read_csv(
    "data/inpatient.csv",
    usecols=columns_to_keep,
    encoding="windows-1252",
    low_memory=False
)

cms = cms.rename(columns={
    "Rndrng_Prvdr_State_Abrvtn": "State",
    "DRG_Desc": "DRG_Def",
    "Avg_Submtd_Cvrd_Chrg": "Avg_Covered_Charges",
    "Avg_Tot_Pymt_Amt": "Avg_Total_Payment",
    "Tot_Dschrgs": "Total_Discharges",
    "Avg_Mdcr_Pymt_Amt":"Avg_Medicare_Payment"
})
cms.isnull().sum()


In [None]:
#filtering chronic diseases
chronic_drgs = ["HEART FAILURE", "CHRONIC OBSTRUCTIVE PULMONARY DISEASE ", "DIABETES"]
cms = cms[cms["DRG_Def"].str.contains('|'.join(chronic_drgs), case=False, na=False)]

# Step 4 — Create weighted totals
cms["w_cov"] = cms["Avg_Covered_Charges"] * cms["Total_Discharges"]
cms["w_tot"] = cms["Avg_Total_Payment"] * cms["Total_Discharges"]
cms["w_mcr"] = cms["Avg_Medicare_Payment"] * cms["Total_Discharges"]

# Step 5 — Aggregate by state (weighted averages)
cms_state = cms.groupby("State", as_index=False).agg({
    "Total_Discharges": "sum",
    "w_cov": "sum",
    "w_tot": "sum",
    "w_mcr": "sum"
})

cms_state["Weighted_Avg_Covered_Charges"] = cms_state["w_cov"] / cms_state["Total_Discharges"]
cms_state["Weighted_Avg_Total_Payment"] = cms_state["w_tot"] / cms_state["Total_Discharges"]
cms_state["Weighted_Avg_Medicare_Payment"] = cms_state["w_mcr"] / cms_state["Total_Discharges"]

# Step 6 — Final cleaned output
cms_state = cms_state[[
    "State", "Total_Discharges",
    "Weighted_Avg_Covered_Charges",
    "Weighted_Avg_Total_Payment",
    "Weighted_Avg_Medicare_Payment"
]]
cms_state.head()


We filtered the CMS dataset to keep only chronic disease-related DRGs (Heart Failure, COPD, Diabetes, Pneumonia, and Myocardial Infarction) — these will align with HRRP readmission categories.
Then, we aggregated by state using weighted averages based on the number of discharges.
This ensures each state’s cost measures reflect its hospital activity volume, giving fair comparisons across states.

## Patient Readmission Data

In [None]:
hrrp = pd.read_csv("data/readmission_hrrp.csv")
columns_to_keep = ["Facility Name", "State", "Measure Name", "Predicted Readmission Rate", "Expected Readmission Rate","Excess Readmission Ratio", "Number of Discharges"]
hrrp = hrrp[columns_to_keep]
hrrp.isnull().sum()


In [None]:
hrrp = hrrp.dropna(subset=["Predicted Readmission Rate", "Expected Readmission Rate"])
#keeping the rows where the predicted and the expected readmission rate is available.
hrrp = hrrp[hrrp["Number of Discharges"].fillna(0) > 0]
hrrp.head()


#### Keeping the chronic diseases which are more common
- "READM-30-HF",      # Heart Failure
- "READM-30-COPD",    # Chronic Obstructive Pulmonary Disease
-  "READM-30-DIABETES",# Diabetes



In [None]:
chronic_measures = [
    "READM-30-HF-HRRP",      # Heart Failure
    "READM-30-COPD-HRRP",    # Chronic Obstructive Pulmonary Disease
    "READM-30-DIABETES-HRRP",# Diabetes
]

hrrp = hrrp[hrrp["Measure Name"].isin(chronic_measures)]
hrrp.head()

### Weighted + State Aggregation of readmission dataset

In [8]:
# Step 1 — create weighted columns
hrrp["w_pred"] = hrrp["Predicted Readmission Rate"] * hrrp["Number of Discharges"]
hrrp["w_exp"]  = hrrp["Expected Readmission Rate"]  * hrrp["Number of Discharges"]

# Step 2 — get weighted averages by State and Measure Name
hrrp_state = (
    hrrp.groupby(["State", "Measure Name"], as_index=False)
    .agg({
        "w_pred": "sum",
        "w_exp": "sum",
        "Number of Discharges": "sum"
    })
)


In [None]:
# Step 3 — calculate weighted averages and ratio
hrrp_state["PredictedRate"] = hrrp_state["w_pred"] / hrrp_state["Number of Discharges"]
hrrp_state["ExpectedRate"]  = hrrp_state["w_exp"]  / hrrp_state["Number of Discharges"]
hrrp_state["Excess_Readmission_Ratio"] = hrrp_state["PredictedRate"] / hrrp_state["ExpectedRate"]

# Step 4 — optional: aggregate to one row per State
hrrp_state_summary = (
    hrrp_state.groupby("State", as_index=False)
    .agg({
        "PredictedRate": "mean",
        "ExpectedRate": "mean",
        "Excess_Readmission_Ratio": "mean",
        "Number of Discharges": "sum"
    })
)

# Step 5 — save or check results
hrrp_state_summary.to_csv("data/hrrp_state_summary.csv", index=False)
hrrp_state_summary.head()

After the cleaning of hrrp, what I have done is aggregate the data on the basis of **weighted average of predicted readmission rate, expected ratio and expected readmission rate** & **STATE**

#### Why Use a Weighted Average for Readmission Rates?

Each hospital reports its **Predicted** and **Expected Readmission Rates**,  
but hospitals differ greatly in size — some treat thousands of patients,  
while others handle only a few hundred.

If we took a *simple average*, a small hospital and a large hospital  
would have the **same influence** on the overall state rate.  
That would make the comparison unfair and not reflect the true population outcome.

A **weighted average** fixes this by giving more importance to hospitals  
that handle more discharges (i.e., treat more patients).


### Why It Matters
- Ensures that larger hospitals contribute proportionally to state-level metrics.  
- Produces a more accurate and fair representation of real-world outcomes.  
- Prevents small hospitals from skewing the overall average.



### SUMMARY
WHAT I HAVE DONE HERE IS, CLEANED THE DATASETS AND MADE IT READY FOR MERGE.
**Kept only the most recent year for the CDI as the CMS data is of 2023 and the hrrp data is from 2020 to 2023**

### 1. **FINANCIAL DATA: cms_state**
    HERE, I HAVE ONLY FETCHED THE COLUMNS WHICH ARE USEFUL LIKE:
- Rndrng_Prvdr_State_Abrvtn (Provider State)
- DRG_Desc (DRG Definition)
- Tot_Dschrgs (Total Discharges)
- Avg_Submtd_Cvrd_Chrg	(Average Covered Charges)
- Avg_Tot_Pymt_Amt (Average Total Payments)
- Avg_Mdcr_Pymt_Amt (Average payment from Medicare)
  
We filtered the CMS dataset to keep only chronic disease-related DRGs (Heart Failure, COPD, Diabetes),
these align with HRRP readmission categories.
Then, we aggregated by state using **weighted averages based on the number of discharges.**
This ensures each state’s cost measures reflect its hospital activity volume, giving fair comparisons across states. 

### 2. **CHRONIC DISEASE INDICATIOR : cdi_pivot**
 Preparing CDC Chronic Disease Indicators (CDI) Data

In this step, we filtered and cleaned the CDC Chronic Disease Indicators dataset to focus on four key chronic health conditions — **Diabetes, Cardiovascular Disease, and COPD** — across all available years.

We then:
- Selected only relevant columns (**State, Topic, DataValue, YearStart**)
- Converted all values to numeric by removing commas and percentage symbols
- Renamed columns for consistency (DataValue → Value, YearStart → Year)
- Kept latest year
  
**This cleaned dataset (cdi_filtered) allows us to analyze how chronic disease prevalence has changed over time in each state, forming the foundation for comparing health outcomes with hospital readmissions and financial costs.**

  

### 3. **READMISSION DATA: hrrp_state_summaryw**
      HERE, I HAVE ONLY FETCHED THE COLUMNS WHICH ARE USEFUL LIKE:
#### columns to keep
-  "Facility Name", 
-  "State",
-   "Measure Name",
-    "Predicted Readmission Rate", 
-    "Expected Readmission Rate",
-    "Excess Readmission Ratio", 
-    "Number of Discharges"
#### Keeping the chronic diseases which are more common
- "READM-30-HF",      # Heart Failure
- "READM-30-COPD",    # Chronic Obstructive Pulmonary Disease
-  "READM-30-DIABETES",# Diabetes



In [None]:
# --- Top disease topics in CDI ---
print("Top 5 Topics in CDI:")
print(cdi["Topic"].value_counts().head(5))
print("\n")

# --- Top readmission measures in HRRP ---
print("Top 5 Conditions in HRRP:")
print(hrrp["Measure Name"].value_counts().head(5))
print("\n")




In [None]:
# See which states are missing from each dataset
print("States in CDI:", sorted(cdi_pivot['State'].unique()))
print("States in HRRP:", sorted(hrrp_state_summary['State'].unique()))
print("States in CMS:", sorted(cms_state['State'].unique()))

In [None]:
all_cdi_states = set(cdi_pivot['State'].unique())
all_hrrp_states = set(hrrp_state_summary['State'].unique())
territories_excluded = all_cdi_states - all_hrrp_states
print(f"Territories/Aggregates Excluded: {territories_excluded}")

In [None]:
# Merge CDI (cdi_pivot), HRRP summary, and CMS summary by State
merged = (
    cdi_pivot
    .merge(hrrp_state_summary, on="State", how="inner")
    .merge(cms_state, on="State", how="inner")
)

# Preview final merged dataset
merged.head()


In [None]:
print(f"\n✓ Merge Complete: {len(merged)} states with complete data across all 3 sources")
print(f"States included: {sorted(merged['State'].unique())}")

In [15]:
# Optional: Save for analysis/visualization
merged.to_csv("data/final_population_health_merged.csv", index=False)


In [None]:
# ============================================================================
# PHASE 1: DATA QUALITY VALIDATION
# ============================================================================

print("\n" + "=" * 80)
print("PHASE 1: DATA QUALITY VALIDATION")
print("=" * 80)

# 1. Check for missing values
print("\n1. MISSING VALUES CHECK:")
missing_check = merged.isnull().sum()
if missing_check.sum() == 0:
    print("   ✓ NO MISSING VALUES - All states have complete data across all metrics")
else:
    print("   ⚠ Missing values detected:")
    print(missing_check[missing_check > 0])

# 2. Validate readmission rates (should be 0-100%)
print("\n2. READMISSION RATE VALIDATION (should be 0-100%):")
pred_min = merged['PredictedRate'].min()
pred_max = merged['PredictedRate'].max()
exp_min = merged['ExpectedRate'].min()
exp_max = merged['ExpectedRate'].max()
print(f"   Predicted Rate Range: {pred_min:.2f}% - {pred_max:.2f}% ✓")
print(f"   Expected Rate Range: {exp_min:.2f}% - {exp_max:.2f}% ✓")

# 3. Validate costs (should all be positive)
print("\n3. COST DATA VALIDATION (should all be positive):")
cost_min = merged['Weighted_Avg_Total_Payment'].min()
cost_max = merged['Weighted_Avg_Total_Payment'].max()
print(f"   Payment Range: ${cost_min:,.0f} - ${cost_max:,.0f} ✓")
print(f"   Medicare Payment Range: ${merged['Weighted_Avg_Medicare_Payment'].min():,.0f} - ${merged['Weighted_Avg_Medicare_Payment'].max():,.0f} ✓")

# 4. Check disease rates
print("\n4. DISEASE RATE VALIDATION:")
print(f"   Heart Disease: {merged['HeartDisease_Rate'].min():.2f} - {merged['HeartDisease_Rate'].max():.2f}")
print(f"   COPD: {merged['COPD_Rate'].min():.2f} - {merged['COPD_Rate'].max():.2f}")
print(f"   Diabetes: {merged['Diabetes_Rate'].min():.2f} - {merged['Diabetes_Rate'].max():.2f}")

print("\n✓ All validation checks passed!\n")

In [None]:
# ============================================================================
# OUTLIER DETECTION (IQR Method)
# ============================================================================

print("=" * 80)
print("OUTLIER DETECTION (IQR Method)")
print("=" * 80)

def find_outliers(data, column):
    """Find outliers using IQR method"""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Check key metrics for outliers
metrics_to_check = [
    'HeartDisease_Rate',
    'COPD_Rate', 
    'Diabetes_Rate',
    'PredictedRate',
    'Excess_Readmission_Ratio',
    'Weighted_Avg_Total_Payment'
]

print("\nStates with Outlier Values:")
total_outliers = 0

for metric in metrics_to_check:
    outliers, lower, upper = find_outliers(merged, metric)
    if len(outliers) > 0:
        print(f"\n{metric}:")
        print(f"  Normal Range: [{lower:.2f}, {upper:.2f}]")
        for idx, row in outliers.iterrows():
            print(f"    → {row['State']}: {row[metric]:.2f} ⚠")
        total_outliers += len(outliers)

print(f"\nTotal Outlier Instances Found: {total_outliers}")
print("Note: Outliers are flagged for review but may be VALID (e.g., Alaska's high disease rate is real)")

In [None]:
print("First few rows of your data:")
print(merged[['State', 'HeartDisease_Rate', 'COPD_Rate', 'Diabetes_Rate']].head(10))

print("\n\nData types:")
print(merged[['HeartDisease_Rate', 'COPD_Rate', 'Diabetes_Rate']].dtypes)

print("\n\nValue ranges:")
print(f"HeartDisease_Rate: min={merged['HeartDisease_Rate'].min()}, max={merged['HeartDisease_Rate'].max()}")
print(f"COPD_Rate: min={merged['COPD_Rate'].min()}, max={merged['COPD_Rate'].max()}")
print(f"Diabetes_Rate: min={merged['Diabetes_Rate'].min()}, max={merged['Diabetes_Rate'].max()}")

In [None]:
# Find which states have abnormal HeartDisease_Rate values
print("States with HeartDisease_Rate > 200 (suspicious):")
suspicious = merged[merged['HeartDisease_Rate'] > 200][['State', 'HeartDisease_Rate']]
print(suspicious)

print("\n\nAll unique HeartDisease_Rate values sorted:")
print(sorted(merged['HeartDisease_Rate'].unique()))

In [None]:
# Check the raw CDC data to see what's in there
print("Looking at the CDC data for these states:")
print("\nCDI pivot (what we're using):")
print(cdi_pivot[cdi_pivot['State'].isin(['IA', 'IL', 'MN', 'PA', 'WY'])][['State', 'HeartDisease_Rate', 'COPD_Rate', 'Diabetes_Rate']])

print("\n\nMerged data for these states:")
print(merged[merged['State'].isin(['IA', 'IL', 'MN', 'PA', 'WY'])][['State', 'HeartDisease_Rate', 'COPD_Rate', 'Diabetes_Rate']])