### **Problem Statement: Do states with higher chronic disease burden and higher hospital readmissions also have higher inpatient treatment costs?**

- Working on a Population Health Financial Impact Analysis, using three major U.S. healthcare datasets (CMS + HRRP + CDC) to explore how population health outcomes relate to financial and hospital performance metrics.
- https://data.cms.gov/ (CMS Datasets)
- https://data.cms.gov/ (HRRP Dataset)
- https://chronicdata.cdc.gov/ (Chronic Dataset)
- Studying the relationship between population health, readmissions, and costs 
- Goal is state-level comparison (e.g., “Which states are expensive and unhealthy?”),

#### Financial Data

### CMS Data Cleaning and State-Level Aggregation

This step processes the **Medicare Inpatient Hospital dataset** to align it with our Population Health project.  
Each hospital record lists average costs per DRG (Diagnosis Related Group).  
To merge it meaningfully with HRRP (readmission) and CDC (chronic disease) datasets — both aggregated by **state** —  
we need to summarize CMS data to the **state level**.

We calculate **weighted averages** for financial metrics (covered charges, total payments, Medicare payments)  
using the number of discharges as the weight.  
This ensures that procedures with higher patient volume contribute proportionally more to the state’s average cost.  
The result is a single record per state showing the overall financial impact of inpatient care.


Columns to keep!
- Rndrng_Prvdr_State_Abrvtn (Provider State)
- DRG_Desc (DRG Definition)
- Tot_Dschrgs (Total Discharges)
- Avg_Submtd_Cvrd_Chrg	(Average Covered Charges)
- Avg_Tot_Pymt_Amt (Average Total Payments)
- Avg_Mdcr_Pymt_Amt (Average payment from Medicare)




In [1]:
import pandas as pd

columns_to_keep = [
    "Rndrng_Prvdr_State_Abrvtn",
    "DRG_Desc",
    "Tot_Dschrgs",
    "Avg_Submtd_Cvrd_Chrg",
    "Avg_Tot_Pymt_Amt",
    "Avg_Mdcr_Pymt_Amt"
]

cms = pd.read_csv(
    "data/inpatient.csv",
    usecols=columns_to_keep,
    encoding="windows-1252",
    low_memory=False
)

cms = cms.rename(columns={
    "Rndrng_Prvdr_State_Abrvtn": "State",
    "DRG_Desc": "DRG_Def",
    "Avg_Submtd_Cvrd_Chrg": "Avg_Covered_Charges",
    "Avg_Tot_Pymt_Amt": "Avg_Total_Payment",
    "Tot_Dschrgs": "Total_Discharges",
    "Avg_Mdcr_Pymt_Amt":"Avg_Medicare_Payment"
})
cms.isnull().sum()


State                   0
DRG_Def                 0
Total_Discharges        0
Avg_Covered_Charges     0
Avg_Total_Payment       0
Avg_Medicare_Payment    0
dtype: int64

In [2]:
#filtering chronic diseases
chronic_drgs = ["HEART FAILURE", "CHRONIC OBSTRUCTIVE PULMONARY DISEASE ", "DIABETES"]
cms = cms[cms["DRG_Def"].str.contains('|'.join(chronic_drgs), case=False, na=False)]

# Step 4 — Create weighted totals
cms["w_cov"] = cms["Avg_Covered_Charges"] * cms["Total_Discharges"]
cms["w_tot"] = cms["Avg_Total_Payment"] * cms["Total_Discharges"]
cms["w_mcr"] = cms["Avg_Medicare_Payment"] * cms["Total_Discharges"]

# Step 5 — Aggregate by state (weighted averages)
cms_state = cms.groupby("State", as_index=False).agg({
    "Total_Discharges": "sum",
    "w_cov": "sum",
    "w_tot": "sum",
    "w_mcr": "sum"
})

cms_state["Weighted_Avg_Covered_Charges"] = cms_state["w_cov"] / cms_state["Total_Discharges"]
cms_state["Weighted_Avg_Total_Payment"] = cms_state["w_tot"] / cms_state["Total_Discharges"]
cms_state["Weighted_Avg_Medicare_Payment"] = cms_state["w_mcr"] / cms_state["Total_Discharges"]

# Step 6 — Final cleaned output
cms_state = cms_state[[
    "State", "Total_Discharges",
    "Weighted_Avg_Covered_Charges",
    "Weighted_Avg_Total_Payment",
    "Weighted_Avg_Medicare_Payment"
]]
cms_state.head()


Unnamed: 0,State,Total_Discharges,Weighted_Avg_Covered_Charges,Weighted_Avg_Total_Payment,Weighted_Avg_Medicare_Payment
0,AK,768,82337.066405,16435.385417,13868.738281
1,AL,6587,42388.66692,9199.165781,7245.591468
2,AR,4831,33611.826744,8638.456841,7248.582074
3,AZ,6609,56103.267665,10456.280375,8696.659706
4,CA,37528,92480.146531,13992.673524,12016.404764


We filtered the CMS dataset to keep only chronic disease-related DRGs (Heart Failure, COPD, Diabetes, Pneumonia, and Myocardial Infarction) — these will align with HRRP readmission categories.
Then, we aggregated by state using weighted averages based on the number of discharges.
This ensures each state’s cost measures reflect its hospital activity volume, giving fair comparisons across states.

## Chronic Disease Data

In [3]:
cdi =  pd.read_csv("data/Chronic_Disease.csv")
cdi.head()

  cdi =  pd.read_csv("data/Chronic_Disease.csv")


Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,DataSource,Topic,Question,Response,DataValueUnit,DataValueType,...,TopicID,QuestionID,ResponseID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,2020,2020,US,United States,BRFSS,Health Status,Recent activity limitation among adults,,Number,Age-adjusted Mean,...,HEA,HEA04,,AGEADJMEAN,SEX,SEXF,,,,
1,2015,2019,AR,Arkansas,US Cancer DVT,Cancer,"Invasive cancer (all sites combined), incidence",,Number,Number,...,CAN,CAN07,,NMBR,SEX,SEXM,,,,
2,2015,2019,CA,California,US Cancer DVT,Cancer,"Cervical cancer mortality among all females, u...",,Number,Number,...,CAN,CAN03,,NMBR,OVERALL,OVR,,,,
3,2015,2019,CO,Colorado,US Cancer DVT,Cancer,"Invasive cancer (all sites combined), incidence",,Number,Number,...,CAN,CAN07,,NMBR,RACE,HIS,,,,
4,2015,2019,GA,Georgia,US Cancer DVT,Cancer,"Prostate cancer mortality among all males, und...",,Number,Number,...,CAN,CAN05,,NMBR,RACE,WHT,,,,


In [4]:
# Keep all years to analyze trends
selected_topics = ["Diabetes", "Cardiovascular Disease", "Chronic Obstructive Pulmonary Disease"]

cdi_filtered = cdi.loc[
    cdi["Topic"].isin(selected_topics),
    ["LocationAbbr", "Topic", "DataValue", "YearStart"]
].copy()

# Convert DataValue to numeric
cdi_filtered["DataValue"] = (
    cdi_filtered["DataValue"]
        .astype(str)
        .str.replace(",", "", regex=False)
        .str.replace("%", "", regex=False)
)
cdi_filtered["DataValue"] = pd.to_numeric(cdi_filtered["DataValue"], errors="coerce")

# Clean column names for consistency
cdi_filtered.rename(columns={
    "LocationAbbr": "State",
    "DataValue": "Value",
    "YearStart": "Year"
}, inplace=True)

cdi_filtered = cdi_filtered.dropna(subset=["Value"])

# Step 5 — Keep only the latest year available per State–Topic (since CDI doesn’t have 2023)
cdi_latest = (
    cdi_filtered
    .sort_values(["State", "Topic", "Year"])
    .drop_duplicates(subset=["State", "Topic"], keep="last")
)
cdi_pivot = (
    cdi_latest.pivot_table(
        index="State",
        columns="Topic",
        values="Value",
        aggfunc="mean"
    )
    .reset_index()
)

cdi_pivot.columns.name = None
cdi_pivot.rename(columns={
    "Diabetes": "Diabetes_Rate",
    "Cardiovascular Disease": "HeartDisease_Rate",
    "Chronic Obstructive Pulmonary Disease": "COPD_Rate",
}, inplace=True)

cdi_pivot.head()
#cdi_filtered.head(20)


Unnamed: 0,State,HeartDisease_Rate,COPD_Rate,Diabetes_Rate
0,AK,112.9,5.0,20.6
1,AL,10.02,9.4,21.5
2,AR,72.8,13.3,5.4
3,AZ,19.38,2.6,10.5
4,CA,13.2,3.3,24.4


### Preparing CDC Chronic Disease Indicators (CDI) Data

In this step, we filtered and cleaned the CDC Chronic Disease Indicators dataset to focus on four key chronic health conditions - Diabetes, Cardiovascular Disease, and COPD — across all available years.

We then:

Selected only relevant columns (State, Topic, DataValue, YearStart)

Converted all values to numeric by removing commas and percentage symbols

Renamed columns for consistency (DataValue → Value, YearStart → Year)

**Kept only the most recent year for the CDI as the CMS data is of 2023 and the hrrp data is from 2020 to 2023**

**This cleaned dataset (cdi_filtered) allows us to analyze how chronic disease prevalence has changed over time in each state, forming the foundation for comparing health outcomes with hospital readmissions and financial costs.**

## Patient Readmission Data

In [5]:
hrrp = pd.read_csv("data/readmission_hrrp.csv")
columns_to_keep = ["Facility Name", "State", "Measure Name", "Predicted Readmission Rate", "Expected Readmission Rate","Excess Readmission Ratio", "Number of Discharges"]
hrrp = hrrp[columns_to_keep]
hrrp.isnull().sum()


Facility Name                     0
State                             0
Measure Name                      0
Predicted Readmission Rate     6583
Expected Readmission Rate      6583
Excess Readmission Ratio       6583
Number of Discharges          10170
dtype: int64

In [6]:
hrrp = hrrp.dropna(subset=["Predicted Readmission Rate", "Expected Readmission Rate"])
#keeping the rows where the predicted and the expected readmission rate is available.
hrrp = hrrp[hrrp["Number of Discharges"].fillna(0) > 0]
hrrp.head()


Unnamed: 0,Facility Name,State,Measure Name,Predicted Readmission Rate,Expected Readmission Rate,Excess Readmission Ratio,Number of Discharges
0,SOUTHEAST HEALTH MEDICAL CENTER,AL,READM-30-AMI-HRRP,13.0146,13.7235,0.9483,296.0
1,SOUTHEAST HEALTH MEDICAL CENTER,AL,READM-30-CABG-HRRP,9.6899,10.1898,0.9509,151.0
2,SOUTHEAST HEALTH MEDICAL CENTER,AL,READM-30-HF-HRRP,21.5645,20.3495,1.0597,681.0
4,SOUTHEAST HEALTH MEDICAL CENTER,AL,READM-30-PN-HRRP,16.1137,16.5863,0.9715,490.0
5,SOUTHEAST HEALTH MEDICAL CENTER,AL,READM-30-COPD-HRRP,15.4544,16.5637,0.933,130.0


#### Keeping the chronic diseases which are more common
- "READM-30-HF",      # Heart Failure
- "READM-30-COPD",    # Chronic Obstructive Pulmonary Disease
-  "READM-30-DIABETES",# Diabetes



In [7]:
chronic_measures = [
    "READM-30-HF-HRRP",      # Heart Failure
    "READM-30-COPD-HRRP",    # Chronic Obstructive Pulmonary Disease
    "READM-30-DIABETES-HRRP",# Diabetes
]

hrrp = hrrp[hrrp["Measure Name"].isin(chronic_measures)]
hrrp.head()

Unnamed: 0,Facility Name,State,Measure Name,Predicted Readmission Rate,Expected Readmission Rate,Excess Readmission Ratio,Number of Discharges
2,SOUTHEAST HEALTH MEDICAL CENTER,AL,READM-30-HF-HRRP,21.5645,20.3495,1.0597,681.0
5,SOUTHEAST HEALTH MEDICAL CENTER,AL,READM-30-COPD-HRRP,15.4544,16.5637,0.933,130.0
8,MARSHALL MEDICAL CENTERS,AL,READM-30-HF-HRRP,20.1511,20.2835,0.9935,176.0
11,MARSHALL MEDICAL CENTERS,AL,READM-30-COPD-HRRP,15.5737,17.909,0.8696,144.0
12,NORTH ALABAMA MEDICAL CENTER,AL,READM-30-COPD-HRRP,17.788,18.7982,0.9463,154.0


### Weighted + State Aggregation of readmission dataset

In [8]:
# Step 1 — create weighted columns
hrrp["w_pred"] = hrrp["Predicted Readmission Rate"] * hrrp["Number of Discharges"]
hrrp["w_exp"]  = hrrp["Expected Readmission Rate"]  * hrrp["Number of Discharges"]

# Step 2 — get weighted averages by State and Measure Name
hrrp_state = (
    hrrp.groupby(["State", "Measure Name"], as_index=False)
    .agg({
        "w_pred": "sum",
        "w_exp": "sum",
        "Number of Discharges": "sum"
    })
)


In [9]:
# Step 3 — calculate weighted averages and ratio
hrrp_state["PredictedRate"] = hrrp_state["w_pred"] / hrrp_state["Number of Discharges"]
hrrp_state["ExpectedRate"]  = hrrp_state["w_exp"]  / hrrp_state["Number of Discharges"]
hrrp_state["Excess_Readmission_Ratio"] = hrrp_state["PredictedRate"] / hrrp_state["ExpectedRate"]

# Step 4 — optional: aggregate to one row per State
hrrp_state_summary = (
    hrrp_state.groupby("State", as_index=False)
    .agg({
        "PredictedRate": "mean",
        "ExpectedRate": "mean",
        "Excess_Readmission_Ratio": "mean",
        "Number of Discharges": "sum"
    })
)

# Step 5 — save or check results
hrrp_state_summary.to_csv("data/hrrp_state_summary.csv", index=False)
hrrp_state_summary.head()

Unnamed: 0,State,PredictedRate,ExpectedRate,Excess_Readmission_Ratio,Number of Discharges
0,AK,18.697698,18.873068,0.990856,1686.0
1,AL,18.753182,18.76816,0.999165,16154.0
2,AR,18.961932,18.743701,1.011618,13341.0
3,AZ,17.953091,18.036221,0.994925,16706.0
4,CA,19.870804,19.477511,1.020225,78653.0


After the cleaning of hrrp, what I have done is aggregate the data on the basis of **weighted average of predicted readmission rate, expected ratio and expected readmission rate** & **STATE**

#### Why Use a Weighted Average for Readmission Rates?

Each hospital reports its **Predicted** and **Expected Readmission Rates**,  
but hospitals differ greatly in size — some treat thousands of patients,  
while others handle only a few hundred.

If we took a *simple average*, a small hospital and a large hospital  
would have the **same influence** on the overall state rate.  
That would make the comparison unfair and not reflect the true population outcome.

A **weighted average** fixes this by giving more importance to hospitals  
that handle more discharges (i.e., treat more patients).


### Why It Matters
- Ensures that larger hospitals contribute proportionally to state-level metrics.  
- Produces a more accurate and fair representation of real-world outcomes.  
- Prevents small hospitals from skewing the overall average.



### SUMMARY
WHAT I HAVE DONE HERE IS, CLEANED THE DATASETS AND MADE IT READY FOR MERGE.
**Kept only the most recent year for the CDI as the CMS data is of 2023 and the hrrp data is from 2020 to 2023**

### 1. **FINANCIAL DATA: cms_state**
    HERE, I HAVE ONLY FETCHED THE COLUMNS WHICH ARE USEFUL LIKE:
- Rndrng_Prvdr_State_Abrvtn (Provider State)
- DRG_Desc (DRG Definition)
- Tot_Dschrgs (Total Discharges)
- Avg_Submtd_Cvrd_Chrg	(Average Covered Charges)
- Avg_Tot_Pymt_Amt (Average Total Payments)
- Avg_Mdcr_Pymt_Amt (Average payment from Medicare)
  
We filtered the CMS dataset to keep only chronic disease-related DRGs (Heart Failure, COPD, Diabetes),
these align with HRRP readmission categories.
Then, we aggregated by state using **weighted averages based on the number of discharges.**
This ensures each state’s cost measures reflect its hospital activity volume, giving fair comparisons across states. 

### 2. **CHRONIC DISEASE INDICATIOR : cdi_pivot**
 Preparing CDC Chronic Disease Indicators (CDI) Data

In this step, we filtered and cleaned the CDC Chronic Disease Indicators dataset to focus on four key chronic health conditions — **Diabetes, Cardiovascular Disease, and COPD** — across all available years.

We then:
- Selected only relevant columns (**State, Topic, DataValue, YearStart**)
- Converted all values to numeric by removing commas and percentage symbols
- Renamed columns for consistency (DataValue → Value, YearStart → Year)
- Kept latest year
  
**This cleaned dataset (cdi_filtered) allows us to analyze how chronic disease prevalence has changed over time in each state, forming the foundation for comparing health outcomes with hospital readmissions and financial costs.**

  

### 3. **READMISSION DATA: hrrp_state_summaryw**
      HERE, I HAVE ONLY FETCHED THE COLUMNS WHICH ARE USEFUL LIKE:
#### columns to keep
-  "Facility Name", 
-  "State",
-   "Measure Name",
-    "Predicted Readmission Rate", 
-    "Expected Readmission Rate",
-    "Excess Readmission Ratio", 
-    "Number of Discharges"
#### Keeping the chronic diseases which are more common
- "READM-30-HF",      # Heart Failure
- "READM-30-COPD",    # Chronic Obstructive Pulmonary Disease
-  "READM-30-DIABETES",# Diabetes



In [10]:
# --- Top disease topics in CDI ---
print("Top 5 Topics in CDI:")
print(cdi["Topic"].value_counts().head(5))
print("\n")

# --- Top readmission measures in HRRP ---
print("Top 5 Conditions in HRRP:")
print(hrrp["Measure Name"].value_counts().head(5))
print("\n")




Top 5 Topics in CDI:
Topic
Cardiovascular Disease                             30709
Chronic Obstructive Pulmonary Disease              26951
Nutrition, Physical Activity, and Weight Status    26069
Health Status                                      25612
Alcohol                                            25321
Name: count, dtype: int64


Top 5 Conditions in HRRP:
Measure Name
READM-30-HF-HRRP      2342
READM-30-COPD-HRRP    1550
Name: count, dtype: int64




In [11]:
# Merge CDI (cdi_pivot), HRRP summary, and CMS summary by State
merged = (
    cdi_pivot
    .merge(hrrp_state_summary, on="State", how="inner")
    .merge(cms_state, on="State", how="inner")
)

# Preview final merged dataset
merged.head()



Unnamed: 0,State,HeartDisease_Rate,COPD_Rate,Diabetes_Rate,PredictedRate,ExpectedRate,Excess_Readmission_Ratio,Number of Discharges,Total_Discharges,Weighted_Avg_Covered_Charges,Weighted_Avg_Total_Payment,Weighted_Avg_Medicare_Payment
0,AK,112.9,5.0,20.6,18.697698,18.873068,0.990856,1686.0,768,82337.066405,16435.385417,13868.738281
1,AL,10.02,9.4,21.5,18.753182,18.76816,0.999165,16154.0,6587,42388.66692,9199.165781,7245.591468
2,AR,72.8,13.3,5.4,18.961932,18.743701,1.011618,13341.0,4831,33611.826744,8638.456841,7248.582074
3,AZ,19.38,2.6,10.5,17.953091,18.036221,0.994925,16706.0,6609,56103.267665,10456.280375,8696.659706
4,CA,13.2,3.3,24.4,19.870804,19.477511,1.020225,78653.0,37528,92480.146531,13992.673524,12016.404764


In [12]:
# Optional: Save for analysis/visualization
merged.to_csv("data/final_population_health_merged.csv", index=False)
