## HRRP Data Cleaning - Step by Step Summary

### Problem
The raw HRRP (Hospital Readmission Reduction Program) data contained facility-level (hospital) readmission records for six different measures. To analyze readmission rates at the state level for our two diseases, we needed to:
- Filter to only Heart Failure and COPD measures
- Aggregate from hospital level to state level
- Create weighted averages (so large hospitals don't dominate results)
- Calculate the Excess Readmission Ratio from aggregated rates
- Organize by disease for comparison

**Step 1: Load Data**
- Loaded 18,510 rows from HRRP readmission dataset
- 12 columns including facility info, readmission measures, and rates

**Step 2: Filter to HF and COPD Only**
- Kept only: READM-30-HF-HRRP and READM-30-COPD-HRRP
- Removed other measures (AMI, CABG, HIP-KNEE, PN)
- Result: 6,170 rows (3,085 Heart Failure + 3,085 COPD)

**Step 3: Create Disease Column**
- Created new "Disease" column by parsing measure names
- Mapped measures to two diseases: Heart_Failure, COPD
- This allows us to track readmission rates by disease type

**Step 4: Keep Only Required Columns**
- Kept: State, Disease, Total Discharges, Predicted Rate, Expected Rate
- Removed all other 

**Step 5: Remove Missing Values**
- Removed rows where Predicted Rate,and Expected Rate was null/empty
- These metrics are essential for the analysis
- Result: 4,962 rows (1,208 rows removed)

**Step 6: Remove Zero Discharge Rows**
- Removed rows where Total_Discharges = 0 (records with no patient volume)
- Result: 3,892 rows (1,070 rows removed)

**Step 7: Create Weighted Columns**
- Why weighted? Hospitals vary in size. A small hospital shouldn't have same influence as a large one.
- Formula: Weighted_Rate = Rate × Total_Discharges
- Example: If readmission rate is 20% and hospital treated 500 patients, weighted = 100 readmissions

**Step 8: Aggregate by State and Disease**
- Grouped all hospital records by state and disease
- Summed total discharges and weighted rates for each state-disease combination
- Result: 102 rows (51 states × 2 diseases = 102 combinations)

**Step 9: Calculate Weighted Averages**
- Formula: Weighted_Avg_Rate = Total_Weighted_Rate ÷ Total_Discharges
- This gives the true average readmission rate per state, weighted by hospital size
- Example: If 10,000 patients total across all hospitals, this is the average readmission rate per patient

**Step 10: Calculate Excess Readmission Ratio**
- Formula: Excess_Ratio = Predicted_Rate ÷ Expected_Rate
- Predicted = actual readmission rate in the state
- Expected = what we'd expect based on patient case mix
- Ratio interpretation:
  - 1.0 = performing as expected (no excess)
  - >1.0 = worse than expected (excess readmissions)
  - <1.0 = better than expected (fewer readmissions)

**Step 11: Keep Only Final Columns**
- Kept: State, Disease, Predicted_Readmission_Rate, Expected_Readmission_Rate, Excess_Readmission_Ratio
- Removed: Total_Discharges (not needed for main analysis)

**Step 12: Pivot to Wide Format**
- Changed from long format (one row per state-disease) to wide format (one row per state)
- Created separate columns for each disease:
  - COPD_Predicted_Rate, Heart_Failure_Predicted_Rate
  - COPD_Excess_Ratio, Heart_Failure_Excess_Ratio
- Result: 51 rows (one per state)


**Why Calculate Excess Ratio Ourselves?**

We calculated Excess_Readmission_Ratio = Predicted_Rate ÷ Expected_Rate (rather than averaging the facility-level ratios) because:
1. **Mathematical correctness** - The ratio must be calculated from properly aggregated rates, not averaged from individual facility ratios
2. **Data integrity** - Ensures the ratio accurately reflects the state's true performance vs expected
3. **Consistency** - Same calculation method as official CMS reporting



In [1]:
import pandas as pd
hrrp = pd.read_csv("data/readmission.csv")
print(f"Total rows: {len(hrrp)}")
hrrp.head()

Total rows: 18510


Unnamed: 0,Facility Name,Facility ID,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
0,SOUTHEAST HEALTH MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,296.0,,0.9483,13.0146,13.7235,36,07/01/2020,06/30/2023
1,SOUTHEAST HEALTH MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,151.0,,0.9509,9.6899,10.1898,13,07/01/2020,06/30/2023
2,SOUTHEAST HEALTH MEDICAL CENTER,10001,AL,READM-30-HF-HRRP,681.0,,1.0597,21.5645,20.3495,151,07/01/2020,06/30/2023
3,SOUTHEAST HEALTH MEDICAL CENTER,10001,AL,READM-30-HIP-KNEE-HRRP,,,0.9654,4.268,4.4211,Too Few to Report,07/01/2020,06/30/2023
4,SOUTHEAST HEALTH MEDICAL CENTER,10001,AL,READM-30-PN-HRRP,490.0,,0.9715,16.1137,16.5863,77,07/01/2020,06/30/2023


In [2]:
# STEP 2: Keep only Heart Failure and COPD measures
hrrp = hrrp[hrrp["Measure Name"].isin(["READM-30-HF-HRRP", "READM-30-COPD-HRRP"])]
print(f"Rows after filtering: {len(hrrp)}")

print(hrrp["Measure Name"].value_counts())

Rows after filtering: 6170
Measure Name
READM-30-HF-HRRP      3085
READM-30-COPD-HRRP    3085
Name: count, dtype: int64


In [3]:
# STEP 3: Create disease column
def get_disease_hrrp(measure_name):
    if "HF" in measure_name:
        return "Heart_Failure"
    elif "COPD" in measure_name:
        return "COPD"
    return None

hrrp["Disease"] = hrrp["Measure Name"].apply(get_disease_hrrp)

print(hrrp["Disease"].value_counts())

Disease
Heart_Failure    3085
COPD             3085
Name: count, dtype: int64


In [4]:
# STEP 4: Keep only required columns
hrrp = hrrp[[
    "State",
    "Disease",
    "Number of Discharges",
    "Predicted Readmission Rate",
    "Expected Readmission Rate",
]].copy()

hrrp.rename(columns={
    "Number of Discharges": "Total_Discharges"
}, inplace=True)

hrrp.head()

Unnamed: 0,State,Disease,Total_Discharges,Predicted Readmission Rate,Expected Readmission Rate,Excess_Readmission_Ratio
2,AL,Heart_Failure,681.0,21.5645,20.3495,1.0597
5,AL,COPD,130.0,15.4544,16.5637,0.933
8,AL,Heart_Failure,176.0,20.1511,20.2835,0.9935
11,AL,COPD,144.0,15.5737,17.909,0.8696
12,AL,COPD,154.0,17.788,18.7982,0.9463


In [5]:
# STEP 5: Remove missing values
print("\nREMOVE MISSING VALUES")
print(f"Rows before: {len(hrrp)}")
hrrp = hrrp.dropna(subset=["Predicted Readmission Rate", "Expected Readmission Rate"])
print(f"Rows after: {len(hrrp)}")

# STEP 6: Remove zero discharge rows
print("\nREMOVE ZERO DISCHARGE ROWS")
print(f"Rows before: {len(hrrp)}")
hrrp = hrrp[hrrp["Total_Discharges"] > 0]
print(f"Rows after: {len(hrrp)}")



REMOVE MISSING VALUES
Rows before: 6170
Rows after: 4962

REMOVE ZERO DISCHARGE ROWS
Rows before: 4962
Rows after: 3892


In [6]:
# STEP 7: Create weighted columns
hrrp["Weighted_Predicted"] = hrrp["Predicted Readmission Rate"] * hrrp["Total_Discharges"]
hrrp["Weighted_Expected"] = hrrp["Expected Readmission Rate"] * hrrp["Total_Discharges"]

# STEP 8: Aggregate by State and Disease
print(f"Rows before aggregation: {len(hrrp)}")

hrrp_state = hrrp.groupby(["State", "Disease"], as_index=False).agg({
    "Total_Discharges": "sum",
    "Weighted_Predicted": "sum",
    "Weighted_Expected": "sum"
})
print(f"Rows after aggregation: {len(hrrp_state)}")

# STEP 9: Calculate weighted averages
hrrp_state["Predicted_Readmission_Rate"] = hrrp_state["Weighted_Predicted"] / hrrp_state["Total_Discharges"]
hrrp_state["Expected_Readmission_Rate"] = hrrp_state["Weighted_Expected"] / hrrp_state["Total_Discharges"]


Rows before aggregation: 3892
Rows after aggregation: 102


In [8]:
# STEP 10: Calculate Excess Readmission Ratio
#existing ratio was getting distorted because of weighted average
hrrp_state["Excess_Readmission_Ratio"] = hrrp_state["Predicted_Readmission_Rate"] / hrrp_state["Expected_Readmission_Rate"]

In [9]:
# STEP 10: Keep only final columns 
hrrp_state = hrrp_state[[
    "State",
    "Disease",
    "Total_Discharges",
    "Predicted_Readmission_Rate",
    "Expected_Readmission_Rate",
    "Excess_Readmission_Ratio"
]]

In [10]:
# STEP 12: Pivot to wide format
print(f"Rows before pivot: {len(hrrp_state)}")

# Pivot for predicted rates
hrrp_pivot_predicted = hrrp_state.pivot_table(
    index="State",
    columns="Disease",
    values="Predicted_Readmission_Rate",
    aggfunc="first"
).reset_index()
hrrp_pivot_predicted.columns.name = None

# Pivot for expected rates
hrrp_pivot_expected = hrrp_state.pivot_table(
    index="State",
    columns="Disease",
    values="Expected_Readmission_Rate",
    aggfunc="first"
).reset_index()
hrrp_pivot_expected.columns.name = None

# Pivot for excess ratios
hrrp_pivot_excess = hrrp_state.pivot_table(
    index="State",
    columns="Disease",
    values="Excess_Readmission_Ratio",
    aggfunc="first"
).reset_index()
hrrp_pivot_excess.columns.name = None

# Rename columns
hrrp_pivot_predicted.rename(columns={
    "COPD": "COPD_Predicted_Rate",
    "Heart_Failure": "Heart_Failure_Predicted_Rate"
}, inplace=True)

hrrp_pivot_expected.rename(columns={
    "COPD": "COPD_Expected_Rate",
    "Heart_Failure": "Heart_Failure_Expected_Rate"
}, inplace=True)

hrrp_pivot_excess.rename(columns={
    "COPD": "COPD_Excess_Ratio",
    "Heart_Failure": "Heart_Failure_Excess_Ratio"
}, inplace=True)


Rows before pivot: 102


In [13]:

# Merge all three pivots on State
hrrp_pivot = hrrp_pivot_predicted.merge(hrrp_pivot_expected, on="State").merge(hrrp_pivot_excess, on="State")
print(f"Rows after pivot: {len(hrrp_pivot)}")
hrrp_pivot.head(10)

Rows after pivot: 51


Unnamed: 0,State,COPD_Predicted_Rate,Heart_Failure_Predicted_Rate,COPD_Expected_Rate,Heart_Failure_Expected_Rate,COPD_Excess_Ratio,Heart_Failure_Excess_Ratio
0,AK,18.730869,18.664527,18.658266,19.087871,1.003891,0.977821
1,AL,17.872373,19.633992,17.901635,19.634685,0.998365,0.999965
2,AR,18.326514,19.59735,18.129895,19.357507,1.010845,1.01239
3,AZ,16.737891,19.168292,16.955757,19.116684,0.987151,1.0027
4,CA,19.645774,20.095834,19.210772,19.74425,1.022644,1.017807
5,CO,17.651251,17.595898,18.073222,18.829706,0.976652,0.934475
6,CT,18.510919,19.823882,18.342643,19.335565,1.009174,1.025255
7,DC,22.357205,19.649238,21.272567,20.571871,1.050988,0.955151
8,DE,17.950422,18.143992,18.292941,18.914121,0.981276,0.959283
9,FL,18.842651,20.795498,18.601526,20.215305,1.012963,1.028701
