In [1]:
import pandas as pd

**In this script, it was aimed to obtain (load and clean) the Table 2 data (Standardised death rates – diseases of the circulatory system, residents, 2022) in a way to make it possible to process afterwards. The below steps have been followed here:**
* Step 1: Load sheet with 3-level header
* Step 2: Drop the two empty columns
* Step 3: Flatten column names; join non-empty parts, strip spaces
* Step 4: Rename the first column to Country
* Step 5: Drop metadata rows; remove rows like "Bookmark", "Source", or codes
* Step 6: Reset index
* Step 7: Convert numeric columns
* Step 8: Rename the data columns
* Step 9: Remove footnote markers like (¹) from country names and drop the footnote rows
* Step 10: Save cleaned dataset

**Notes on Table 2:** 
- These are age-standardized death rates (SDR) per 100,000 inhabitants.
- The rates are broken down by disease type and sex.
- The column structure follows: Disease group → (subcategory, if applicable) → Sex.
- For ischaemic heart diseases, there is a further breakdown into two and the sum of these subcategories equals the parent category "Ischaemic heart diseases":
  * Acute myocardial infarction (including subsequent myocardial infarction)
  * Other ischaemic heart diseases
  
Please refer to data description in the readme file for details. 

In [2]:
file_path = "Cardiovascular_diseases_Health2025.xlsx"
df_raw = pd.read_excel(file_path, sheet_name=2, skiprows=9, header=[0,1,2])


df_raw = df_raw.drop(columns=df_raw.columns[:2])


def flatten_cols(col_tuple):
    return "_".join([str(c).strip() for c in col_tuple if c and c != "Unnamed"]).replace("\n", " ")

df_raw.columns = [flatten_cols(col) for col in df_raw.columns]

print("Columns after flattening:")
print(df_raw.columns.tolist())

df_raw = df_raw.rename(columns={df_raw.columns[0]: "Country"})


df_raw = df_raw[df_raw["Country"].notna()]
df_raw = df_raw[~df_raw["Country"].astype(str).str.contains("Source|Bookmark|hlth", na=False)]


df_clean = df_raw.reset_index(drop=True)


for col in df_clean.columns:
    if col != "Country":
        df_clean[col] = pd.to_numeric(df_clean[col], errors="coerce")


rename_dict = {
    "Ischaemic  heart diseases_Unnamed: 3_level_1_Males": "Ischaemic heart diseases_Males",
    "Ischaemic  heart diseases_Unnamed: 4_level_1_Females": "Ischaemic heart diseases_Females",
    "of which:_Acute myocardial infarction including subsequent myocardial infarction_Males": "Acute myocardial infarction_Males",
    "of which:_Acute myocardial infarction including subsequent myocardial infarction_Females": "Acute myocardial infarction_Females",
    "of which:_Other ischaemic  heart diseases_Males": "Other ischaemic heart diseases_Males",
    "of which:_Other ischaemic  heart diseases_Females": "Other ischaemic heart diseases_Females",
    "Other heart diseases_Unnamed: 9_level_1_Males": "Other heart diseases_Males",
    "Other heart diseases_Unnamed: 10_level_1_Females": "Other heart diseases_Females",
    "Cerebrovascular diseases_Unnamed: 11_level_1_Males": "Cerebrovascular diseases_Males",
    "Cerebrovascular diseases_Unnamed: 12_level_1_Females": "Cerebrovascular diseases_Females",
    "Other diseases of the circulatory system_Unnamed: 13_level_1_Males": "Other circulatory diseases_Males",
    "Other diseases of the circulatory system_Unnamed: 14_level_1_Females": "Other circulatory diseases_Females",
}


df_clean = df_clean.rename(columns=rename_dict)


df_clean["Country"] = df_clean["Country"].str.replace(r"\(.*\)", "", regex=True).str.strip()


df_clean = df_clean[~df_clean["Country"].str.contains("Definition", na=False)]


out_path = "data/processed/cvd_rates_by_type_clean.csv"
df_clean.to_csv(out_path, index=False)

print(f"Cleaned Table 2 saved to: {out_path}")
print("Preview of cleaned Table 2:")
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
print(df_clean.head(10))

Columns after flattening:
['Unnamed: 2_level_0_Unnamed: 2_level_1_Unnamed: 2_level_2', 'Ischaemic  heart diseases_Unnamed: 3_level_1_Males', 'Ischaemic  heart diseases_Unnamed: 4_level_1_Females', 'of which:_Acute myocardial infarction including subsequent myocardial infarction_Males', 'of which:_Acute myocardial infarction including subsequent myocardial infarction_Females', 'of which:_Other ischaemic  heart diseases_Males', 'of which:_Other ischaemic  heart diseases_Females', 'Other heart diseases_Unnamed: 9_level_1_Males', 'Other heart diseases_Unnamed: 10_level_1_Females', 'Cerebrovascular diseases_Unnamed: 11_level_1_Males', 'Cerebrovascular diseases_Unnamed: 12_level_1_Females', 'Other diseases of the circulatory system_Unnamed: 13_level_1_Males', 'Other diseases of the circulatory system_Unnamed: 14_level_1_Females']
Cleaned Table 2 saved to: data/processed/cvd_rates_by_type_clean.csv
Preview of cleaned Table 2:
    Country  Ischaemic heart diseases_Males  Ischaemic heart diseas