# Exploratory Data Analysis (EDA) - Heart Disease Risk Dataset

This notebook contains a comprehensive exploration of the BRFSS dataset used for heart disease risk prediction.

## Objectives:
1. Understand the dataset structure and content
2. Analyze missing values patterns
3. Extract and link feature labels from HTML codebook
4. Explore feature correlations with target variable
5. Identify relevant feature groups for modeling


## 1. Import Libraries


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import bs4
import re

# Configure plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)


## 2. Load Dataset


In [None]:
# Load training data
df = pd.read_csv("data/train.csv")

# Basic information
print(f"Dataset shape: {df.shape}")
print(f"Number of features: {df.shape[1]}")
print(f"Number of samples: {df.shape[0]}")
print(f"\nTarget variable: {df.columns[-1]}")
print(f"\nTarget distribution:")
print(df['TARGET'].value_counts())
print(f"\nClass balance: {df['TARGET'].value_counts(normalize=True)}")


## 3. Extract Feature Labels from HTML Codebook

The dataset comes with an HTML codebook containing detailed descriptions of each feature. We parse this to understand what each column represents.


In [None]:
html_file = "data/USCODE22_LLCP_102523.HTML"

# Read and parse HTML
with open(html_file, encoding="latin-1") as f:
    html = f.read()

soup = bs4.BeautifulSoup(html, "html.parser")

# Extract question information
questions = []

for td in soup.find_all("td", class_="l m linecontent"):
    text = td.get_text(separator=" ", strip=True)
    text = text.replace("\xa0", " ")  
    text = re.sub(r"\s+", " ", text)

    # Extract different fields using regex
    label_match = re.search(r"Label:\s(.*?)\s+(?:Section Name:|Core Section Name:)", text, re.IGNORECASE)
    section_name_match = re.search(r"(?:Section Name:|Core Section Name:)\s(.*?)\s+(?:Section Number:|Core Section Number:|Module Number:)", text, re.IGNORECASE)
    section_number_match = re.search(r"(?:Section Number:|Core Section Number:|Module Number:)\s*([0-9A-Za-z]+)", text, re.IGNORECASE)
    sas_match = re.search(r"SAS Variable Name:\s*([A-Za-z0-9_]+)", text, re.IGNORECASE)

    questions.append({
        "Label": label_match.group(1).strip() if label_match else "",
        "Section Name": section_name_match.group(1).strip() if section_name_match else "",
        "Section Number": section_number_match.group(1).strip() if section_number_match else "", 
        "SAS_Variable_Name": sas_match.group(1).strip() if sas_match else ""
    })

# Create DataFrame
labels_df = pd.DataFrame(questions)

# Save for later use
labels_df.to_csv("data/labels_questions.csv", index=False, encoding="utf-8")

print(f"Extracted {len(labels_df)} feature descriptions")
print(f"\nNumber of unique sections: {labels_df['Section Name'].nunique()}")
print(f"\nSample of extracted data:")
labels_df.head(10)


## 4. Missing Values Analysis

Understanding patterns in missing data is crucial for preprocessing decisions.
