# Exploratory Data Analysis

This notebook explores triage patterns using only `cleaned.csv`, with KTAS expert classification as the main focus.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")
pd.set_option("display.max_columns", None)

### Findings Notes
This cell configures libraries and display settings. There are no data findings yet.

In [None]:
DATA_PATH = "cleaned.csv"

df = pd.read_csv(DATA_PATH)
if "Unnamed: 0" in df.columns:
    df = df.drop(columns=["Unnamed: 0"])

print(f"Loaded: {DATA_PATH}")
print(f"Shape: {df.shape}")
df.head()

### Findings Notes
Data source is strictly `cleaned.csv`. Expected shape from this dataset is around 1.2k rows with 14 features after dropping `Unnamed: 0`.

In [None]:
quality = pd.DataFrame({
    "dtype": df.dtypes.astype(str),
    "missing": df.isna().sum(),
    "unique_values": df.nunique()
}).sort_values(["missing", "unique_values"], ascending=[False, False])

print("Duplicate rows:", int(df.duplicated().sum()))
quality

### Findings Notes
- Missingness is clean across columns in this file (no null-heavy features).
- There are 5 duplicate rows, which is small but worth keeping in mind for modeling.

In [None]:
target_col = "KTAS_expert"

if target_col in df.columns:
    dist = df[target_col].value_counts().sort_index()
    pct = (dist / len(df) * 100).round(2)
    display(pd.DataFrame({"count": dist, "percent": pct}))

    fig, ax = plt.subplots(figsize=(7, 4))
    sns.barplot(x=dist.index.astype(str), y=dist.values, ax=ax, color="#3b82f6")
    ax.set_title("KTAS Expert Class Distribution")
    ax.set_xlabel("KTAS_expert")
    ax.set_ylabel("Count")
    plt.show()
else:
    print(f"Column {target_col} not found in dataset.")

### Findings Notes
- Class balance is centered in KTAS 3 and 4 (about 38% and 36%).
- KTAS 1 is rare (about 2%), so high-acuity cases are limited in count.
- This imbalance matters for downstream modeling and evaluation.

In [None]:
candidate_num_cols = [
    "Age", "NRS_pain", "SBP", "DBP", "HR", "RR",
    "BT", "Saturation", "KTAS_RN", "Mental_ord",
    "Sex_Male", "Injury_Yes", "Pain_Yes"
]
num_cols = [c for c in candidate_num_cols if c in df.columns]

if target_col in df.columns:
    by_target = df.groupby(target_col)[num_cols].mean().round(2)
    by_target
else:
    print("Target column missing; skipping grouped means.")

### Findings Notes
- Higher-acuity classes (KTAS 1-2) are older on average than KTAS 4-5.
- RR tends to be higher and oxygen saturation lower in higher-acuity groups.
- `Mental_ord` is notably higher in KTAS 1, consistent with altered mental status in critical presentations.

In [None]:
corr_cols = [c for c in num_cols + [target_col] if c in df.columns]
corr = df[corr_cols].corr(numeric_only=True)

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corr, cmap="coolwarm", center=0, ax=ax)
ax.set_title("Correlation Heatmap")
plt.tight_layout()
plt.show()

if target_col in corr.columns:
    corr[target_col].sort_values(key=lambda s: s.abs(), ascending=False)

### Findings Notes
- Strongest linear alignment with `KTAS_expert` is `KTAS_RN` (high positive correlation).
- `Mental_ord`, `RR`, and `Age` show meaningful negative correlation with KTAS number (more abnormal tends toward higher acuity).
- Correlation here is directional association, not causality.

In [None]:
if {"KTAS_RN", "KTAS_expert"}.issubset(df.columns):
    agreement = (df["KTAS_RN"] == df["KTAS_expert"]).mean() * 100
    abs_diff = (df["KTAS_RN"] - df["KTAS_expert"]).abs()
    print(f"RN vs Expert exact agreement: {agreement:.2f}%")
    display(abs_diff.value_counts().sort_index().rename("count").to_frame())
else:
    print("KTAS_RN and/or KTAS_expert missing; skipping agreement check.")

### Findings Notes
- RN and expert match exactly in about 85% of cases.
- Most mismatches are only one KTAS level apart, with very few large disagreements.
- This indicates strong but imperfect consistency between bedside and expert labeling.

## KTAS Expert-Focused EDA

In [None]:
if target_col in df.columns:
    class_order = sorted(df[target_col].dropna().unique())
    vitals_cols = ["Age", "HR", "RR", "Saturation", "SBP", "DBP", "NRS_pain", "BT"]
    vitals_cols = [c for c in vitals_cols if c in df.columns]

    if vitals_cols:
        n_cols = 3
        n_rows = int(np.ceil(len(vitals_cols) / n_cols))
        fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 4 * n_rows))
        axes = np.array(axes).reshape(-1)

        for i, col in enumerate(vitals_cols):
            sns.boxplot(data=df, x=target_col, y=col, order=class_order, ax=axes[i], color="#93c5fd", fliersize=2)
            axes[i].set_title(f"{col} by KTAS_expert")
            axes[i].set_xlabel("KTAS_expert")

        for j in range(len(vitals_cols), len(axes)):
            axes[j].axis("off")

        plt.tight_layout()
        plt.show()
    else:
        print("No vital-sign columns found for boxplots.")
else:
    print("KTAS_expert missing; skipping vital-sign distributions by class.")

### Findings Notes
- Compare medians and spread across KTAS classes to identify physiologic separation.
- RR and saturation usually show clearer class separation than blood pressure in this dataset.
- Wide overlaps between adjacent classes explain why perfect classification is difficult.

In [None]:
binary_cols = ["Injury_Yes", "Pain_Yes", "Sex_Male"]
available_binary_cols = [c for c in binary_cols if c in df.columns]

if target_col in df.columns and available_binary_cols:
    binary_rates = (df.groupby(target_col)[available_binary_cols].mean() * 100).round(2)
    display(binary_rates)

    plot_df = binary_rates.reset_index().melt(id_vars=target_col, var_name="feature", value_name="percent")
    fig, ax = plt.subplots(figsize=(10, 5))
    sns.lineplot(data=plot_df, x=target_col, y="percent", hue="feature", marker="o", ax=ax)
    ax.set_title("Binary Feature Rates by KTAS_expert")
    ax.set_ylabel("Percent")
    ax.set_xlabel("KTAS_expert")
    plt.tight_layout()
    plt.show()
else:
    print("KTAS_expert and/or binary columns missing; skipping binary trend analysis.")

### Findings Notes
- Pain-related presentations tend to peak in mid-to-lower acuity groups rather than KTAS 1.
- Injury prevalence varies by class and can help explain class composition shifts.
- Sex proportion differences are present but usually smaller than clinical-vital differences.

In [None]:
if {"KTAS_RN", target_col}.issubset(df.columns):
    cm_percent = pd.crosstab(df[target_col], df["KTAS_RN"], normalize="index") * 100
    display(cm_percent.round(2))

    fig, ax = plt.subplots(figsize=(8, 6))
    sns.heatmap(cm_percent, annot=True, fmt=".1f", cmap="Blues", ax=ax)
    ax.set_title("KTAS_RN vs KTAS_expert (Row-Normalized %)")
    ax.set_xlabel("KTAS_RN")
    ax.set_ylabel("KTAS_expert")
    plt.tight_layout()
    plt.show()

    diff = df["KTAS_RN"] - df[target_col]
    disagreement_summary = pd.DataFrame(
        {
            "percent": [
                (diff == 0).mean() * 100,
                (diff > 0).mean() * 100,
                (diff < 0).mean() * 100,
            ]
        },
        index=["exact_match", "under_triage_by_RN", "over_triage_by_RN"],
    ).round(2)
    display(disagreement_summary)

    by_expert_error = (
        df.assign(_diff=diff)
        .groupby(target_col)
        .agg(
            exact_match_pct=("_diff", lambda s: (s == 0).mean() * 100),
            under_triage_pct=("_diff", lambda s: (s > 0).mean() * 100),
            over_triage_pct=("_diff", lambda s: (s < 0).mean() * 100),
        )
        .round(2)
    )
    display(by_expert_error)
else:
    print("KTAS_RN and/or KTAS_expert missing; skipping confusion and triage error analysis.")

### Findings Notes
- The confusion matrix is strongest on the diagonal, indicating good RN-expert consistency.
- Off-diagonal errors are mostly adjacent levels, not extreme jumps.
- In this dataset, under-triage by RN occurs more often than over-triage, which is important for safety review.

## Key Findings (cleaned.csv)

- KTAS classes are imbalanced toward 3-4; KTAS 1 cases are limited.
- Older age, higher RR, lower saturation, and mental-status abnormality track with higher acuity.
- RN labels align closely with expert labels overall, with mostly one-level differences.
- Remaining disagreement pattern shows a meaningful under-triage tendency.