# Lab 2 – Exploratory Data Analysis (EDA) and Cleaning
**Author:** Anam Ayyub  
**Dataset:** Derm7pt (meta.csv)  
**Goal:** Explore metadata, clean features, prepare for baselines


In [None]:

import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_columns", 100)
sns.set(style="whitegrid", context="notebook")



In [None]:

DATASET_PATH = r"C:\Users\anama\Documents\Group_8\Dataset\DERM7PT"
META_FILE = os.path.join(DATASET_PATH, "meta", "meta.csv")
CLEAN_FILE = os.path.join(DATASET_PATH, "meta", "metadata_cleaned.csv")

# Sanity checks
assert os.path.exists(DATASET_PATH), f"Dataset path not found: {DATASET_PATH}"
assert os.path.exists(META_FILE), f"Metadata file not found: {META_FILE}"

print("Paths confirmed.")
print("Dataset path:", DATASET_PATH)
print("Meta file:", META_FILE)



In [None]:

df = pd.read_csv(META_FILE)
print("Shape:", df.shape)
display(df.head())
df.info()


In [None]:
# Column overview and basic checks
cols = list(df.columns)
print("Columns:", cols)

# Expected core columns
expected_present = [
    "case_num", "diagnosis", "seven_point_score", "pigment_network", "streaks",
    "pigmentation", "regression_structures", "dots_and_globules", "blue_whitish_veil",
    "vascular_structures", "level_of_diagnostic_difficulty", "elevation", "location",
    "sex", "management", "clinic", "derm"
]
missing_expected = [c for c in expected_present if c not in df.columns]
print("Missing expected columns (if any):", missing_expected)


for c in ["case_id", "notes"]:
    if c in df.columns:
        non_null = df[c].notna().sum()
        print(f"{c}: non-null count = {non_null} of {len(df)}")


In [None]:
# Class balance
plt.figure(figsize=(10,4))
sns.countplot(x="diagnosis", data=df, order=df["diagnosis"].value_counts().index)
plt.title("Diagnosis (class) distribution")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

print("Diagnosis counts:")
display(df["diagnosis"].value_counts())


In [None]:
# Missing values overview
missing = df.isnull().sum().sort_values(ascending=False)
print("Missing values per column:")
display(missing)

plt.figure(figsize=(8,4))
sns.barplot(x=missing.index, y=missing.values, color="steelblue")
plt.title("Missing values per column")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()


In [None]:
# Numeric feature overview
numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
print("Numeric columns:", numeric_cols)

if "seven_point_score" in df.columns:
    plt.figure(figsize=(8,4))
    sns.histplot(df["seven_point_score"], bins=10, kde=True)
    plt.title("Seven Point Score distribution")
    plt.tight_layout()
    plt.show()

# Quick correlation heatmap among numeric cols
# Target encode all categorical columns for correlation
categorical_cols = df.select_dtypes(include=["object"]).columns.tolist()
categorical_cols = [c for c in categorical_cols if c != "diagnosis"]

df["target_binary"] = df["diagnosis"].str.contains("melanoma").astype(int)

for col in categorical_cols:
    means = df.groupby(col)["target_binary"].mean()
    df[col + "_te"] = df[col].map(means)

# Combine all numeric + encoded columns
encoded_cols = [c + "_te" for c in categorical_cols]
numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
all_numeric = numeric_cols + encoded_cols

# Compute full correlation matrix
full_corr = df[all_numeric].corr()

# Show top correlations with seven_point_score
top_corr = full_corr["seven_point_score"].drop("seven_point_score").sort_values(ascending=False)
print("Top correlations with seven_point_score:")
display(top_corr.head(5))

# Optional: plot heatmap of top 5 correlations
top_features = top_corr.head(5).index.tolist() + ["seven_point_score"]
sns.heatmap(df[top_features].corr(), annot=True, cmap="Blues", fmt=".2f")
plt.title("Top Correlations with Seven Point Score")
plt.tight_layout()
plt.show()



In [None]:
# Key categorical distributions (sex, location, clinic, derm, management, diagnostic difficulty)
cat_to_plot = [
    "sex", "location", "clinic", "derm", "management", "level_of_diagnostic_difficulty",
    "pigment_network", "streaks", "pigmentation", "regression_structures",
    "dots_and_globules", "blue_whitish_veil", "vascular_structures", "elevation"
]
for c in cat_to_plot:
    if c in df.columns:
        plt.figure(figsize=(10,4))
        order = df[c].value_counts().index
        sns.countplot(x=c, data=df, order=order)
        plt.title(f"{c} distribution")
        plt.xticks(rotation=45, ha="right")
        plt.tight_layout()
        plt.show()


In [None]:
# Drop unusable/sparse columns
# Rationale: near-empty, non-predictive for baselines; documented in notebook text.
drop_cols = [c for c in ["case_id", "notes"] if c in df.columns]
df_clean = df.drop(columns=drop_cols)
print("Dropped columns:", drop_cols)
print("Clean shape:", df_clean.shape)


In [None]:
# Cleaning and Target Preparation

# Step 1: Drop sparse/unusable columns
drop_cols = [c for c in ["case_id", "notes"] if c in df.columns]
df_clean = df.drop(columns=drop_cols)
print("Dropped columns:", drop_cols)
print("Shape after drop:", df_clean.shape)

# Step 2: Define binary and multiclass targets
melanoma_labels = [
    "melanoma",
    "melanoma (less than 0.76 mm)",
    "melanoma (in situ)",
    "melanoma (0.76 to 1.5 mm)",
    "melanoma (more than 1.5 mm)",
    "melanoma metastasis"
]

# Binary target: 1 = melanoma, 0 = benign/other
y_binary = df_clean["diagnosis"].apply(lambda x: 1 if x in melanoma_labels else 0)
print("Binary target distribution:\n", y_binary.value_counts())

# Multiclass target: one-hot encode diagnosis
y_multiclass = pd.get_dummies(df_clean["diagnosis"])
print("Multiclass target shape:", y_multiclass.shape)

# Step 3: Encode categorical features (exclude diagnosis)
categorical_cols = df_clean.select_dtypes(include=["object"]).columns.tolist()
categorical_to_encode = [c for c in categorical_cols if c != "diagnosis"]

X = pd.get_dummies(
    df_clean.drop(columns=["diagnosis"]),
    columns=categorical_to_encode,
    drop_first=True
)

print("Features shape:", X.shape)
print("Binary target shape:", y_binary.shape)
print("Multiclass target shape:", y_multiclass.shape)

# Step 4: Scale numeric features (exclude identifiers like case_num)
from sklearn.preprocessing import StandardScaler

numeric_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
scale_cols = [c for c in numeric_cols if c != "case_num"]

print("Numeric columns:", numeric_cols)
print("Columns to scale:", scale_cols)

if scale_cols:
    scaler = StandardScaler()
    X[scale_cols] = scaler.fit_transform(X[scale_cols])

# Step 5: Save outputs
X.to_csv(os.path.join(DATASET_PATH, "meta", "features.csv"), index=False)
y_binary.to_csv(os.path.join(DATASET_PATH, "meta", "target_binary.csv"), index=False)
y_multiclass.to_csv(os.path.join(DATASET_PATH, "meta", "target_multiclass.csv"), index=False)

print("Saved:")
print(" - features.csv")
print(" - target_binary.csv")
print(" - target_multiclass.csv")


In [None]:
# Scale numeric features (excluding identifiers like case_num)
from sklearn.preprocessing import StandardScaler

numeric_cols_post = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
# Exclude case_num from scaling (acts like an ID/order)
scale_cols = [c for c in numeric_cols_post if c != "case_num"]

print("Numeric columns after encoding:", numeric_cols_post)
print("Columns to scale:", scale_cols)

if scale_cols:
    scaler = StandardScaler()
    X[scale_cols] = scaler.fit_transform(X[scale_cols])


In [None]:
# Save cleaned metadata
X.to_csv(CLEAN_FILE, index=False)
print(f"Cleaned metadata saved to: {CLEAN_FILE}")
print("Final columns:", list(X.columns))
print("Final shape:", X.shape)


In [None]:
# Verification of cleaned outputs

X = pd.read_csv(os.path.join(DATASET_PATH, "meta", "features.csv"))
y_binary = pd.read_csv(os.path.join(DATASET_PATH, "meta", "target_binary.csv"))
y_multiclass = pd.read_csv(os.path.join(DATASET_PATH, "meta", "target_multiclass.csv"))

print("Shapes:")
print("Features:", X.shape)
print("Binary target:", y_binary.shape)
print("Multiclass target:", y_multiclass.shape)

# Quick sanity check: all should have the same number of rows
assert len(X) == len(y_binary) == len(y_multiclass), "Row mismatch detected!"
print("All aligned correctly.")


In [None]:
print(df["diagnosis"].value_counts())


# Lab 2 – EDA and Cleaning (Summary & Executive Notes)

---

### Step‑by‑Step Summary

1. **Setup and Imports**  
   - Loaded pandas, numpy, seaborn, matplotlib.  
   - Defined dataset paths for reproducibility.  

2. **Data Loading**  
   - Read `meta.csv` into `df`.  
   - Inspected shape, dtypes, and non‑null counts.  
   - *Why:* Confirm schema and data quality.  

3. **Exploratory Data Analysis (EDA)**  
   - Checked class balance (`diagnosis`).  
   - Visualized distributions of numeric and categorical features.  
   - Plotted missing values.  
   - Correlation heatmap among numeric features.  
   - *Why:* Understand imbalance, feature coverage, and guide cleaning.  

4. **Cleaning Decisions**  
   - **Dropped Columns:**  
     - `case_id` → ~97% missing, identifier only.  
     - `notes` → ~99% missing, free‑text, not usable.  
   - Preserved `case_num` but excluded from scaling.  
   - *Why:* Avoid noise and target leakage.  

5. **Target Preparation**  
   - Kept raw `diagnosis`.  
   - Created `y_binary` (melanoma=1, others=0).  
   - Created `y_multiclass` (one‑hot encoded diagnosis).  
   - *Why:* Supports both binary and multiclass experiments.  

6. **Feature Engineering**  
   - One‑hot encoded categorical features (excluding `diagnosis`).  
   - Scaled numeric features (`seven_point_score`, etc.), excluding `case_num`.  
   - *Why:* Ensure features are numeric and standardized for models.  

7. **Saving Outputs**  
   - Saved:  
     - `features.csv` – cleaned, encoded, scaled feature matrix.  
     - `target_binary.csv` – binary labels (melanoma vs. benign/other).  
     - `target_multiclass.csv` – multiclass labels (20 categories).  
     - `metadata_cleaned.csv` – cleaned metadata table.  
   - Verified alignment: all outputs have consistent row counts.  

---

## Executive Summary – Lab 2 (EDA & Cleaning)

- **Objective:** Prepare Derm7pt metadata for modeling by exploring, cleaning, and transforming features into a machine‑learning‑ready format.  

- **Key Steps:**  
  - Inspected dataset structure, distributions, and missing values.  
  - Dropped sparse/non‑predictive columns (`case_id`, `notes`).  
  - One‑hot encoded categorical features; standardized numeric features.  
  - Created two target sets:  
    - **Binary target:** grouped all melanoma subtypes into class **1 (melanoma)**, and all other diagnoses into class **0 (benign/other)**.  
    - **Multiclass target:** one‑hot encoded all 20 diagnosis categories.  

- **Outputs:**  
  - `features.csv` – cleaned, encoded, scaled feature matrix.  
  - `target_binary.csv` – binary labels.  
  - `target_multiclass.csv` – multiclass labels.  
  - `metadata_cleaned.csv` – cleaned metadata table.  

- **Why it matters:** Establishes a reproducible, clean foundation for Lab 3 baselines (logistic regression, decision trees, etc.) and future multimodal fusion with images.  

---

**Status:** Lab 2 is complete. Next step: use `X` + `y_binary` for binary baselines, and `X` + `y_multiclass` for multiclass experiments in Lab 3.


---

## Key Findings – Lab 2 (EDA)

- **Class imbalance:** Melanoma cases form a minority compared to benign/other diagnoses, confirming the need for careful handling of imbalance in downstream models.  
- **Missingness concentrated:** `case_id` (~97% missing) and `notes` (~99% missing) were essentially unusable, while most other features had good coverage.  
- **Numeric features:** `seven_point_score` showed a skewed distribution, with most cases clustered at lower scores. Correlations among numeric features were weak, suggesting limited redundancy.  
- **Categorical distributions:** Features like `sex`, `location`, and `management` were unevenly distributed, with some categories dominating (e.g., certain body locations more frequent).  
- **Data integrity:** After cleaning and encoding, all feature and target files aligned perfectly in row counts, ensuring reproducibility for Lab 3.  

---

**conclusion:** The dataset is imbalanced but structurally sound after cleaning. With binary and multiclass targets prepared, the project is ready to progress into baseline modeling (Lab 3).
