# Exploratory Data Analysis: Wisconsin Diagnostic Breast Cancer (WDBC)

### 1.1 Introduction
This report analyzes the Wisconsin Diagnostic Breast Cancer (WDBC) dataset to identify key features distinguishing malignant from benign tumors. The data features were computed from digitized images of fine needle aspirates (FNA) of breast masses, describing the characteristics of the cell nuclei present in the image (Wolberg et al., 1995).

### 1.2 Data Acquisition
The raw data was retrieved directly from the UCI Machine Learning Repository to ensure reproducibility. The dataset consists of 569 instances with 30 real-valued input features and one binary target variable (`Diagnosis`).

In [1]:
# Imports (only nesseccary for EDA)
import pandas as pd
import numpy as np

import altair_ally as aly
import altair as alt
alt.data_transformers.enable('vegafusion')

from ucimlrepo import fetch_ucirepo

In [2]:
# import the data
# Code from https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
# Need ucimlrepo package to load the data
raw_data = fetch_ucirepo(id=17)

raw_X = raw_data.data.features
raw_y = raw_data.data.targets

raw_df = pd.concat([raw_X, raw_y], axis=1)
raw_df.to_csv("../data/raw/breast_cancer_raw.csv", index=False)

### 2. Data Cleaning and Schema Mapping
The raw dataset lacks semantic column headers. To facilitate analysis, we implemented a schema mapping strategy based on the `wdbc.names` metadata. The 30 features represent ten distinct cell nucleus characteristics (e.g., Radius, Texture) computed in three statistical forms.

We applied the following suffix mapping transformation:
* **Mean Value:** Suffix `1` -> `_mean`
* **Standard Error:** Suffix `2` -> `_se`
* **Worst (Max) Value:** Suffix `3` -> `_max`

This step ensures all features are semantically interpretable for the subsequent EDA.

In [3]:
# Data Cleaning
# There is no NA in the dataset

# Clean the column names based on description
clean_columns = []
for col in raw_X.columns:
    if col.endswith('1'):
        clean_name = col[:-1] + '_mean'
    elif col.endswith('2'):
        clean_name = col[:-1] + '_se'
    elif col.endswith('3'):
        clean_name = col[:-1] + '_max'
    else:
        clean_name = col
    
    clean_columns.append(clean_name)
raw_X.columns = clean_columns
X = raw_X.copy()

# Clean the target column
y = raw_y.copy()
y['Diagnosis'] = y['Diagnosis'].map({'M': 'Malignant', 'B': 'Benign'})
clean_df = pd.concat([X, y], axis=1)

# Export the cleaned data
clean_df.to_csv('../data/processed/breast_cancer_cleaned.csv', index=False)

clean_df

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,fractal_dimension_mean,...,texture_max,perimeter_max,area_max,smoothness_max,compactness_max,concavity_max,concave_points_max,symmetry_max,fractal_dimension_max,Diagnosis
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,Malignant
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,Malignant
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,Malignant
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,Malignant
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,Malignant
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,Malignant
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,Malignant
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,Malignant
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,Malignant


### 3. Data Profiling: Structure and Statistics
**Purpose:**
* **`df.info()`:** Used to verify data integrity by checking for null values and ensuring all feature columns are of `float64` type.
* **`df.describe()`:** Used to examine the central tendency and spread of numeric features. This highlights differences in **magnitude** (scales) across variables.

**Observation:**
The dataset is complete (no missing values). However, `describe()` reveals massive scale disparities (e.g., `area_mean` ranges up to 2500, while `smoothness_mean` is < 0.1), confirming the necessity for **Feature Scaling** (Standardization) before modeling.

In [4]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   radius_mean             569 non-null    float64
 1   texture_mean            569 non-null    float64
 2   perimeter_mean          569 non-null    float64
 3   area_mean               569 non-null    float64
 4   smoothness_mean         569 non-null    float64
 5   compactness_mean        569 non-null    float64
 6   concavity_mean          569 non-null    float64
 7   concave_points_mean     569 non-null    float64
 8   symmetry_mean           569 non-null    float64
 9   fractal_dimension_mean  569 non-null    float64
 10  radius_se               569 non-null    float64
 11  texture_se              569 non-null    float64
 12  perimeter_se            569 non-null    float64
 13  area_se                 569 non-null    float64
 14  smoothness_se           569 non-null    fl

In [5]:
clean_df.describe()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,fractal_dimension_mean,...,radius_max,texture_max,perimeter_max,area_max,smoothness_max,compactness_max,concavity_max,concave_points_max,symmetry_max,fractal_dimension_max
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


9. Target/response variable follows expected distribution
* We validate that the target variable `Diagnosis` is not severely imbalanced. If one class is much rarer than the other, this can hurt model performance and may require special handling (for example, resampling or adjusting evaluation metrics).

10. No anomalous correlations between target and features
* We check how strongly each feature is associated with `Diagnosis`. Extremely high predictive power for a single feature can indicate data leakage or unexpected dependencies that should be investigated.

11. No anomalous correlations between features
* We examine pairwise correlations between features. If many feature pairs are highly correlated, this suggests redundancy or multicollinearity, which may require feature selection or dimensionality reduction.

In [6]:
from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import ClassImbalance, FeatureLabelCorrelation
from deepchecks.tabular.checks.data_integrity import FeatureFeatureCorrelation

bc_dataset = Dataset(
    clean_df,
    label='Diagnosis'
)

# 9. Target/response variable follows expected distribution
class_imbalance_check = ClassImbalance().add_condition_class_ratio_less_than(
    class_imbalance_ratio_th=0.2  # flag if minority / majority < 0.2
)

class_imbalance_result = class_imbalance_check.run(bc_dataset)

class_imbalance_result


pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.



VBox(children=(HTML(value='<h4><b>Class Imbalance</b></h4>'), HTML(value='<p>Check if a dataset is imbalanced …

In [25]:
# 10. Feature–target correlations
flc_check = FeatureLabelCorrelation().add_condition_feature_pps_less_than(
    threshold=0.8  # flag features that are *too* predictive of the label
)

flc_result = flc_check.run(bc_dataset)
flc_result

VBox(children=(HTML(value='<h4><b>Feature Label Correlation</b></h4>'), HTML(value='<p>Return the PPS (Predict…

In [None]:
# 11. Feature–feature correlations
ffc_check = FeatureFeatureCorrelation().add_condition_max_number_of_pairs_above_threshold(
    0.95,
    10 
)

ffc_result = ffc_check.run(bc_dataset)
ffc_result

VBox(children=(HTML(value='<h4><b>Feature-Feature Correlation</b></h4>'), HTML(value='<p>    Checks for pairwi…

The `FeatureFeatureCorrelation` check fails the condition we set. In our data there are many feature pairs with very high correlation, for example `radius_mean`, `perimeter_mean` and `area_mean`, as well as their corresponding `_max` and `_se` versions. This pattern is expected for this dataset because these variables all describe related geometric properties of the tumor, so strong correlations are not a data quality error but a sign of redundancy and multicollinearity.


### 4. Correlation Analysis: Pearson vs. Spearman
**Method:**
* **Pearson Correlation:** Measures linear relationships.
* **Spearman Correlation:** Measures monotonic rank relationships (non-linear).
Comparing both helps identify if relationships are strictly linear or just trending in the same direction.

**Purpose:**
To detect **Multicollinearity**—redundant features that increase model complexity without adding information.

**Results:**
Both metrics show near-perfect correlation ($>0.95$) between `Radius`, `Perimeter`, and `Area`. This confirms these features are geometrically redundant. We should retain only one (e.g., Radius) and drop the others to improve model stability.

In [9]:
# Multicollinearity

corr_chart = aly.corr(clean_df)

corr_chart.save('../results/images/corr_chart.png')
corr_chart.save('../results/images/corr_chart.svg')

corr_chart


'selection_multi' is deprecated.  Use 'selection_point'


The value of 'empty' should be True or False.



### 5. Pairwise Separability Analysis
**Purpose:**
To visualize 2D decision boundaries. We look for feature combinations where the **Benign (Blue)** and **Malignant (Orange)** clusters are clearly distinct with minimal overlap.

**Results:**
* **High Separability:** Features related to size (`radius_mean`) and shape complexity (`concavity_mean`) separate the classes well.
* **Non-linear patterns:** The curved relationship between `area` and `radius` is clearly visible, reinforcing the geometric redundancy found in the correlation analysis.

In [10]:
# Only include mean as it provide a lot of info
cols_mean = [c for c in clean_df.columns if '_mean' in c] + ['Diagnosis']
pair_chart = aly.pair(clean_df[cols_mean], color='Diagnosis:N')

pair_chart.save('../results/images/pair_chart.png')
pair_chart.save('../results/images/pair_chart.svg')

pair_chart


'selection_multi' is deprecated.  Use 'selection_point'


'add_selection' is deprecated. Use 'add_params' instead.



### 6. Distribution Analysis
**Purpose:**
To inspect the univariate "shape" of the data. We look for **Skewness** (asymmetry) and **Outliers** that could bias linear models.

**Results:**
* **Skewness:** Features like `area_se` and `concavity_mean` are heavily **right-skewed** (long tail to the right). This indicates that **Log Transformation** is required to normalize these distributions.
* **Overlap:** "Texture" and "Smoothness" show high overlap between classes, suggesting they are less informative on their own compared to "Size" features.

In [11]:
dist_chart =aly.dist(clean_df, color='Diagnosis')

dist_chart.save('../results/images/dist_chart.png')
dist_chart.save('../results/images/dist_chart.svg')

dist_chart

### EDA Findings

* **Class Separation:**
    * **High Separability:** Features related to **size** (`radius`, `perimeter`, `area`) and **concavity** (`concave_points`, `concavity`) show clear distinction between Benign and Malignant classes (Malignant samples generally have higher values).
    * **Low Separability:** Texture, Smoothness, and Fractal Dimension show significant overlap, indicating they are weaker individual predictors.
* **Distributions:**
    * **Skewness:** "Area" and "Concavity" features (both `_mean` and `_se`) are heavily **right-skewed**.
    * **Outliers:** Visible in the upper tails of `area_max` and `perimeter_se`.
* **Correlations (Multicollinearity):**
    * **Severe Multicollinearity:** `radius`, `perimeter`, and `area` are perfectly correlated ($R \approx 1$). This is expected geometrically but redundant for models.
    * `concavity`, `concave_points`, and `compactness` also exhibit very high positive correlation.

### Preprocessing Recommendations

Based on the above, the following pipeline is suggesued:

1.  **Feature Selection / Drop:**
    * Remove redundant features to reduce multicollinearity. **Keep `radius`** (or `perimeter`), but **drop `area`** and `perimeter` as they duplicate information.
2.  **Transformation:**
    * Apply **Log Transformation** to skewed features (e.g., `area`, `concavity`) to normalize distributions.
3.  **Scaling:**
    * Features vary vastly in scale (e.g., `area` > 1000 vs. `smoothness` < 0.2). Use **`StandardScaler`** to standardize all features to unit variance.
4.  **Imputation:**
    * None needed (Data is clean).

### Onto Creating a Classification Model

In [12]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

X = clean_df.drop('Diagnosis', axis=1)
y = clean_df['Diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [13]:
X_train.columns

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_max', 'texture_max', 'perimeter_max',
       'area_max', 'smoothness_max', 'compactness_max', 'concavity_max',
       'concave_points_max', 'symmetry_max', 'fractal_dimension_max'],
      dtype='object')

In [14]:
numeric_feats = ['radius_mean', 'texture_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_max', 'texture_max', 
       'smoothness_max', 'compactness_max', 'concavity_max',
       'concave_points_max', 'symmetry_max', 'fractal_dimension_max']

drop_feats = [
    'perimeter_mean',
    'area_mean',
    'perimeter_se',
    'area_se',
    'texture_se',
    'smoothness_se',
    'symmetry_se',
    'perimeter_max',
    'area_max'
]

In [15]:
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

ct = make_column_transformer(    
    (StandardScaler(), numeric_feats), 
    ("drop", drop_feats)
)

pipe = Pipeline([
    ("preprocess", ct),
    ("svc", SVC())
])

param_grid = {
    "svc__gamma": [0.001, 0.01, 0.1, 1.0, 10, 100],
    "svc__C": [0.001, 0.01, 0.1, 1.0, 10, 100]
}

gs = GridSearchCV(
    estimator = pipe,
    param_grid = param_grid,
    cv = 15,
    n_jobs = -1,
    return_train_score = True
)

gs.fit(X_train, y_train)

In [16]:
results = pd.DataFrame(gs.cv_results_)

best_performing = results[['param_svc__C', 'param_svc__gamma', 'mean_test_score']].sort_values(
    by='mean_test_score', ascending=False
).head(10)

heatmap_data = results[['param_svc__C', 'param_svc__gamma', 'mean_test_score']].copy()
heatmap_data['C'] = heatmap_data['param_svc__C'].astype(str)
heatmap_data['gamma'] = heatmap_data['param_svc__gamma'].astype(str)

heatmap = alt.Chart(heatmap_data).mark_rect().encode(
    x = alt.X('gamma:N', title='gamma'),
    y = alt.Y('C:N', title='C'),
    color = alt.Color('mean_test_score:Q', scale=alt.Scale(scheme='viridis')),
    tooltip = ['C', 'gamma', 'mean_test_score']
).properties(
    width = 400,
    height = 400,
    title = 'SVM GridSearchCV Mean Test Scores'
)

In [17]:
best_performing

Unnamed: 0,param_svc__C,param_svc__gamma,mean_test_score
25,10.0,0.01,0.969176
31,100.0,0.01,0.966667
30,100.0,0.001,0.960287
19,1.0,0.01,0.955986
24,10.0,0.001,0.955914
20,1.0,0.1,0.955914
26,10.0,0.1,0.95362
32,100.0,0.1,0.95147
18,1.0,0.001,0.931613
14,0.1,0.1,0.927455


In [18]:
heatmap.display()

In [19]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = gs.predict(X_test)

report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose().drop('support', axis = 1).drop(['macro avg', 'weighted avg'])
report_df

Unnamed: 0,precision,recall,f1-score
Benign,0.986486,1.0,0.993197
Malignant,1.0,0.97561,0.987654
accuracy,0.991228,0.991228,0.991228


In [20]:

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm, index = gs.classes_, columns = gs.classes_)

cm_melted = cm_df.reset_index().melt(id_vars='index')
cm_melted.columns = ['Actual', 'Predicted', 'Count']

heatmap = alt.Chart(cm_melted).mark_rect().encode(
    x = alt.X('Predicted:N', title = 'Predicted'),
    y = alt.Y('Actual:N', title = 'Actual'),
    color = alt.Color('Count:Q', scale = alt.Scale(scheme ='viridis'))
).properties(
    width = 400,
    height = 400,
    title = 'Confusion Matrix Heatmap'
)

text = alt.Chart(cm_melted).mark_text(color = 'white').encode(
    x = 'Predicted:N',
    y = 'Actual:N',
    text = 'Count:Q'
)

heatmap + text

### Discussion:
Our model performed very well, achieving high accuracy on the test set and correctly classifying nearly all cases. This result was generally expected given the strong feature patterns observed during EDA, which suggested clear separation between benign and malignant tumours.

The main concern is the single false negative, where a malignant tumour was predicted as benign. Even though this is rare, such an error carries significant clinical risk and highlights that the model, while strong, is not yet reliable enough for real world medical use.

These results suggest future work should explore methods aimed at reducing false negatives, such as adjusting class weights, using cost-sensitive training, or validating on external datasets to assess robustness.

### References

1. Reitz, Kenneth. 2011. Requests HTTP for Humans.
https://requests.readthedocs.io/en/master/.

2. American Cancer Society. 2024. “Breast Cancer Facts & Figures.” https://www.cancer.org/cancer/types/breast-cancer.html.

3. National Cancer Institute. 2024. “Breast Cancer Treatment (PDQ).” https://www.cancer.gov/types/breast/patient/breast-treatment-pdq.

4. UCI Machine Learning Repository. 2017. “Breast Cancer Wisconsin (Diagnostic) Data Set.” https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).
