# 🧬 Breast Cancer Subtype Classification using Deep Learning (METABRIC Dataset)


This notebook uses the METABRIC breast cancer dataset from Kaggle to build a deep learning model that classifies cancer subtypes based on gene expression profiles.

We will go through:
- Loading and preprocessing the data
- Building a deep neural network (DNN)
- Training and evaluating the model
- Visualizing the performance


## 📥 Load & Preprocess Data

In [1]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load the dataset
df = pd.read_csv("METABRIC_RNA_Mutation.csv")

# Drop rows with missing subtype labels
df = df.dropna(subset=["pam50_+_claudin-low_subtype"])

# Encode target variable
le = LabelEncoder()
df["subtype_encoded"] = le.fit_transform(df["pam50_+_claudin-low_subtype"])

# Select gene expression features (drop non-feature columns)
non_feature_cols = df.columns[:30].tolist() + ["pam50_+_claudin-low_subtype", "subtype_encoded"]
X = df.drop(columns=non_feature_cols)
y = df["subtype_encoded"]

# Scale the gene expression data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, stratify=y, random_state=42)


  df = pd.read_csv("METABRIC_RNA_Mutation.csv")


ValueError: could not convert string to float: 'Living'

## 🤖 Build & Train Deep Learning Model

In [None]:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical

# One-hot encode target
y_train_cat = to_categorical(y_train)
y_test_cat = to_categorical(y_test)

# Define the model
model = Sequential([
    Dense(512, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(256, activation='relu'),
    Dropout(0.2),
    Dense(y_train_cat.shape[1], activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train_cat, validation_split=0.2, epochs=50, batch_size=32)


## 📏 Evaluate Model

In [None]:

from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Evaluate on test set
loss, acc = model.evaluate(X_test, y_test_cat)
print(f"Test Accuracy: {acc:.4f}")

# Predict on test set
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)

# Classification report
print(classification_report(y_test, y_pred_classes, target_names=le.classes_))


## 📊 Visualize Performance

In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Plot accuracy over epochs
plt.plot(history.history['accuracy'], label='Train Acc')
plt.plot(history.history['val_accuracy'], label='Val Acc')
plt.title('Model Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

# Confusion matrix
conf = confusion_matrix(y_test, y_pred_classes)
sns.heatmap(conf, annot=True, fmt='d', xticklabels=le.classes_, yticklabels=le.classes_)
plt.title("Confusion Matrix")
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
plt.show()


## 💾 Save Model

In [None]:

# Save the trained model
model.save("metabric_subtype_classifier.h5")
