# Breast Cancer Prediction Project

Welcome to the Breast Cancer Prediction project! 

## Objective

Our goal is to build a machine learning model that can accurately predict whether a tumor is **malignant (M)** or **benign (B)** based on a set of measurements.

We'll use the dataset `Cancer_Data.csv` 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

##  Step 1: Load the Dataset

In [None]:
df = pd.read_csv('Cancer_data.csv')

this dataset is from breast cancer diagnosis records. it has data from `569` patients, where each row shows details from a test done on a tumor sample. the values include things like the `size`, `texture`, and `shape` of the cells.

it has `30` features (all numbers) that describe the tumor

there's one main column called diagnosis that tells whether the tumor was:

`M` = malignant (cancerous)

`B` = benign (non-cancerous)

the main goal is to use this data to predict whether a tumor is benign or malignant.

##  Step 2: Explore the Dataset

In [None]:
print("shape: ", df.shape)

print("\n columns: ", df.columns.tolist())

print("\n Info:")
print(df.info())

In [None]:
print("\n diagnosis values:", df['diagnosis'].unique())

df.describe()

we found that the dataset has 569 rows and 33 columns, and the target column is `diagnosis` with values 'M' and 'B'. the `id` column is just an identifier and `Unnamed: 32` is completely empty.

##  Step 3: Clean the Data

We'll remove any unnecessary columns and handle missing values.

Drop the 'Unnamed: 32' column and 'id'

In [None]:

df = df.drop(columns=['id', 'Unnamed: 32'])

print(df.shape)

Saved `Cancer_Data_Cleaned.csv` seperately, although since its a small dataset, we'll be cleaning it during runtime only

##  Step 4: Visualize the Data

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='diagnosis', data=df)
plt.title("Count of Diagnosis (M = Malignant, B = Benign)")
plt.show()

the correlation matrix can only work after the diagnosis is encoded into `numerical data`, so that has been done in the preprocessing section.

##  Step 5: Preprocess the Data

In [None]:

df['diagnosis'] = df['diagnosis'].map({'M' : 1, 'B' : 0})
print("Encoded values in diagnosis:", df['diagnosis'].unique())


X = df.drop('diagnosis' , axis=1)
Y = df['diagnosis']

print(X.shape , Y.shape) #gives (569, 30) (569,) 

Convert diagnosis column to `0`(benign) and `1` (malignant). \
Split features and labels (x and y). \
gives `(569, 30) (569,)`

In [None]:
#corelation matrix
corr = df.corr()
plt.figure(figsize=(14, 12))
sns.heatmap(corr, annot=False, cmap='coolwarm', linewidths=0.5)
plt.title("Heatmap")
plt.show()

In [None]:
# Split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2, random_state=42)

print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("Y_train:", Y_train.shape)
print("Y_test:", Y_test.shape, "\n")


# Scale the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(X_train_scaled.shape)

# re-scale
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


`455` training samples and `114` testing samples: each with 30 features

Scaling the features to be all set to train the model

##  Step 6: Train a Machine Learning Model

In [None]:

import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import accuracy_score


Using `pytorch` since tensorflow isnt suppported on python 3.13

In [None]:
Y_train_tensor = torch.tensor(Y_train.values, dtype=torch.float32).view(-1, 1)
Y_train_tensor = (Y_train_tensor >= 0.5).float()  # makes sure values are 0.0 or 1.0

Y_test_tensor = torch.tensor(Y_test.values, dtype=torch.float32).view(-1, 1)
Y_test_tensor = (Y_test_tensor >= 0.5).float()


`Tensor Conversion` - This block converts the Y_train and Y_test labels into PyTorch float tensors and ensures values are either 0.0 or 1.0 (binary), which is required for binary classification with a sigmoid output.

In [None]:
#defining model
class CancerNet(nn.Module):
    def __init__(self):
        super(CancerNet, self).__init__()
        self.fc1 = nn.Linear(30, 16)   
        self.fc2 = nn.Linear(16, 8)    
        self.out = nn.Linear(8, 1)    
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.sigmoid(self.out(x))
        return x
model = CancerNet()


`Model Defining` - Defines a simple neural network called `CancerNet` using PyTorch. It has 3 fully connected layers with `ReLU` activations and a final `sigmoid` for binary output. This model predicts if the tumor is malignant or benign.

In [None]:
#defining loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


`binary cross-entropy` - perfect for binary classification

`Adam` - helps adjust weights based on loss

In [None]:
# Training loop
epochs = 100

for epoch in range(epochs):
    # Forward pass
    outputs = model(X_train_tensor)
    loss = criterion(outputs, Y_train_tensor)

    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print loss every 10 epochs
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")


We can see the loss decreases every 10 epochs, thus we are assured the training is fine

In [None]:
# Using Logistic Regression
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, Y_train)

log_preds = log_reg.predict(X_test_scaled)

from sklearn.metrics import classification_report, confusion_matrix
print("Logistic Regression Classification Report:")
print(classification_report(Y_test, log_preds))
print("Confusion Matrix:")
print(confusion_matrix(Y_test, log_preds))


Logistic regression performs well and balanced. it hits `97%` accuracy and confusion matrix is nearly perfect.

##  Step 7: Evaluate the Model

In [None]:

with torch.no_grad():
    test_outputs = model(X_test_tensor)
    predicted = (test_outputs >= 0.5).float()  
# Accuracy
accuracy = accuracy_score(Y_test_tensor, predicted)
print(f"Test Accuracy: {accuracy:.4f}")


In [126]:
#saving the model
torch.save(model.state_dict(), 'cancer_model.pth')


Since the dataset is small and clean, we have an achieved accuracy over `95%` on the test set.

## `Finally`
Built a breast cancer classification model using `PyTorch` and `scikit-learn`. Achieved `96%` test accuracy with a neural network and `97%` with Logistic Regression on a well-processed medical dataset. Explored model benchmarking, data preprocessing, and evaluation