# Breast Cancer Prediction System - Model Development

**Student Name:** Oluwalase Soboyejo  
**Matric Number:** 23CD034363  

This notebook develops a machine learning model to predict whether a breast tumor is benign or malignant using the Breast Cancer Wisconsin (Diagnostic) dataset.

**Note:** This system is strictly for educational purposes and must not be presented as a medical diagnostic tool.

## 1. Import Required Libraries

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn for ML
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix

# Model persistence
import joblib
import os

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")

## 2. Load the Breast Cancer Wisconsin Dataset

In [None]:
# Load the dataset from sklearn
breast_cancer = load_breast_cancer()

# Create a DataFrame
df = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)

# Add target variable
df['diagnosis'] = breast_cancer.target

# Display basic information
print("Dataset Shape:", df.shape)
print("\nFeature Names:")
print(breast_cancer.feature_names)
print("\nTarget Names:", breast_cancer.target_names)  # 0 = malignant, 1 = benign

In [None]:
# Display first few rows
df.head()

In [None]:
# Display dataset info
df.info()

In [None]:
# Statistical summary
df.describe()

## 3. Data Preprocessing

### 3.1 Check for Missing Values

In [None]:
# Check for missing values
print("Missing Values per Column:")
print(df.isnull().sum())
print("\nTotal Missing Values:", df.isnull().sum().sum())

In [None]:
# Check for duplicates
print("Number of duplicate rows:", df.duplicated().sum())

### 3.2 Feature Selection

Selecting **5 input features** from the recommended list:
1. radius_mean
2. texture_mean
3. perimeter_mean
4. area_mean
5. concavity_mean

In [None]:
# Define the 5 selected features
selected_features = [
    'mean radius',
    'mean texture',
    'mean perimeter',
    'mean area',
    'mean concavity'
]

# Create feature matrix with selected features only
X = df[selected_features].copy()

# Target variable
y = df['diagnosis'].copy()

print("Selected Features:")
for i, feature in enumerate(selected_features, 1):
    print(f"  {i}. {feature}")

print(f"\nFeature Matrix Shape: {X.shape}")
print(f"Target Vector Shape: {y.shape}")

In [None]:
# Display selected features
X.head()

### 3.3 Encode Target Variable

In [None]:
# The target is already encoded in sklearn's dataset:
# 0 = Malignant, 1 = Benign
print("Target Variable Distribution:")
print(y.value_counts())
print("\nMapping: 0 = Malignant, 1 = Benign")

# Visualize the distribution
plt.figure(figsize=(8, 5))
ax = sns.countplot(x=y, palette=['#FF6B6B', '#4ECDC4'])
plt.title('Distribution of Diagnosis', fontsize=14)
plt.xlabel('Diagnosis (0 = Malignant, 1 = Benign)', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks([0, 1], ['Malignant (0)', 'Benign (1)'])

# Add count labels on bars
for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}', 
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='bottom', fontsize=12)

plt.tight_layout()
plt.show()

### 3.4 Exploratory Data Analysis

In [None]:
# Correlation heatmap for selected features
plt.figure(figsize=(10, 8))
correlation_matrix = X.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Selected Features', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Box plots for each feature by diagnosis
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for idx, feature in enumerate(selected_features):
    sns.boxplot(x=y, y=X[feature], ax=axes[idx], palette=['#FF6B6B', '#4ECDC4'])
    axes[idx].set_title(f'{feature} by Diagnosis', fontsize=12)
    axes[idx].set_xlabel('Diagnosis (0=Malignant, 1=Benign)')
    axes[idx].set_ylabel(feature)

# Remove the extra subplot
axes[-1].set_visible(False)

plt.tight_layout()
plt.show()

### 3.5 Split Data into Training and Testing Sets

In [None]:
# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training Set Size: {X_train.shape[0]} samples")
print(f"Testing Set Size: {X_test.shape[0]} samples")
print(f"\nTraining Set Distribution:")
print(y_train.value_counts())
print(f"\nTesting Set Distribution:")
print(y_test.value_counts())

### 3.6 Feature Scaling (Mandatory for KNN)

KNN is a distance-based algorithm, so feature scaling is essential to ensure all features contribute equally to the distance calculations.

In [None]:
# Initialize StandardScaler
scaler = StandardScaler()

# Fit on training data and transform both sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert to DataFrame for better visualization
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=selected_features)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=selected_features)

print("Feature Scaling Applied Successfully!")
print("\nScaled Training Data Statistics:")
print(X_train_scaled_df.describe())

## 4. Model Implementation - K-Nearest Neighbors (KNN)

KNN is a simple, yet powerful classification algorithm that classifies new data points based on the majority class of their k nearest neighbors.

### 4.1 Find Optimal K Value

In [None]:
# Test different values of k to find the optimal one
k_range = range(1, 31)
accuracy_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

# Find the best k
best_k = k_range[np.argmax(accuracy_scores)]
best_accuracy = max(accuracy_scores)

print(f"Best K value: {best_k}")
print(f"Best Accuracy: {best_accuracy:.4f}")

In [None]:
# Visualize accuracy vs k values
plt.figure(figsize=(12, 6))
plt.plot(k_range, accuracy_scores, 'b-', marker='o', markersize=5)
plt.axvline(x=best_k, color='r', linestyle='--', label=f'Best K = {best_k}')
plt.xlabel('Number of Neighbors (K)', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('KNN Accuracy vs Number of Neighbors', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### 4.2 Train the Final Model

In [None]:
# Initialize and train the KNN model with optimal k
knn_model = KNeighborsClassifier(n_neighbors=best_k, metric='euclidean')
knn_model.fit(X_train_scaled, y_train)

print(f"KNN Model trained successfully with K = {best_k}")
print(f"\nModel Parameters:")
print(knn_model.get_params())

## 5. Model Evaluation

In [None]:
# Make predictions on test set
y_pred = knn_model.predict(X_test_scaled)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("=" * 50)
print("MODEL EVALUATION METRICS")
print("=" * 50)
print(f"\nAccuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-Score:  {f1:.4f}")
print("=" * 50)

In [None]:
# Detailed classification report
print("\nDetailed Classification Report:")
print("=" * 50)
print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Malignant', 'Benign'],
            yticklabels=['Malignant', 'Benign'])
plt.title('Confusion Matrix', fontsize=14)
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.tight_layout()
plt.show()

print(f"\nTrue Negatives (TN): {cm[0][0]} - Correctly predicted Malignant")
print(f"False Positives (FP): {cm[0][1]} - Malignant predicted as Benign")
print(f"False Negatives (FN): {cm[1][0]} - Benign predicted as Malignant")
print(f"True Positives (TP): {cm[1][1]} - Correctly predicted Benign")

## 6. Save the Trained Model

In [None]:
# Create a dictionary containing all necessary components
model_components = {
    'model': knn_model,
    'scaler': scaler,
    'feature_names': selected_features,
    'best_k': best_k
}

# Save using joblib
model_path = 'breast_cancer_model.pkl'
joblib.dump(model_components, model_path)

print(f"Model saved successfully to: {model_path}")
print(f"File size: {os.path.getsize(model_path) / 1024:.2f} KB")

## 7. Demonstrate Model Loading and Prediction

In [None]:
# Load the saved model
loaded_components = joblib.load(model_path)

# Extract components
loaded_model = loaded_components['model']
loaded_scaler = loaded_components['scaler']
loaded_features = loaded_components['feature_names']

print("Model loaded successfully!")
print(f"\nFeatures expected: {loaded_features}")

In [None]:
# Test prediction with sample data
# Sample 1: Typical Malignant tumor characteristics (larger values)
sample_malignant = np.array([[17.99, 10.38, 122.8, 1001, 0.3001]])

# Sample 2: Typical Benign tumor characteristics (smaller values)
sample_benign = np.array([[12.46, 24.04, 83.97, 475.9, 0.0484]])

# Scale the samples
sample_malignant_scaled = loaded_scaler.transform(sample_malignant)
sample_benign_scaled = loaded_scaler.transform(sample_benign)

# Make predictions
pred_malignant = loaded_model.predict(sample_malignant_scaled)
pred_benign = loaded_model.predict(sample_benign_scaled)

print("=" * 60)
print("DEMONSTRATION: Prediction without Retraining")
print("=" * 60)

print("\n--- Sample 1 (Expected: Malignant) ---")
print(f"Input Features:")
for feature, value in zip(loaded_features, sample_malignant[0]):
    print(f"  {feature}: {value}")
print(f"Prediction: {'Malignant' if pred_malignant[0] == 0 else 'Benign'}")

print("\n--- Sample 2 (Expected: Benign) ---")
print(f"Input Features:")
for feature, value in zip(loaded_features, sample_benign[0]):
    print(f"  {feature}: {value}")
print(f"Prediction: {'Malignant' if pred_benign[0] == 0 else 'Benign'}")

print("\n" + "=" * 60)

In [None]:
# Verify model consistency - predictions on test set should be the same
loaded_predictions = loaded_model.predict(X_test_scaled)
original_predictions = y_pred

predictions_match = np.array_equal(loaded_predictions, original_predictions)
print(f"Loaded model predictions match original: {predictions_match}")
print(f"Loaded model accuracy: {accuracy_score(y_test, loaded_predictions):.4f}")

## 8. Summary

### Model Development Summary

| Aspect | Details |
|--------|----------|
| **Dataset** | Breast Cancer Wisconsin (Diagnostic) |
| **Total Samples** | 569 |
| **Algorithm** | K-Nearest Neighbors (KNN) |
| **Selected Features** | radius_mean, texture_mean, perimeter_mean, area_mean, concavity_mean |
| **Optimal K Value** | Determined through cross-validation |
| **Feature Scaling** | StandardScaler |
| **Model Persistence** | Joblib |

### Performance Metrics

The model achieves good performance on the test set with metrics reported above.

### Next Steps

The saved model (`breast_cancer_model.pkl`) will be used in the Flask web application (`app.py`) to provide predictions through a user-friendly interface.

---

**Disclaimer:** This system is strictly for educational purposes and must not be presented as a medical diagnostic tool.

In [None]:
print("\n" + "="*60)
print("MODEL DEVELOPMENT COMPLETED SUCCESSFULLY!")
print("="*60)
print(f"\nSaved Model: {model_path}")
print(f"Algorithm: K-Nearest Neighbors (K={best_k})")
print(f"Features: {selected_features}")
print(f"Model Persistence: Joblib")
print("\nReady for deployment with Flask web application!")
print("="*60)