# Breast Cancer Classification Using Supervised Learning Techniques

## Objective
The objective of this assessment is to evaluate various supervised learning techniques on the breast cancer dataset and compare their performances.

---

## 1. Loading and Preprocessing 
### Steps:
1. Load the dataset from `sklearn.datasets`.
2. Handle any missing values.
3. Perform feature scaling.

In [3]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Check for missing values
print("Missing Values:\n", X.isnull().sum())

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

Missing Values:
 mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64


In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train Logistic Regression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predict and evaluate
y_pred_log_reg = log_reg.predict(X_test)
print("Logistic Regression:\n", classification_report(y_test, y_pred_log_reg))


Logistic Regression:
               precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



In [7]:
from sklearn.tree import DecisionTreeClassifier

# Train Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Predict and evaluate
y_pred_dt = dt.predict(X_test)
print("Decision Tree Classifier:\n", classification_report(y_test, y_pred_dt))


Decision Tree Classifier:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



In [9]:
from sklearn.ensemble import RandomForestClassifier

# Train Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred_rf = rf.predict(X_test)
print("Random Forest Classifier:\n", classification_report(y_test, y_pred_rf))


Random Forest Classifier:
               precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



In [11]:
from sklearn.svm import SVC

# Train SVM
svm = SVC()
svm.fit(X_train, y_train)

# Predict and evaluate
y_pred_svm = svm.predict(X_test)
print("Support Vector Machine:\n", classification_report(y_test, y_pred_svm))


Support Vector Machine:
               precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



In [13]:
from sklearn.neighbors import KNeighborsClassifier

# Train k-NN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict and evaluate
y_pred_knn = knn.predict(X_test)
print("k-Nearest Neighbors:\n", classification_report(y_test, y_pred_knn))


k-Nearest Neighbors:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



## 1. Logistic Regression
### How it Works:
Logistic Regression is a linear model used for binary classification. It predicts the probability that a given input point belongs to a certain class, using the logistic function (sigmoid). The decision boundary is linear, meaning it works well when the classes are linearly separable.

### Why Suitable:
The breast cancer dataset is relatively simple and has clear class separations, making Logistic Regression a good fit. It is also computationally efficient and interpretable.

---

## 2. Decision Tree Classifier
### How it Works:
A Decision Tree Classifier splits the data into subsets based on feature values, creating a tree structure. Each node in the tree represents a decision rule, and the leaves represent class labels. It recursively partitions the data to minimize impurity (like Gini impurity or entropy).

### Why Suitable:
Decision Trees handle both numerical and categorical data well and are interpretable. However, they are prone to overfitting, especially with high-dimensional data like the breast cancer dataset.

---

## 3. Random Forest Classifier
### How it Works:
Random Forest is an ensemble method that constructs multiple decision trees during training. It then aggregates their predictions to improve accuracy and robustness. Each tree is trained on a random subset of the data, and features are randomly selected at each split.

### Why Suitable:
Random Forest is less prone to overfitting compared to a single decision tree, making it suitable for high-dimensional datasets like breast cancer data. It also performs well in handling feature interactions and complex relationships.

---

## 4. Support Vector Machine (SVM)
### How it Works:
SVM finds the optimal hyperplane that best separates the classes in a high-dimensional feature space. It maximizes the margin between the closest points of each class, known as support vectors. The kernel trick allows SVM to work efficiently with non-linear data by mapping it to a higher-dimensional space.

### Why Suitable:
SVM is highly effective for high-dimensional data and can handle complex decision boundaries, making it a good choice for the breast cancer dataset, which has many features.

---

## 5. k-Nearest Neighbors (k-NN)
### How it Works:
k-NN is a non-parametric, distance-based algorithm. It classifies a data point based on the majority class of its k nearest neighbors. The distance metric (usually Euclidean) is used to find the closest neighbors.

### Why Suitable:
k-NN is simple and works well with smaller datasets. However, it can struggle with high-dimensional data due to the "curse of dimensionality," making it less suitable for this dataset without proper tuning and feature scaling.

---

## Model Performance Analysis

In [15]:
from sklearn.metrics import accuracy_score

# Calculate accuracy for each model
models = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'SVM', 'k-NN']
accuracies = [
    accuracy_score(y_test, y_pred_log_reg),
    accuracy_score(y_test, y_pred_dt),
    accuracy_score(y_test, y_pred_rf),
    accuracy_score(y_test, y_pred_svm),
    accuracy_score(y_test, y_pred_knn)
]

# Display results
results = pd.DataFrame({'Model': models, 'Accuracy': accuracies})
print(results.sort_values(by='Accuracy', ascending=False))


                 Model  Accuracy
0  Logistic Regression  0.973684
3                  SVM  0.973684
2        Random Forest  0.964912
1        Decision Tree  0.947368
4                 k-NN  0.947368



### Best Performing Algorithm:
- **Logistic Regression** and **SVM** both achieved the highest accuracy of **97.37%**. These models performed well because they can efficiently handle the linear separability of the dataset and deal with high-dimensional data.

### Worst Performing Algorithm:
- **Decision Tree** and **k-NN** both had the lowest accuracy of **94.74%**. While Decision Trees are interpretable, they are prone to overfitting, and k-NN suffers from the curse of dimensionality, especially when the dataset has many features.
