<a href="https://colab.research.google.com/github/edaaydinea/machine-learning/blob/master/Breast%20Cancer%20Classification/%20Breast_Cancer_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Day 08 - AIWC

# **Breast Cancer Classification using Machine Learning**

Breast cancer is the most common cancer among women worldwide, accounting for **25% of all cancer cases** and affecting **2.1 million people in 2015**. Early diagnosis significantly increases the chances of survival.

## **Challenges in Breast Cancer Detection**
The primary challenge in cancer detection is classifying tumors as **malignant** (cancerous) or **benign** (non-cancerous). Machine learning techniques can dramatically enhance diagnostic accuracy.

Research indicates that even experienced physicians achieve **79% accuracy** in cancer diagnosis. By leveraging machine learning, we can improve classification accuracy and assist doctors in making more precise decisions.

---

## **Stages of Cancer Detection**

### **First Stage: Cell Extraction**
This process involves extracting a small sample of cells from the tumor for analysis.

- **Benign tumors** – These do not spread across the body, meaning the patient is generally safe.
- **Malignant tumors** – These are cancerous and require immediate medical intervention to prevent further growth and spread.

---

## **Machine Learning for Breast Cancer Classification**

### **Objective**
Our goal is to train a machine learning model to classify breast cancer tumors as **malignant or benign** based on extracted features from medical images.

### **Process**
1. **Image Processing:** Extract images of tumors.
2. **Feature Extraction:** Identify key characteristics from the images, such as:
   - **Radius**
   - **Cell count**
   - **Texture**
   - **Perimeter**
   - **Area**
   - **Smoothness**
3. **Model Training:** Feed these extracted features into a machine learning model.
4. **Prediction:** The trained model classifies tumors as **malignant or benign** with high accuracy.

### **Conclusion**
By teaching machines to classify tumors effectively, we enhance early detection, improving patient outcomes and supporting medical professionals in making better-informed decisions.

# **Problem in Machine Learning Vocabulary**

## **Dataset Overview**
### **Input Features:** (30 Features)
- **Radius**
- **Texture**
- **Perimeter**
- **Area**
- **Smoothness**
- *(...other features included in the dataset)*

### **Target Classes:** (Binary Classification)
- **0 - Malignant (Cancerous)**
- **1 - Benign (Non-cancerous)**

### **Dataset Information**
- **Number of Instances:** 569
- **Class Distribution:**
  - **Malignant:** 212 cases
  - **Benign:** 357 cases

### **Data Sources:**
- [Breast Cancer Wisconsin (Diagnostic)](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))
- [Breast Cancer Detection with Reduced Feature Set](https://www.researchgate.net/publication/271907638_Breast_Cancer_Detection_with_Reduced_Feature_Set)

---

## **Binary Classification Representation**
When analyzing the **30 features**, our goal is to classify the cancer type:
- **0 → Malignant**
- **1 → Benign**

This is a **binary classification problem**, meaning the model predicts **either 0 or 1**, corresponding to malignant or benign tumors.

---

# **Support Vector Machine (SVM) Classifier**
### **Why Use SVM?**
Near the **maximum margin hyperplane**, it can be difficult to determine whether a tumor is malignant or benign. This is where **Support Vector Machines (SVMs)** are highly effective.

### **How Does SVM Work?**
- SVM finds the **optimal boundary** between classes by using **support vectors**—the most relevant data points that define the decision boundary.
- It creates a **maximum margin hyperplane** that separates malignant and benign tumors.
- **Support vectors** are the key points that lie closest to the decision boundary and influence its position.

### **Why is SVM Powerful?**
- SVM is an **extreme learning algorithm** that **focuses only on critical data points** (support vectors).
- It does not consider all data points but **only those on the boundary**, making it highly effective in classification tasks.

---

## **Conclusion**
Using **Support Vector Machines (SVMs)**, we can classify breast cancer cases more effectively by focusing on the **key features** and **critical boundary points**, leading to **higher accuracy in distinguishing malignant and benign tumors**.

### **IMPORTING DATA**

In [None]:
# import libraries 
import pandas as pd # Import Pandas for data manipulation using dataframes
import numpy as np # Import Numpy for data statistical analysis 
import matplotlib.pyplot as plt # Import matplotlib for data visualisation
import seaborn as sns # Statistical data visualization
# %matplotlib inline

In [None]:
# Import Cancer data drom the Sklearn library
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

In [None]:
cancer

In [None]:
# What dictionaries we have
cancer.keys()

In [None]:
# print them one by one
print(cancer['DESCR'])

In [None]:
print(cancer['target'])

In [None]:
print(cancer['target_names'])

In [None]:
print(cancer['feature_names'])

In [None]:
print(cancer['data'])

In [None]:
cancer['data'].shape

In [None]:
df_cancer = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target']))

In [None]:
df_cancer.head(5)

In [None]:
df_cancer.tail(5)

### **VISUALIZING THE DATA**

In [None]:
sns.pairplot(df_cancer,vars= ['mean radius','mean texture', 'mean area', 'mean perimeter', 'mean smoothness'])

But the only problem is that doesn't show the target class. It doesn't show actual which one of these samples is malignant or which one of them is benign.

In [None]:
sns.pairplot(df_cancer,hue = 'target', vars= ['mean radius','mean texture', 'mean area', 'mean perimeter', 'mean smoothness'])

The blue points in here that's the malignant case. The orange points in here that's the benign case.

In [None]:
sns.countplot(df_cancer['target'])

We take one of these slide graphs and see how can we play.

In [None]:
sns.scatterplot(x='mean area', y='mean smoothness', hue='target', data=df_cancer)

Let's check the correlation between the variables

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(df_cancer.corr(), annot=True)

### **MODEL TRAINING (FINDING A PROBLEM SOLUTION)**

In [None]:
# Let's drop the target label coloumns
x = df_cancer.drop(['target'],axis=1)

In [None]:
x

In [None]:
y = df_cancer['target']
y

# **Model Training and Testing Process**

In order to build an effective breast cancer classification model, we follow a **train-test split** approach:

1. **Training Phase:**  
   - We use a **subset of the dataset** for training the machine learning model.
   - The model learns patterns and relationships between **features** and their corresponding **labels** (Malignant or Benign).

2. **Testing Phase:**  
   - After training, we evaluate the model using a **testing dataset**.
   - The testing dataset contains data that the model **has never seen before**.
   - This ensures that the model’s predictions are **generalizable** and not just memorized from the training data.

By following this approach, we can measure the model's performance and ensure it works effectively on new, unseen cases.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state= 5)

In [None]:
x_train

In [None]:
x_train.shape

In [None]:
x_test

In [None]:
x_test.shape

In [None]:
y_train

In [None]:
y_train.shape

In [None]:
y_test

In [None]:
y_test.shape

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
svc_model = SVC()

In [None]:
svc_model.fit(x_train, y_train)

**EVALUATING THE MODEL**

We're talking about the testing data which has data that has never seen before. 

In [None]:
y_predict = svc_model.predict(x_test)

In [None]:
y_predict

We're going to plot a confusion matrix.  We need to specify compare our true value versus the predicted that.

In [None]:
cm = confusion_matrix(y_test, y_predict)

In [None]:
sns.heatmap(cm, annot=True)

In [None]:
print(classification_report(y_test, y_predict))

**IMPROVING THE MODEL**

In [None]:
min_train = x_train.min()
min_train

In [None]:
range_train = (x_train - min_train).max()
range_train

In [None]:
x_train_scaled = (x_train - min_train)/range_train
x_train_scaled

In [None]:
sns.scatterplot(x = x_train['mean area'], y= x_train['mean smoothness'], hue= y_train)

In [None]:
sns.scatterplot(x= x_train_scaled['mean area'], y= x_train_scaled['mean smoothness'], hue= y_train)

In [None]:
min_test = x_test.min()
range_test = (x_test - min_test).max()
x_test_scaled = (x_test - min_test)/ range_test

In [None]:
svc_model.fit(x_train_scaled, y_train)

In [None]:
y_predict = svc_model.predict(x_test_scaled)

In [None]:
cm = confusion_matrix(y_test, y_predict)

In [None]:
sns.heatmap(cm, annot=True, fmt = 'd')

In [None]:
print(classification_report(y_test, y_predict))

**IMPROVING THE MODEL - PART 2**

In [None]:
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf']} 

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid = GridSearchCV(SVC(), param_grid, refit= True, verbose= 4)

In [None]:
grid.fit(x_train_scaled, y_train)

In [None]:
grid.best_params_

In [None]:
grid.best_estimator_

In [None]:
grid_prediction = grid.predict(x_test_scaled)

In [None]:
cm = confusion_matrix(y_test, grid_prediction)

In [None]:
sns.heatmap(cm, annot=True)

In [None]:
print(classification_report(y_test,grid_prediction ))

# **Conclusion**

- Machine Learning techniques, particularly **Support Vector Machines (SVM)**, successfully classified tumors as **Malignant or Benign** with **97% accuracy**.
- This approach enables the **rapid evaluation** of breast masses, allowing for **automated classification** with high precision.
- **Early breast cancer detection** can significantly improve survival rates, especially in **developing regions** where access to expert medical diagnosis is limited.
- The technique can be further enhanced by integrating **Computer Vision** and **Machine Learning** to directly classify cancer using **tissue images**, leading to even more accurate and automated detection methods.