<a href="https://colab.research.google.com/github/nour-rayann/HCIA-AI/blob/main/Breast_Cancer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**K Nearest Neighbors (KNN) & SVM Project**

1.Introduction

1.1 Problem Statement

Breast cancer is a leading cause of death among women worldwide,Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

1.2 Objectives

-KNN and SVM algorithms on the Breast Cancer Wisconsin dataset to classify tumors accurately as benign or malignant.

-Perform data preprocessing, including cleaning and feature selection.

-Optimize hyperparameters for both KNN and SVM to achieve the best performance.

-Compare the performance of KNN and SVM using metrics like accuracy, precision, recall, and F1 score.



2.Data Preparation

2.1 Data Collection

 -Dataset: Breast Cancer Wisconsin (Diagnostic) dataset, available from the UCI Machine Learning Repository.
  Summary:

 -Number of Features: 30 features representing various characteristics of cell nuclei present in the breast cancer tissue.

 -Number of Samples: 569 samples.

 -Feature Description: Features include measurements like radius, texture, perimeter, area, smoothness, compactness, concavity, and symmetry.

 -Label Description: The target variable indicates whether the tumor is benign (label 0) or malignant (label 1).




In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer

In [None]:
# Load the dataset from sklearn
data = load_breast_cancer()

# Convert to a DataFrame for easier manipulation
df = pd.DataFrame(data.data, columns=data.feature_names)
df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [None]:
# Checking for missing values
print(df.isnull().sum())

# Handling missing values (if any, although the dataset is clean)
df.fillna(df.mean(), inplace=True)

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64


In [None]:
# Standardizing the features
scaler = StandardScaler()
X = df
X_scaled = scaler.fit_transform(X)

# Encoding target variable
y = data.target  # y already comes as numerical (0 and 1)

3. Model Training

3.1 Splitting the Data

The dataset was split into training and test sets using an 80-20 ratio to ensure a robust evaluation of model performance on unseen data.

3.2 Training the KNN and SVM Models


In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into features (X) and labels (y)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

KNN Algorithm

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Initialize the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')

# Train the model
knn.fit(X_train, y_train)

# Predictions
y_pred_knn = knn.predict(X_test)

# Evaluation
accuracy_knn = accuracy_score(y_test, y_pred_knn)
precision_knn = precision_score(y_test, y_pred_knn)
recall_knn = recall_score(y_test, y_pred_knn)
f1_knn = f1_score(y_test, y_pred_knn)

print(f'KNN Accuracy: {accuracy_knn:.2f}')
print(f'KNN Precision: {precision_knn:.2f}')
print(f'KNN Recall: {recall_knn:.2f}')
print(f'KNN F1 Score: {f1_knn:.2f}')


KNN Accuracy: 0.95
KNN Precision: 0.96
KNN Recall: 0.96
KNN F1 Score: 0.96


SVM Algorithm

In [None]:
from sklearn.svm import SVC

# Initialize the SVM classifier
svm = SVC(kernel='rbf', C=1.0, gamma='scale')

# Train the model
svm.fit(X_train, y_train)

# Predictions
y_pred_svm = svm.predict(X_test)

# Evaluation
accuracy_svm = accuracy_score(y_test, y_pred_svm)
precision_svm = precision_score(y_test, y_pred_svm)
recall_svm = recall_score(y_test, y_pred_svm)
f1_svm = f1_score(y_test, y_pred_svm)

print(f'SVM Accuracy: {accuracy_svm:.2f}')
print(f'SVM Precision: {precision_svm:.2f}')
print(f'SVM Recall: {recall_svm:.2f}')
print(f'SVM F1 Score: {f1_svm:.2f}')


SVM Accuracy: 0.97
SVM Precision: 0.97
SVM Recall: 0.99
SVM F1 Score: 0.98


4. Model Evaluation

4.1 Performance Metrics

Accuracy: The proportion of correctly classified instances.

Precision: The proportion of true positives among all positive predictions.

Recall: The proportion of true positives among all actual positives.

F1 Score: The harmonic mean of precision and recall, balancing the two.

In [None]:
print("KNN vs SVM Performance")
print(f'KNN - Accuracy: {accuracy_knn:.2f}, Precision: {precision_knn:.2f}, Recall: {recall_knn:.2f}, F1 Score: {f1_knn:.2f}')
print(f'SVM - Accuracy: {accuracy_svm:.2f}, Precision: {precision_svm:.2f}, Recall: {recall_svm:.2f}, F1 Score: {f1_svm:.2f}')


KNN vs SVM Performance
KNN - Accuracy: 0.95, Precision: 0.96, Recall: 0.96, F1 Score: 0.96
SVM - Accuracy: 0.97, Precision: 0.97, Recall: 0.99, F1 Score: 0.98


6. References

Breast Cancer Wisconsin Dataset

Support Vector Machine (SVM) -  Machine Learning

K Nearest Neighbors (KNN) - Machine Learning
