<a href="https://www.kaggle.com/code/manishkr1754/breast-cancer-classification?scriptVersionId=144981123" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
<center><h1>Breast Cancer Classification</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

Breast cancer is a widespread and potentially life-threatening medical condition that affects a significant portion of the population, predominantly women. Timely and precise diagnosis of breast cancer plays a crucial role in determining treatment options and improving patient outcomes. In this context, the application of machine learning offers a promising avenue to tackle this healthcare challenge.

This project belongs to the domain of **Medical Diagnosis and Classification using Machine Learning**. The primary goal is **to develop a predictive model for the classification of breast cancer by analyzing a comprehensive dataset that includes various clinical attributes, mammography findings and patient demographics**.

## 2) Understanding Data
---

The project uses **Breast Cancer Data** which contains several variables (independent variables) and the outcome variable or dependent variable.

## 3) Getting System Ready
---
Importing required libraries


In [None]:
import numpy as np
import pandas as pd

# for model buidling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

## 4) Data Eyeballing
---

### Laoding Data

In [None]:
breast_cancer_data = pd.read_csv('Datasets/Day19_Breast_Cancer_Data.csv') 

In [None]:
breast_cancer_data

In [None]:
print('The size of Dataframe is: ', breast_cancer_data.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
breast_cancer_data.info()
print('-'*100)

In [None]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in breast_cancer_data.columns if breast_cancer_data[feature].dtype != 'O']
categorical_features = [feature for feature in breast_cancer_data.columns if breast_cancer_data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=breast_cancer_data.isnull().sum().sort_values(ascending=False)
percent=(breast_cancer_data.isnull().sum()/breast_cancer_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
breast_cancer_data.describe()

In [None]:
print('Summary Statistics of categorical features for DataFrame are as follows:')
print('-'*100)
breast_cancer_data.describe(include='object')

In [None]:
breast_cancer_data['diagnosis'].value_counts() # status is target variable

## 5) Data Cleaning and Preprocessing
---

### Dropping unwanted columns

In [None]:
breast_cancer_data = breast_cancer_data.drop(columns = ['Unnamed: 32'], axis=1)

In [None]:
breast_cancer_data

### Encoding 'M'(Malignant) as 0 and 'B'(Benign) as 1

In [None]:
breast_cancer_data['diagnosis'] = breast_cancer_data['diagnosis'].map({'M':0,'B':1})

In [None]:
breast_cancer_data

## 5) Model Building
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [None]:
# separating the data and labels
X = breast_cancer_data.drop(columns = ['id','diagnosis'], axis=1) # Feature matrix
y = breast_cancer_data['diagnosis'] # Target variable

In [None]:
X

In [None]:
y

### Data Standardization

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
standardized_data = scaler.transform(X)

In [None]:
standardized_data

In [None]:
X = standardized_data

In [None]:
X

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=45)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

In [None]:
print(y.shape, y_train.shape, y_test.shape)

### Model Comparison : Training & Evaluation

In [None]:
models = [LogisticRegression, SVC, DecisionTreeClassifier, RandomForestClassifier]
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for model in models:
    classifier = model().fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    precision_scores.append(precision_score(y_test, y_pred))
    recall_scores.append(recall_score(y_test, y_pred))
    f1_scores.append(f1_score(y_test, y_pred))

In [None]:
classification_metrics_df = pd.DataFrame({
    "Model": ["Logistic Regression", "SVM", "Decision Tree", "Random Forest"],
    "Accuracy": accuracy_scores,
    "Precision": precision_scores,
    "Recall": recall_scores,
    "F1 Score": f1_scores
})

classification_metrics_df.set_index('Model', inplace=True)
classification_metrics_df

### Inference

In the context of breast cancer classification,

- The Logistic Regression and Support Vector Machine (SVM) models demonstrate the highest accuracy at 96.49%, indicating their proficiency in distinguishing between benign (B) and malignant (M) cases. Additionally, both models exhibit impressive precision and recall scores, exceeding 95%, showcasing their ability to minimize false positives and false negatives.

- The Decision Tree model, while achieving a decent accuracy of 91.23%, lags behind in terms of precision and recall, indicating a moderate performance in correctly classifying cases. 

- On the other hand, the Random Forest model, with an accuracy of 94.74%, strikes a balance between precision and recall. It achieves a commendable precision score of 97.14%, suggesting a low rate of false positives, while maintaining a good recall score of 94.44%, indicating its effectiveness in detecting malignant cases.

In conclusion, the Logistic Regression and SVM models exhibit the best overall performance, emphasizing their potential for breast cancer classification. However, the Random Forest model also proves to be a reliable choice with a well-rounded performance. The Decision Tree, while decent, may benefit from further optimization to enhance its predictive capabilities.