# Breast Cancer Classification Project

## Project Overview

In this study, I attempt to build a logistic regression model that could help predict breast cancer using the widespread Wisconsin breast cancer diagnostic dataset. This is done to demonstrate one's understanding of logistic regression, data preprocessing, model development, and evaluation.

## 1. Machine Learning Algorithm (20 points)

### Chosen Algorithm: Logistic Regression

#### Explanation
Logistic regression is a classification method that estimates the likelihood of a binary result. It applies a logistic function to the model's likelihood that a given input is part of a specific 

#### Underlying Principles
* Binary Classification: Logistic regression is used for binary classification tasks.
* Logit Function: It uses the logit function to map predicted values to probabilities.
* Decision Boundary: It defines a decision boundary to separate classes based on the probability threshold (typically 0.5).
#### Applications
* Medical diagnosis (e.g., cancer detection)
* Spam detection
* Credit scoring


### Suitability for Dataset
Logistic regression is suitable for this dataset because the target variable is binary. For this case, the dataset diagnoses malignant and benign tumors.

## 2. Dataset (20 points)

### Chosen Dataset: Breast Cancer Wisconsin (Diagnostic) Dataset
#### Description
* **Features:** The dataset contains 30 features representing characteristics of cell nuclei present in the breast cancer biopsies.
* **Target Variable:** `diagnosis` - a binary variable where 'M' indicates malignant and 'B' indicates benign.
* **Preprocessing Steps:**
  * Dropping irrelevant columns (`id`, `Unnamed: 32`)
  * Encoding the target variable (`diagnosis`) to numerical values
  * Standardizing the features
#### Dataset Information
* **Source:** [Kaggle - Breast Cancer Wisconsin (Diagnostic) Dataset](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data?resource=download)
* **Number of Instances:** 569
* **Number of Features:** 30

## 3. Implementation (30 points)

### Preprocessing

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

# Load the dataset
data = pd.read_csv('data.csv')

# Inspect the dataset
print(data.head())
print(data.columns)

# Drop unnecessary columns
data.drop(['id', 'Unnamed: 32'], axis=1, inplace=True)

# Encode the target variable
data['diagnosis'] = data['diagnosis'].apply(lambda x: 1 if x == 'M' else 0)

# Separate features and target
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    842302         M        17.99         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   

   ...  texture_worst  perimeter_worst  area_worst  smoothness

### Model Training

In [14]:
# Implement Logistic Regression
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

### Model Evaluation

In [15]:
# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print evaluation metrics
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'Confusion Matrix:\n {conf_matrix}')


Accuracy: 0.9736842105263158
Precision: 0.9761904761904762
Recall: 0.9534883720930233
Confusion Matrix:
 [[70  1]
 [ 2 41]]


## 4. Model Training and Evaluation (20 points)


### Data Preprocessing
* Dropped unnecessary columns (id, Unnamed: 32)
* Encoded the target variable diagnosis from categorical to numerical
* Standardized the features using StandardScaler
### Model Training
* Implemented logistic regression using LogisticRegression from sklearn
### Model Evaluation
* Accuracy: 0.9736842105263158
* Precision: 0.9761904761904762
* Recall: 0.9534883720930233
* Confusion Matrix:
[[70  1]
[ 2 41]]


## 5. Interpretation
High Accuracy and Precision: The model is reliable in predicting both benign and malignant cases with very few false positives.
High Recall: The model effectively identifies malignant cases, which is crucial for timely medical intervention.
Confusion Matrix: Shows a small number of misclassifications, indicating robustness and reliability of the model.
