<a href="https://www.kaggle.com/code/nhanbaoho/logistic-regression-accuracy-97?scriptVersionId=99003489" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **<center>Classification Model for Breast Cancer Wisconsin**

---
---

# **Dataset**

https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

"Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant"

# **Goal**
**We will build a Classification Model to predict predict whether a person has presence of breast cancer based on physical features of cell nucleus of that person.**


    

---
<a id='0'></a>
# **Table of Contents**
1. [Exploring Data Analysis (EDA)](#1)
2. [Feature Selection](#2)
3. [Choosing Model](#3)
4. [Traing Model](#4)
5. [Evaluating model performance](#5)


---
<a id = '1'></a>
# **1. Exploring Data Analysis**
[Table of Contents](#0)

## **1.1. Import packages for exploring data**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## **1.2. Read data into a dataframe named "df"**

In [None]:
df=pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
df

## **1.3. Clean data: drop irrelevent columns**

**We can drop "id" and "Unnamed: 32" columns as they are not features**

In [None]:
df = df.drop(["id", "Unnamed: 32"], axis = 1)
df

## **1.4. Explore information about dataframe: index dtype, columns, non-null values, memory usage**

In [None]:
df.info()

## **1.5. Explore  descriptive statistics about dataframe**

In [None]:
df.describe()

In [None]:
df.describe().transpose()

## **1.6. Explore correlation between features**

**Matrix of correlation**

In [None]:
# matrix of correlation
df.corr()

**Create a heatmap that displays the correlation between all the columns**
**Plot heatmap with lower triangle**

In [None]:
# figure size
plt.figure(figsize=(20,12))
# correlation matrix
corr = df.corr()
# upper triangle is marked
marked_matrix = np.triu(corr)
# plot heatmap
sns.heatmap(data = corr, cmap='viridis', annot=True, mask = marked_matrix)  # 

---
<a id = '2'></a>
# **2. Preprocessing data: Feature selection**
[Table of Contents](#0)

**There are many features that are highly correlated. We aims to remove them in the this step. A comparison below suggests that removal of correlation of those greater than 0.85 gives us the best combination of accuracy and running effectiveness.**

## **2.1. Feature sellection**


**Features that are highly correlated (correlation abs > 0.85) are about to be removed.**

In [None]:
# correlation matrix
corr = df.corr()
corr_abs = corr.abs()
# select upper triangle of correlation matrix
upper_triangle = corr_abs.where(np.triu(np.ones(corr_abs.shape), k=1).astype(np.bool))

# columns with high correlation to be dropped
# dropped_columns = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.75)]  # this give accuracy 95%
# dropped_columns = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.8)]     # this give accuracy 95%
dropped_columns = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.85)]  # accuracy 97%
# dropped_columns = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.9)]     # 97%

# drop columns from dataframe
df = df.drop(dropped_columns, axis = 1)
df

**Explore features that were dropped**

In [None]:
dropped_columns

**There are 13 featured removed as shown below.**

In [None]:
len(dropped_columns)

## **2.2. Display the relationships between selected features in a pair plot**

In [None]:
# figure size
plt.figure(figsize=(15,15))
# pairplot
sns.pairplot(df, hue='diagnosis')

## **2.3. Exploring features X and target y**

### **We first drop the column "diagnosis" to obtain X = features. The column y = "diagnosis" is target.**

In [None]:
# Features
X = df.drop("diagnosis", axis = 1)

# Target
y = df["diagnosis"]

### **Explore target values**

In [None]:
y.unique()

### **Count target values**

In [None]:
y.value_counts()

**The binary target values suggests a model of Logictic Regression.**

### **Plot total count of target values**

In [None]:
# figure size
plt.figure(figsize=(8, 5))
sns.countplot(y)

### ***Plot a piechart of target***

In [None]:
plt.figure(figsize=(8,8))
# declaring data
data = y.value_counts()
keys = y.unique()
  
# define Seaborn color palette to use
#palette_color = sns.color_palette('bright')
  
# plotting data on chart
#plt.pie(data, labels=keys, colors=palette_color, autopct='%.0f%%')
plt.pie(data, labels=keys, autopct='%.0f%%')
  
# displaying chart
plt.show()


---
<a id = '3'></a>
# **3. Building the model: Logistic Regression Model**
[Table of Contents](#0)

* **One option for this dataset of binary target values is Logictic Regression**

## **3.1. Import the model**

In [None]:
from sklearn.linear_model import LogisticRegression

## **3.2. Create an instance of LogisticRegression model**

In [None]:
log_model = LogisticRegression()

## **3.3. Spliting data**

#### **Import libraries**

In [None]:
from sklearn.model_selection import train_test_split  # train - test data spliting
from sklearn.preprocessing import StandardScaler      # scaling data

#### **Spliting for training and testing. We use 30% of data for testing**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

## **3.4. Scaling data**

#### **Create a StandardScaler object and normalize the X train and test set feature data. This will standardize our data to new data that has normal distribution N(0, 1). Then fit to the training data.**

In [None]:
# Create an object of StandardScaler
scaler = StandardScaler()

# We only fit to the training data, not test data.
scaled_X_train = scaler.fit_transform(X_train)

# We transform but not fit the test data.
scaled_X_test = scaler.transform(X_test)

---
<a id = '4'></a>
# **4. Training the model on the data**
[Table of Contents](#0)

## **4.1. Train the model on the data**

In [None]:
log_model.fit(scaled_X_train,y_train)

## **4.2. Predict on test data**

In [None]:
y_pred = log_model.predict(scaled_X_test)
y_pred

---
<a id = '5'></a>
# **5. Evaluating model performance**
[Table of Contents](#0)

## **5.1. Import libraries for evaluation**

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,plot_confusion_matrix

## **5.2. Measuring Model Performance**
* **The model accuracy is 97%**

In [None]:
score = accuracy_score(y_test,y_pred, normalize=True)
score

## **5.3. Confusion Matrix**

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix

#### **Visualization of confusion_matrix**

In [None]:
plot_confusion_matrix(log_model,scaled_X_test,y_test)
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 10);

#### **Scaling plot to frequency. This gives us better insight.**

In [None]:
plot_confusion_matrix(log_model,scaled_X_test,y_test,normalize='true')

## **5.4. Classification report**

In [None]:
print(classification_report(y_test,y_pred))

---
## **5.5. Plotting performance curves**

**Insight**
* https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
* https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html


### **Import library**

In [None]:
from sklearn.metrics import plot_precision_recall_curve,plot_roc_curve

## **Plot the Precision-Recall curve**

* **The precision-recall curve below shows both high recall and high precision.**

In [None]:
plot_precision_recall_curve(log_model,scaled_X_test,y_test)

### **Plot Receiver operating characteristic (ROC) curve**
* **The ROC curve below shows the area under the curve approaches almost 1.**

In [None]:
plot_roc_curve(log_model,scaled_X_test,y_test)

---
---
# ***Thanks for your interest and your feedback!***