<a href="https://www.kaggle.com/code/nhanbaoho/logistic-regression-accuracy-97?scriptVersionId=98728361" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **<center>Classification Model for Breast Cancer Wisconsin**

# **Dataset**

https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

"Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant"

# **Goal**
We will build a Classification Model to predict predict whether a person has presence of breast cancer based on physical features of cell nucleus of that person.


    

---
# **Contents**
1. [Exploring Data Analysis (EDA)](#1)
2. [Feature Selection](#2)
3. [Choosing Model](#3)
4. [Traing Data](#4)
5. [Evaluating model performance](#5)


---
# **1. Exploring Data Analysis**

## **1.a. Import packages for exploring data**

In [7]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## **1.b. Read data into a dataframe named "df"**

In [8]:
df=pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
df

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,


## **1.c. Clean data**

We can drop "id" and "Unnamed: 32" columns as they are not features

In [11]:
df = df.drop(["id", "Unnamed: 32"], axis = 1)
df

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


---
# 3. Analysing and Visualising data

## 3.1. Explore information about dataframe: index dtype, columns, non-null values, memory usage

In [None]:
df.info()

## 3.2. Explore  descriptive statistics about dataframe.

In [None]:
df.describe()

In [None]:
df.describe().transpose()

## 3.3. Explore correlation between features.

In [None]:
# matrix of correlation
df.corr()

## 3.4. Create a heatmap that displays the correlation between all the columns

Plot heatmap with lower triangle

In [None]:
# figure size
plt.figure(figsize=(20,12))
# correlation matrix
corr = df.corr()
# upper triangle is marked
marked_matrix = np.triu(corr)
# plot heatmap
sns.heatmap(data = corr, cmap='viridis', annot=True, mask = marked_matrix)  # 

---
# 4. Preprocessing data 
* There are many features that are highly correlated. We aims to remove them in the next step. A comparison below suggests that removal of correlation of those greater than 0.85 gives us the best combination of accuracy and running effectiveness.

## 4.1. Feature sellection


### Features that are highly correlated (correlation abs > 0.85) are about to be removed.

In [None]:
# correlation matrix
corr = df.corr()
corr_abs = corr.abs()
# select upper triangle of correlation matrix
upper_triangle = corr_abs.where(np.triu(np.ones(corr_abs.shape), k=1).astype(np.bool))

# columns with high correlation to be dropped
# dropped_columns = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.75)]  # this give accuracy 95%
# dropped_columns = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.8)]     # this give accuracy 95%
dropped_columns = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.85)]  # accuracy 97%
# dropped_columns = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.9)]     # 97%

# drop columns from dataframe
df = df.drop(dropped_columns, axis = 1)
df

### Explore features that were dropped

In [None]:
dropped_columns

#### There are 13 featured removed as shown below.

In [None]:
len(dropped_columns)

## 4.2. Display the relationships between selected features in a pair plot.

In [None]:
# figure size
plt.figure(figsize=(15,15))
# pairplot
sns.pairplot(df, hue='diagnosis')

## 4.3. Features X and target y

### We first drop the column "diagnosis" to obtain X = features. The column y = "diagnosis" is target.

In [None]:
# Features
X = df.drop("diagnosis", axis = 1)

# Target
y = df["diagnosis"]

### Explore target values

In [None]:
y.unique()

### Count target values
* The binary target values suggests a model of Logictic Regression.

In [None]:
y.value_counts()

### Plot total count of target values

In [None]:
# figure size
plt.figure(figsize=(8, 5))
sns.countplot(y)

### Plot a piechart of target

In [None]:
# declaring data
data = y.value_counts()
keys = y.unique()
  
# define Seaborn color palette to use
palette_color = sns.color_palette('bright')
  
# plotting data on chart
plt.pie(data, labels=keys, colors=palette_color, autopct='%.0f%%')
  
# displaying chart
plt.show()

---
# 5. Spliting and Scaling data

## 5.1. Import libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## 5.2. Spliting for training and testing. We use 30% of data for testing

In [None]:
# We use 30% of data for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

### 5.3 Standardising and Trainging
#### Create a StandardScaler object and normalize the X train and test set feature data. This will standardize our data to new data that has normal distribution N(0, 1). Then fit to the training data.


In [None]:
# Create an object of StandardScaler
scaler = StandardScaler()

# We only fit to the training data, not test data.
scaled_X_train = scaler.fit_transform(X_train)

# We transform but not fit the test data.
scaled_X_test = scaler.transform(X_test)

---
# 6. Building the model: Logistic Regression Model
* One option for this dataset of binary target values is Logictic Regression. 

### 6.1. Import the model

In [None]:
from sklearn.linear_model import LogisticRegression

### 6.2. Create an instance of LogisticRegression model

In [None]:
log_model = LogisticRegression()

### 6.3. Training the model on the data

In [None]:
log_model.fit(scaled_X_train,y_train)

### 6.4. Predict on test data

In [None]:
y_pred = log_model.predict(scaled_X_test)
y_pred

---
# 7. Evaluating model performance

## 7.1. Import

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,plot_confusion_matrix

## 7.2. Measuring Model Performance
* The model accuracy is 97%

In [None]:
score = accuracy_score(y_test,y_pred, normalize=True)
score

### 7.3. Confusion Matrix

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix

#### Visualization of confusion_matrix

In [None]:
plot_confusion_matrix(log_model,scaled_X_test,y_test)
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 10);

#### Scaling plot to frequency. This gives us better insight.

In [None]:
plot_confusion_matrix(log_model,scaled_X_test,y_test,normalize='true')

## 7.4. Classification report

In [None]:
print(classification_report(y_test,y_pred))

---
# 8. Plotting performance curves

## Insight
* https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
* https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html


## 8.1. Import

In [None]:
from sklearn.metrics import plot_precision_recall_curve,plot_roc_curve

## 8.2. Plot the Precision-Recall curve
* The precision-recall curve below shows both high recall and high precision.

In [None]:
plot_precision_recall_curve(log_model,scaled_X_test,y_test)

## 8.3 Plot Receiver operating characteristic (ROC) curve
* The ROC curve below shows the area under the curve approaches almost 1.

In [None]:
plot_roc_curve(log_model,scaled_X_test,y_test)

---
# Thanks for your interest and your feedback!