<a href="https://www.kaggle.com/code/nhanbaoho/logistic-regression-accuracy-97?scriptVersionId=98212468" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
# Contents
1. Importing libraries
2. Reading data
3. Analysing and Visualising data
4. Preprocessing data 
5. Spliting and Scaling data
6. Building the model
7. Evaluating model performance
8. Plotting performance curves

---
# 1. Import libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

---
# 2. Read data
### Read data into a dataframe named "df"

In [None]:
df=pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
df

In [None]:
df.transpose()

### We can drop "id" and "Unnamed: 32" columns as they are not features

In [None]:
df = df.drop(["id", "Unnamed: 32"], axis = 1)
df

---
# 3. Analysing and Visualising data

## 3.1. Explore information about dataframe: index dtype, columns, non-null values, memory usage

In [None]:
df.info()

## 3.2. Explore  descriptive statistics about dataframe.

In [None]:
df.describe()

In [None]:
df.describe().transpose()

## 3.3. Explore correlation between features.

In [None]:
# matrix of correlation
df.corr()

## 3.4. Create a heatmap that displays the correlation between all the columns

Plot heatmap with lower triangle

In [None]:
# figure size
plt.figure(figsize=(20,12))
# correlation matrix
corr = df.corr()
# upper triangle is marked
marked_matrix = np.triu(corr)
# plot heatmap
sns.heatmap(data = corr, cmap='viridis', annot=True, mask = marked_matrix)  # 

---
# 4. Preprocessing data 
* There are many features that are highly correlated. We aims to remove them in the next step. A comparison below suggests that removal of correlation of those greater than 0.85 gives us the best combination of accuracy and running effectiveness.

## 4.1. Feature sellection


### Features that are highly correlated (correlation abs > 0.85) are about to be removed.

In [None]:
# correlation matrix
corr = df.corr()
corr_abs = corr.abs()
# select upper triangle of correlation matrix
upper_triangle = corr_abs.where(np.triu(np.ones(corr_abs.shape), k=1).astype(np.bool))

# columns with high correlation to be dropped
# dropped_columns = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.75)]  # this give accuracy 95%
# dropped_columns = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.8)]     # this give accuracy 95%
dropped_columns = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.85)]  # accuracy 97%
# dropped_columns = [col for col in upper_triangle.columns if any(upper_triangle[col] > 0.9)]     # 97%

# drop columns from dataframe
df = df.drop(dropped_columns, axis = 1)
df

### Explore features that were dropped

In [None]:
dropped_columns

#### There are 13 featured removed as shown below.

In [None]:
len(dropped_columns)

## 4.2. Display the relationships between selected features in a pair plot.

In [None]:
# figure size
plt.figure(figsize=(15,15))
# pairplot
sns.pairplot(df, hue='diagnosis')

## 4.3. Features X and target y

### We first drop the column "diagnosis" to obtain X = features. The column y = "diagnosis" is target.

In [None]:
# Features
X = df.drop("diagnosis", axis = 1)

# Target
y = df["diagnosis"]

### Explore target values

In [None]:
y.unique()

### Count target values
* The binary target values suggests a model of Logictic Regression.

In [None]:
y.value_counts()

### Plot total count of target values

In [None]:
# figure size
plt.figure(figsize=(8, 5))
sns.countplot(y)

### Plot a piechart of target

In [None]:
# declaring data
data = y.value_counts()
keys = y.unique()
  
# define Seaborn color palette to use
palette_color = sns.color_palette('bright')
  
# plotting data on chart
plt.pie(data, labels=keys, colors=palette_color, autopct='%.0f%%')
  
# displaying chart
plt.show()

---
# 5. Spliting and Scaling data

## 5.1. Import libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## 5.2. Spliting for training and testing. We use 30% of data for testing

In [None]:
# We use 30% of data for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

### 5.3 Standardising and Trainging
#### Create a StandardScaler object and normalize the X train and test set feature data. This will standardize our data to new data that has normal distribution N(0, 1). Then fit to the training data.


In [None]:
# Create an object of StandardScaler
scaler = StandardScaler()

# We only fit to the training data, not test data.
scaled_X_train = scaler.fit_transform(X_train)

# We transform but not fit the test data.
scaled_X_test = scaler.transform(X_test)

---
# 6. Building the model: Logistic Regression Model
* One option for this dataset of binary target values is Logictic Regression. 

### 6.1. Import the model

In [None]:
from sklearn.linear_model import LogisticRegression

### 6.2. Create an instance of LogisticRegression model

In [None]:
log_model = LogisticRegression()

### 6.3. Training the model on the data

In [None]:
log_model.fit(scaled_X_train,y_train)

### 6.4. Predict on test data

In [None]:
y_pred = log_model.predict(scaled_X_test)
y_pred

---
# 7. Evaluating model performance

## 7.1. Import

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,plot_confusion_matrix

## 7.2. Measuring Model Performance
* The model accuracy is 97%

In [None]:
score = accuracy_score(y_test,y_pred, normalize=True)
score

### 7.3. Confusion Matrix

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix

#### Visualization of confusion_matrix

In [None]:
plot_confusion_matrix(log_model,scaled_X_test,y_test)
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 10);

#### Scaling plot to frequency. This gives us better insight.

In [None]:
plot_confusion_matrix(log_model,scaled_X_test,y_test,normalize='true')

## 7.4. Classification report

In [None]:
print(classification_report(y_test,y_pred))

---
# 8. Plotting performance curves

## Insight
* https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
* https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html


## 8.1. Import

In [None]:
from sklearn.metrics import plot_precision_recall_curve,plot_roc_curve

## 8.2. Plot the Precision-Recall curve
* The precision-recall curve below shows both high recall and high precision.

In [None]:
plot_precision_recall_curve(log_model,scaled_X_test,y_test)

## 8.3 Plot Receiver operating characteristic (ROC) curve
* The ROC curve below shows the area under the curve approaches almost 1.

In [None]:
plot_roc_curve(log_model,scaled_X_test,y_test)

---
# Thanks for your interest and your feedback!