<a href="https://colab.research.google.com/github/mustafabozkaya/PythonDataScienceHandbooks/blob/main/breast-cancer-eda-pca-svm-lr-rf-dt-gb-gridsearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Breast Cancer Eda  PCA, SVM,LR,RF,DTree,GaradientBost,GridSearchCV Param Tuning - 99% Accuracy 

**Breast Cancer Data Analysis**


![](https://www.sysmex-europe.com/fileadmin/_processed_/a/9/csm_LifeScience_StageImage_BreastCancer_1500x600-01_2498abd1e0.jpg)

In this tutorial, based on the data we are going to find out if the cancer is benign or malignant. We would use python libraries such as Numpy, Pandas and Plotly. We would use classification techniques to predict the values (1 or 0) on our dataset. 

**Source** : https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Let's start off by installing and import the required libraries into our code

In [48]:
import numpy as np
import pandas as pd 
import plotly.express as px
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import missingno as msn
import sklearn.datasets as sdata
%matplotlib inline

In [49]:
plt.rcParams['font.size'] = 14
plt.rcParams['figure.figsize'] = (18, 12)

`sklearn` provides this dataset for us to work with so we are going to be using the same library for importing our dataset and loading into a dataframe with the help of `Pandas` library

In [50]:
import os 

try:
  for file in os.listdir("../input/"):
    print(file)
except:
  for file in os.listdir("./"):
    print(file)

.config
sample_data


In [51]:
try:
  Data = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")
  Df=Data.copy()
except:
  Data = sdata.load_breast_cancer(return_X_y=False)
  col=list(Data.feature_names)+["target"]
  Data=np.concatenate((Data.data,Data.target.reshape(-1,1)),axis=1)
  Df=pd.DataFrame(Data,columns=col)

In [52]:
Df[:1] 

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0.0


In [53]:
Df.columns=list(map(lambda x :x.replace(" ","_"),Df.columns))

In [54]:
Df.columns

Index(['mean_radius', 'mean_texture', 'mean_perimeter', 'mean_area',
       'mean_smoothness', 'mean_compactness', 'mean_concavity',
       'mean_concave_points', 'mean_symmetry', 'mean_fractal_dimension',
       'radius_error', 'texture_error', 'perimeter_error', 'area_error',
       'smoothness_error', 'compactness_error', 'concavity_error',
       'concave_points_error', 'symmetry_error', 'fractal_dimension_error',
       'worst_radius', 'worst_texture', 'worst_perimeter', 'worst_area',
       'worst_smoothness', 'worst_compactness', 'worst_concavity',
       'worst_concave_points', 'worst_symmetry', 'worst_fractal_dimension',
       'target'],
      dtype='object')

In [59]:
Df[:1]["mean_radius"] 

0    17.99
Name: mean_radius, dtype: float64

In [57]:
Df.iloc[:,1:]

Unnamed: 0,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,radius_error,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,target
0,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,1.0950,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0.0
1,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,0.5435,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0.0
2,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,0.7456,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0.0
3,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,0.4956,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0.0
4,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,0.7572,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,1.1760,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0.0
565,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,0.7655,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0.0
566,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,0.4564,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0.0
567,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,0.7260,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0.0


In [60]:
Df.head()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0.0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0.0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0.0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0.0


In [61]:
Df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean_radius              569 non-null    float64
 1   mean_texture             569 non-null    float64
 2   mean_perimeter           569 non-null    float64
 3   mean_area                569 non-null    float64
 4   mean_smoothness          569 non-null    float64
 5   mean_compactness         569 non-null    float64
 6   mean_concavity           569 non-null    float64
 7   mean_concave_points      569 non-null    float64
 8   mean_symmetry            569 non-null    float64
 9   mean_fractal_dimension   569 non-null    float64
 10  radius_error             569 non-null    float64
 11  texture_error            569 non-null    float64
 12  perimeter_error          569 non-null    float64
 13  area_error               569 non-null    float64
 14  smoothness_error         5

In [62]:
Df.describe()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,target
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


## Exploratory Data Analysis aka EDA

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

In [63]:
fig = px.bar(Df, 
            x='diagnosis', 
            y='radius_mean', 
            color='perimeter_mean',
            hover_data=["radius_mean"], 
            title='Radius Mean vs Diagnosis')
fig.update_xaxes(showgrid=False)   #Turning the grid off
fig.update_yaxes(showgrid=False)   #Turning the grid off
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)','paper_bgcolor': 'rgba(0, 0, 0, 0)'})  #removing the background color
fig.show()

ValueError: ignored

In [64]:
for template in ["none"]:
    fig = px.bar(Df,
                     x="compactness_mean", 
                     y="diagnosis", 
                     color="compactness_mean",
                     log_x=True, 
                     template=template, 
                     title="Compactness Mean Vs Diagnosis")
    fig.update_xaxes(showgrid=False)
    fig.update_yaxes(showgrid=False)
    fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)','paper_bgcolor': 'rgba(0, 0, 0, 0)'})
    fig.show()

ValueError: ignored

In [None]:
fig = px.histogram(Df, 
                   x='diagnosis', 
                   color_discrete_sequence=['red'],
                   title='Diagnosis Count')
fig.update_layout(bargap=0.3)
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)','paper_bgcolor': 'rgba(0, 0, 0, 0)'})
fig.show()

In [None]:
sns.heatmap(Df.corr(),annot=True,fmt=".2g")

From our analysis above, we saw there are 357 Benign Cases and 212 Malignant breast cancer cases. Compactness Mean is more in the Malignant Cases as compared to the Benign Cases.

## Data Pre-processing
Now, lets start processing our data and make sure its in line with the requirements of the machine learning ecosystem as we wanna make sure there is no categorical data since Machine Learning Algorithms cannot work with Categorical data.

Fortunately, we don't have the categorical columns in our dataset so we are just using slicing method to make a list of columns for independent(input) and dependent variable(output)

In [None]:
Df.columns

In [None]:
input_cols = Df.columns[2:-1]
input_cols

In [None]:
target_col =  Df.columns[1]
target_col

In [None]:
inputs_df = Df[list(input_cols)].copy()
inputs_df

In [None]:
targets = Df[(target_col)]
targets

### Data Scaling

We want to scale the data as in the machine learning algorithms if the values of the features are closer to each other there are chances for the algorithm to get trained well and faster instead of the data set where the data points or features values have high differences with each other will take more time to understand the data and the accuracy will be lower.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(Df[input_cols])
Df[input_cols] = scaler.transform(inputs_df[input_cols])
inputs_df[input_cols].describe().loc[['min', 'max'],:]

## Principal Component Analysis (PCA)
Principal Component Analysis is a way to reduce the number of variables while maintaining the majority of the important information. It transforms a number of variables that may be correlated into a smaller number of uncorrelated variables, known as principal components.

The main objective of PCA is to simplify your model features into fewer components to help visualize patterns in your data and to help your model run faster. Using PCA also reduces the chance of overfitting your model by eliminating features with high correlation.

In [None]:
from sklearn.preprocessing import scale
from sklearn import decomposition
X = scale(inputs_df)
pca = decomposition.PCA(n_components=10)
pca.fit(X)

In [None]:
scores = pca.transform(X)
cols=[f'PCA{i+1}' for i in range(10)]
scores_df = pd.DataFrame(scores, columns=cols)
scores_df.head()

In [None]:
sns.heatmap(scores_df.corr(),annot=True,fmt=".2g")

In [None]:
target = pd.Series(targets, name='target')
target

In [None]:
result_df = pd.concat([scores_df, target], axis=1)
result_df.head()

In [None]:
type(scores),scores.shape,scores_df.dtypes,scores_df.shape

In [None]:
sns.scatterplot(data=result_df, x="PCA1", y="PCA2",size="PCA3",style="target", hue="PCA4");

### Explained Variance Ratio

The explained variance ratio is the percentage of variance that is attributed by each of the selected components. Ideally, you would choose the number of components to include in your model by adding the explained variance ratio of each component until you reach a total of around 0.8 or 80% to avoid overfitting.

In [None]:
print('Variance of each component:', pca.explained_variance_ratio_)
print('\n Total Variance Explained:', round(sum(list(pca.explained_variance_ratio_))*100, 2))

We can see that our first 10 principal components explain the majority of the variance in this dataset (95.3%)! This is an indication of the total information represented compared to the original data.

## Splitting Data
We start the process of training our data now that we are done with preprocessing of the data. Lets go ahead and split the data into 2 splits i.e. training and validation data. Training data will be used to train our model and we will validate the score on the validation data.

We have taken the test size as 0.25 since we don't want to train our model on the entire dataset and then end up having the model learn nothing when new set of data is thrown at it.

In [None]:
from sklearn.model_selection import train_test_split
X_train,x_test, y_train , y_test = train_test_split(scores, 
                                                                        targets, 
                                                                        test_size=0.25, 
                                                                        random_state=42)

In [None]:
X_train.shape, y_train.shape, x_test.shape, y_test.shape

## Training Models to find the best one

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
from sklearn.metrics import accuracy_score,adjusted_rand_score,auc,plot_roc_curve,f1_score,roc_auc_score,roc_curve

In [None]:
names = ['LR', "KNN", "SVM","GBoost", "DT", "RF"]
classifiers = [
    LogisticRegression(solver='liblinear'),
    KNeighborsClassifier(n_neighbors=5),
    SVC(kernel="linear", C=0.025),
    GradientBoostingClassifier(n_estimators=100),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=100)]

In [None]:
scores = []
accurancy=[]
for name, clf in zip(names, classifiers):
    clf.fit(X_train,y_train)
    y_pred=clf.predict(x_test)
    score = clf.score(x_test, y_test)
    accurance=accuracy_score(y_test,y_pred)
    scores.append(score)
    accurancy.append(accurance)

In [None]:
scores_df = pd.DataFrame()
scores_df['name'] = names
scores_df['score'] = scores
scores_df["accurance"]=accurancy
scores_df.sort_values('score', ascending= False)

In [None]:
plot_roc_curve(classifiers[0],X_train,y_train)
plot_roc_curve(classifiers[1],X_train,y_train)

We would NOT be picking the Gradient Boosting and Decision Tree since the test/validation score is less than 96%. Let's go ahead and tune some Hyperparameters.

## Hyperparameter Tuning for all the models

When creating a machine learning model, you'll be presented with design choices as to how to define your model architecture. Often times, we don't immediately know what the optimal model architecture should be for a given model, and thus we'd like to be able to explore a range of possibilities. 

In true machine learning fashion, we'll ideally ask the machine to perform this exploration and select the optimal model architecture automatically. Parameters which define the model architecture are referred to as hyperparameters and thus this process of searching for the ideal model architecture is referred to as hyperparameter tuning which is what he have done using `GridSearchCV` and `RandomizedSearchCV`.

In [None]:
#?np.arange

In [None]:
#?LogisticRegression

## GridSearch Hyperparameter Tunnig

In [None]:
from sklearn.model_selection import GridSearchCV
C_range = np.arange(1,10,2)
penalty = ['l1', 'l2', 'elasticnet']
max_iter_range = np.arange(50,500,20)
solver=['newton-cg', 'lbfgs', 'liblinear']
param_grid = dict(C=C_range, penalty=penalty,solver=solver, max_iter= max_iter_range)
model = LogisticRegression()

grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=10)
grid.fit(X_train, y_train )
print("The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))

In [None]:
neighbors_range = np.arange(1,7,1)
leaf_size_range = np.arange(10,40,10)
param_grid = dict(n_neighbors=neighbors_range, leaf_size=leaf_size_range)
model = KNeighborsClassifier(n_jobs=-1)
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=10)
grid.fit(X_train, y_train )
print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))

In [None]:
Kernel_range = ['linear','rbf']
C_range = np.arange(1,15,1)
param_grid = dict(kernel=Kernel_range, C= C_range)
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=10)
grid.fit(X_train, y_train )
print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))

## Randomized Search Cv Hyperparameter Tunnig

In [None]:
max_depth_range = np.arange(1,8,1)
max_features_range= np.arange(1,6,1)
max_leaf_nodes_range = np.arange(2,100,10)
from sklearn.model_selection import RandomizedSearchCV
distributions = dict(max_depth=max_depth_range, max_features=max_features_range, max_leaf_nodes=max_leaf_nodes_range)
model = RandomForestClassifier(n_jobs=-1, random_state=42)
clf = RandomizedSearchCV(model, distributions, random_state=42)
clf.fit(train_inputs, train_targets)
print("The best parameters are %s with a score of %0.2f"
      % (clf.best_params_, clf.best_score_))

Now that we are done tuning all our models, let's put those numbers in as is and see what is the best model that we have for this dataset.

In [None]:
names = ['LogisticRegression', "Nearest_Neighbors", "Linear_SVM", "Random_Forest"]
classifiers = [
    LogisticRegression(C=2,max_iter=100, penalty='l2',solver='liblinear'),
    KNeighborsClassifier(leaf_size=10, n_neighbors=5),
    SVC(kernel="linear", C=4),
    RandomForestClassifier(max_leaf_nodes=82,max_features=4, max_depth=5)]

In [None]:
scores = []
for name, clf in zip(names, classifiers):
    clf.fit(X_train,y_train)
    score = clf.score(x_test, y_test)
    scores.append(score)

In [None]:
scores_df = pd.DataFrame()
scores_df['name'] = names
scores_df['score'] = scores
scores_df.sort_values('score', ascending= False)

We can clearly see that **Logistic Regression** and **SVM** has given us the best accuracy score.

**SUMMARY OF THE NOTEBOOK:-**

1. 357 Benign Cases and 212 Malignant breast cancer cases. Compactness Mean is more in the Malignant Cases as compared to the Benign Cases.
2. Depending upon the data and the computational power, one should use GridSearch or RandomizedSearch for hyperparameter tuning
3. PCA is a great way to shift from high dimensionality to low dimensionality. If we have more features than observations than we run the risk of massively overfitting our model — this would generally result in terrible out of sample performance.
4. Relying on complex algorithms always should not be the way out. Sometimes, even a simpler algorithms can work wonders.

**Resources**
 https://scikit-learn.org/stable/modules/svm.html
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

I HOPE This NOTEBOOK HELPED YOU IN SOME WAY. THANKS FOR UPVOTE