# Cancer Tumor Detection using KNN Algorithm

Consider The Wisconsin Breast Cancer Database. 

Class attribute shows the observation result, whether the patient is suffering from the benign tumor or malignant tumor. 

Benign tumors do not spread to other parts while the malignant tumor is cancerous. 

### Detailed fields descriptions
1. Number of records: 569 
2. Number of attributes: 32 (ID, diagnosis, 30 real-valued input features)
3. Attribute information

1. ID number
2. Diagnosis (M = malignant, B = benign)
3. 3-32: Ten real-valued features are computed for each cell nucleus:
4. radius (mean of distances from center to points on the perimeter)
5. texture (standard deviation of gray-scale values)
6. perimeter
7. area
8. smoothness (local variation in radius lengths)
9. compactness (perimeter^2 / area - 1.0)
10. concavity (severity of concave portions of the contour)
11. concave points (number of concave portions of the contour)
12. symmetry 
13. fractal dimension ("coastline approximation" - 1)

The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.  For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

# Problem Statement:
### Model different classifier using the Breast Cancer data for predicting whether a patient is suffering from the benign tumor or malignant tumor
### Optimize models performance by fine tuning respective models' hyper parameters

In [2]:
import numpy as np 
import pandas as pd

In [3]:
# ML Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [4]:
from sklearn.model_selection import train_test_split
from scipy.stats import zscore
# calculate accuracy measures and confusion matrix
from sklearn import metrics
from sklearn.metrics import confusion_matrix

In [5]:
# For plotting
import matplotlib.pyplot as plt   wisc_bc_data
import seaborn as sns
%matplotlib inline 

In [6]:
data = pd.read_csv("wisc_bc_data.csv")

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
id                   569 non-null int64
diagnosis            569 non-null object
radius_mean          569 non-null float64
texture_mean         569 non-null float64
perimeter_mean       569 non-null float64
area_mean            569 non-null float64
smoothness_mean      569 non-null float64
compactness_mean     569 non-null float64
concavity_mean       569 non-null float64
points_mean          569 non-null float64
symmetry_mean        569 non-null float64
dimension_mean       569 non-null float64
radius_se            569 non-null float64
texture_se           569 non-null float64
perimeter_se         569 non-null float64
area_se              569 non-null float64
smoothness_se        569 non-null float64
compactness_se       569 non-null float64
concavity_se         569 non-null float64
points_se            569 non-null float64
symmetry_se          569 non-null float64
dimension_se    

In [8]:
data.head(5).T

Unnamed: 0,0,1,2,3,4
id,87139402,8910251,905520,868871,9012568
diagnosis,B,B,B,B,B
radius_mean,12.32,10.6,11.04,11.28,15.19
texture_mean,12.39,18.95,16.83,13.39,13.21
perimeter_mean,78.85,69.28,70.92,73,97.65
area_mean,464.1,346.4,373.2,384.8,711.8
smoothness_mean,0.1028,0.09688,0.1077,0.1164,0.07963
compactness_mean,0.06981,0.1147,0.07804,0.1136,0.06934
concavity_mean,0.03987,0.06387,0.03046,0.04635,0.03393
points_mean,0.037,0.02642,0.0248,0.04796,0.02657


In [9]:
#Delete the "id" column
data.drop("id",axis=1,inplace=True)

In [10]:
#Count the diagnosis variable
data.diagnosis.value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

In [11]:
#Diagnosis variable is a target variable for the classification
#Replace M and B with 1 and 0 respectively
data.diagnosis=data.diagnosis.map({'M':1,'B':0})

In [12]:
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']

In [13]:
#Split dataset into train set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,random_state=7)

In [14]:
#Size of train and test data sets
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(398, 30)
(398,)
(171, 30)
(171,)


### Without Hyper Parameters Tuning
1. Logistic Regression
2. Naive Bayes
3. kNearestNeighbors
4. Decision Tree
5. AdaBoost
6. GradientBoosting
7. RandomForest

In [15]:
# Compare results without and with hyper parameters
resultsDf = pd.DataFrame(index=['Logistic Regression', 'Naive Bayes', 'KNN', 'DecisionTree', 
                                'AdaBoost', 'GradientBoost', 'RandomForest'])
resultsWOHP = []
resultsWHP = []

In [16]:
##### 1. Logistic Regression
# Make the instance
model = LogisticRegression()
# Train the model
model.fit(X_train, y_train)
# Prediction
prediction = model.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:", metrics.accuracy_score(prediction,y_test))
resultsWOHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Accuracy: 0.9473684210526315
Confusion Metrix:   
 [[106   6]
 [  3  56]]


In [17]:
##### 2. Naive Bayes
# Make the instance
model = GaussianNB()
# Train the model
model.fit(X_train, y_train)
# Prediction
prediction = model.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:", metrics.accuracy_score(prediction,y_test))
resultsWOHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Accuracy: 0.9415204678362573
Confusion Metrix:   
 [[106   7]
 [  3  55]]


In [18]:
##### 3. kNearestNeighbors
# Make the instance
model = KNeighborsClassifier()
# Train the model
model.fit(X_train, y_train)
# Prediction
prediction = model.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:", metrics.accuracy_score(prediction,y_test))
resultsWOHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Accuracy: 0.9298245614035088
Confusion Metrix:   
 [[104   7]
 [  5  55]]


In [19]:
##### 4. DecisionTree
# Make the instance
model = DecisionTreeClassifier(random_state=7)
# Train the model
model.fit(X_train, y_train)
# Prediction
prediction = model.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:", metrics.accuracy_score(prediction,y_test))
resultsWOHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Accuracy: 0.9298245614035088
Confusion Metrix:   
 [[105   8]
 [  4  54]]


In [20]:
##### 5. AdaBoost
# Make the instance
model = AdaBoostClassifier()
# Train the model
model.fit(X_train, y_train)
# Prediction
prediction = model.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:", metrics.accuracy_score(prediction,y_test))
resultsWOHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Accuracy: 0.9532163742690059
Confusion Metrix:   
 [[106   5]
 [  3  57]]


In [20]:
##### 6. GradientBoosting
# Make the instance
model = GradientBoostingClassifier()
# Train the model
model.fit(X_train, y_train)
# Prediction
prediction = model.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:", metrics.accuracy_score(prediction,y_test))
resultsWOHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Accuracy: 0.9649122807017544
Confusion Metrix:   
 [[109   6]
 [  0  56]]


In [21]:
##### 7. RandomForest
# Make the instance
model = RandomForestClassifier()
# Train the model
model.fit(X_train, y_train)
# Prediction
prediction = model.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:", metrics.accuracy_score(prediction,y_test))
resultsWOHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Accuracy: 0.935672514619883
Confusion Metrix:   
 [[105   7]
 [  4  55]]


### With Hyper Parameters Tuning
1. Logistic Regression
2. Naive Bayes
3. kNearestNeighbors
4. Decision Tree
5. AdaBoost
6. GradientBoosting
7. RandomForest

In [22]:
#Import GridSearch module
from sklearn.model_selection import GridSearchCV

In [23]:
##### 1. Logistic Regression
#Make ML model the instance
model= LogisticRegression()
#Hyper Parameters Set
params = {} ### No hyper parameters for logistic regression
#Make ML model with hyper parameters sets
model1 = GridSearchCV(model, param_grid=params, n_jobs=-1)
#Train the model
model1.fit(X_train, y_train)
#The best hyper parameters set
print("Best Hyper Parameters:",model1.best_params_)
#Prediction
prediction=model1.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:",metrics.accuracy_score(prediction,y_test))
resultsWHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Best Hyper Parameters: {}
Accuracy: 0.9473684210526315
Confusion Metrix:   
 [[106   6]
 [  3  56]]


In [24]:
##### 2. Naive Bayes
#Make ML model the instance
model = GaussianNB()
#Hyper Parameters Set
params = {}  ## No hyper parameters for Naive Bayes
#Make ML model with hyper parameters sets
model1 = GridSearchCV(model, param_grid=params, n_jobs=-1)
#Train the model
model1.fit(X_train, y_train)
#The best hyper parameters set
print("Best Hyper Parameters:",model1.best_params_)
#Prediction
prediction=model1.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:",metrics.accuracy_score(prediction,y_test))
resultsWHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Best Hyper Parameters: {}
Accuracy: 0.9415204678362573
Confusion Metrix:   
 [[106   7]
 [  3  55]]


In [25]:
##### 3. kNearestNeighbors
#Make ML model the instance
model = KNeighborsClassifier()
#Hyper Parameters Set
params = {'n_neighbors':[5,6,7,8,9,11, 13, 15, 17, 19, 21, 23]}
#Make ML model with hyper parameters sets
model1 = GridSearchCV(model, param_grid=params, n_jobs=-1)
#Train the model
model1.fit(X_train, y_train)
#The best hyper parameters set
print("Best Hyper Parameters:",model1.best_params_)
#Prediction
prediction=model1.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:",metrics.accuracy_score(prediction,y_test))
resultsWHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Best Hyper Parameters: {'n_neighbors': 15}
Accuracy: 0.935672514619883
Confusion Metrix:   
 [[106   8]
 [  3  54]]


In [26]:
##### 4. DecisionTree
#Import GridSearch module
from sklearn.model_selection import GridSearchCV
#Make ML model the instance
model= DecisionTreeClassifier(random_state=7)
#Hyper Parameters Set
params = {'min_samples_split': [2,3,4,5,6,7,8,9,10,11,12,13,14,15], 
          'min_samples_leaf':[1,2,3,4,5,6,7,8,9,10,11]}
#Make ML model with hyper parameters sets
model1 = GridSearchCV(model, param_grid=params, n_jobs=-1)
#Train the model
model1.fit(X_train, y_train)
#The best hyper parameters set
print("Best Hyper Parameters:",model1.best_params_)
#Prediction
prediction=model1.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:",metrics.accuracy_score(prediction,y_test))
resultsWHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Best Hyper Parameters: {'min_samples_leaf': 1, 'min_samples_split': 4}
Accuracy: 0.9239766081871345
Confusion Metrix:   
 [[105   9]
 [  4  53]]


In [27]:
##### 5. AdaBoost
#Make ML model the instance
model= AdaBoostClassifier(random_state=7)
#Hyper Parameters Set
params = {'n_estimators':[10,15,20,25,30, 100], 
          'learning_rate' : [0.01, 0.1, 0.9]}
#Make ML model with hyper parameters sets
model1 = GridSearchCV(model, param_grid=params, n_jobs=-1)
#Train the model
model1.fit(X_train, y_train)
#The best hyper parameters set
print("Best Hyper Parameters:",model1.best_params_)
#Prediction
prediction=model1.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:",metrics.accuracy_score(prediction,y_test))
resultsWHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Best Hyper Parameters: {'learning_rate': 0.1, 'n_estimators': 100}
Accuracy: 0.9649122807017544
Confusion Metrix:   
 [[108   5]
 [  1  57]]


In [28]:
##### 6. GradientBoosting
#Make ML model the instance
model= GradientBoostingClassifier(random_state=7)
#Hyper Parameters Set
params = {'n_estimators':[10, 15, 20, 25, 30, 100],
          'max_depth': [3, 5,25], 
          'min_samples_leaf':[1,3],
          'min_samples_split':[2,5,7],
          'learning_rate' : [0.01, 0.1,0.9],
          'subsample':[0.6,0.7, 1.0], 
          'max_features':[3, 5, 10, 20]}
#Make ML model with hyper parameters sets
model1 = GridSearchCV(model, param_grid=params, n_jobs=-1)
#Train the model
model1.fit(X_train, y_train)
#The best hyper parameters set
print("Best Hyper Parameters:",model1.best_params_)
#Prediction
prediction=model1.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:",metrics.accuracy_score(prediction,y_test))
resultsWHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Best Hyper Parameters: {'learning_rate': 0.9, 'max_depth': 3, 'max_features': 10, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 25, 'subsample': 0.7}
Accuracy: 0.9707602339181286
Confusion Metrix:   
 [[108   4]
 [  1  58]]


In [29]:
##### 7. RandomForest
#Make ML model the instance
model = RandomForestClassifier(random_state=7)
#Hyper Parameters Set
params = {'n_estimators':[10,15,20,25,30],
          'max_depth': [5, 15, 25, 50], 
          'min_samples_leaf':[1,2,3],
          'min_samples_split':[3,4,5,6,7], 
          'max_features':[5, 10, 20]}
#Make ML model with hyper parameters sets
model1 = GridSearchCV(model, param_grid=params, n_jobs=-1)
#Train the model
model1.fit(X_train, y_train)
#The best hyper parameters set
print("Best Hyper Parameters:",model1.best_params_)/GreatLearningAIML1/chennai-aug-batch-mraj2018/tree/master/ML
#Prediction
prediction=model1.predict(X_test)
## Evaluation
# Accuracy
print("Accuracy:",metrics.accuracy_score(prediction,y_test))
resultsWHP.append(metrics.accuracy_score(prediction,y_test))
# Confusion Metrix 
print("Confusion Metrix:   \n", metrics.confusion_matrix(prediction,y_test))

Best Hyper Parameters: {'max_depth': 15, 'max_features': 5, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 20}
Accuracy: 0.9649122807017544
Confusion Metrix:   
 [[108   5]
 [  1  57]]


### Compare results

In [30]:
resultsDf['accuracyWithOutHPTuning'] = resultsWOHP
resultsDf['accuracyWithHPTuning'] = resultsWHP
resultsDf

Unnamed: 0,accuracyWithOutHPTuning,accuracyWithHPTuning
Logistic Regression,0.947368,0.947368
Naive Bayes,0.94152,0.94152
KNN,0.929825,0.935673
DecisionTree,0.935673,0.923977
AdaBoost,0.953216,0.964912
GradientBoost,0.964912,0.97076
RandomForest,0.935673,0.964912
