# Breast Cancer Wisconsin (Diagnostic) Prediction
####Predict whether the cancer is benign or malignant

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)

he mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

In [0]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [0]:
# Import Dataset
dataset = pd.read_csv("/content/drive/My Drive/Web Data/Breast_Cancer.csv")

In [3]:
dataset.head() # It'll show first 5 Dataset

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [0]:
dataset.drop(['id'],1,inplace=True) # We dropped id column because we doen't want anymore

In [5]:
dataset.head() # After Deleting Id Column

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [6]:
dataset['diagnosis']=dataset['diagnosis'].map({'M':1,'B':0})
''' In our Dataset There Are 2 Types Of Cancer Stages
1) M = malignant
2) B = benign
we can't input as string that's why i mapped into 0 and 1 form ''' 


" In our Dataset There Are 2 Types Of Cancer Stages\n1) M = malignant\n2) B = benign\nwe can't input as string that's why i mapped into 0 and 1 form "

In [7]:
dataset

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,1.0950,0.9053,8.589,153.40,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.01860,0.01340,0.01389,0.003532,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,1,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.006150,0.04006,0.03832,0.02058,0.02250,0.004571,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,0.4956,1.1560,3.445,27.23,0.009110,0.07458,0.05661,0.01867,0.05963,0.009208,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,1,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.011490,0.02461,0.05688,0.01885,0.01756,0.005115,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,1,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,1.1760,1.2560,7.673,158.70,0.010300,0.02891,0.05198,0.02454,0.01114,0.004239,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,1,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,0.7655,2.4630,5.203,99.04,0.005769,0.02423,0.03950,0.01678,0.01898,0.002498,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,1,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,0.4564,1.0750,3.425,48.55,0.005903,0.03731,0.04730,0.01557,0.01318,0.003892,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,1,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,0.7260,1.5950,5.772,86.22,0.006522,0.06158,0.07117,0.01664,0.02324,0.006185,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [8]:
dataset.columns

Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

In [0]:
X = dataset[['texture_mean','perimeter_mean','smoothness_mean','compactness_mean','symmetry_mean','radius_se','area_se']]
# We take Indepent variables in 1 (X)
y = dataset['diagnosis']
# Y is depended variable

In [0]:
# We splitting dataset into 2 parts, Train and Test
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=1/3 , random_state=0)

In [11]:
X.shape

(569, 7)

In [12]:
y.shape

(569,)

In [13]:
y

0      1
1      1
2      1
3      1
4      1
      ..
564    1
565    1
566    1
567    1
568    0
Name: diagnosis, Length: 569, dtype: int64

In [14]:
X_train # This is our Independent train dataset

Unnamed: 0,texture_mean,perimeter_mean,smoothness_mean,compactness_mean,symmetry_mean,radius_se,area_se
60,14.88,64.55,0.11340,0.08061,0.2743,0.5158,34.620
6,19.98,119.60,0.09463,0.10900,0.1794,0.4467,53.910
8,21.82,87.50,0.12730,0.19320,0.2350,0.3063,24.320
474,15.62,70.41,0.10070,0.10690,0.1861,0.1482,9.597
320,16.18,66.52,0.10610,0.11110,0.1743,0.3677,22.680
...,...,...,...,...,...,...,...
277,19.98,120.90,0.08923,0.05884,0.1550,0.3283,36.740
9,24.04,83.97,0.11860,0.23960,0.2030,0.2976,23.940
359,18.32,59.82,0.10090,0.05956,0.1506,0.5079,30.480
192,18.22,60.73,0.06950,0.02344,0.1653,0.3539,21.690


In [15]:
y_train   # This is our Dependent train dataset

60     0
6      1
8      1
474    0
320    0
      ..
277    1
9      1
359    0
192    0
559    0
Name: diagnosis, Length: 379, dtype: int64

In [16]:
X_test # This is our Independent test dataset

Unnamed: 0,texture_mean,perimeter_mean,smoothness_mean,compactness_mean,symmetry_mean,radius_se,area_se
512,20.52,88.64,0.11060,0.14690,0.2116,0.3906,33.67
457,25.25,84.10,0.08791,0.05205,0.1619,0.2084,17.58
439,15.66,89.59,0.07966,0.05581,0.1589,0.2142,19.25
298,18.17,91.22,0.06576,0.05220,0.1635,0.2300,20.56
37,18.42,82.61,0.08983,0.03766,0.1467,0.1839,14.16
...,...,...,...,...,...,...,...
299,23.09,66.85,0.10150,0.06797,0.1695,0.2868,20.56
347,14.74,94.87,0.08875,0.07780,0.1521,0.3428,29.06
502,16.32,81.25,0.11580,0.10850,0.1943,0.2577,18.49
56,18.57,125.50,0.10530,0.12670,0.1917,0.7275,102.50


In [17]:
y_test # This is our Dependent test dataset

512    1
457    0
439    0
298    0
37     0
      ..
299    0
347    0
502    0
56     1
144    0
Name: diagnosis, Length: 190, dtype: int64

# Simple Linear Regression 
###(Just for fun) Don't do this

In [18]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
y_pred = model.predict(X_test)

In [0]:
from sklearn import metrics as mt
mse = mt.mean_squared_error(y_test,y_pred)
RMSE = np.sqrt(mse)

In [21]:
RMSE

0.2837701063165978

***''' There are 6 steps '''***


1. import Algorithm
2. Create object of algo
3. fit data into it 

4. Predict the value from algo

5. Create confusion matrix 
6. Fit data into confusion matrix find accuracy


# Congrachulations ! In just 6 steps You'll Make your Breast Cancer detection Machine Learning Models


# Logistic Regression

In [22]:
from sklearn.linear_model import LogisticRegression # Step 1
modelLR = LogisticRegression(random_state=0) #Step 2
modelLR.fit(X_train,y_train)# Step 3



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
y_predLR = modelLR.predict(X_test) # Step 4

In [0]:
from sklearn.metrics import confusion_matrix # Step 5
cmlr = confusion_matrix(y_test,y_predLR) # step 6

In [25]:
cmlr

array([[117,   5],
       [ 17,  51]])

In [0]:
# 117+51 = 168 (Right Prediction)
# 17+5 = 22 (Wrong Prediction)
# Accuracy of Logistic Regression is : 88.42
# [(168*100) / 190] = 88.42

# Random Forest Classifier

In [27]:
from sklearn.ensemble import RandomForestClassifier
modelR = RandomForestClassifier(n_estimators = 10,criterion = 'entropy',random_state = 0)
modelR.fit(X_train,y_train)
 

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [0]:
y_predR = modelR.predict(X_test)

In [0]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_predR)

In [30]:
cm

array([[119,   3],
       [ 10,  58]])

In [0]:
# 119+58 = 177 (Right Prediction)
# 10+3 = 13 (Wrong Prediction)
# Random Forest Accuracy Is : 93.15        
# [(177*100)/190] = 93.15

# Support Vector Classifier

In [32]:
from sklearn.svm import  SVC
modelSVM = SVC(kernel = 'linear',random_state = 0)
modelSVM.fit(X_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=0,
    shrinking=True, tol=0.001, verbose=False)

In [0]:
y_predSVM = modelSVM.predict(X_test)

In [0]:
from sklearn.metrics import confusion_matrix
cmsvm = confusion_matrix(y_test,y_predSVM)

In [35]:
cmsvm

array([[113,   9],
       [  7,  61]])

In [0]:
# 113+61 = 174 (Right Prediction)
# 7+9 = 16 (Wrong Prediction)
# SVM Accuracy Is : 91.57       

# Decision Tree Classifier

In [37]:
from sklearn.tree import DecisionTreeClassifier
modelDTC = DecisionTreeClassifier(criterion = 'entropy')
modelDTC.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [0]:
y_predDTC = modelDTC.predict(X_test)

In [0]:
from sklearn.metrics import confusion_matrix
cmdtc = confusion_matrix(y_test,y_predDTC)

In [40]:
cmdtc

array([[109,  13],
       [  9,  59]])

In [0]:
# 110+60 = 170 (Right Prediction)
# 12+8 = 20 (Wrong Prediction)
# SVM Accuracy Is : 89.47       

# K-NN Classifier

In [42]:
from sklearn.neighbors import KNeighborsClassifier
modelKNN  = KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)
modelKNN.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [0]:
y_predKNN = modelKNN.predict(X_test)

In [0]:
from sklearn.metrics import confusion_matrix
cmknn = confusion_matrix(y_test,y_predKNN)

In [45]:
cmknn

array([[113,   9],
       [  4,  64]])

In [0]:
# 113+64 = 177 (Right prediction)
# 4+9 = 13 (Wrong Prediction)
# K-NN Accuracy is : 93.15

# Gaussian Naive bayes

In [47]:
from sklearn.naive_bayes import GaussianNB
modelGNB = GaussianNB()
modelGNB.fit(X_train,y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [0]:
y_predGNB = modelGNB.predict(X_test)

In [0]:
from sklearn.metrics import confusion_matrix
cmgnb = confusion_matrix(y_test,y_predGNB)

In [50]:
cmgnb

array([[113,   9],
       [ 13,  55]])

In [0]:
# 113+55 = 168 (Right Prediction)
# 13+9 = 22 (Wrong Prediction)
# Accuracy of Gaussian Naive Bayes is : 88.42

In [0]:
from IPython.display import HTML, display
import tabulate
table = [
         ["Random Forest",(93.15)],
         ["K-Nearest Neibour",(93.15)],
         ["Support Vector Classifier ",(91.57)],
         ["Decision Tree",(89.47)],
         ["Logistic Regression",(88.42)],
         ["Gaussian Naive Bayes",(88.42)] ]


# Observation

In [53]:
display(HTML(tabulate.tabulate(table, tablefmt='html')))

0,1
Random Forest,93.15
K-Nearest Neibour,93.15
Support Vector Classifier,91.57
Decision Tree,89.47
Logistic Regression,88.42
Gaussian Naive Bayes,88.42
