# Breast Cancer Wisconsin dataset: k-NN algorithm

The following exercise is taken from <b> Machine Learning with R</b> by <b> Brett Lantz </b> (Third Edition)

The dataset used in the exercise is the <b>Breast Cancer Wisconsin (Diagnostic)</b> dataset and originates from <b> UCI Machine Learning Repository</b>. The dataset here is downloaded from the textbook's github page.

## Step 1: Collecting the dataset

In [3]:
wbcd <- read.csv("https://raw.githubusercontent.com/PacktPublishing/Machine-Learning-with-R-Third-Edition/master/Chapter03/wisc_bc_data.csv", 
                stringsAsFactors = FALSE)

## Step 2: Exploring and preparing the data

In [2]:
str(wbcd) 

'data.frame':	569 obs. of  32 variables:
 $ id               : int  87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
 $ diagnosis        : chr  "B" "B" "B" "B" ...
 $ radius_mean      : num  12.3 10.6 11 11.3 15.2 ...
 $ texture_mean     : num  12.4 18.9 16.8 13.4 13.2 ...
 $ perimeter_mean   : num  78.8 69.3 70.9 73 97.7 ...
 $ area_mean        : num  464 346 373 385 712 ...
 $ smoothness_mean  : num  0.1028 0.0969 0.1077 0.1164 0.0796 ...
 $ compactness_mean : num  0.0698 0.1147 0.078 0.1136 0.0693 ...
 $ concavity_mean   : num  0.0399 0.0639 0.0305 0.0464 0.0339 ...
 $ points_mean      : num  0.037 0.0264 0.0248 0.048 0.0266 ...
 $ symmetry_mean    : num  0.196 0.192 0.171 0.177 0.172 ...
 $ dimension_mean   : num  0.0595 0.0649 0.0634 0.0607 0.0554 ...
 $ radius_se        : num  0.236 0.451 0.197 0.338 0.178 ...
 $ texture_se       : num  0.666 1.197 1.387 1.343 0.412 ...
 $ perimeter_se     : num  1.67 3.43 1.34 1.85 1.34 ...
 $ area_se          : num  1

In [4]:
# there is an ID column. ID columns can lead to erroneous findings, as they predict each example correctly 
# Hence, the column must be excluded
wbcd <- wbcd[-1]

# column 'diagnosis' is of particular interest!
table(wbcd$diagnosis)


  B   M 
357 212 

In [5]:
# The diagnosis variable was read in as a character (because of stringsAsFactors = True in the read.csv call)
# however, it is a factor. Also, the labels 'M' & 'B' are not particularly informative
wbcd$diagnosis <- factor(wbcd$diagnosis, levels =c("B", "M"), 
                         labels = c("Benign", "Malignant"))

In [6]:
# Checking the prop table

round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)


   Benign Malignant 
     62.7      37.3 

In [7]:
summary(wbcd)
# Checking the measurement scales of the variables

     diagnosis    radius_mean      texture_mean   perimeter_mean  
 Benign   :357   Min.   : 6.981   Min.   : 9.71   Min.   : 43.79  
 Malignant:212   1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17  
                 Median :13.370   Median :18.84   Median : 86.24  
                 Mean   :14.127   Mean   :19.29   Mean   : 91.97  
                 3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10  
                 Max.   :28.110   Max.   :39.28   Max.   :188.50  
   area_mean      smoothness_mean   compactness_mean  concavity_mean   
 Min.   : 143.5   Min.   :0.05263   Min.   :0.01938   Min.   :0.00000  
 1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492   1st Qu.:0.02956  
 Median : 551.1   Median :0.09587   Median :0.09263   Median :0.06154  
 Mean   : 654.9   Mean   :0.09636   Mean   :0.10434   Mean   :0.08880  
 3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040   3rd Qu.:0.13070  
 Max.   :2501.0   Max.   :0.16340   Max.   :0.34540   Max.   :0.42680  
  points_mean      symmetry

### Transformation - normalizing numeric data

In [8]:
# The k-NN algorithm is heavily dependant upon the measurement scale of the input features.
# since the data has different scales, it must be normalized. Here the features will be min-max normalized

normalize <- function(x){
    return ((x - min(x)) / (max(x) - min(x)))
}

wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))

# Check whether normalization worked correctly
summary(wbcd_n)

  radius_mean      texture_mean    perimeter_mean     area_mean     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.2233   1st Qu.:0.2185   1st Qu.:0.2168   1st Qu.:0.1174  
 Median :0.3024   Median :0.3088   Median :0.2933   Median :0.1729  
 Mean   :0.3382   Mean   :0.3240   Mean   :0.3329   Mean   :0.2169  
 3rd Qu.:0.4164   3rd Qu.:0.4089   3rd Qu.:0.4168   3rd Qu.:0.2711  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
 smoothness_mean  compactness_mean concavity_mean     points_mean    
 Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:0.3046   1st Qu.:0.1397   1st Qu.:0.06926   1st Qu.:0.1009  
 Median :0.3904   Median :0.2247   Median :0.14419   Median :0.1665  
 Mean   :0.3948   Mean   :0.2606   Mean   :0.20806   Mean   :0.2431  
 3rd Qu.:0.4755   3rd Qu.:0.3405   3rd Qu.:0.30623   3rd Qu.:0.3678  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
 symmetry_mean    dimension

### Data preparation - creating training and test datasets

In [25]:
# Dividing the dataset into training and test data. Typically, the data must also be shuffled first. 
# However, in this case, the data is aready randomly ordered and not arranged chronologically or in groups
# the labels will be stored in separate variables

wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]

wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]

## Step 3: Training a model on the data

In [26]:
# a good first estimate for k is the square root of the training dataset
sqrt(dim(wbcd_train)[1])

In [27]:
library("class")

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                     cl = wbcd_train_labels, k = 21)

## Step 4: Evaluating model performance

In [28]:
# Cross tabulation comparing predicted adn actual label vectors

library("gmodels")
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = F)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
                 | wbcd_test_pred 
wbcd_test_labels |    Benign | Malignant | Row Total | 
-----------------|-----------|-----------|-----------|
          Benign |        61 |         0 |        61 | 
                 |     1.000 |     0.000 |     0.610 | 
                 |     0.968 |     0.000 |           | 
                 |     0.610 |     0.000 |           | 
-----------------|-----------|-----------|-----------|
       Malignant |         2 |        37 |        39 | 
                 |     0.051 |     0.949 |     0.390 | 
                 |     0.032 |     1.000 |           | 
                 |     0.020 |     0.370 |           | 
-----------------|-----------|-----------|-----------|
    Column Total |        63 |        37 |       100 | 
           

- The top-left cell indicates the true negative results. All 61 of 100 values were identified correctly.

- The bottom-right cell indicates the true positive results. Of the 39 malignant 37 were identified correctly.

- The lower-left cell are false negative results. Here, the k-NN prediction erroneously classified 2 benign tumors as malignant.

- In total, the alogirthm has a 98% accuracy.

## Step 5: Improving model performance

### Transformation - z-score standardization

In [21]:
# The min-max normalization compresses outliers toward the center.
# There might be reason to weight outliers more heavily (e.g. uncontrollably growing tumors)
wbcd_z <- as.data.frame(scale(wbcd[-1]))

# Check whether transformation occured correctly (mean must always be zero):
summary(wbcd_z)

  radius_mean       texture_mean     perimeter_mean      area_mean      
 Min.   :-2.0279   Min.   :-2.2273   Min.   :-1.9828   Min.   :-1.4532  
 1st Qu.:-0.6888   1st Qu.:-0.7253   1st Qu.:-0.6913   1st Qu.:-0.6666  
 Median :-0.2149   Median :-0.1045   Median :-0.2358   Median :-0.2949  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.4690   3rd Qu.: 0.5837   3rd Qu.: 0.4992   3rd Qu.: 0.3632  
 Max.   : 3.9678   Max.   : 4.6478   Max.   : 3.9726   Max.   : 5.2459  
 smoothness_mean    compactness_mean  concavity_mean     points_mean     
 Min.   :-3.10935   Min.   :-1.6087   Min.   :-1.1139   Min.   :-1.2607  
 1st Qu.:-0.71034   1st Qu.:-0.7464   1st Qu.:-0.7431   1st Qu.:-0.7373  
 Median :-0.03486   Median :-0.2217   Median :-0.3419   Median :-0.3974  
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.63564   3rd Qu.: 0.4934   3rd Qu.: 0.5256   3rd Qu.: 0.6464  
 Max.   : 4.76672   Max.   : 4.5644   Max.   

In [31]:
# Same steps as before: Divide into test and train, classify instances with knn(), compare predicted vs. actual labels

wbcd_train <- wbcd_z[1:469, ]
wbcd_test <- wbcd_z[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]

In [32]:
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                     cl = wbcd_train_labels, k = 21)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = F)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  100 

 
                 | wbcd_test_pred 
wbcd_test_labels |    Benign | Malignant | Row Total | 
-----------------|-----------|-----------|-----------|
          Benign |        61 |         0 |        61 | 
                 |     1.000 |     0.000 |     0.610 | 
                 |     0.924 |     0.000 |           | 
                 |     0.610 |     0.000 |           | 
-----------------|-----------|-----------|-----------|
       Malignant |         5 |        34 |        39 | 
                 |     0.128 |     0.872 |     0.390 | 
                 |     0.076 |     1.000 |           | 
                 |     0.050 |     0.340 |           | 
-----------------|-----------|-----------|-----------|
    Column Total |        66 |        34 |       100 | 
           

- There is no improvement in accuray. In fact, the z-score standardization slightly worsened accuracy from 98% to 95%
- False negatives even increased!
- Hence, the min-max normalization is better suited in this case