# Using KNN classification algorithm, predicting Malignant OR Benign prostate cancer.

The data set contains patients who have been diagnosed with either Malignant (M) or Benign (B) cancer.

In [1]:
# Loading the packages we need.
install.packages("class")
install.packages("gmodels")
library(class)
library(gmodels)


The downloaded binary packages are in
	/var/folders/xs/2xmt88qs64s3nhph7yjcxgth0000gn/T//RtmpluJsyc/downloaded_packages

The downloaded binary packages are in
	/var/folders/xs/2xmt88qs64s3nhph7yjcxgth0000gn/T//RtmpluJsyc/downloaded_packages


In [2]:
# Load the data and check its characteristics..

prc <- read.csv("Prostate_Cancer.csv",stringsAsFactors = FALSE)
summary(prc)

       id         diagnosis_result       radius         texture     
 Min.   :  1.00   Length:100         Min.   : 9.00   Min.   :11.00  
 1st Qu.: 25.75   Class :character   1st Qu.:12.00   1st Qu.:14.00  
 Median : 50.50   Mode  :character   Median :17.00   Median :17.50  
 Mean   : 50.50                      Mean   :16.85   Mean   :18.23  
 3rd Qu.: 75.25                      3rd Qu.:21.00   3rd Qu.:22.25  
 Max.   :100.00                      Max.   :25.00   Max.   :27.00  
   perimeter           area          smoothness      compactness    
 Min.   : 52.00   Min.   : 202.0   Min.   :0.0700   Min.   :0.0380  
 1st Qu.: 82.50   1st Qu.: 476.8   1st Qu.:0.0935   1st Qu.:0.0805  
 Median : 94.00   Median : 644.0   Median :0.1020   Median :0.1185  
 Mean   : 96.78   Mean   : 702.9   Mean   :0.1027   Mean   :0.1267  
 3rd Qu.:114.25   3rd Qu.: 917.0   3rd Qu.:0.1120   3rd Qu.:0.1570  
 Max.   :172.00   Max.   :1878.0   Max.   :0.1430   Max.   :0.3450  
    symmetry      fractal_dimensio

Remove 'id' from data set because its not needed to build this model.

In [3]:
prc <- prc[-1]

In [4]:
sum(is.na(prc))

Checked for missing values
Good news! No NAs.

Since our dependent variable "diagnosis_result" is M or B, I want to make it into a descriptive form, adding another column "diagnosis". 

Then, normalize data and divide it into Training and Testing Sets randomly, I used 65:35 ratio.

In [5]:
prc$diagnosis <- factor(prc$diagnosis_result, levels = c("B", "M"), labels = c("Benign", "Malignant"))

#Normalize the numeric variables in data
normalize <- function(x) {
               return ((x - min(x)) / (max(x) - min(x))) }

prc_normalized <- as.data.frame(lapply(prc[2:9], normalize))
prc_normalized <- cbind(prc$diagnosis, prc_normalized)

#Divide data in train & test set
train_index <- sample(1:nrow(prc_normalized), 0.65 * nrow(prc_normalized))
prc_train <- prc_normalized [train_index,]
prc_test <- prc_normalized [-train_index,]

#Extract prediction class labels of train and test set.
train_label <- prc_train[,1]
test_label <- prc_test[,1]


Apply KNN classification algorithm multiple times to measure accuracy of Model with multiple values of k.
Record accuracy for each value.

In this case I preferred k = 3 to 12.


In [6]:
# Can be optimized

k <- c()
accu <- c()
result<- data.frame()
for (i in 3:12)
{
    prc_pred <- knn(train = prc_train [,-1], test = prc_test [,-1], cl = train_label, k = i)
    confusion_matrix <- table(test_label, prc_pred)

    result<-rbind(result,c(i,sum(diag(confusion_matrix)) / sum(confusion_matrix)))

}
names(result) <- c("k", "accuracy")

result

k,accuracy
<dbl>,<dbl>
3,0.8285714
4,0.7428571
5,0.8285714
6,0.8571429
7,0.8285714
8,0.8571429
9,0.8571429
10,0.8
11,0.9142857
12,0.8285714


Highest accuracy achieved for ......