# K-means in R

First let us load the data:

In [37]:
head(iris)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


I am going to use the first 4 columns to cluster the data, and then compare these clusters with the actual species of the flower.

In [58]:
model1 <- kmeans(iris[,1:4],3,algorithm="Lloyd")
table1 <- table(model1$cluster,iris[,5])
table1

   
    setosa versicolor virginica
  1      0         46        50
  2     32          0         0
  3     18          4         0

The k-means algorithms is not deterministic since it makes a random guess at the beginning as to where the centers should be.  But, we can also provide an initial set of centers.

In [59]:
model2 <- kmeans(iris[,1:4],data[c(11,62,112),1:4],algorithm="Lloyd")
table2 <- table(model2$cluster,iris[,5])
table2 

   
    setosa versicolor virginica
  1     50          0         0
  2      0         47        14
  3      0          3        36

We must also decide which of these models is better.  Looking at the confusion matrices and try to compare the results by hand is not a statistically quantifiable process. We need a better measure. For that we use the $\chi^2$-test:

In [60]:
print(chisq.test(table1))
print(chisq.test(table2))


	Pearson's Chi-squared test

data:  table1
X-squared = 136.61, df = 4, p-value < 2.2e-16


	Pearson's Chi-squared test

data:  table2
X-squared = 218.66, df = 4, p-value < 2.2e-16



# K-nearest neighbor in R

The k-means algorithm belongs to a class of *clustering* machine learning algorithms called **unsupervised**.  In an unsupervised learning algorithm, we do not have a preconceived notion of clusters.  Algorithm decides where and how the clusters should form by using the internal structure of the data alone.

In an **supervised** machine learning setup, we have a set of labels for each cluster and a set of examples where we know which label to assign to each instance. k-nearest neighbor algorithm is one of those.

Each supervised machine learning algorithm has two phases: 

1. training
2. testing

So, we must split our data set into two pieces:


In [48]:
N <- nrow(iris)
xs <- sample(1:N, floor(0.75*N))
train <- iris[xs,]
test <- iris[-xs,]

Now, we are ready to use k-nearest neighbor.

Let us load the necessary library:


In [49]:
library(class)

Let us construct the model from the training data:

In [62]:
result <- knn(train[,1:4],test[,1:4],cl=train[,5],k=1)
table3 <- table(real=test[,5],predicted=result)
table3

            predicted
real         setosa versicolor virginica
  setosa         14          0         0
  versicolor      0         12         0
  virginica       0          1        11

And the required $\chi^2$-test

In [63]:
chisq.test(table3)

"Chi-squared approximation may be incorrect"


	Pearson's Chi-squared test

data:  table3
X-squared = 70.154, df = 4, p-value = 2.106e-14
