# Project Title
The comparison of k-means and k-nearest neighbors (k-NN) clustering  algorithms 

## Members

1. First member: Kübra Zeynep Zor  [`zork@itu.edu.tr`] (k-means algorithm)
2. Second member: Nur Yılmaz [`yilmaznur@itu.edu.tr`] (k-NN algorithm)

## Description of the project
To create a statistical model using these two data sets: Iris data set and Wine data set. Also, to make assumptions by applying  k-means and k-NN clustering algorithms. We will use scikit-learn library of Python.

### The methods to be used

**k-means**: The algorithm based on  dividing n observations  into k disjoint sets. Similar elements are in the same cluster. So every element should belong to only one cluster. The algorithm aims to reduce the sum of the distance of each point to the k center of the circle. At first cluster centers are identified. Then the distance of each sample from the selected centers is calculated. The samples are put in to the closest samples of k clusters. After that, the new center values of clusters are assigned as averages of the samples. These steps are repeated until the center points do not change. 
Eucledian distance can be used to calculate the distance of the points to the cluster centers. The Euclidean distance is:$$p=(p_1,p_2,...,p_n)$$ $$and$$ $$q=(q_1,q_2,...,q_n)$$ $$ \sqrt{\sum_{i=1}^{n}(p_i-q_i)^2}=\sqrt{(p_1-q_1)^2+(p_2-q_2)^2+...+(p_n-q_n)^2}$$

Algorithmic steps for k-means clustering:

Let $X=(x_1,x_2,...,x_n)$ be set of data points. Then, and $V=(v_1,v_2,...v_c)$ be the set of centers.

1. Randomly place 'c' cluster centers.
2. Calculate the distance from each data point $x_i$ to the cluster centers.
3. Find nearest cluster centers and assign the data point this cluster.
4. Recalculate the new cluster center using:$$v_i=\frac{1}{c_i}\sum_{j=1}^{c_i}x_i$$ where '$c_i$ ' represents the number of data points ith cluster
5. Recalculate the distance between each data point and new cluster centers.
6. Stop if the location of the data points does not change, otherwise go back to step 3.

This algorithm aims an minimizing an objective function know as squared error function given by: $$J(V)=\sum_{i=1}^{c}\sum_{j=1}^{c_i}(||x_i-v_j||)^2$$ where, $||x_i-v_j||$ is the Euclidean distance $x_i$ and $v_j$,
'$c_i$' is the number of data points ith cluster,
'c' is the number of cluster centers

k-means algorithm is fast and understandable. It gives the best result when the data set is well seperated from each other. It is necessary to determine appropriate number of clusters before starting the algorithm. Another disadvantage is that a high value can significantly change the average of cluster and cluster center. The algorithm is not sensitive to noisy data. Algorithm can not be applied for non-linear data set.

**k-NN**: The nearest neighbors algorithm that solves the classification problem is an instance-based learning method. It is one of the simplest of all machine learning algorithms. The general logic of the algorithm is the investigated sample belongs to the nearest cluster. The choice of k is also important. Generally large values of k reduce the effect of noise on the classification, but decrease the clarity of boundaries between classes. A useful k can be selected by some techniques such as bootstrap method. The use of all of the observations in the data sets in estimation leads to time loss and spoilage. Therefore, the samples that represent the population well are needed. Using observations in data sets of any size, the data is resampled by displacement depending on the chance. So that new data sets of various amounts and sizes are created. Thus, algorithms are applied to the new data sets that are formed. This method was developed by Bradley Efron in 1979 and is called the Bootstrap (Resampling) method.

The steps of the algorithm are as follows:

1. Decide how many of the nearest neighbors i.e. k will be looked.
2. Choose n known samples in order to classify the test sample.
3. Calculate  the distances from the sample to the other samples and choose k nearest samples.
4. Determine the most repetitive sample.
5. The test sample belongs to the class which is the most repetitive sample.

The distance between samples can be measured using metrics as L1 (first norm) Manhattan -taxicab- distance or L2 (second norm) Euclidean distance. The Euclidean distance may not be effective when data size increases so we will use Manhattan distance. The Manhattan name based on the gridlike street shape in Manhattan borough of New York. In fact, the taxicab norm is the distance wanted to take passengers with a car in the city where square blocks are arranged.

The taxicab distances $(d_1)$ are calculated using: $$d_1(p,q) = ||p-q||_1 = \sum_{i=1}^{n}|p_i-q_i|$$ where $(p,q)$ are vectors $p=(p_1,p_2,...,p_n)$ and $q=(q_1,q_2,...,q_n)$.


 
The algorithm is resistant to noisy datas but is not much effective in multidimentional data sets. It is used not only quantitative but also qualitative data sets.

### The data

The Iris Flower data set or Fisher’s Iris data set is a multivariate data set which was explained by  statistician and biologist Ronald Fisher. The dataset was created by taking 50 samples from three species: Iris setosa, Iris virginica and Iris versicolor. Four attributes were analyzed from each sample: the length and the width of the sepals and petals, in centimetres. Fisher improved a linear discriminant model in order to classificate the species according to these features.

Data size: 150 entries

Data distribution: 50 entries for each class


The Wine data set  expresses the chemical analysis of three wine kinds produced in Italy. This analysis shows the amount of same 13 ingredients in each wine.

Data size: 178 entries

3 classes

Data distribution: 59, 71, and 48 entries for each class

### References

1.  https://en.wikipedia.org/wiki/K-means_clustering
2.  https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
3.  http://mirlab.org/jang/books/dcpr/dataSetIris.asp?title=2-2%20Iris%20Dataset
4.  http://mirlab.org/jang/books/dcpr/dataSetWine.asp?title=2-3%20Wine%20Dataset
5.  http://www.emo.org.tr/ekler/8c1874c96244659_ek.pdf
6.  https://www.youtube.com/watch?v=hd1W4CyPX58
7.  https://en.wikipedia.org/wiki/Taxicab_geometry
8.  https://sites.google.com/site/dataclusteringalgorithms/k-means-clustering-algorithm
9.  http://acikerisim.ticaret.edu.tr:8080/xmlui/bitstream/handle/11467/208/M00041.pdf?sequence=1&isAllowed=y
10. http://dergipark.ulakbim.gov.tr/egeziraat/article/viewFile/5000154071/5000139378




In [15]:
#import numpy as np

#from sklearn.datasets import load_iris
#iris = load_iris()

#X = iris.data
#y = iris.target

#k_means = cluster.KMeans(n_clusters=3)
#k_means.fit(X)
#y_pred = KMeans.predict(X)

#print (X.shape)
#print (y.shape)

#from sklearn.neighbors import KNeighborsClassifier

#knn = KNeighborsClassifier(n_neighbors=1)

#print (knn)

#n_samples, n_features=iris.data.shape
#print (n_samples, n_features) 

#from sklearn import cluster
#from sklearn.neighbors import NearestNeighbors

#from sklearn.datasets import load_iris
#iris = load_iris()

In [16]:
# loading iris
from sklearn.datasets import *
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cluster import KMeans




In [17]:
iris=load_iris()
#X=iris.data
#y=iris.target
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=0)



In [18]:
ir=pd.DataFrame(iris.data)
ir.columns=iris.feature_names
ir['CLASS']=iris.target
ir.head(150)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),CLASS
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
6,4.6,3.4,1.4,0.3,0
7,5.0,3.4,1.5,0.2,0
8,4.4,2.9,1.4,0.2,0
9,4.9,3.1,1.5,0.1,0


In [19]:
#clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
#clf.score(X_test, y_test) 

In [20]:
#kmeans for iris

km= KMeans(n_clusters=3, max_iter=1000)
km.fit(X_train, y_train)



KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=1000,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [71]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], random_state=0, test_size=0.40)
from sklearn import cluster, datasets
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X_train, y_train)
predict_values = k_means.predict(X_test)
print(y_test)
print(predict_values)
accuracy_score(y_test, predict_values)

[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 1 1 1 2 0 2 0 0 1 2 2 2 2 1 2 1 1 2 2 2 2 1 2]
[2 2 1 0 1 0 1 2 2 2 0 2 2 2 2 1 2 2 1 1 2 2 1 1 2 1 1 2 2 1 0 2 1 2 0 2 1
 2 2 2 0 1 0 1 1 2 0 0 2 0 2 0 2 2 2 2 2 2 0 0]


0.14999999999999999

In [69]:
 print (k_means.labels_[::10])

[1 2 2 2 0 0 0 2 1]


In [52]:
print (y_iris[::10])

[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]


In [53]:
km= KMeans(n_clusters=3, max_iter=1000)
km.fit(X_train, y_train)
y_pred= km.predict (X_test)
y_test

array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1, 0,
       0, 2, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 1, 0, 1, 1, 1, 2, 0, 2, 0, 0])

In [25]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,y_pred))


[[ 0 13  0]
 [ 0  0 16]
 [ 5  0  4]]


In [26]:
km.score(X_test,y_test)

-19.947347409930195

In [27]:
#predictions['KMeans predicted classes'] = km.labels_
#predictions

In [28]:
iris.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

In [29]:
print (iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [30]:
#from sklearn.cluster import KMeans

In [31]:
import math

In [32]:
import sklearn.metrics as sm

In [33]:
from sklearn.neighbors import DistanceMetric
dist=DistanceMetric.get_metric('euclidean')
dist.pairwise(X)

NameError: name 'X' is not defined

In [None]:
km.cluster_centers_

In [None]:
km.labels_

In [None]:
ir['KMeans predicted classes'] = km.labels_
ir

In [None]:
sm.confusion_matrix(ir['CLASS'],ir['KMeans predicted classes'])

# knn for iris

In [None]:
from sklearn.datasets import *
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cluster import KMeans

In [None]:
ir2=pd.DataFrame(iris.data)
ir2.columns=iris.feature_names
ir2['CLASS']=iris.target
ir2.head(150)

In [72]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='manhattan',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='distance')

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
knn.fit(X_train, y_train) 

y_pred=knn.predict(X_test)

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,y_pred))

[[16  0  0]
 [ 0 22  1]
 [ 0  2 19]]
