We are given a data set of items, with certain features, and values for these features (like a vector). The task is to categorize those items into groups.

The algorithm will categorize the items into k groups of similarity. To calculate that similarity, we will use the euclidean distance as measurement.

The algorithm works as follows:

1. First we initialize k points, called means, randomly.
2. We categorize each item to its closest mean and we update the mean’s coordinates, which are the averages of the items categorized in that mean so far.
3. We repeat the process for a given number of iterations and at the end, we have our clusters.

The “points” mentioned above are called means, because they hold the mean values of the items categorized in it. To initialize these means, we have a lot of options. An intuitive method is to initialize the means at random items in the data set. Another method is to initialize the means at random values between the boundaries of the data set (if for a feature x the items have values in [0,3], we will initialize the means with values for x at [0,3]).

In [3]:
import math #For pow and sqrt
import sys
from random import shuffle, uniform

In [4]:
def ReadData(filename):
     # Read the file, splitting by lines 
    f = open(filename,'r')
    lines = f.read().splitlines()
#     print(lines)
    f.close()
    items = []
    for i in range(1,len(lines)):
#         print(i)
        line = lines[i].split(',')
#         print(line)
        itemFeatures = []
        
        for j in range(len(line)-1):
            v = float(line[j])# Convert feature value to float 
#             print(v)
            itemFeatures.append(v)# Add feature value to dict 
        
        items.append(itemFeatures)
        
    shuffle(items)
    
    return items

In [5]:
data = ReadData('data.txt')

In [6]:
data

[[6.0, 3.0, 4.8, 1.8],
 [5.0, 3.4, 1.6, 0.4],
 [5.4, 3.4, 1.5, 0.4],
 [5.8, 2.8, 5.1, 2.4],
 [5.0, 3.3, 1.4, 0.2],
 [6.5, 3.0, 5.8, 2.2],
 [5.7, 2.6, 3.5, 1.0],
 [4.9, 3.1, 1.5, 0.1],
 [5.7, 2.8, 4.1, 1.3],
 [6.5, 3.2, 5.1, 2.0],
 [4.6, 3.2, 1.4, 0.2],
 [4.4, 3.0, 1.3, 0.2],
 [7.4, 2.8, 6.1, 1.9],
 [5.8, 2.7, 3.9, 1.2],
 [5.1, 3.5, 1.4, 0.3],
 [6.0, 3.4, 4.5, 1.6],
 [5.0, 2.3, 3.3, 1.0],
 [6.3, 2.8, 5.1, 1.5],
 [5.7, 2.9, 4.2, 1.3],
 [5.6, 3.0, 4.5, 1.5],
 [5.1, 3.3, 1.7, 0.5],
 [6.8, 2.8, 4.8, 1.4],
 [7.7, 2.6, 6.9, 2.3],
 [6.8, 3.2, 5.9, 2.3],
 [7.1, 3.0, 5.9, 2.1],
 [5.9, 3.0, 5.1, 1.8],
 [4.4, 3.2, 1.3, 0.2],
 [6.0, 2.7, 5.1, 1.6],
 [5.9, 3.0, 4.2, 1.5],
 [6.3, 2.7, 4.9, 1.8],
 [4.9, 3.1, 1.5, 0.1],
 [6.5, 3.0, 5.5, 1.8],
 [6.3, 2.9, 5.6, 1.8],
 [5.8, 2.7, 5.1, 1.9],
 [5.4, 3.4, 1.7, 0.2],
 [6.3, 3.3, 6.0, 2.5],
 [6.7, 3.1, 4.4, 1.4],
 [6.6, 2.9, 4.6, 1.3],
 [5.2, 2.7, 3.9, 1.4],
 [6.3, 2.5, 4.9, 1.5],
 [6.0, 2.2, 4.0, 1.0],
 [7.9, 3.8, 6.4, 2.0],
 [6.4, 3.2, 4.5, 1.5],
 [6.6, 3.0,

We want to initialize each mean’s values in the range of the feature values of the items. For that, we need to find the min and max for each feature.

In [11]:
def FindColMinMax(items): 
    "items - list of list"
    n = len(items[0])
#     print("no of items in list",n)
    minima = [sys.maxsize for i in range(n)]
#     print('minima generated by sys',minima)
    maxima = [-sys.maxsize -1 for i in range(n)]
#     print('maxima generated by sys',maxima)
    for item in items: 
        for f in range(len(item)): 
            if (item[f] < minima[f]): 
                minima[f] = item[f]
                print("updating minima",minima[f])
            if (item[f] > maxima[f]):
                maxima[f] = item[f]
                print("updating maxima",maxima[f])
    return minima,maxima

In [12]:
data[0]

[6.0, 3.0, 4.8, 1.8]

In [13]:
cMin, cMax = FindColMinMax(data)
print(cMin,cMax)

updating minima 6.0
updating maxima 6.0
updating minima 3.0
updating maxima 3.0
updating minima 4.8
updating maxima 4.8
updating minima 1.8
updating maxima 1.8
updating minima 5.0
updating maxima 3.4
updating minima 1.6
updating minima 0.4
updating minima 1.5
updating minima 2.8
updating maxima 5.1
updating maxima 2.4
updating minima 1.4
updating minima 0.2
updating maxima 6.5
updating maxima 5.8
updating minima 2.6
updating minima 4.9
updating minima 0.1
updating minima 4.6
updating minima 4.4
updating minima 1.3
updating maxima 7.4
updating maxima 6.1
updating maxima 3.5
updating minima 2.3
updating maxima 7.7
updating maxima 6.9
updating maxima 2.5
updating minima 2.2
updating maxima 7.9
updating maxima 3.8
updating maxima 3.9
updating maxima 4.1
updating minima 1.2
updating maxima 4.2
updating minima 2.0
updating maxima 4.4
updating minima 4.3
updating minima 1.1
updating minima 1.0
[4.3, 2.0, 1.0, 0.1] [7.9, 4.4, 6.9, 2.5]


The variables minima, maxima are lists containing the min and max values of the items respectively. We initialize each mean’s feature values randomly between the corresponding minimum and maximum in those above two lists

In [15]:
def InitializeMeans(items, k, cMin, cMax): 
  
    # Initialize means to random numbers between 
    # the min and max of each column/feature     
    f = len(items[0]) # number of features 
    means = [[0 for i in range(f)] for j in range(k)]
    print("initial means",means)
      
    for mean in means: 
        for i in range(len(mean)): 
  
            # Set value to a random float 
            # (adding +-1 to avoid a wide placement of a mean) 
            mean[i] = uniform(cMin[i]+1, cMax[i]-1)
            print("mean[i]",mean[i])
  
    return means

In [16]:
means = [[0 for i in range(4)] for j in range(3)]
means

[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]

In [17]:
means = InitializeMeans(data, 3,cMin,cMax)
print("updated means", means)

initial means [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
mean[i] 6.236353280465993
mean[i] 3.2896146272603928
mean[i] 4.30074858770792
mean[i] 1.421308960910105
mean[i] 6.088771268986619
mean[i] 3.1262916221857298
mean[i] 3.7692894623981656
mean[i] 1.1405459177580908
mean[i] 5.697993103681773
mean[i] 3.291187044404643
mean[i] 3.5281608129700874
mean[i] 1.4416517327915115
updated means [[6.236353280465993, 3.2896146272603928, 4.30074858770792, 1.421308960910105], [6.088771268986619, 3.1262916221857298, 3.7692894623981656, 1.1405459177580908], [5.697993103681773, 3.291187044404643, 3.5281608129700874, 1.4416517327915115]]


We will be using the euclidean distance as a metric of similarity for our data set

In [24]:
def EuclideanDistance(x, y): 
    S = 0 #  The sum of the squared differences of the elements 
    for i in range(len(x)): 
        S += math.pow(x[i]-y[i], 2)
  
    return math.sqrt(S) #The square root of the sum 

To update a mean, we need to find the average value for its feature, for all the items in the mean/cluster. We can do this by adding all the values and then dividing by the number of items, or we can use a more elegant solution. We will calculate the new average without having to re-add all the values

In [25]:
def UpdateMean(n,mean,item): 
    for i in range(len(mean)): 
        m = mean[i]; 
        m = (m*(n-1)+item[i])/float(n); 
        mean[i] = round(m, 3); 
      
    return mean

In [26]:
UpdateMean(4,[6.608595605040486, 3.0556152772369045, 3.291870881974839, 1.3796958473151308],[7.2, 3.0, 5.8, 1.6])

[6.756, 3.042, 3.919, 1.435]

Now we need to write a function to classify an item to a group/cluster. For the given item, we will find its similarity to each mean, and we will classify the item to the closest one.

In [27]:
def Classify(means,item): 
  
    # Classify item to the mean with minimum distance     
    minimum = sys.maxsize
#     print(minimum)
    index = -1
  
    for i in range(len(means)): 
  
        # Find distance from item to mean 
        dis = EuclideanDistance(item, means[i])
        print(dis)
  
        if (dis < minimum): 
            minimum = dis
            index = i
      
    return index

In [28]:
Classify(means,data[0])

0.729656344639607
1.2333182115499546
1.3863572622268456


0

In [38]:
data[0]

[6.7, 3.0, 5.0, 1.7]

To actually find the means, we will loop through all the items, classify them to their nearest cluster and update the cluster’s mean. We will repeat the process for some fixed number of iterations. If between two iterations no item changes classification, we stop the process as the algorithm has found the optimal solution.

The below function takes as input k (the number of desired clusters), the items and the number of maximum iterations, and returns the means and the clusters. The classification of an item is stored in the array belongsTo and the number of items in a cluster is stored in clusterSizes.

In [32]:
def CalculateMeans(k,items,maxIterations=100): 
  
    # Find the minima and maxima for columns 
    cMin, cMax = FindColMinMax(items)
    print("cMin",cMin)
    print("cMax",cMax)
      
    # Initialize means at random points 
    means = InitializeMeans(items,k,cMin,cMax) 
    print("means",means)
      
    # Initialize clusters, the array to hold 
    # the number of items in a class 
    clusterSizes= [0 for i in range(len(means))] 
    print("clusterSizes",clusterSizes)
  
    # An array to hold the cluster an item is in 
    belongsTo = [0 for i in range(len(items))]
    print("belongsTo",belongsTo)
  
    # Calculate means 
    for e in range(maxIterations): 
  
        # If no change of cluster occurs, halt 
        noChange = True
        for i in range(len(items)): 
  
            item = items[i]
  
            # Classify item into a cluster and update the 
            # corresponding means.         
            index = Classify(means,item)
  
            clusterSizes[index] += 1
            cSize = clusterSizes[index]
            means[index] = UpdateMean(cSize,means[index],item)
  
            # Item changed cluster 
            if(index != belongsTo[i]): 
                noChange = False 
  
            belongsTo[i] = index
  
        # Nothing changed, return 
        if (noChange): 
            break;
  
    return means

In [33]:
def FindClusters(means,items): 
    clusters = [[] for i in range(len(means))] # Init clusters 
#     print(clusters)
      
    for item in items: 
  
        # Classify item into a cluster 
        index = Classify(means,item)
#         print(index)
  
        # Add item to cluster 
        clusters[index].append(item)
  
    return clusters

In [34]:
means = CalculateMeans(3, data)
print(means)

updating minima 6.0
updating maxima 6.0
updating minima 3.0
updating maxima 3.0
updating minima 4.8
updating maxima 4.8
updating minima 1.8
updating maxima 1.8
updating minima 5.0
updating maxima 3.4
updating minima 1.6
updating minima 0.4
updating minima 1.5
updating minima 2.8
updating maxima 5.1
updating maxima 2.4
updating minima 1.4
updating minima 0.2
updating maxima 6.5
updating maxima 5.8
updating minima 2.6
updating minima 4.9
updating minima 0.1
updating minima 4.6
updating minima 4.4
updating minima 1.3
updating maxima 7.4
updating maxima 6.1
updating maxima 3.5
updating minima 2.3
updating maxima 7.7
updating maxima 6.9
updating maxima 2.5
updating minima 2.2
updating maxima 7.9
updating maxima 3.8
updating maxima 3.9
updating maxima 4.1
updating minima 1.2
updating maxima 4.2
updating minima 2.0
updating maxima 4.4
updating minima 4.3
updating minima 1.1
updating minima 1.0
cMin [4.3, 2.0, 1.0, 0.1]
cMax [7.9, 4.4, 6.9, 2.5]
initial means [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0

In [36]:
clusters = FindClusters(means, data)
print(clusters)

1.0775778394157889
0.7413872132698266
3.830796131354421
4.629597606704064
3.046777806142089
0.19748164471666732
4.584165572926004
3.054808504636584
0.4214249636649452
1.1372660198915643
1.3097537936574182
4.338847658076968
4.865693578514784
3.271613516294369
0.14628397041371313
0.38311094998707607
1.8730870241395616
4.998679725687574
2.552287993154377
0.8878372598624145
2.4188011493299735
4.848337240745532
3.220039596029838
0.37255737813120804
1.9371561630390048
0.24424373072814007
2.98368882425765
0.5500672686135762
1.3170630205119263
4.305972480172162
5.017207789199088
3.378883691398685
0.4679732898360762
5.1843393021676345
3.519070189695
0.7555124088987549
0.8909399530832595
2.4633422417520467
5.505687877095831
2.0950355605573856
0.42432888188290974
2.8234728615660534
4.810381897521236
3.254482293698954
0.14628397041371263
1.4041987038877366
0.7601677446458772
3.4658330888835374
3.1074063139538093
1.4034439782192951
2.2745546816904625
0.8377195234683267
0.9568986362201581
4.10204814

In [40]:
len(clusters[0])

45

In [41]:
len(clusters[1])

55

In [42]:
len(clusters[2])

49

In [39]:
len(data)

149

In [43]:
from sklearn.cluster import KMeans

In [44]:
kmeans = KMeans(n_clusters=3, random_state=0).fit(data)

In [45]:
kmeans.labels_

array([1, 0, 0, 1, 0, 2, 1, 0, 1, 2, 0, 0, 2, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       2, 2, 2, 1, 0, 1, 1, 1, 0, 2, 2, 1, 0, 2, 1, 1, 1, 1, 1, 2, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 2, 1, 1, 0, 0, 2, 0, 1, 2, 0, 0, 0, 2, 2,
       2, 2, 1, 1, 0, 0, 1, 2, 1, 0, 0, 1, 0, 0, 0, 2, 2, 1, 1, 2, 1, 0,
       0, 1, 2, 2, 1, 1, 1, 2, 0, 1, 2, 2, 1, 2, 2, 0, 1, 2, 0, 0, 0, 0,
       0, 1, 1, 2, 0, 0, 1, 2, 2, 1, 0, 1, 0, 1, 0, 1, 1, 2, 0, 1, 1, 2,
       2, 0, 2, 1, 0, 0, 0, 1, 2, 2, 1, 1, 1, 1, 1, 0, 0], dtype=int32)

In [46]:
kmeans.predict([[0,0,0,0], [12,3,5,7]])

array([0, 2], dtype=int32)