### MY470 Computer Programming

### Problem Set 2

#### \*\*\* Example Answers \*\*\*

---
### Writing your own k-means clustering algorithm

K-means clustering is a simple unsupervised machine-learning method for cluster analysis. The aim of the method is to partition a set of points into k clusters, such that each point is assigned to the nearest cluster. The algorithm iterates through two steps:

1. Assign each data point to the cluster with the nearest centroid
2. Update the centroids of the clusters given the new assignment

The algorithm converges when the assignments no longer change. Since the intial assignment to clusters is largely random, there is no guarantee that the optimum assignment is found. So it is common to run the algorithm multiple times and use different starting conditions.

In this problem set, we will implement a much simplified version of the k-means clustering algorithm. Rather than running the algorithm until convergence, we will repeat the above two steps a large but fixed number of times. In addition, we will initialize only once, using a naive method according to which we randomly choose k points from the data to use as initial cluster centroids. 

(In real life, you will of course use a library to implement such an algorithm. In Python, you can do this using [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).)

For the problem set, we will additionally use data from the file `Wholesale customers data.csv`, which you can find in the `data` repository. The file contains information on the annual spending on diverse product categories for the clients of a wholesale distributor. The data are obtained from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php) and you can find more information about them [here](http://archive.ics.uci.edu/ml/datasets/Wholesale+customers#).

#### Hints

Use docstrings to describe your functions. We will subtract points from your mark if you do not use appropriate description of your code.

There are many different implementations of the k-means algorithm you can find online. However, this problem set expects you to follow the instructions and algorithms below precisely.  

In [7]:
# We will first import the modules we need
# You are expected to solve the problem set with these modules only
# Do not import and use any other ones 

# You will need the math module to estimate the square root.
# To get the square root of num, use math.sqrt(num)
import math
import csv
import random

### Problem 1: Function to estimate Euclidean distance between two points

Write a function called `get_distance` that calculates the Euclidean distance between two n-dimensional points. The function should take two lists as arguments, where each list contains the n coordinates of each of the two points. 

Test your function for the points [0, 3, 0] and [4, 0, 0].

#### Hints

You can read about the definition of Euclidean distance on [Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance).


In [8]:
def get_distance(x, y):
    """Estimates the Euclidean distance between two n-dimensional points.
    Assumes x and y are lists of numerical values (the point coordinates).
    Returns float (the Euclidean distance between x and y).
    """
    
    sqrs = [(x[i] - y[i])**2 for i in range(len(x))]
    return math.sqrt(sum(sqrs))

print(get_distance([0, 3, 0], [4, 0, 0]))


5.0


### Problem 2: Function to estimate the centroid of a collection of points

Write a function called `get_centroid` that estimates the centroid of a collection of n-dimensional points. The function should take one list as an argument, which contains each of the points entered as a list of n coordinates. The function should return a list with the coordinates of the virtual center point.

Test your function for the points in `test_lst` entered below.

#### Hints

The coordinate of the centroid in each dimension is the mean of the coordinates of all the points in that dimension.


In [9]:
test_lst = [[0,0,0], [0,0,1], [0,1,0], [1,0,0], 
            [0,1,1], [1,0,1], [1,1,0], [1,1,1]]

def get_centroid(points):
    """Estimates the centroid for a collection of n-dimensional points.
    Assumes points is a collection of lists of numerical values.
    Returns a list of numerical values (the coordinates of the centroid).
    """
    
    centroid = []
    num_points = len(points)
    num_dims = len(points[0])
    for dim in range(num_dims):
        coord = [i[dim] for i in points]
        centroid.append(sum(coord)/num_points)
        
    return centroid

print(get_centroid(test_lst))


[0.5, 0.5, 0.5]


---
### Problem 3: Function to read data

Write a function called `get_data` that opens the file `../data/Wholesale customers data.csv` and returns all the data in a list. 

Use the csv module to read the file. You can read how to do this [here](https://docs.python.org/3/library/csv.html). Make sure you do not include the column names in the data. 

Each element in the list you return should be a list of each customer's annual spending on fresh products, milk products, grocery products, frozen products, detergents and paper products, and delicatessen products. In other words, your list should contain 440 elements (customers), each of which contains six numeric elements (amounts spent on products). The function does not need to take any arguments.

Test your function by saving the data it returns in a variable called `data`. Then print the first two elements of `data`.

In [10]:
def get_data():
    """Reads the file Wholesale customers data.csv and 
    returns part of the data as a list of lists.
    """
    
    with open('../data/Wholesale customers data.csv') as f:
        reader = csv.reader(f)
        data = [[int(i) for i in row[2:]] for row in reader if row[0] != 'Channel']
    return data

data = get_data()
print(data[:2])

[[12669, 9656, 7561, 214, 2674, 1338], [7057, 9810, 9568, 1762, 3293, 1776]]


---
### Problem 4: Function to implement k-means algorithm

Write a function called `kmeans` that clusters a collection of points into k clusters using a simplified version of the k-means algorithm. The function should take two arguments: 

1. `points` – a list of n-dimensional points, and
2. `k` – an integer that defines the number of desired clusters. 

The function should return two things: 

1. A clustering – a list of `k` clusters, each of which is a list of points (each of which is a list of coordinates)
2. A list of the centroids for each of the `k` clusters. Each centroid is essentially a point, so it should be presented as a list of coordinates.

Write your code around the detailed comments and the helping code below.

Test your function on the data from Problem 3 for k = 3. For each of the three clusters, print the number of customers assigned to it and the cordinates of its centroid.


In [11]:
#random.seed(1) # Set the seed to replicate exactly, see below

def kmeans(points, k):
    """Clusters data using a naive implementation of the k-means 
    clustering algorithm. Assumes points is a list of lists 
    of numerical values (point coordinates) and k is 
    an integer > 0 specifiying the number of clusters to be used.
    Returns the k-means clustering after 100 iterations 
    and a single initialization as a list of k lists (clusters) 
    of points and a list of k lists of numerical values 
    (the coordinates of the cluster centroids.)
    """
    
    # Select k random points to use as initial centroids
    init = random.sample(points, k)

    # Create a list of k lists to contain the points assigned to each cluster.  
    clusters = [[] for i in init]
    
    # Create a list to keep the centroids of the k clusters. 
    # For now, this list will contain the points from init.
    centroids = [i for i in init]
    
    # Repeat the clustering for 100 iterations.
    # The idea is that each new repetition refines the clustering 
    # because it starts from the centroids of the previous clustering.     
    for _ in range(100):
        # Create a list of lists for the new clustering
        new_clustering = [[] for i in range(k)]
        
        # Assign each point to the cluster with the closest centroid.
        for p in points:
            # Start by setting the closest cluster to be the first one
            min_dist = get_distance(p, centroids[0])
            closest_clust = 0
            # Now find the actual closest cluster
            for i in range(1, k):
                dist = get_distance(p, centroids[i])
                if dist < min_dist:
                    min_dist = dist
                    closest_clust = i                    
            # Add the point to the closest cluster
            new_clustering[closest_clust].append(p)
            
        # Now update the clusters and the centroids
        clusters = new_clustering
        centroids = [get_centroid(i) for i in clusters]
    
    return clusters, centroids
    
        
clusters, centroids = kmeans(data, 3)
for i in range(3):
    print('***Cluster ' + str(i+1) + '***')
    print('Number of customers:', len(clusters[i]))
    print('Centroid:', centroids[i])
    print()
    
# Note that your answers are likely to be different due to the random
# initialization of the algorithm. In fact, the answers are likely to be
# different every time you run this code.
# For testing purposes, if you want to replicate specific results, you
# should fix the random seed of the pseudo-random number generator.
# For this, see the commented line at the beginning of the cell.


***Cluster 1***
Number of customers: 50
Centroid: [8723.78, 19220.54, 27604.86, 2724.7, 12277.34, 3195.36]

***Cluster 2***
Number of customers: 73
Centroid: [33111.69863013698, 4918.465753424657, 5847.54794520548, 5554.027397260274, 1097.876712328767, 2097.123287671233]

***Cluster 3***
Number of customers: 317
Centroid: [7655.482649842272, 3881.0157728706627, 5335.79810725552, 2555.11356466877, 1810.236593059937, 1129.6056782334385]



---

### Evaluation

| Problem | Mark     | Comment   
|:-------:|:--------:|:----------------------
| 1       |   /2    |              
| 2       |   /2    | 
| 3       |   /2    | 
| 4       |   /6    | 
| Legibility      |   /2    | 
| Modularity      |   /2    | 
| Efficiency      |   /4    | 
|**Total**|**/20**  | 
