### MY470 Computer Programming

### Problem Set 2, AT 2023

#### \*\*\* Due 12:00 noon on Monday, October 23 \*\*\*

---
### Writing your own k-means clustering algorithm

K-means clustering is a simple unsupervised machine-learning method for cluster analysis. The aim of the method is to partition a set of points into k clusters, such that each point is assigned to the nearest cluster. The algorithm iterates through two steps:

1. Assign each data point to the cluster with the nearest centroid
2. Update the centroids of the clusters given the new assignment

The algorithm converges when the assignments no longer change. Since the intial assignment to clusters is largely random, there is no guarantee that the optimum assignment is found. So it is common to run the algorithm multiple times and use different starting conditions.

In this problem set, we will implement a much simplified version of the k-means clustering algorithm. Rather than running the algorithm until convergence, we will repeat the above two steps a large but fixed number of times. In addition, we will initialize only once, using a naive method according to which we randomly choose k points from the data to use as initial cluster centroids. 

(In real life, you will of course use a library to implement such an algorithm. In Python, you can do this using [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).)

For the problem set, we will additionally use data from the file `Wholesale customers data.csv`, which you can find in the `data` repository. The file contains information on the annual spending on diverse product categories for the clients of a wholesale distributor. The data are obtained from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php) and you can find more information about them [here](http://archive.ics.uci.edu/ml/datasets/Wholesale+customers#).

#### Hints

Use docstrings to describe your functions. We will subtract points from your mark if you do not use appropriate description of your code.

There are many different implementations of the k-means algorithm you can find online. However, this problem set expects you to follow the instructions and algorithms below precisely.  

In [1]:
# We will first import the modules we need
# You are expected to solve the problem set with these modules only
# Do not import and use any other ones 

# You will need the math module to estimate the square root.
# To get the square root of num, use math.sqrt(num)
import math
import csv
import random 

### Problem 1: Function to estimate Euclidean distance between two points

Write a function called `get_distance` that calculates the Euclidean distance between two n-dimensional points. The function should take two lists as arguments, where each list contains the n coordinates of each of the two points. 

Test your function for the points [0, 3, 0] and [4, 0, 0].

#### Hints

You can read about the definition of Euclidean distance on [Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance).


In [2]:
# Enter your answer to Problem 1 below. 
def get_distance(point_1,point_2):
    """Calculates the Euclidean distance between two n-dimensional points.

    Parameters:
        point_1 (list): n coordinates of first point. 
        point_2 (list): n coordinates of second point.

    Returns:
        dist (int): Euclidean distance between point1 and point2.

    This function returns the Euclidean distance from the 
    Cartesian coordinates of the points using Pythagoras' theorem. 
    
    """
    # Calculate Euclidean distance by summing the squared differences between 
    # corresponding coordinates of point_1 and point_2 in each dimension.
    dist = math.sqrt(sum((point_1[i]-point_2[i])**2 for i in range(len(point_1)))) 
    return dist

get_distance([0,3,0],[4,0,0])

5.0

### Problem 2: Function to estimate the centroid of a collection of points

Write a function called `get_centroid` that estimates the centroid of a collection of n-dimensional points. The function should take one list as an argument, which contains each of the points entered as a list of n coordinates. The function should return a list with the coordinates of the virtual center point.

Test your function for the points in `test_lst` entered below.

#### Hints

The coordinate of the centroid in each dimension is the mean of the coordinates of all the points in that dimension.


In [3]:
test_lst = [[0,0,0], [0,0,1], [0,1,0], [1,0,0], 
            [0,1,1], [1,0,1], [1,1,0], [1,1,1]]

# Enter your answer to Problem 2 below. 
def get_centroid(points):
    """Calculates centroid of a collection of n-dimensional points.

    Parameters:
        points (list): Each of the points entered as a list of n coordinates.

    Returns:
        centroid (list): Coordinates of the virtual center point.

    The function calculates the centroid by summing up coordinates of 
    all n-dimensional points in list and dividing each sum by the 
    total number of points in the list. 
    """
    # Determine dimensions of points
    n = len(points[0])

    # Initialise centroid as a list containing n elements 
    centroid = [0] * n

    # Iterate over each point in list of points and summing points in each dimension
    for point in points:
        for i in range(n):
            centroid[i] += point[i]
    
    # Finding mean of coordinates of all points in each dimension and storing as a list
    centroid = [coord/len(points) for coord in centroid]

    return centroid

get_centroid(test_lst)

[0.5, 0.5, 0.5]

---
### Problem 3: Function to read data

Write a function called `get_data` that opens the file `../data/Wholesale customers data.csv` and returns all the data in a list. 

Use the csv module to read the file. You can read how to do this [here](https://docs.python.org/3/library/csv.html). Make sure you do not include the column names in the data. 

Each element in the list you return should be a list of each customer's annual spending on fresh products, milk products, grocery products, frozen products, detergents and paper products, and delicatessen products. In other words, your list should contain 440 elements (customers), each of which contains six numeric elements (amounts spent on products). The function does not need to take any arguments.

Test your function by saving the data it returns in a variable called `data`. Then print the first two elements of `data`.

In [4]:
# Enter your answer to Problem 3 here. 

def get_data():
    """
    This function reads the Wholesale customers file using the csv module
    and returns a nested list where the inner list contains each customer's 
    annual spending on six different products.
    """
    data = []
    with open('../data/Wholesale customers data.csv', newline='') as f:
        # Skips first line 
        next(f)
        customer_data = csv.reader(f, delimiter=',')
        # Iterate over each row, excluding column names, converting into integer
        # and append to data list
        for row in customer_data:
            row = row[2:]
            row = [int(i) for i in row]
            data.append(row)
    
    return data

data = get_data()
data[:2]

[[12669, 9656, 7561, 214, 2674, 1338], [7057, 9810, 9568, 1762, 3293, 1776]]

---
### Problem 4: Function to implement k-means algorithm

Write a function called `kmeans` that clusters a collection of points into k clusters using a simplified version of the k-means algorithm. The function should take two arguments: 

1. `points` – a list of n-dimensional points, and
2. `k` – an integer that defines the number of desired clusters. 

The function should return two things: 

1. A clustering – a list of `k` clusters, each of which is a list of points (each of which is a list of coordinates)
2. A list of the centroids for each of the `k` clusters. Each centroid is essentially a point, so it should be presented as a list of coordinates.

Write your code around the detailed comments and the helping code below.

Test your function on the data from Problem 3 for k = 3. For each of the three clusters, print the number of customers assigned to it and the cordinates of its centroid.


In [5]:
# Enter your answer to Problem 4 in-between the code and comments below.

def kmeans(points, k):
    """Clusters a collection of points into k clusters.

    Parameters:
        points (list): A list of n-dimensional points
        k (int): An integer that defines the number of desired clusters
    
    Returns:
        clusters (list): A list of k clusters, each of which is a list 
        of points (each of which is a list of coordinates)
        centroids (list): A list of the centroids for each of the k clusters

    This function uses a simplified version of the k-means algorithm to cluster 
    a collection of points into k clusters. It assigns each point to the cluster with the 
    closest centroid using the get_distance() function, and updates the initial
    clusters list and centroids list with the get_centroid() function. 
    Process is repeated for 100 iterations.
    """
    
    # Select k random points to use as initial centroids
    init = random.sample(points, k) 

    # Create a list of k lists to contain the points assigned to each cluster.  
    clusters = [[] for i in init]
    
    # Create a list to keep the centroids of the k clusters. 
    # For now, this list will contain the points from init.
    centroids = [i for i in init]
    
    # You now need to assign each point to the cluster 
    # with the closest centroid. Use the get_distance function 
    # you wrote in Problem 1 for this.
    max_iterations = 100
    for _ in range(max_iterations):
        # Create a new list of k lists to store points for each cluster
        new_clusters = [[] for i in init]

        for point in points:
            # Initialise minimum distance to be between point and first centroid
            min_dist = get_distance(point, centroids[0])
            closest_centroid_index = 0

            for j, centroid in enumerate(centroids):
                dist = get_distance(point, centroid)
                if dist < min_dist:
                    min_dist = dist
                    closest_centroid_index = j

            if point not in new_clusters[closest_centroid_index]:
                new_clusters[closest_centroid_index].append(point)

    # You should then update the variable "clusters" to be
    # the new clustering and update the variable "centroids"
    # to contain the centroids of the clusters in this new clustering.
    # Use the function you wrote in Problem 2 to estimate the centroids.
        clusters = new_clusters
        new_centroids = [get_centroid(cluster) for cluster in clusters]
        centroids = new_centroids
        
    return clusters, centroids

    # Repeat the process described above for 100 iterations. 
    # The idea is that each new repetition refines the clustering 
    # because it starts from the centroids of the previous clustering. 
    # If we repeat the process long enough, the assignment to 
    # clusters and the centroids will become stable.


In [6]:
# Test function on data for 3 clusters
clusters, centroids = kmeans(data, 3)

# Print number of customers assigned to each cluster
for i in range(len(clusters)):
        print(f'There are {len(clusters[i])} customers in Cluster {i+1}.')

# Print coordinates of centroid of each cluster
for i in range(len(centroids)):
    print(f'Cluster {i+1} centroid: {centroids[i]}')

There are 60 customers in Cluster 1.
There are 53 customers in Cluster 2.
There are 327 customers in Cluster 3.
Cluster 1 centroid: [35941.4, 6044.45, 6288.616666666667, 6713.966666666666, 1039.6666666666667, 3049.4666666666667]
Cluster 2 centroid: [7751.981132075472, 17910.509433962263, 27037.905660377357, 1970.9433962264152, 12104.867924528302, 2185.735849056604]
Cluster 3 centroid: [8296.0, 3787.256880733945, 5162.80122324159, 2582.1162079510705, 1724.5229357798164, 1138.0152905198777]


---

### Evaluation

| Problem | Mark     | Comment   
|:-------:|:--------:|:----------------------
| 1       |   2/2    | Good |             
| 2       |   2/2    | Good | 
| 3       |   2/2    | Good | 
| 4       |   6/6    | Good | 
| Legibility      |   2/2    | Align comments with code in P4. | 
| Modularity      |   2/2    | Good | 
| Efficiency      |   2/4    | P2 dividing can be included into the loop when you loop through each dimension (loop through the dimensions first, then points). P3 when you do the slicing operation ([2:]) separately, it effectively loops the list to collect the elements, which can be combined with your next line of code where you cast data type to each element in the list comprehension. | 
|**Total**|**18/20**  | Well done! |