
# Lab: Clustering using CURE

Data Mining 2019/2020

By Jordi Smit and Gosia Migut

**WHAT** This _optional_ lab consists of several programming and insight exercises/questions. These exercises are ment to let you practice with the theory covered in: "Mining of Massive Datasets" by J. Leskovec, A. Rajaraman, J. D. Ullman. <br>

**WHY** Practicing, both through programming and answering the insight questions, aims at deepening your knowledge and preparing you for the exam. <br>

**HOW** Follow the exercises in this notebook either on your own or with a friend. Use Mattermost to disscus questions with your peers. For additional questions and feedback please consult the TA's at the assigned lab session. The answers to these exercises will not be provided.

**SUMMARY**
In the following exercises you will implement the CURE algorithm. This is a clustering algorithm designed for very large data sets that don't fit into memory. In this exercise we will simulate the limited amount of memory by dividing the data into sub batches.

**Requirements**
 - Python 3.6 or higher
 - numpy
 - scipy
 - ipython
 - jupyter
 - matplotlib
 - tqdm

In [None]:
import sys
!{sys.executable} -m pip install tqdm


from tqdm import trange
import matplotlib.pyplot as plt
import numpy as np

import random

%matplotlib inline

# 0 The problem
K-means and Hierarchical Clustering are two very well known clustering algorithms. Both these algorithms only work if the entire data set is in the main memory, which means that there is an upper limit on the amount of data they can cluster. So if we want to go beyond this upper limit we need an algorithm that doesn't need the entire data set to be in the main memory. 


In this exercise we look at the approach of the CURE algorithm. The idea of the CURE algorithm is rather simple. We don't need the entire data set since most of the data is very similar. So we take a random sample of the data set that fits into memory and we cluster this data. We then go through the remaining data and assign it to the closest cluster.


The CURE algorithm has the following pseudo code:
```
data_samples = sample_m_data_point_from_the_data_set()
k_sample_cluster = cluster(data_samples, k)
cure_clusters = []
foreach cluster in k_sample_cluster:
	points = find_k_most_representive_point(cluster)
	center = find_center(points)
	foreach point in points:
		move point x% towards the center
	add cure_cluster(points) to cure_clusters 

foreach dp in unseen data:
	assign dp to cure_cluster of the closest representive point
```

If you want more explanation, see [this online video lecture](https://www.youtube.com/watch?v=JrOJspZ1CUw) from the authors of the book.

# 1 Setting up
Lets get started by creating the data structures for this problem.  We have already created a class for the *Cluster* for you. This class stores its centroid and the data points that have been assigned to it.  This class will be used for the traditional hierarchical clustering.
You can see a summery of its class signature and its documentation using the function `help(Cluster)` or you can look at its implementation by opening the `cure_helper.py`.

Next let's define the CureCluster class. This class has two attributes, namely the `k_most_representative_points` and `data` (the clusters that have been assigned to it). The class is almost finished. The only thing left to do is to finish the distance function.

In [None]:
from cure_helper import Cluster

#help(Cluster)

class CureCluster:
    def __init__(self, k_most_representative_points):
        self.k_most_representative_points = k_most_representative_points
        self.data = None
        
    def distance(self, cluster):
        """
        Calculates the distances between the centroid of the cluster and the closest representitve point.

        Parameters:
        cluster: Cluster: The cluster with its data points we are intrested in.

        Returns:
        float: Returns the distance as a float.
        """
        min_dist = sys.float_info.max
        #Student start
        
        
        #Student end
        return min_dist
    
    def append(self, cluster):
        """
        Adds a data point to this cluster.
        !!!!Is statefull.!!!!

        Parameters:
        cluster: Cluster: A cluster that contains the datapoints we want to add.
        """
        if self.data is None:
            self.data = cluster
        else:
            self.data = self.data.merge(cluster)
    
    def __repr__(self):
        return f"CureCluster(\nrepresentative_points:\n{self.k_most_representative_points},\ndata: \n{self.data}\n)\n"

In the next cell we import some helper functions we have already created for you: 
 - `load_data`;
 - `plot_clusters`;
 - `plot_data`;
 - `plot_cure_clusters` ;
 - `hierarchical_clustering`;
 - `find_two_closest`;
 
 
You can read there documentation using python's `help` function, as shown below.

In [None]:
from cure_helper import load_data
from cure_helper import plot_clusters
from cure_helper import plot_data
from cure_helper import plot_cure_clusters
from cure_helper import hierarchical_clustering
from cure_helper import find_two_closest
from cure_helper import find_centroid

help(load_data)
help(plot_clusters)
help(plot_data)
help(plot_cure_clusters)
help(hierarchical_clustering)
help(find_two_closest)
help(find_centroid)

# 1 CURE
Next, let's define the `find_k_most_representative_points` function. We'll use this function to find the $k$ most representative points in a cluster we have found using the `hierarchical_clustering` function. It is your job to find the $k$ most representative point in the data of this cluster.

In [None]:
def find_k_most_representative_points(cluster, k):
    """
    Finds the k most representative points.
    
    Parameters:
    cluster: Cluster: The cluster we are intrested in.
    k: int: The amount of representative_points.

    Returns:
    CureCluster: Returns a k x 2 matrix. Where each row contains the a representive point.
    """
    
    # Divides each data point in the cluster into a seperate cluster.
    datapoints_as_singleton_clusters = cluster.data_points_as_cluster()
    # Student start
    # Find k_clusters most representative points in the datapoints of the cluster.
    k_most_representative_points = None
    # Student end
    
    return np.array([c.centroid for c in k_most_representative_points])

In this part we'll combine the previously defined functions that transform a cluster, we have found using the `hierarchical_clustering` function, into a cure cluster. It's your job to find the `k_most_representative_points` and to prepare them before we create a new instance of the `CureCluster`.

**Hint**: Carefully look at the pseudo code of the algorithm.

In [None]:
def tranform_into_cure_cluster(cluster, representative_points, move_to_center_precentage):
    """
    Transforms a give cluster into a cure cluster by
    - selecting k representive points from the data assigned to this cluster.
    - moving the k representive points towards their centroid a give precentage.
    
    Parameters:
    cluster: Cluster: The cluster we want to transform.
    representative_points: int: The amount of representative_points.
    move_to_center_precentage: float: How much the k points should be move towards their centroid.

    Returns:
    CureCluster: Returns a new Cure Clusters with its k representive points.
    """
    
    assert 0 < move_to_center_precentage < 1, "The value of move_to_center_precentage must be in the range (0,1)"
    if representative_points > len(cluster):
        print(f"[Warning] This cluster only has {len(cluster)} datapoints you requested {representative_points} points. Representative_points has been changed to {len(cluster)} for this cluster.")
        representative_points = len(cluster)
    
    #Student start
    # Find the k_most_representative_points
    k_most_representative_points = None
    #Calc the centroid of the k_most_representative_points
    centroid = None
    #Move the k_most_representative_points a bit to the centroid
    k_most_representative_points = None
    #Student end
    
    return CureCluster(k_most_representative_points)

Next lets define the `find_cure_cluster_with_min_dist` function. It's your job to find and return the `CureCluster` in the input list that is closest to this the input `Cluster`.

In [None]:
def find_cure_cluster_with_min_dist(cure_clusters, cluster):
    """
    Parameters:
    cure_clusters: List[CureCluster]: The cure clusters we want to compare against.
    cluster: Cluster: The cluster we are intrested in.

    Returns:
    CureCluster: Returns the Cure Clusters with the minimal distance to the cluster.
    """
    
    #Student start
    cure_cluster_with_min_dist = None

    # Student end
    return cure_cluster_with_min_dist

# 3 Results

These are the hyperparameters of the algorithm:

 - seed: A random seed to ensure that the random sampling returns the same result between different run;
 - sample_size: The amount of random samples we use to find the k clusters;
 - representative_points: The number of representative points we take from the k clusters;
 - n_clusters: The number of clusters we want to find.
 - move_to_center_precentage: How much the k representative points will be move towards their centroid.
 
 
 We have two sets *'data/cluster.txt'* and *'data/cluster_lines.txt'*.
 Try to find the correct hyperparameters for both sets.

In [None]:
#CURE parameters
seed = 42
#Student start
sample_size = 35
representative_points = 3
n_clusters = 3
move_to_center_precentage = 0.2
#Student end
# data set 1
file_path = "data/cluster.txt"
# data set 2
#file_path = "data/cluster_lines.txt"

# select the correct distance measure of the current problem. If you don't know select one at random and see what happens.
distance_measure="mean_sqaured_distance"
#distance_measure="closests_point"


First lets see what the data looks like. (You might want to change your distance measure in the *Cluster* class after seeing this).

In [None]:
data = load_data(file_path)
plot_clusters(data)

Next, lets sample some random points. Make sure that you have enough samples in each cluster. If this is not the case you might want to change your hyper parameters.

In [None]:
random.seed(seed)
data_sample = random.sample(data, sample_size)
plot_clusters(data_sample)

Lets assume we have a good, well distributed random sample from the data. Now lets perform traditional hierarchical clustering on this sample of the data set. The resulting clusters should be the same clusters as we visually saw in the original data sets.

In [None]:
#cluster samples
sample_clusters = hierarchical_clustering(data_sample, k=n_clusters, distance_measure=distance_measure)
print("The resulting clusters of the sample data:")
plot_clusters(sample_clusters)

Now lets use the clusters to create $k$ cure clusters with the functions you have created. Then lets loop through all the data and lets assign each datapoint to the correct CURE cluster.

In [None]:
# Create CURE Clusters
cure_clusters = [tranform_into_cure_cluster(cluster, representative_points, move_to_center_precentage) for cluster in sample_clusters]

# Assign remaining data to the clusters
for dp in data:
    cure_cluster = find_cure_cluster_with_min_dist(cure_clusters, dp)
    cure_cluster.append(dp)
    

print("The resulting cluster on all the data. Whereby the dots are the data points and the diamond are the representative points of the cluster")
plot_cure_clusters(cure_clusters)

If you have implemented everything correctly and you have chosen some good hyperparameters, then your results from the CURE version should be very similar to the result of the traditional hierarchical clustering function you see below.

In [None]:
full_data_set_results = hierarchical_clustering(data, k=n_clusters, distance_measure=distance_measure)
plot_clusters(full_data_set_results)

# 4 Questions
**What is the advantage this algorithm has over the BFR algorithm?**


**What happens if *sample_size* hyperparameter is too high or too low?**


**What happens if *representative_points* hyperparameter is too high or too low?**


**What is the effect of different distance measuremets on the final result?**

