
# Lab: Clustering using BFR

Data Mining 2019/2020

By Jordi Smit and Gosia Migut

**WHAT** This _optional_ lab consists of several programming and insight exercises/questions. These exercises are ment to let you practice with the theory covered in: "Mining of Massive Datasets" by J. Leskovec, A. Rajaraman, J. D. Ullman. <br>

**WHY** Practicing, both through programming and answering the insight questions, aims at deepening your knowledge and preparing you for the exam. <br>

**HOW** Follow the exercises in this notebook either on your own or with a friend. Use Mattermost to disscus questions with your peers. For additional questions and feedback please consult the TA's at the assigned lab session. The answers to these exercises will not be provided.

**SUMMARY**
In the following exercises you will implement the BFR algorithm. This is a clustering algorithm designed for very large data sets that don't fit into memory. In this exercise we will simulate the limited amount of memory by dividing the data into sub batches.

**Requirements**
 - Python 3.6 or higher
 - numpy
 - scipy
 - ipython
 - jupyter
 - matplotlib

In [None]:
from uuid import UUID
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import sys
import uuid

%matplotlib inline

# 0 The problem
K-means and Hierarchical Clustering are two very well known clustering algorithms. Both these algorithms only work if the entire data set is in the main memory, which means that there is an upper limit on the amount of data they can cluster. So if we want to go beyond this upper limit we need an algorithm that doesn't need the entire data set to be in the main memory. In this exercise we look at the approuch of the BFR algorithm. We will simulate the lack of memory by dividing the data in a list of lists. Whereby each sub list is a different batch that has 'supposedly' been read from disk or some other storage server.



BFR works by summarizing the clusteringdata into statistical data, such as the Sum, Squared Sum and number of data points per cluster. Which means that it has to read each data point only once. The algorithm uses three sets that contain clustering summarizations:
- **Discard Set**:
Contains the summarizations of the data points that are *close enough* (we'll define this later on) to one of the main clusters.
- **Compressed Set** (a.k.a mini cluster):
Contains the summarizations of the data points that are not *close enough* to one of the main clusters but form mini clusters with other points that are not *close enough* to one of the main clusters.
- **Retained Set**: 
Contains data points that are not *close enough* to one of the main clusters and not *close enough* to one of the mini clusters. (This are summarizations of a single datapoint).

BFR uses the first chunk of data to find the $k$ main clusters and puts them into the **Discard set**. Then it loops through the remaining chunk of data. For each data point in this chunk it will check if the data point is  *close enough*. If the data point is *close enough* it will be added to the **Discard set** if not it will be added to the **Retained Set**. After we have sorted the data in this chunk we check if we can find any new mini clusters in the **Retained Set**. All the new none singleton clusters will be added the the **Compressed Set** while all the singleton clusters will stay in the **Retained Set**. Before we continue to the next chunk we have to check if we don't have to many mini clusters in **Compressed Set**. We can reduce the number of mini clusters by combining them through clustering. After we have gone through all the data we end up $k$ main cluster, $m$ mini clusters and $n$ retained data points. Because we only want $k$ clusters we need to combine all these summarizations, which can also be done using clustering.


After we have done all this we end up with $k$ cluster summarizations, which can be used to assign future data to the closest clusters.

If you want more explanation, see [this online video lecture](https://www.youtube.com/watch?v=NP1Zk8MY08k) from the authors of the book.

# 1 Setting up
Lets get started by creating the data structures for this problem. First of all we need to create a class for the DataPoint. This class stores the vector location and to which cluster it has been assigned.

In [None]:
class DataPoint(object):
    """
    A datapoint that can be clustered.
    """

    def __init__(self, vector):
        self.vector = vector
        self.cluster_id = None

    def to_singleton_cluster(self):
        """
        Returns:
        Cluster: A cluster with a single data point.
        """
        sum_v = self.vector
        squared_sum = sum_v ** 2
        n_data_points = 1
        self.cluster_id = uuid.uuid4()
        return BFRCluster(sum_v, squared_sum, n_data_points, set([self.cluster_id]))

    def __repr__(self):
        return f"DataPoint(vector: {self.vector}, cluster_id: {self.cluster_id})"

Next lets create a class for the BFR cluster. This class must store both the statistical summarization of the data and be usable with hierarchical clustering. All the hierarchical clustering related logic has already been implemented in its parent class `Cluster`. You can read its documentation using `help(Cluster)` or see its implementation in `bfr_helper.py`.

However the statistical summarization and BFR related logic must still be implemented. Now it is your job to:
 - Define the ***mean*** attribute;
 - Define the ***variance*** attribute;
 - Define the ***std*** attribute;
 - Finish the ***is_data_point_sufficiently_close*** method, used to  determine if a *DataPoint* is close enough to be added to the discard set;
 - Finish the ***mahalanobis_distance*** method, the distance measure used by the ***is_data_point_sufficiently_close*** function;

We define a *DataPoint* as close enough if the $MD < 3 * std_i$ for any $i$. Where $i$ is the axis index and $MD$ is the *mahalanobis distance*.

**Hints:**
 - ${\sigma_i}^2 = \frac{SUMSQ}{N}$  
 
 - $\bar{x_i} = \frac{SUM}{N}$ 
 
 - $MD =\sum_{i=1}^{N} {(\frac{x_i - \bar{x_i}}{\sigma_i})^2}$

In [None]:
from bfr_helper import Cluster
# help(Cluster)

In [None]:
class BFRCluster(Cluster):
    """
    A sumerization of multiple data points.
    """
    def __init__(self, sum_v, squared_sum, n_data_points, cluster_ids):
        # Student start
        mean = None
        variance = None
        std = None
        # Student end
        
        super().__init__(sum_v, squared_sum, n_data_points, cluster_ids, mean, variance, std)
        
    def is_singleton(self):
        """
        Returns:
        Cluster: Return is the cluster only has a single data point..
        """
        return self.n_data_points == 1

    def mahalanobis_distance(self, dp):
        """
        Parameters:
        dp: DataPoint: The DataPoint we are intrested in.

        Returns:
        float: The mahalanobis distance between the centroids of this cluster and a datapoint
        """
        # Student start
        return 0
        # Student end
    
    def is_data_point_sufficiently_close(self, dp):
        """
        Parameters:
        dp: DataPoint: The DataPoint we are intrested in.

        Returns:
        bool: True iff the mahalanobis distance is less than 3 times the std on atleast one axis.
        """
        # Student start
        
        # Student end
        return False

Run the code below to verify that the functions were implemented correctly:

In [None]:
np.random.seed(42)
v = np.random.rand(3,2)
cluster = BFRCluster(np.sum(v, axis=0, keepdims=True), np.sum(v ** 2, axis=0, keepdims=True), len(v), [uuid.uuid4()])

#Verify that mean is implemented correctly
assert cluster.mean.shape == (1,2)
assert cluster.mean[0][0] == 0.420850900367068

#Verify that variance is implemented correctly
assert cluster.variance.shape == (1,2)
assert cluster.variance[0][1] == 0.10571935835911339

#Verify that std is implemented correctly
assert cluster.std.shape == (1,2)
assert cluster.std[0][0] == 0.23741019819501288

#Verify that mahalanobis_distance is implemented correctly
dp = DataPoint(np.random.rand(1,2))
assert cluster.mahalanobis_distance(dp) == 3.1732638628025542

inpoint = DataPoint(cluster.mean)
outpoint = DataPoint(2 * cluster.mean)
assert cluster.is_data_point_sufficiently_close(inpoint)
assert not cluster.is_data_point_sufficiently_close(outpoint)

In the next cell we import some helper functions we have already created for you:
 - `load_data`;
 - `hierarchical_clustering`;

You can read there documentation using python's `help` function, as shown below.

In [None]:
from bfr_helper import hierarchical_clustering
from bfr_helper import load_data

help(hierarchical_clustering)
help(load_data)


# 2 BFR
In this section we'll use the previously defined data structures and functions to create the BFR algorithm. Lets get started by defining the *find_index_sufficiently_close_cluster* function. This function needs to return the index of a cluster that is sufficiently close. (**Hint** *we have already defined a function for this*). If no cluster is close enough it should return None.

In [None]:
def find_index_sufficiently_close_cluster(k_points, dp):
    """
    Finds the index of the most representative cluster from the give k points

    Parameters:
    k_points List[Cluster]: The K clusters in the discard set.
    dp: DataPoint: The datapoint we are intrested in.

    Returns:
    Optional[int]:Returns the index of the representive cluster. Returns None if no cluster is representative

    """
    # Student start
    
    # Student end
    return None

These are the hyper parameters of the algorithm:

 - chunk_size: How much data can we store in a single memory scan;
 - k_clusters: The final amount of clusters we want;
 - n_discard_clusters: The number of discard cluster we'll have in the algorithm;
 - n_mini_clusters: The number of mini cluster we keep between memory scans;
 - n_new_mini_clusters: The number of new mini cluster we create per memory scan;

In [None]:
#Algo hyperparameters
file_path = "data/cluster.txt"
chunk_size = 35
data = load_data(file_path, chunk_size, create_data_point_func=DataPoint)
k_clusters = 3
n_discard_clusters = 3
n_mini_clusters = 25
n_new_mini_clusters = 25

In this cell we'll implement the BFR algorithm. It is your job to:

- For the first chunk:
	 - Fill the discard set using the first chunk;
- For the remaining chunk:
	 - Add each point that is sufficiently close to a cluster to that clusters;
	 - Add each point that is not sufficiently close to a cluster to the retained set;
	 - Combine each point in the retained set  that are closest to each other into mini clusters while keeping points that are not close to any other point in the retained set;
     - After all chunk:
	 - Combine the discard, compressed and retained set into the wanted amount of $K$ clusters.

**Hints**

 - You can combine the clusters that are closest to each other using `hierarchical_clustering`;
 - Carefully look at the functions we have defined in the previous part. Most of the logic is already defined there;


In [None]:
discard = []
compressed = []
retained = []

for dp in data[0]:
    # Student start
    singleton_cluster = None
    
    # Student end
    
# Student start
# Fill discard with k representative points
discard = None
# Student end


for chunk in data[1:]:
    for dp in chunk:
        index_sufficiently_close_cluster = find_index_sufficiently_close_cluster(discard, dp)
        if index_sufficiently_close_cluster is not None:
            # Student start
            # Replace the sufficiently_close_cluster with the merged cluster
            pass
            # Student end
        else:
            # Student start
            # transfor datapoint into singleton cluster
            pass           
            # add the singleton cluster to the retrained set
            pass
            # Student end
   
    # Student start  
    # find new mini clusters in the retained set
    new_mini_clusters = None
    retained = None
    new_mini_clusters = None
    compressed = None
    # Student end

# Combine the remaining summarization of the clusters. 
combine_summarization = discard + compressed + retained
# Combine the summarization untill there are only k_clusters
# Student start  
resulting_clusters = None       
# Student end

# 3 Results
And we are done! The only thing left to do is to look at the final result. Run the cell below to visualize the resulting clusters. The small dots are the data points, while the diamond are the centroids of the clusters.

In [None]:
marker_dp = "."
marker_cluster = "D"
k = len(resulting_clusters)
colors = cm.rainbow(np.linspace(0,1,k))

# Plot the centroids of the clusters
for i, cluster in enumerate(resulting_clusters):
    x = cluster.mean[:, 0]
    y = cluster.mean[:, 1]
    plt.scatter(x, y, marker=marker_cluster, c='k')

# Plot the assigned data
for chunk in data:
    for dp in chunk:
        x = dp.vector[:, 0]
        y = dp.vector[:, 1]
        color = None
        for i, cluster in enumerate(resulting_clusters):
            if cluster.contains(dp):
                color = colors[i]
                break
        assert color is not None
        plt.scatter(x, y, marker=marker_dp, c=[color])

plt.show()

# 4 Questions

### **4.1 This algorithm has one major assumption. What is this assumptions?**

*(hint it has something to do with data distribution.)*

### **4.2 What is the major disadvantage of this assumption?**

### **4.3 How many secondary memory pass do this algorithm have to make?**


### **4.4 Lets say we have a dataset with 3 clusters A, B & C. What happens if the first chunk only has data from the A cluster?**