 <img src="uva_seal.png"> 

## MLlib Clustering

### University of Virginia
### DS 5559: Big Data Analytics
### Last Updated: Feb 26, 2020

---  


### SOURCES  
- Learning Spark, Chapter 11: Machine Learning with MLlib  

- https://spark.apache.org/docs/latest/mllib-clustering.html  


### OBJECTIVES
Introduction to some of the major clustering techniques in MLlib  

### CONCEPTS

- Unsupervised learning
- K-means
- Mixture of Gaussians

---

**Unsupervised Learning**  
In this task, labels are unknown and the analyst wishes to segment the observations into groups of high similarity, where similarity is defined in terms of the feature space.

Common use cases are:
- Data exploration to discover the properties of similar observations  
- Outlier detection; outliers will generally form their own group (e.g., singletons)  

**K-Means**  
This is the most popular clustering algorithm, with widespread use in industry. It is relatively simple, uses a single parameter, and converges on a solution (but possibly not the global maximum).

The following models are supported in `spark.mllib`:

- K-means
- Gaussian mixture
- Power iteration clustering (PIC)
- Latent Dirichlet allocation (LDA)
- Bisecting k-means
- Streaming k-means

**<center>K-Means Specs</center>**

| Item   | Description |
| -------- | ----------- |
| Supervised/Unsupervised | Unsupervised |
| Initialization | Random Assignment |
| Assumptions | Euclidean Distance |
| Preprocessing | Scaling |
| Parameters | $K:$ number of clusters |
| Metrics | Inertia |
| Strengths | One parameter, relatively simple |
| Weaknesses | 1. May not find global optimum <br> 2. Can't handle non-quant data (e.g., categorical)<br> 3. Assumes spherical cluster shape|

**K-Means Sample 2D Visualization ($K=3$)**

<img src="k_means_before_after.png">

| K-Means Sample Workflow | 
| -------- | 
| 1. feature selection | 
| 2. feature standardization | 
| 3. run algo for sequence of $K$ |  
| 4. examine results and remediate outliers <br> <span style="color:red">loop on 3-4 as needed</span>| 
| 5. select $K^*$, extract labels | 
| 6. enrich with domain knowledge | 

**K-Means: Selecting $K^*$**

One method for selecting $K^*$ is by identifying the elbow in a scree plot.  At the inflection point, adding more clusters reduces WSS only marginally.  Generally, well-formed clusters are split apart, creating new ones.

$$ Within Sum of Squares (WSS) = 1 - \frac{Between Sum of Squares}{Total Sum of Squares} $$

<img src='scree_plot_k_means2.png'>

**K-Means Implementation**

`MLlib` contains an implementation of `K-means` and also `K-means||`  
`K-means||` provides a better initialization in parallel environments.  

Included in the parameters,  
`initializationMode` specifies either `random` initialization or initialization via `k-means||`.

`K-means` takes an RDD of `Vectors`  

**Methods:**  
Can access `clusterCenters` as an array of vectors  
Can call `predict()` on a new vector to return its assigned cluster;   this is the closest center.

**K-Means Example**

In [2]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

In [3]:
from numpy import array
from math import sqrt

from pyspark.mllib.clustering import KMeans, KMeansModel

# Load and parse the data
data = sc.textFile("kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

In [8]:
parsedData.take(2)

In [5]:
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random")

In [7]:
clusters

In [9]:
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
    center = clusters.centers[clusters.predict(point)]
    return sum([x**2 for x in (point - center)])

In [13]:
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

In [11]:
# Save and load model
clusters.save(sc, "KMeansModel")
sameModel = KMeansModel.load(sc, "KMeansModel")

In [14]:
sameModel

Fitting other clustering models requires calling the appropriate training function.  

**Gaussian Mixture Model**    
The *Gaussian Mixture Model* is a weighted combination of underlying Gaussian distributions, each with a fixed probability.  The *expectation-maximization algorithm* is used in `spark.mllib` to estimate the parameters.  

**Mixture of Gaussians, Basic Operations**

In [28]:
# Code for illustration only, does not run

from pyspark.mllib.clustering import GaussianMixture, GaussianMixtureModel

# Build the model (cluster the data)
gmm = GaussianMixture.train(parsedData, 2)


# output parameters of model
for i in range(2):
    print("weight = ", gmm.weights[i], "mu = ", gmm.gaussians[i].mu,
          "sigma = ", gmm.gaussians[i].sigma.toArray())

**TRY FOR YOURSELF (UNGRADED EXERCISES)**

1) **K-Means: Loop over *maxIterations***  
i. Copy the k-means example in the cell below  
ii. Remove the "save and load model" parts of the code  
iii. Modify the code to loop over *maxIterations* for values 10, 20, 30 and print the WSSSE  
iv. Run the results and note how WSSSE varies as a function of *maxIterations*

2) **K-Means: Loop over *K***  
i. Copy the k-means example in the cell below  
ii. Remove the "save and load model" parts of the code  
iii. Modify the code to loop over the number of clusters *K* for values 2, 3, 4, 5 and print the WSSSE  
iv. Run the results and note how WSSSE varies as a function of *K*