K-means and Hierarchial Clustering

* Objectives:
    * Explain the difference between supervised and unsupervised learning
    * Implement a k-means algorithm for clustering
    * Discuss how curse of dimensionality affects clustering
    * Choose the best k using the elbow method or silhouette scores
    * Implement and interpret hierarchical clustering

1) Supervised vs Unsupervised Learning
* Unsupervised Learning Properties:
    * No response variable, $y$
        * Just based on predictors: $X_1,X_2,X_3,\dots,X_p$
    * No cross-validating to choose best "model" in usual sense
    * No cross-validating to know how well you're doing
    * Can be useful as **preprocessing step** for supervised learning
    * Can be useful for better **understanding features**
* Most Common Unsupervised Techniques
    1. **PCA** - **low-dimensional representation** of data that explains good fraction of variance
    ![pca](pca.png)
    2. **Clustering** - finding **homogeneous subgroups** among data
    ![clustering](clustering.png)
* Unsupervised learning algorithms:
    * K-means, Hierarchical Clustering (can be used in supervised learning)
    * PCA (can be used in supervised learning)
* Supervising learning algorithms:
    * Linear, Logistic, Lasso, Ridge
    * Decision Trees, Bagging, Random Forest, Boosting
    * SVM
    * kNN

2) Curse of Dimensionality Review - points are "far away" in high dimensions, and it's easy to overfit small datasets (sparsity of sample data points)
* Linear models vs k-Nearest Neighbor
    * Linear Models:
        * Very structured
        * Stable, but possibly inaccurate
        * Low variance, High bias
    * k-Nearest Neighbor:
        * Very mildy structural
        * Often accurate, but unstable
        * High variance, Low bias
    * kNN is problematic in high-dimensional spaces
        * Usually pretty good for $p \leq 4$ and $N$ on the large side
        * Need to get a reasonable fraction of the $N$ values of $y_i$ to average to bring down the variance
        * Nearest neighbors can be "far" in high dimensions
            * Let's consider 10% to be a reasonable fraction of distance:
            ![knn_radius](knn_radius.png)
            * $p=1$ involves variable x1
            * $p=2$ involves variable x1 and x2
                * Radius of circle in 2 dimensions is much bigger than radius in 1 dimension
            ![dim_radius](dim_radius.png)
        * Hyper-cubical neighborhood about target point to capture fraction $\mathbf{v}$ of the unit volume
            * expected edge length: $e_p(v)=v^{\frac{1}{p}}$
            * sampling density proportional to: $N^{\frac{1}{p}}$
                * $p$ = dimensions of input space
                * $N$ = number of points
            ![volume_edge_length](volume_edge_length.png)
            * edge length example:
                * Suppose interested in a $v=10\%$ neighborhood
                * $p=1 \rightarrow e_p(v)=(0.1)^{\frac{1}{1}}=0.1$
                * $p=10 \rightarrow e_p(v)=(0.1)^{\frac{1}{10}}=0.794$
            * sampling density example:
                * How to achieve equivalent density in higher dimensions
                * If $N_1=100^{1}$ represents dense sample for one dimensional feature space. 
                * To achieve same density for 10 inputs, we need $N_{10}=100^{\frac{1}{10}}$ points
        * kNN, or **any method involving this sort of distancing**, suffers majorly from curse of dimensionality
            * kNN with $p=10$ is "far" in high dimensions
            * K-means and hierarchical clustering also has issues in high dimensions
        * Idea of "far" and sparsity of points in high dimensions can be thought of in **radii approach** and **hypercube approaches**
        * It takes up a lot of data to make up for increased in dimensions

2) K-means Clustering
* What is clustering? Divide data into distinct **subgroups** such that observations **within each group are quite similar**
![uncluster_to_cluster](uncluster_to_cluster.png)
* **K-means clustering** - partitions data into $K$ subgroups while **minimizing** within-cluster variation
    * Example: With a fixed $K=3$, assign each of $n$ data point to one of 3 clusters, such that **within-cluster variation (WCV)** is smallest
        * There are $K^n$ possible choices
    * Equation: $$\text{minimize}_{C_1,\dots,C_K}\Big\{\sum_{k=1}^K \mathbf{WCV}(C_k)\Big\}$$ where **WCV** for k$^{th}$ cluster is the sum of all the pairwise Euclidean distances
        * $$\mathbf{WCV}(C_k)=\frac{1}{|C_k|}\sum_{i,i'\in C_k}\sum_{j=1}^p (x_{ij}-x_{x'j})^2$$ where $|C_k|$ is number of observations in k$^{th}$ cluster
    * Problem is that there are $K^n$ ways (too many!)
    * K-means clustering algorithm:
    ![kmeans_alg](kmeans_alg.png)
        1. Randomly assign number, from 1 to $K$, to each data point
        2. Repeat until cluster assignments stop changing
            1. For each of $K$ clusters, compute cluster **centroid** by taking vector of $p$ feature means
            2. Assign data point to cluster for which centroid is closest (Euclidean distance)
    * Issues in K-means:
        * Results in local optimum because of **random initialization**
        * Solution: Try **multiple initializations** and pick one with the lowest **WCV**
            * Also consider **K-means++** method that allows for smarter initializations
    * Choosing $K$:
        * Issues with choosing $K$:
            * No easy answer
            * May just want $K$ similar groups
            * But more often, want something **interpretable** that exposes some interesting aspect of data
                * Presence/absence of natural distinct groups
                * Descriptive statistics about groups
            * Example: Are there certain segments of my market that tend to be alike?
                * e.g. middle-aged living in suburbs who log-in infrequently
        * Methods for choosing $K$:
            1. **"Elbow" method** - choose a number of clusters so that adding another cluster doesn't minimize **WCV** much more
                ![elbow_method](elbow_method.png)
                * **Within Cluster Point Scatter (WCPS)** - a natural loss function is the sum pairwise distances of the points within each cluster, summed over all clusters. In particular, we could specify $d(x_i,x_{i'})$
                * WCPS Equation:
                    * $$W(C)=\frac{1}{2}\sum_{k=1}^K\sum_{C(i)=k}\sum_{C(i')=k}d(x_i,x_{i'})$$
                    * Let $d_{ii'}=d(x_i,x_{i'})$:
                    * **Total Point Scatter**: $$\begin{align}
                    T & = \frac{1}{2}\sum_{i=1}^N\sum_{i'=1}^N d_{ii'} \\
                    & = \frac{1}{2}\sum_{k=1}^K\sum_{C(i)=k}\Big( \sum_{C(i')=k}d_{ii'}+\sum_{C(i')\neq k}d_{ii'} \Big) \\
                    & = W(C)+B(C)
                    \end{align}$$
                    * **Between Cluster Point Scatter**: $$B(C)=\frac{1}{2}\sum_{k=1}^K\sum_{C(i)=k}\sum_{C(i')\neq k}d_{ii'}$$
                * Alternative Form: 
                    * $$\begin{align}
                    W(C) & = \frac{1}{2}\sum_{k=1}^K\sum_{C(i)=k}\sum_{C(i')=k}\Vert x_i-x_{i'} \Vert^2 \\
                    & = \sum_{k=1}^K N_k \sum_{C(i)=k}\Vert x_i-\bar{x}_k \Vert^2
                    \end{align}$$
                        * $\bar{x}_k=(\bar{x}_{1k},\dots,\bar{x}_{pk})$ is mean vector associated with k$^{th}$ cluster
                        * $N_k=\sum_{i=1}^N I(C(i)=k)$
            2. **GAP statistic** - compare within-cluster scatter, $W_1,\dots,W_k$, to uniformly distributed rectangle containing data. Find largest gap.
                ![gap_statistic](gap_statistic.png)
                * For each $K$, compare $W_k$ (within-cluster sum of squares) with that of randomly generated "reference distributions"
                    * Generate $B$ distributions: $Gap(K)=\frac{1}{B}\sum_{b=1}^B log W_{Kb}-log W_K$
                    * Choose smallest $K$ such that $Gap(K) \geq Gap(K+1)-s_{N+1}$ where $s_K$ is the standard error of $Gap(K)$
                * GAP Statistics Steps:
                    1. Observed vs. Expected value of $log(W_k)$ over 20 simulations from uniform data
                    2. Translate curves so that $log(W_k)=0$ for $k=1$
                    3. GAP statistic $K^*$ is smallest $K$ producing gap within one standard deviation of gap at $K+1$
                * Arguably best method!
                * Notice as number of clusters increase, within cluster scatter decreases
                * What happens when number of clusters is number of points?
            3. **Silhouette Coefficient** - general method for interpreting and validating clusters of data
                * For each observation $i$:
                    * $a_i$ = average dissimilarity of $i$ with all other data points **within same cluster**
                        * a measure of how well $i$ is assigned to the cluster
                        * the **smaller** $a_i$ is, the better the assignment
                    * $b_i$ = lowest average dissimilarity of $i$ to any other cluster, of which $i$ is not member
                        * other cluster can be thought of as a "neighboring cluster"
                * Equation:
                    * $S_i = \frac{b_i-a_i}{max(a_i, b_i)}$
                    * range: $-1<S_i<1$
                * Want $a_i$ small, $b_i$ large $\rightarrow$ want silhouette, $S_i$, large
                    * near 1 $\rightarrow$ dense and well separated
                    * near 0 $\rightarrow$ overlapping clusters; could well belong to another cluster
                    * near -1 $\rightarrow$ misclustered
                * Example: 38 data points, 3 clusters
                ![silhouette](silhouette.png)
                    * 1st cluster - 8 data points and avg silhouette of 0.78
                    * 2nd cluster - 19 data points and avg silhouette of 0.64
                    * 3rd cluster - 11 data points and avg silhouette of 0.51
                    * Overall avg silhouette of **0.63**
                * Guidelines for Overall Avg Silhouette:

|   Range  |       Interpretation      |
|:--------:|:-------------------------:|
| 0.71-1.0 | Strong structure found    |
| 0.51-0.7 | Reasonable structure      |
| 0.26-0.5 | Structure weak/artificial |
| < 0.25   | No substantial structure  |               

3) **Hierarchical Clustering** - method of cluster analysis which seeks to build a hierarchy of clusters
* Hierarchical Clustering Algorithm:
    ![hierarchical_clustering](hierarchical_clustering.png)
    1. Each point as its own cluster
    2. Merge closest clusters
    3. End when all points in single cluster
* Still need to account for "distance" between clusters
* **Height of Fusion** - indicates the proximity of clusters
    * Example (from above): A & C are close (height: 1.2)
    * Red and Green clusters are not close (height: ~4.1)
* Varying $K$
![varying_k](varying_k.png)
    * In contrast to K-means, it is not necessarily to choose $K$ from the start
        * Depending on where the cut is precisely, there could be **1 to $n$ clusters**
    * Choosing $K$: Can again use Elbow Method, GAP Statistic, Silhouette Coefficient
        * But, notice the **height of dendrogram** give a **sense of separation of clusters** depending on the cut
* Distance Between Two Clusters:

|  Linkage |                                                                                                                                                                 Description                                                                                                                                                                |                          Usage                          |
|:--------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------:|
| Complete | Maximal intercluster dissimilarity. Compute all pairwise dissimilarities<br/> between the observations in cluster A and the observations in cluster B, and<br/> record the **largest** of these dissimilarities                                                                                                                            | - More commonly used<br/>- Tends to be balanced         |
| Single   | Minimal intercluster dissimilarity. Compute all pairwise dissimilarities<br/> between the observations in cluster A and the observations in cluster B, and<br/> record the **smallest** of these dissimilarities. Single linkage can<br/> result in extended, trailing clusters in which single observations are<br/> fused one-at-a-time. | - Less commonly used<br/>- Extended trailing clusters   |
| Average  | Mean intercluster dissimilarity. Compute all pairwise dissimilarities between<br/> the observations in cluster A and the observations in cluster B,<br/> and record the **average** of these dissimilarities                                                                                                                               | - More commonly used<br/>- Tends to be balanced         |
| Centroid | Dissimilarity between the centroid for cluster A (a mean vector of length $p$)<br/> and the centroid for cluster B. Centroid linkage can result in undesirable<br/> **inversions**.                                                                                                                                                        | - Less commonly used<br/>- Although popular in Genomics |
![linkages](linkages.png)

4) Clustering Intuitions
* Is it important to standardize features?
    * Yes, most likely
    * How do we deal with categorial features?
* Outliers can be problematic
    * Especially using squared Euclidean as a distance metric
    * What if small subset of observations is very different from all others?
        * K-means and hierarchical clustering **forces** every data point into clusters, potentially **distorting** clusters
        * Mixture models (e.g. **soft clustering**) are attractive alternative as they accommodate outliers
* Generally **not** very robust
    * Can test by clustering subsets of data
* Clustering is a simple, elegant method, but can be problematic in a lot of ways
    * Only intended for **quantitative** features (think centroid calculation for categorical data) and squared **Euclidean** distance (which is not robust to outliers)

5) Alternative Clustering Methods
* **K-medoids** - minimizes pairwise dissimilarities and chooses one of the data points as center, or "medoid"
    ![kmedoids](https://i.stack.imgur.com/wBlqF.png)
    * (-) Computationally more intensive (large proximity matrix computation)
    * (+) Handles **categorical features** more naturally (though still must define distance metric for mixed data carefully), and **more robust to outliers**
* **DBSCAN** - density-based spatial clustering of applications with noise
    ![dbscan](https://camo.githubusercontent.com/08d18a2ecf4bc19cc496c73867f540e4e54334c5/68747470733a2f2f64337676366c703535716a6171632e636c6f756466726f6e742e6e65742f6974656d732f30483068336932773176307a33653136306c31462f64627363616e2d7072696e6369706c652e706e673f582d436c6f75644170702d56697369746f722d49643d31303934343231)
    * Two parameters (number of clusters not specified)
        * $\epsilon$ - distance between points for them to be connected
        * $minPts$ - number of connected points for a point to be a "core" point
        * A cluster is all connected core points, plus others within $\epsilon$ of one of those. Other points are noise.
    * very popular clustering algorithm
    * groups together close together points, and marks low density regions as outliers
    * Distribution-based clustering:
    ![dbscan_2](dbscan_2.png)
        * Assume clusters follow some (generally gaussian) distribution
        * Find distributions with the **maximum likelihood** to produce this result
        * Except, don't know which point is part of which cluster, so need to add some hidden variables and follow an **expectation-maximization (EM)** algorithm
    * Example: Cluster similar shoppers to show items and ads they'll like

| Shopper | Computers | Keyboards | Peanut Butter | Oreos |
|:-------:|:---------:|:---------:|:-------------:|:-----:|
| Aditi   | 1         | 2         | 0             | 0     |
| Rohit   | 0         | 0         | 30            | 50    |
| Aaron   | 0         | 0         | 50            | 50    |
| Jia     | 0         | 0         | 0             | 1     |
| Jack    | 2         | 4         | 10            | 20    |
| William | 3         | 6         | 0             | 0     |
| ...     | ...       | ...       | ...           | ...   |
* Who is Aditi most "similar" to in Euclidean distance?
* Who is Jack most "similar" to? 
    * Do we care more about selling a jar of Peanut Butter or a Computer?
* What can we do so that distance isn't just based on Peanut Butter and Oreos?
    * But, William is still far from Aditi