###### Dimension Reduction and Clustering
The goal of this notebook is to explore a select few methods of dimension reduction and clustering methods that form a basis for you to apply to different methods. In particular, we will examine PCA and NMF as dimension reduction methods followed by K-means and NMF (again, yes) as very practical methods of clustering the data. \
\
PCA: This method is very intuitive. Let us start with a high-level explanation. Let us say that we would like to summarise what goes on during Halloween in the US. We start with a massive data set consisting of the prices spent on say grilling, candy, the candy dietary information, costumes, movies, etc. The goal with PCA is to simply reduce the number of features as many overlap and correlate with one another to explain the most variance of the Halloween data. If that insulted your intelligence, then let us finally explain the inner-workings of PCA starting with covariance matrices. Let us start with a data matrix $X\in\mathbb{R}^{n\times p}$ where $n$ denotes the number of observations and $p$ denotes the number of features we observe per observation. We can think of the features as being pulled from some data-generating process that has a joint distribution specified over $\mathbb{R}^p$ i.e. each of our observations maybe follows a multivariate normal. Given we have a sample of the distribution, we can estimate the mean by the sample mean:

$$\hat{\mu}=\frac{1}{n}X^T\begin{bmatrix}1\\\vdots\\1\end{bmatrix}.$$
All we did above is take the average over the $n$ observations for each column. We can similarly compute the sample covariance by subtracting off the column means and multiplying the matrix by its transpose:

$$\hat{\sigma^2_X}=\left(X-\frac{1}{n}\begin{bmatrix}1&\cdots&1\\\vdots&\vdots&\vdots\\1&\cdots&1\end{bmatrix}X\right)\left(X-\frac{1}{n}\begin{bmatrix}1&\cdots&1\\\vdots&\vdots&\vdots\\1&\cdots&1\end{bmatrix}X\right)^T.$$\
This matrix is trivially symmetric - letting $Z$ denote the first term yields $(ZZ^T)^T=ZZ^T$. The matrix is also positive semi-definite. Recall that a semi-definite positive matrix is a Hermitian matrix with positive eigenvalues. By the spectral theorem, we know that proving all eigenvalues are positive is equivalent to showing that for any $x\in\mathbb{R}^n$, $x^T(ZZ^T)z>0$. Prove this! Thus let $x\in\mathbb{R}^n$, then

$$x^TZZ^Tx=(xZ)^2\geq 0$$ which proves that the covariance matrix is semi-positive definite. This implies  
\
In simple terms, the goal in clustering is to subset the data into partitions that have meaning. Within these clusters, we sometimes are also interested in the intersection of clusters - perhaps the clusters are not disjoint and share a continuum of information (linguistic patterns while separable, are often overlapping for example). Clustering hinges on the measure of similarity we use based off domain knowledge to group objects. \
\
As an example of a measure of similarity used correctly, consider the problem of deciphering linguistic trends from survey data. The data would consist of responses labeled a through say i with choices of sayings for the participant to select. One method that would not make sense is if we performed one-hot encoding and utilized like the Euclidean or the Manhattan distance. Why? Well, let us say one of the responses is 1) you all, 2) you guys, 3) yinglings, and 4) ya'll. Clearly the first two are much more similar to each other than the latter two. However, with the Euclidean distance, we are treating the disimilarity between these guys as the same which is preposterous. A much better approach would be to either consult with linguistics to develop some metric based on maybe phonetic differences or apply string dissimilarity measures between the questions - like Jaccard, Cosine, etc.\
\
Most clustering algorithms (aside from association rules and hierarchical clustering) rely on the construction of what is known as a dissimilarity matrix. To quote ESL on the importance of the dissimilarity measure, "Specifying an appropriate dissimilary measure is far more important in obtaining success with clustering than choice of clustering algorithm". Let $X$ denote our $N\times p$ data matrix where $N$ represents the number of observations and $p$, the number features. Then we express the dissimilarity between vectors by $$D(x_i,x_j)=\sum\limits_{k=1}^p w_id(x_{ik},x_{jk})$$ where $\sum w_i=1$ usually. The choice of weightings for the $k$th attribute is again reliant heavily on domain knowledge rather than some omniscient mathematical algorithm. If $w_i=1$, then our terms do not necessarily equally contribute to the dissimilarity. If the scales are not the same, then we would have to set $w_i=N^2\times(\sum d(x_{ik},x_{jk}))$ to ensure equal weighting. This is not often desired as for instance in the case of home prices - square footage most likely plays a larger role in forming the price than the number of washers and dryers the home holds (albeit a weak correlation between the two).\
\
As pointed above, clustering methods are usually less important than the metric. There are two primary classes of clustering methods: 1) combinatorial algorithms which use the observed data without assuming a probability model (non-Bayesian), and 2) mixture modeling which supposes the data is i.i.d. from some population described by a predetermined density function characterized by mixtures of densities. \
\
Combinatorial Approach: We pre-specify a number of clusters $k\in\{2,\dots,N-1\}$ and attempt to develop a many-to-one encoder $k=C(i)$ which assigns the $i$th observation vector to one of the clusters $1,\dots, k$. The encoder is generally found by means of minimizing a loss function as in supervised learning. A natural first choice is to minimize the within-cluster distance over all features $1,\dots,K$ with $w_i=1$: $$\sum\limits_{C(i=k),C(j)=k} D(x_i,x_j).$$ To enumerate all possible assignments of $N$ points to $k$ clusters is given by $\frac{1}{k!}\sum (-1)^{k-i}\binom{k}{i}i^N$ which provides computationally infeasible for $N=30$. Instead, prototypical methods involve what is known as iterative greedy descent.\
\
K-means: K-means is arguably the most popular of the iterative greedy descent clustering methods. What are the assumptions we need to understand before performing K-means? We need to choose $K$, the number of clusters to be created - as we are usually interested in unsupervised clustering methods, this seems a bit out of the ordinary and adds a level of arbitrariness and instability to the cluster results. Another assumption is that we require the variables to be continuous. This essentially boils down to being able to establish different relative distances to compare and cluster (for categorical data, we cannot clearly establish a difference between options 1 and 7 on the basis of the numbers). Usually, if one has categorical data, we perform one-hot encoding (mapping to columns of zeros and ones) to try to remedy the problem - however in the process, we are assuming uniform differences between the options which often times is what we don't want. K-means works when the clusters we output are spherical - as an example where K-means would fail, consider the points assigned on an archery target which is a collection of concentric circles - K-means wouldn't not be able to separate the rings. Another assumption is that the clusters are relatively similar in size (if we cluster linguistic patterns in the US and add say Tokyo to the mix, despite there being a large difference, for small $K$, K-means wouldn't separate the two correctly). Now most assumptions out of the way, let us go through the algorithm. We assume the data is continuous and thus we typically define the squared Euclidean distance as the dissimilarity measure. \
\
Intermezzo: ESL Ex. 14.1: Show that the weighted Euclidean distance

$$d^w(x_i,x_j)=\frac{\sum\limits_{l=1}^p w_l(x_{il}-x_{jl})^2}{\sum\limits_{l=1}^p w_l}$$ 

is equivalent to the unweighted Euclidean distance based on $z$ where

$$z_{il}=x_{il}\cdot \left(\frac{w_l}{\sum\limits_{l=1}^p w_l}\right)^{1/2}.$$

Answer: Starting with the unweighted Euclidean distance, we trivially plug in the equality and use algebra to get the LHS: $$d^{uw}(z_i,z_j)=\sum\limits_{l=1}^p \left(x_{il}\times\left(\frac{w_l}{\sum\limits_{l=1}^p w_l}\right)^{1/2}-x_{jl}\times\left(\frac{w_l}{\sum\limits_{l=1}^p w_l}\right)^{1/2}\right)^2=\sum\limits_{l=1}^p \frac{w_l}{\sum\limits_{l=1}^p w_l}\left(x_{il}-x-{jl}\right)^2=d^w(x_i,x_j).$$

Back to the algorithm, K-means attempts to minimize the within-point scatter which is equivalent to minimizing the distance from the centroids: 

$$W(C)=\sum\limits_{k=1}^K\sum\limits_{C(i)=k}\sum\limits_{C(j)=k}d(x_i,x_j)=\sum\limits_{k=1}^K \sum\limits_{C(i)=k}\left(\sum\limits_{j=1}^p (x_{ij}-x_{k_1j})^2+\cdots +\sum\limits_{j=1}^p (x_{ij}-x_{k_lj})^2\right)$$ 

$$=\sum\limits_{k=1}^K\sum\limits_{C(i)=k}\left(\sum\limits_{j=1}^p nx_{ij}^2-2x_{ij}\sum\limits_{m=1}^l x_{k_mj}+\sum\limits_{m=1}^l x_{k_mj}^2\right)=\sum\limits_{k=1}^K \sum\limits_{i=1}^N \mathbb{I}_{C(i)=k}\sum\limits_{C(i)=k} \lVert x_i-\bar{x}_k\rVert^2$$

where $\bar{x}_k$ is the centroid for cluster $k$. The algorithm is then characterized by:
1) For an encoder $C$, the total cluster variance, $$\min\limits_{C,m_1,\dots,m_K} \sum\limits_{k=1}^K N_k\sum\limits_{C(i)=k}\lVert x_i-m_k\rVert^2,$$ where $N_k$ is the number of points in cluster $k$ is minimized with respect to the centroids $m_1,\dots,m_K$.

2) Given the current set of means $m_1,\dots,m_k$, we minimize the total cluster variance by assigning each observation to the closest current cluster mean. That is, $$C(i)=\arg\min\limits_{k\leq K} \lVert x_i-m_k\rVert^2.$$

3) Repeat 1 and 2 until assignments do not change.

This algorithm is highly susceptible to getting stuck in a local minimum. As a result, many stability methods must be considered - for example:
* Test out many different initializations
* Add noise to the data (I don't think it has to be Gaussian per say)
* Some combination of the above and/or the bootstrap: https://people.eecs.berkeley.edu/~jordan/sail/readings/luxburg_ftml.pdf


Intermezzo 2: ESL Problem 14.6: Write programs to implement K-means clustering and a self-organizing map (SOM), with the prototype lying on a two-dimensional grid. 


In [75]:
import numpy as np
import pandas as pd

def KMeans(X, k, mediods = np.array([]), tol = 1e-4, max_iter = 100):
    """
    :param: X: dataframe
    :param: k: choice of cluster, integer
    """
    # initialize membership dataframe
    mediod_membership = np.zeros(len(X.index))

    # initialize mediods (what exactly is this - numbers or vectors??)
    if mediods.shape != (k, len(X.columns)):
        mediods = np.random.normal(0, 2, size = (k, len(X.columns)))
    
    # initialize the param for deciding convergence of the mediods and iteration number
    mediod_dist = np.ones(len(X.index))
    iter = 0
    
    # here we perform the k-means algorithm - looping until max_inter or convergence
    while np.mean(np.abs(mediod_dist)) > tol and iter <= max_iter:
        # Membership Update: assign the points to the clostest clusters
        for i in range(len(X.index)):
            # compute distances to mediods
            distances = []
            for j in range(k):
                distances.append(np.linalg.norm(X.iloc[i,] - mediods[j,]))
                
            # take minimum distance as new membership for i
            new_dist = distances.index(min(distances))
            # update distance from the previous centroid
            mediod_dist[i] = new_dist - mediod_membership[i]
            mediod_membership[i] = new_dist
            print(new_dist)
        
        # Centroid Update: mean of points under each membership is the new mediod 
        for i in range(k):
            indices = [index for index, element in enumerate(distances) if element == i]
            mediods[i] = X.iloc[indices,].mean(axis = 0)        
            
    return [mediod_membership, mediods]    
     
# get in the housing data - only the numeric columns can really be justified with the euclidean distance
df = pd.read_csv("/Users/marko/MJDR/texas_counties1.csv")[["sales", "dollar_vol", "avg_price", "med_price", "total_listings", "month_inventory"]]
df = df.replace(',', '', regex=True)
df = df.apply(pd.to_numeric, errors='ignore')

# run K-means on the housing data
res = KMeans(X = df, k = 3, tol = 1e-3, max_iter = 50)

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


In [74]:
df = pd.read_csv("/Users/marko/MJDR/texas_counties1.csv")[["sales", "dollar_vol", "avg_price", "med_price", "total_listings", "month_inventory"]]
df = df.replace(',', '', regex=True)
df = df.apply(pd.to_numeric, errors='ignore')

res

[array([0., 0., 0., ..., 0., 0., 0.]),
 array([[nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan]])]

Sources:
* http://www.columbia.edu/~jwp2128/Teaching/W4721/Spring2017/slides/lecture_4-6-17.pdf
* https://www.cs.columbia.edu/~djhsu/AML/lectures/notes-pca.pdf
* https://people.math.harvard.edu/~knill/teaching/math22b2019/handouts/lecture17.pdf
* http://pillowlab.princeton.edu/teaching/statneuro2018/slides/notes05_PCA2.pdf
* https://cs.nju.edu.cn/_upload/tpl/01/0b/267/template267/zhouzh.files/course/dm/reading/reading03/fodor_techrep02.pdf
*

SVD:
* http://cda.psych.uiuc.edu/statistical_learning_course/Jolliffe%20I.%20Principal%20Component%20Analysis%20(2ed.,%20Springer,%202002)(518s)_MVsa_.pdf