\section{Machine learning: unsupervised learning}
In the supervised learning setting, we typically have access to a set of $p$ features $X_1, X_2, \dots, X_p$ measured on $n$ observations, and a response $Y$ also measured on those same $n$ observations. The gola is then to predict $Y$ using $X_1, X_2, \dots, X_p$.

In unsupervised learning, there is no response variable.

\subsection{The challenge of unsupervised learning}
Unsupervised learning is more challenging than supervised learning. It tends to be more subjective, and there is no simple goal for the analysis, such as prediction of a response. In unsupervised learning, there is no way to check our work because we do not know the true answer. 

\subsection{Principal components analysis}
Principal component analysis refers to the process by which principal components are computed, and subsequent use of these components in understanding the data. 

\subsubsection{What are principal components?}
A better method is required to visualize the $n$ observations when $p$ is large. The first principal component of a set of features $X_1, X_2, \dots, X_p$ is the normalized linear combination of the features
\[
Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \dots + \phi_{p1}X_p
\]
that has the largest variance. By normalized, we mean that $\sum_{j=1}^p \phi_{j1}^2 = 1$. We refer to the elements $\phi_{11}, \dots, \phi_{p1}$ as the loadings of the first principal component; together, the loadings make up the principal component loading vector, $\phi_1 = (\phi_{11}, \phi_{21}, \dots, \phi_{p1})^T$. 

There is a nice geometric interpreataion for the first principal component. The loading vector $\phi_1$ with elements $\phi_{11}, \phi_{21}, \dots, \phi_{p1}$ defined a direction in the feature space along which the data vary the most. If we project the $n$ data points $x_1, \dots, x_n$ onto this direction, the projected values are the principal component scores. 

Once we have computed the principal components, we can plot them against each other in order to produce low-dimensional views of the data. 

Biplot: displays both the principal component scores and the principal component loadings.

If original variables have loadings in the same direction, they are correlated. 

\subsubsection{Another interpretation of principal components}
Together the first $M$ principal component score vectors and the first $M$ principal component loading vectors provide the best $M$-dimensional approxiamtion to the $i$th observation $x_{ij}$. 

\subsubsection{More on PCA}
Before PCA is performed, the variables should be centred to have mean zero. Furthermore, the results obtained when we perform PCA will also depend on whether the variables have been individually scaled. 

Each princial component loading vector is unique, up to a sign flip.

We are interested in knowing the proportion of variance explained (PVE) by each principal component. The total variance is a data set is defined as
\[
\sum_{j=1}^p var(X_j) = \sum_{j=1}^p \frac{1}{n} \sum_{i=1}^n x_{ij}^2
\]
and the variance explained by the mth principal component is
\[
\frac{1}{n}\sum_{i=1}^n z_{im}^2 = \frac{1}{2}\sum_{i=1}^n \bigg(\sum_{j=1}^p \phi_{jm}x_{ij}\bigg)^2
\]
Therefore, the PVE of the mth principal component is given by
\[
\frac{\sum_{i=1}^n (\sum_{j=1}^p \phi_{jm}x_{ij})^2}{\sum_{j=1}^p \sum_{i=1}^n x_{ij}^2}
\]

\subsection{Clustering methods}
Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. When we cluster the observations of a data set, we seek to partition them into distinct groups so that the observations within each group are quite similar to each other. 

PCA looks to find a low-dimensional representation of the observations that explain a good fraction of the variance. Clustering looks to find homogeneous subgroups among the observations.

The two best-known clustering approaches: K-means clustering and hierarchical clustering. 

\subsubsection{K-means clustering}
To perform K-means cluestering, we must specify the desired number of clusters K; then the K-means algorithm will assign each observation to exactly one of the K clusters. The idea behind K-means clustering is that a good clustering is one for which the within-cluster variation is as small as possible. 

A very simple algorithm can be shiwn to provide a local optimum - a pretty good solution - to the K-means optimization problem. 
\begin{enumerate}
    \item Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations
    \item Iterate until the cluster assignment stop changing.
        \item For each of the K clusters, compute the cluster centorid. The kth cluster centroid is the vector of the p feature means for the obsevrations in the kth cluster
        \item Assing each observation to the cluster whose centroid is closest 
\end{enumerate}

It is important to run the algorithm multiple times from different random initial configurations. 

\subsubsection{Hierarchical clustering}
One potential disadvantage of K-means clustering is that it requires us to pre-specify the number of clusters $K$. In contrast, hierarchical clustering results in a tree-based representation of the observations, called a dendrogram.

Bottom-up or agglomerative clustering

We begin by defining some sort of dissimilarity measure between each pair of observations. Starting out at the bottom of the dendrogram, each of the $n$ observations is treated as its own cluster. The two clusters that are most similar to each other are then fused so that there now are $n-1$ clusters. 
\begin{enumerate}
    \item Begin with $n$ observations and a measure (such as Euclidean distance) of all the ${n \choose 2}$ pairwise dissimilarities. Treat each observation as its own cluster.
    \item For $i = n, n-1, \dots, 2$:
        \item Examine all pairwise inter-cluster dissimilarities among the $i$ clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed.
        \item Compute the new pairwise inter-cluster dissimilarities among the $i-1$ remaining clusters,
\end{enumerate}

Complete: maximal intercluster dissimilarity. 

Single: minimal intercluster dissimilarity

Average: mean intercluster dissimilarity

Centroid: dissimilarity between the centorid for cluster $A$ and the centroid for cluster $B$.

Correlation-based distance considers two observations to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance. 

\subsubsection{Practical issues in clustering}
Should the observations or features first be standardized in some way? Maybe the variables should be centered to have mean zero and scaled to have standard deviation one. 

Hierarchical clustering: what dissimilarity measure should be used? What type of linkage should be used? Where should we cut the dendrogram in order to obtain clusters?


