# Dimension Reduction and unsupervised learning

I was hoping that Jarvist would teach you about dimensionality reduction last week and I am no great expert in unsupervised learning. This means that I have taken much of this from other sources -- mostly {homl} -- because I had to write about one topic in a hurry and for the other topic this is a good a source as any other. As I have said before I recommend {homl} as a very good practical guide. 

# Dimensionality Reduction

We have only been looking at data sets with a relatively small number of features, with MNIST being the largest with 784 featueres. In many cases you will have data sets with thousands and possibly millions of features. You have already seen that models built on data sets with many features are much slower than those built on data with just a few features - just compare the iris data with the MNIST data. When you get to data sets with a great many features this can be very slow.

But there are other reasons as well. Even if you transform all your data so that the range of each feature is a single value between 0 and 1, as you add features you are adding dimensions to the hyperspace that you need to fill and characterise/model. If you pick two points at random from a unit square (i.e. a two featured space) their separation is, on average, $\approx$0.52, if you go to 3D cube then it grows to $\approx$0.66 but if you go a million features i.e. a 1000000D hypercube this has grown to $\approx$408.25. This means that your model is now training on very sparce data which may not be representative. The obvious answer is to use more data to train the model, however (according to {homl})  you would need more data than there are atoms in the observable universe to have an average separation of 0.1 for just 100 dimensions (not sure how the calculated the actual nuber of atoms in the observable universe). This is sometimes called [*the curse of dimensionality*](https://en.wikipedia.org/wiki/Curse_of_dimensionality).

So some form of dimensionality reduction can often be very helpful. However, in any such reduction you will loose some information and you want to minimise this. A good way to consider this is to try to retain the maximum variance in your reduced data set. That way you will (generally) loose the smallest amount of discrimination. 


## Principal Component Analysis

While there are many different forms of dimensionality reduction PCA is by far the most common and so it is the only one that we will really cover. In PCA your data are projected along axes that retain the greatest variance.

Consider the diagrams below -- don't worry about code that generates them, this is taken directly from {homl}

In [None]:
# first some basics
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt


In [None]:


angle = np.pi / 5
stretch = 5
m = 200

np.random.seed(3)
X = np.random.randn(m, 2) / 10
X = X.dot(np.array([[stretch, 0],[0, 1]])) # stretch
X = X.dot([[np.cos(angle), np.sin(angle)], [-np.sin(angle), np.cos(angle)]]) # rotate

u1 = np.array([np.cos(angle), np.sin(angle)])
u2 = np.array([np.cos(angle - 2 * np.pi/6), np.sin(angle - 2 * np.pi/6)])
u3 = np.array([np.cos(angle - np.pi/2), np.sin(angle - np.pi/2)])

X_proj1 = X.dot(u1.reshape(-1, 1))
X_proj2 = X.dot(u2.reshape(-1, 1))
X_proj3 = X.dot(u3.reshape(-1, 1))

plt.figure(figsize=(8,4))
plt.subplot2grid((3,2), (0, 0), rowspan=3)
plt.plot([-1.4, 1.4], [-1.4*u1[1]/u1[0], 1.4*u1[1]/u1[0]], "k-", linewidth=1)
plt.plot([-1.4, 1.4], [-1.4*u2[1]/u2[0], 1.4*u2[1]/u2[0]], "k--", linewidth=1)
plt.plot([-1.4, 1.4], [-1.4*u3[1]/u3[0], 1.4*u3[1]/u3[0]], "k:", linewidth=2)
plt.plot(X[:, 0], X[:, 1], "bo", alpha=0.5)
plt.axis([-1.4, 1.4, -1.4, 1.4])
plt.arrow(0, 0, u1[0], u1[1], head_width=0.1, linewidth=5, length_includes_head=True, head_length=0.1, fc='k', ec='k')
plt.arrow(0, 0, u3[0], u3[1], head_width=0.1, linewidth=5, length_includes_head=True, head_length=0.1, fc='k', ec='k')
plt.text(u1[0] + 0.1, u1[1] - 0.05, r"$\mathbf{c_1}$", fontsize=22)
plt.text(u3[0] + 0.1, u3[1], r"$\mathbf{c_2}$", fontsize=22)
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$x_2$", fontsize=18, rotation=0)
plt.grid(True)

plt.subplot2grid((3,2), (0, 1))
plt.plot([-2, 2], [0, 0], "k-", linewidth=1)
plt.plot(X_proj1[:, 0], np.zeros(m), "bo", alpha=0.3)
plt.gca().get_yaxis().set_ticks([])
plt.gca().get_xaxis().set_ticklabels([])
plt.axis([-2, 2, -1, 1])
plt.grid(True)

plt.subplot2grid((3,2), (1, 1))
plt.plot([-2, 2], [0, 0], "k--", linewidth=1)
plt.plot(X_proj2[:, 0], np.zeros(m), "bo", alpha=0.3)
plt.gca().get_yaxis().set_ticks([])
plt.gca().get_xaxis().set_ticklabels([])
plt.axis([-2, 2, -1, 1])
plt.grid(True)

plt.subplot2grid((3,2), (2, 1))
plt.plot([-2, 2], [0, 0], "k:", linewidth=2)
plt.plot(X_proj3[:, 0], np.zeros(m), "bo", alpha=0.3)
plt.gca().get_yaxis().set_ticks([])
plt.axis([-2, 2, -1, 1])
plt.xlabel("$z_1$", fontsize=18)
plt.grid(True)


plt.show()

This shows the projection of the data along the axes shown. When taking the PCA the first component is the one with the greatest variance -- in this case **C1**. The second principal component is the one with the greatest remaining variance which is **C2**. **C2** is perpendicular to **C1**. If we had more dimensions then we could define more vectors.

So how do we find these, well there is a well known linear algebra technique which factorises matrices called *Single Valued Decomposition* (SVD). 

Numpy has a function which will return these (although the input must be centered around zero)

In [None]:
X_centered = X-X.mean(axis=0)

U,s, Vh=np.linalg.svd(X_centered) # docs at https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html
c1=Vh.T[:,0]
c2=Vh.T[:,1]

## Projecting Down to d Dimensions

Once the principal components have been found you can project down on to a d-dimensional hyperplane defined by the first d principal components. This is means that as little varioance as possible is lost.

## Looking at MNIST (again)

So lets return to our old friend MNIST

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
print(mnist.keys())

In [None]:
X=mnist['data']
y=mnist['target']

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Using PCA is even easier in sklearn than it is in numpy as it does all the zeroing for you

In [None]:
from sklearn.decomposition import PCA
pca=PCA(n_components=200) # choose the first 200 components out 784
X_reduced=pca.fit_transform(X_train)

In [None]:
# the components will then just be stored in pca.components.T i.e. the transpose
print(len(pca.components_)) # should be 200 vectors
print(len(pca.components_.T[:,0])) # in 784 dimensions
print(len(X_reduced)) #should be 60000 from all the original training data
print(len(X_reduced[0,:])) # should now only be 200 not 784

## Explained Variance Ratio

The explained variance ratio indicates the proportion of the datasets variance that lies along each principal component. So if we look at (say) the first 10 you can see how quickly/slowly it drops off. You can also see how much variance you have lost (in this case  around 3% so not much).

In [None]:
print(pca.explained_variance_ratio_[0:11])
print("lost=",1-pca.explained_variance_ratio_.sum())

## But how many dimensions to choose?

Rather than plucking a number out of the hat for the number of dimensions you wish to project onto you could side how much variance you wish to retain and use this to decide the dimensionality that you want to use. You could do this for all  number of dimensions and see which is the first abouve your threshold.



In [None]:
pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
print(d)

However, sklearn have done this for you and rather than n_components being equal to a number of principal components you can  give it a number between 0 and 1 which is the variance that you want to keep.

In [None]:
pca=PCA(n_components=0.95)
X_reduced=pca.fit_transform(X_train)
pca.n_components_

You could also plot this number as a function of the number of dimensions.

In [None]:
plt.figure(figsize=(6,4))
plt.plot(cumsum, linewidth=3)
plt.axis([0, 400, 0, 1])
plt.xlabel("Dimensions")
plt.ylabel("Explained Variance")
plt.plot([d, d], [0, 0.95], "k:")
plt.plot([0, d], [0.95, 0.95], "k:")
plt.plot(d, 0.95, "ko")
plt.annotate("Elbow", xy=(65, 0.85), xytext=(70, 0.7),
             arrowprops=dict(arrowstyle="->"), fontsize=16)
plt.grid(True)

plt.show()

The value of 154 means that you only need to store $\approx$20% of the data for 95% of the information so you can use this as a form of compression. Look at the example below and note how you can reverse the transformation but that you have lost some information.

In [None]:
import matplotlib as mpl
def plot_digits(instances, images_per_row=5, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = mpl.cm.binary, **options)
    plt.axis("off")

In [None]:
pca = PCA(n_components = 154) #try varying this 154 (start qt 1)and see what difference it makes to the images below
X_reduced = pca.fit_transform(X_train)
X_recovered = pca.inverse_transform(X_reduced)

In [None]:
plt.figure(figsize=(7, 4))
plt.subplot(121)
plot_digits(X_train[::2100])
plt.title("Original", fontsize=16)
plt.subplot(122)
plot_digits(X_recovered[::2100])
plt.title("Compressed", fontsize=16)


## Exercise

As the comment suggests, try varying the 154 number of dimensions and see what difference it makes to the the image. Start with very low numbers and build up.

## Exercise

Use your favourite classifier (may be SVC or BDT) to classify the MNIST data set as well as you can. Then see how the timing and accuracy changes if you use versions with reduced dimensions and plot your reults (say accuracy against number of dimensions). This is what you are really interested in.

## Other sorts of PCA

We don't have time here but you should be aware that there are also varioants of PCA that can be useful. These include:
* **Kernel PCA** where you use a similar kernel trick as with SVMs to introduce nonlinear features (without really doing so).
* Randomised PCA that generates good approximations to the PC in a semi-random way and is very much faster for large feature sets.

* ...

## Other forms of dimensionality reduction

There exist many other forms of dimensionality reduction but none anywhere near as popular as PCA. Thes include Locally Linear Embedding, Random Projections, Linear Discriminant Analysis, ... They all have their place and you should know that there are more out there that exist.

# Unsupervised learning 

I was rather hoping to spend this entire session on unsupervised learning, but figured that dimensionality reduction was so important that it had to be covered explicitly.

As it is I can only give you a taste of this and show you an example of a commonly used approach (and give you an exercise of course).

Although you should note that dimensionality reduction is in itself a form of unsupervised learning as you only consider the variance of the data not  the targets.

## So what is unsuperived learning and what is it used for?

As you might have guessed, unsupervised learning is where you don't have the answer as to what you are looking i.e. you don't have the target, so you cannot train your favourite classifier. Somehow you need an algorithm that will train a model to pick things that are "the same". This could be trying to distinguish coins on weight and diameter without know which coin is which - this is often used as an example. 

The most common form is clustering (although fault detection and density estimation are other common forms) and the most common form of clustering is K-Means, so this is the example that we will look at. Again this is an example taken from {homl}.

## K-Means

K-means can take an unlabeled data set and groups them into clusters. So lets generate some random data made up of clusters.

In [None]:
import numpy as np
import scipy as sp
import sklearn as sk
import matplotlib.pyplot as plt

In [None]:
from sklearn.datasets import make_blobs
blob_centers = np.array(
    [[ 0.2,  2.3],
     [-1.5 ,  2.3],
     [-2.8,  1.8],
     [-2.8,  2.8],
     [-2.8,  1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])


In [None]:
X, y = make_blobs(n_samples=2000, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

In [None]:
def plot_clusters(X, y=None):
    plt.scatter(X[:, 0], X[:, 1], c=y, s=1)
    plt.xlabel("$x_1$", fontsize=14)
    plt.ylabel("$x_2$", fontsize=14, rotation=0)

In [None]:
plt.figure(figsize=(8, 4))
plot_clusters(X)
#print(X)
plt.show()

OK so we can see by eye that there are 5 new blobs of data here. However, how can you have a model as to which is which. If you you could tell the model the centroids of each then this would be easy or if you could tell it the identity of each entry (the y from the make blobs command) but in unsupervised learning you can do neither.

K-Means starts off by picking random centroids and then classifies each instance according to its nearest centroid. It then recalculates the centroids from the data associated with it,  followed by a reclassification of the data according which of the new centroids it is nearest to and so  it continues until the centroids stop moving. Yep, it really is that simple. So lets try it out here: 

In [None]:
from sklearn.cluster import KMeans
k = 5
kmeans = KMeans(n_clusters=k, random_state=42)
y_pred = kmeans.fit_predict(X)

Unfortuneately you have to tell it how many clusters to look for but hey. Now each of the data points has been assigned to one of the 5 clusters:

In [None]:
print(y_pred[0:20])

In [None]:
kmeans.cluster_centers_ # to find the centres of the clusters

Lets look at how well it did:

In [None]:
plt.figure(figsize=(8, 4))
plot_clusters(X)
#print(X)
plt.plot(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],"ro")
#print(kmeans.inertia_)
plt.show() # pretty good

You can use it to predict the label for new data

In [None]:
X_new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]]) #try adding some more to see what happens
kmeans.predict(X_new)

You can plot the decision boundaries as a Voronoi plot (code taken straight from {homl}:

In [None]:
def plot_data(X):
    plt.plot(X[:, 0], X[:, 1], 'k.', markersize=2)

def plot_centroids(centroids, weights=None, circle_color='w', cross_color='k'):
    if weights is not None:
        centroids = centroids[weights > weights.max() / 10]
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='o', s=30, linewidths=8,
                color=circle_color, zorder=10, alpha=0.9)
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='x', s=10, linewidths=10,
                color=cross_color, zorder=11, alpha=1)

def plot_decision_boundaries(clusterer, X, resolution=1000, show_centroids=True,
                             show_xlabels=True, show_ylabels=True):
    mins = X.min(axis=0) - 0.1
    maxs = X.max(axis=0) + 0.1
    xx, yy = np.meshgrid(np.linspace(mins[0], maxs[0], resolution),
                         np.linspace(mins[1], maxs[1], resolution))
    Z = clusterer.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.contourf(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]),
                cmap="Pastel2")
    plt.contour(Z, extent=(mins[0], maxs[0], mins[1], maxs[1]),
                linewidths=1, colors='k')
    plot_data(X)
    if show_centroids:
        plot_centroids(clusterer.cluster_centers_)

    if show_xlabels:
        plt.xlabel("$x_1$", fontsize=14)
    else:
        plt.tick_params(labelbottom=False)
    if show_ylabels:
        plt.ylabel("$x_2$", fontsize=14, rotation=0)
    else:
        plt.tick_params(labelleft=False)

In [None]:
plt.figure(figsize=(8, 4))
plot_decision_boundaries(kmeans, X)
plt.show()

If you happen to have a vague idea of where the centroids are you can tell it where to start:
```python
init_guess=np.array([-3,1.0],[-3,2],[-3,3],[-1,2],[0,2])
kmeans=KMeans(n-cluster=5,init=init_guess,n_inits=5)
```

### Inertia

When the K-Means algorithm runs it actually runs several time. The number of times it runs is given by n_inits and the default value is 10. It then uses a performance algorithm to detemine which is the best and keeps that one. The performance metric is called *inertia* and is the mean squared distance between each centroid and the instances associated with it. This is available to you.

In [None]:
kmeans.inertia_

## Exercise

Plot what happens to the inertia score as you change the number of centroids in your algorithm. Can you use this to determine how man centroids you should have?

## Exercise

Try to use K-Means on the iris data to separate them out without using labels. Even though we have only been using 2D to show plots the algorithm will happily run in multiple dimensions.


In [None]:
import numpy as np
import scipy as sp 
from sklearn.datasets import load_iris
iris=load_iris()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris['data'],iris['target'], test_size=0.2) 
k = 4
kmeans = KMeans(n_clusters=k, random_state=42,n_init=10)
y_pred = kmeans.fit_predict(X_train)

In [None]:
yp=kmeans.predict(X_test)

In [None]:
print(yp)

In [None]:
print(y_test)

In [None]:
#0->2, 1->0, 2->1?, 3->1


There are many other forms of unsupervised learning. If I had more time the next one that I would introduc you to is *Density-based spatial clustering of applications with noise* (DBSCAN) which I think is more powerful than K-Mean but not as popular (there are a couple of reasons for this). Really you should know that there are a variety of tools out there and you can then search for them if you need them.