[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/joshmaglione/CS102-Jupyter/main?labpath=.%2FWeek11.ipynb) 

<a href="https://colab.research.google.com/github/joshmaglione/CS102-Jupyter/blob/main/Week11.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> 

[View on GitHub](https://github.com/joshmaglione/CS102-Jupyter/blob/main/Week11.ipynb)

# Week 11: Unsupervised learning

![](https://miro.medium.com/v2/resize:fit:587/1*J82yf-YU7Ryhh5BdhyOdcw.png)

**What is the difference between supervised and unsupervised learning?**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

Unsupervised learning can highlight intrinsic information about the data.

One example is **dimension reduction**:
- take high-dimensional data and return lower-dimensional data that still retains most of the information (usually info = total variance)
- used for:
  - visualization (from 10 dimensions to 2)
  - speed up model training (from 2000 dimensions to 400)

## Principal component analysis

Principal component analysis (PCA) is perhaps one of the most broadly used unsupervised algorithms. 

(My personal favorite)

PCA is fundamentally a **dimensionality reduction** algorithm, but it can be used for (among other things)
- visualization
- noise filtering
- feature extraction

The mathematics around PCA is super fun. 

In [Geometric Foundations for Data Analysis](https://joshmaglione.com/2023CS4102.html) we go through PCA in detail. 

### Toy example

We compute the two principal components, which are just vectors.

We'll plot both principal components on the plot of the data.

In [None]:
# Load the data
df = pd.read_csv("data/pcadata.csv")

# Helper function
def draw_vector(v0, v1, color='red'):
    ax = plt.gca()
    arrowprops = dict(arrowstyle='->', linewidth=2, shrinkA=0, shrinkB=0, color=color)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

# Perform PCA
model = PCA(n_components=2)
model.fit(df)

# Plot the principal components
plt.grid()
plt.axis('equal')
plt.scatter(df.x, df.y, c='blue', alpha=0.5, zorder=2)
mu = model.mean_
PC1, PC2 = model.components_
var1, var2 = model.explained_variance_
draw_vector(mu, mu + 3*PC1, color='red')
draw_vector(mu, mu + 3*PC2, color='orange')
plt.text(mu[0] + 3*PC1[0] - 1, mu[1] + 3*PC1[1] + 0.5, 'PC1', color='red', fontsize=12, ha='right')
plt.text(mu[0] + 3*PC2[0] + 0.5, mu[1] + 3*PC2[1] + 0.75, 'PC2', color='orange', fontsize=12, ha='right')
plt.show()

In [None]:
model = PCA(n_components=2)						# Bringing it here to see
model.fit(df)
print(f"First component: PC1 = {model.components_[0]}")
print(f"Second component: PC2 = {model.components_[1]}")

To each principal component is a value -- eigenvalue eigenvector pairs -- this measures variance.

The first principal component is where the largest variance occurs. 

It is decreasing (in order) for the rest of the principal components.

In [None]:
# Project the data onto the first 2 principal components
projected_data = model.transform(df)

# Create a scatter plot of the projected data
plt.scatter(projected_data[:, 0], projected_data[:, 1], c='blue', alpha=0.5, zorder=2)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Data on First 2 Principal Components')
plt.grid()
plt.axis('equal')
plt.show()

Notice that PC1 is not just one variable $x$ or $y$. It is a *linear combination* of the two.

This can make it challenging to interpret the components, but for many scenarios this is OK.

We can get the vector of the explained variance. 

In [None]:
model.explained_variance_ratio_

This tells us :
- $97.6\%$ of the variance is seen in PC1
- $2.4\%$ of the variance (the rest) is seen in PC2.

So really, we could just keep one dimension.

In [None]:
pca = PCA(n_components=1)	# Only use 1 component
X = df.to_numpy()
pca.fit(X)
X_pca = pca.transform(X)
X_new = pca.inverse_transform(X_pca)
plt.grid()
plt.axis('equal')
plt.scatter(X[:, 0], X[:, 1], alpha=0.25)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.75)
plt.title('PCA with 1 Component')
plt.show()

The light blue points are the original data, while the orange points are the projected version. 

The key to PCA:
- information along the least important principal axes are removed!
- Th reduced dataset is, in some sense, good enough to encode the most important relationships between the points.

This is a toy example, so it's not easy to see the benefit, but it is easy to see the impact.

## Back to Iris

Let's load up the Iris data set again and plot the basic data.

In [None]:
from sklearn.datasets import load_iris
import seaborn as sns

iris = load_iris()
ser = pd.Series(iris.target_names[iris.target], name='species')
df_labeled = pd.DataFrame(
	iris.data, 
	columns=iris.feature_names, 
)
df_labeled = pd.concat([ser, df_labeled], axis=1)

_ = sns.pairplot(
	df_labeled, 
	hue='species'
)	

(This is just an excuse to enjoy a nice image 🫠)

In [None]:
df_labeled.head()

Notice that the scale of petal width is much smaller than the other features. 

This is a problem for PCA, *which is sensitive to the scale of the features*. 

We can fix this by standardizing the data.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_labeled.drop('species', axis=1))
df_scaled = pd.DataFrame(df_scaled, columns=df_labeled.columns[1:])
df_scaled = pd.concat([df_labeled['species'], df_scaled], axis=1)
df_scaled.head()

Umm what?!

![](https://media.giphy.com/media/dAVLtOPb0JeIE/giphy.gif?cid=ecf05e476jqe1aysalqrm12p61ybtfb2bo4aeg27gpsrxng4&ep=v1_gifs_search&rid=giphy.gif&ct=g)