# Class 3 - Unsupervised learning models

Unsupervised learning is area of machine learning focused on detecting patterns in the data and **modelling without explicitly set labels/target variable**. In contrast, supervised learning techniques are mainly based on predicting nominal features (classification) or continuous features (regression).

Main tasks in the area of unsupervised learning are:
- **dimensionality reduction**
- **clustering**
- anomaly detection

**Dimensionality reduction** algorithms aim to represent high-dimensional input data in the output space with lower dimensionality. The approach is useful for:
- visualization of high dimensional data
- removing noise
- lowering the volume of the dataset, hence improving performance of other algorithms
- obfuscating and anonymizing the data

**Clustering** aims to differentiate the groups within the data, usually based on the distance between the observations. It's common task for customer or product datasets - segments created based on clustering results may be used in marketing activities or as an input to supervised machine learning model.

In [None]:
#!pip install umap-learn

In [None]:
import random
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
import umap

## Dimensionality reduction

In [None]:
digits = load_digits()
digits.keys()

In [None]:
digits['images'][0]

In [None]:
fig, ax_array = plt.subplots(1, 5)
fig.set_dpi(200)
axes = ax_array.flatten()
rand = random.sample(range(len(digits['images'])),5)
for i,ax in enumerate(axes):
    ax.imshow(digits.images[rand[i]], cmap='summer')
plt.setp(axes, xticks=[], yticks=[], frame_on=False);

In [None]:
#What is the dimensionality of the digits?
print(digits.data.shape)

We'll use PCA (Principal Components Analysis) technique to represent 64-dimensional digits data in 2-dimensional space and plot the result.

PCA is popular algorithm for dimensionality reduction based on linear algebra. For input matrix (dataset) we need to calculate eigenvectors (principal components) and eigenvalues. Eigenvectors determine directions for projection in new feature space and eigenvalues determine the mangnitude ('importance') of the vectors.
![PCA](https://jakevdp.github.io/PythonDataScienceHandbook/figures/05.09-PCA-rotation.png)
[source](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html)

In [None]:
pca = PCA(2)  # PCA model reducing data to 2 dimensions (2 principal components)
pca_embedding = pca.fit_transform(digits.data)
print(pca_embedding.shape)

In [None]:
def plot_reduced_data(embedding, color_col):
    plt.figure(dpi=150)
    plt.scatter(embedding[:, 0], embedding[:, 1], c=color_col, cmap='rainbow', s=5)
    plt.gca().set_aspect('equal', 'datalim')
    n = len(np.unique(color_col))
    plt.colorbar(boundaries=np.arange(n+1)-0.5).set_ticks(np.arange(n)) 

In [None]:
plot_reduced_data(pca_embedding, digits.target)

Now let's try a modern dimensionality reduction algorithm called **UMAP** (Uniform Manifold Approximation & Projection). It is rooted in Riemannian geometry - details can be found in the [paper](https://arxiv.org/abs/1802.03426). UMAP proved to give really good results and is considered state-of-the-art.

In [None]:
model = umap.UMAP(random_state=42)
model.fit(digits.data)
umap_embedding = model.transform(digits.data)
umap_embedding.shape

In [None]:
plot_reduced_data(umap_embedding, digits.target)

## Clustering

K-means algorithm in the nutshell:
1. Pick randomly 'k' observations from the dataset - initial centroids
2. Assign other observations to the nearest centroid
3. Calculate average coordinates from the members of the clusters - new coordinates of the center
4. Repeat 2. and 3. until stop criterion is reached

![kmeans](https://scikit-learn.org/stable/_images/sphx_glr_plot_kmeans_digits_001.png)
[source](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py)

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
df = pd.read_csv(url, names = ['sepal_length','sepal_width','petal_length','petal_width','species'])
df

In [None]:
species = df.species
df = df.drop('species', axis = 1)

In [None]:
df = pd.DataFrame(StandardScaler().fit(df).transform(df))
# df = pd.DataFrame(StandardScaler().fit_transform(df))

In [None]:
df

In [None]:
kmeans = KMeans(n_clusters=2, random_state=0).fit(df)
kmeans.labels_

In [None]:
model = umap.UMAP(random_state=42)
umap_embedding = model.fit(df).transform(df)

In [None]:
plot_reduced_data(umap_embedding, kmeans.labels_)

**Elbow** method to pick k - inertia for given k-means clustering is the sum of squares between clusters' center and their members

In [None]:
kmeans.inertia_

In [None]:
x = range(2,10)
inertias = [KMeans(n_clusters=k, random_state=0).fit(df).inertia_ for k in x]
plt.figure(dpi = 150)
plt.plot(x, inertias,'.-')
plt.ylabel('Inertia')
plt.xlabel('Number of clusteres');

In [None]:
k3_labels = KMeans(n_clusters=3, random_state=0).fit(df).labels_
plot_reduced_data(umap_embedding, k3_labels)

But how to measure quality of the clustering if we have a label to compare to?

In [None]:
from sklearn.metrics import adjusted_rand_score

In [None]:
adjusted_rand_score(species, k3_labels)

In [None]:
adrs = [adjusted_rand_score(species,KMeans(n_clusters=k, random_state=0).fit(df).labels_) for k in x]
plt.figure(dpi = 150)
plt.plot(x,adrs,'.-')
plt.ylabel('Adjusted Rand index')
plt.xlabel('Number of clusteres');

## Homework (5 pts)

A) Use 2 other dimensionality reduction techniques (other than PCA nad UMAP) on the digits dataset (2 pts)

B) Use 2 other dimensionality reduction techniques (other than PCA nad UMAP) on the other dataset than digits (2 pts)

C) Use 1 other clustering technique on Iris dataset plot the results with UMAP as above (1 pt)

Please prepare the code in Jupyter notebook and send the notebook to lkrain@sgh.waw.pl with output of the execution.