# Unsupervised Learning

- Dimensionality reduction (curse of dimensionality, performance, etc.)

- Clustering

- Others: Visualization, Finding Association Rules, Anomaly Detection


![Image](./img/clustering.png)

In [None]:
# imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_digits

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA
import umap   # $pip install umap-learn

from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from yellowbrick.cluster import KElbowVisualizer   # $pip install yellowbrick

from sklearn.metrics import mean_squared_error

---

# Dimensionality reduction (a.k.a. Projection)

- Principal Component Analyses (PCA)

- Uniform Manifold Approximation and Projection (UMAP)

![Image](./img/projection.JPG)

---

### [Principal Component Analyses (PCA)](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

- Good performance

- Less sparsity

The idea is to combine multiple numeric predictor variables into a smaller set of variables, which are __weighted linear combinations__ of the original set (i.e.: the principal components).

In [None]:
# The most fancy dataset!!! 

boson = pd.read_csv('./datasets/higgs-boson.csv')
boson['Label'] = boson['Label'].map({'s': 1, 'b': 0})
boson.info()

In [None]:
# 31-D dataset

boson

In [None]:
boson['Label'].unique()

#### Maximum amount of variation

![Image](./img/max_variance.JPG)

In [None]:
# Scaling

boson_pca = boson[[x for x in boson.columns if x != 'Label']]
scaler = StandardScaler()
boson_scaled_pca = scaler.fit_transform(boson_pca)
boson_scaled_pca

In [None]:
# Model training (with all components)

pca = PCA().fit(boson_scaled_pca)

In [None]:
# Relative Importance of PCs (i.e.: the percentage of variance that is attributed by each of the selected components)

pca.explained_variance_ratio_

In [None]:
# Scree Plot

fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(13,8))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)')
plt.ylim([0,1])
plt.title('Higgs Boson Dataset Explained Variance')
plt.show()

#### The weights used to form the principal components reveal the relative contributions of the original variables to the new principal components. In this case 20 components is the optimum amount (roughly)

In [None]:
# We build our Principal Components

pca_optimum = PCA(n_components=15)
boson_scaled_pca_optimum = pca_optimum.fit_transform(boson_scaled_pca)
pd.DataFrame(boson_scaled_pca_optimum)

#### Why scaling?

In [None]:
pca_no_scaling = PCA().fit(boson_pca)
pca_no_scaling.explained_variance_ratio_

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(13,8))
plt.plot(np.cumsum(pca_no_scaling.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)')
plt.ylim([0,1])
plt.title('Higgs Boson Dataset Explained Variance')
plt.show()

In [None]:
np.cumsum(pca_no_scaling.explained_variance_ratio_)

---

### [Uniform Manifold Approximation and Projection (UMAP)](https://pair-code.github.io/understanding-umap/)

- Better preservation of the data's global structure (i.e.: it works better with categorical and mixed data).

- More complex mathematically (black box)

In [None]:
# Image data

digits = load_digits()
print(digits.data.shape)
print(digits.images.shape)
print(digits.target.shape)
print(digits.target_names.shape)
print(digits.DESCR)

In [None]:
fig, ax_array = plt.subplots(20, 20, figsize=(13,8))
axes = ax_array.flatten()
for i, ax in enumerate(axes):
    ax.imshow(digits.images[i], cmap='gray_r')
plt.setp(axes, xticks=[], yticks=[], frame_on=False)
plt.tight_layout(h_pad=0.5, w_pad=0.01)

In [None]:
digits_df = load_digits(as_frame=True)
digits_df.data

In [None]:
reducer = umap.UMAP(random_state=42, 
                    n_neighbors=15, 
                    min_dist=0,
                    n_components=2,
                    metric='euclidean')
reducer.fit(digits.data)

#### [Hyperparameters](https://umap-learn.readthedocs.io/en/latest/parameters.html)

- __n_neighbors:__ balances local versus global structure in the data. This means that low values of n_neighbors will force UMAP to concentrate on very local structure.

- __min_dist:__ controls how tightly UMAP is allowed to pack points together. It, quite literally, provides the minimum distance apart that points are allowed to be in the low dimensional representation. A low value is ideal for clustering purposes.

- __n_components:__ the dimensionality of the reduced dimension space we will be embedding the data into.

- __metric:__ how distance is computed in the input data space (euclidean, cosine, chebyshev...).

In [None]:
embedding = reducer.transform(digits.data)

embedding_df = pd.DataFrame(embedding)
embedding_df

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(10,8))
plt.scatter(embedding[:, 0], embedding[:, 1])
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Digits dataset', fontsize=18);

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(13,8))
plt.scatter(embedding[:, 0], embedding[:, 1], c=digits.target, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(11) - 0.5).set_ticks(np.arange(10))
plt.title('UMAP projection of the Digits dataset', fontsize=18);

---

## Clustering

![Image](./img/clustering_types.png)


- __Retail Marketing:__ Retail companies often use clustering to identify groups of households that are similar to each other. The company can then send personalized advertisements or sales letters to each household based on how likely they are to respond to specific types of advertisements.

- __Streaming Services:__ Streaming services often use clustering analysis to identify viewers who have similar behavior. Using these metrics, a streaming service can perform cluster analysis to identify high usage and low usage users so that they can know who they should spend most of their advertising dollars on.

- __Sports Science:__ Data scientists for sports teams often use clustering to identify players that are similar to each other. They can then feed these variables into a clustering algorithm to identify players that are similar to each other so that they can have these players practice with each other and perform specific drills based on their strengths and weaknesses.

- __Other Examples:__ Investment portfolio, mental health assessment, music preferences, etc.

---

### [K-Means](https://scikit-learn.org/stable/modules/clustering.html#k-means)

- K-means clustering minimizes within-cluster variances (Inertia). We need to define _a priori_ the number of clusters. [Here](https://www.naftaliharris.com/blog/visualizing-k-means-clustering/) you may find a tool to visually understand the centroids generation.

In [None]:
kmeans = KMeans(n_clusters=10, random_state=42).fit(digits.data)
kmeans

In [None]:
kmeans.predict(digits.data)

In [None]:
kmeans.labels_

In [None]:
kmeans.cluster_centers_[0]

In [None]:
check = pd.DataFrame({'Ground truth':digits.target, 'Infered Labels':kmeans.labels_})
check.head(50)

---

In [None]:
kmeans_emb = KMeans(n_clusters=10, random_state=42).fit(embedding)
kmeans_emb

In [None]:
kmeans_emb.predict(embedding)

In [None]:
kmeans_emb.labels_

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(10,8))
plt.scatter(embedding[:, 0], embedding[:, 1], s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.title('Embedding projection', fontsize=18);

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(13,8))
plt.scatter(embedding[:, 0], embedding[:, 1], c=kmeans_emb.labels_, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(kmeans.n_clusters + 1) - 0.5).set_ticks(np.arange(kmeans.n_clusters))
plt.title('K-Means over the Digits dataset', fontsize=24);

In [None]:
check_emb = pd.DataFrame({'Ground truth':digits.target, 'Infered Labels':kmeans_emb.labels_})
check_emb.head(20)

#### [Elbow Method](https://www.scikit-yb.org/en/latest/api/cluster/elbow.html)

In [None]:
visualizer = KElbowVisualizer(kmeans, k=(6,14))

visualizer.fit(embedding)
visualizer.show();

---

### [DBSCAN (Density-Based Spatial Clustering of Applications with Noise)](https://scikit-learn.org/stable/modules/clustering.html#dbscan)

- Define clusters as areas of high density separated by areas of low density

In [None]:
dbscan_c = DBSCAN(eps=0.5,
                  min_samples=25).fit(embedding)

dbscan_c

In [None]:
fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(13,8))
plt.scatter(embedding[:, 0], embedding[:, 1], c=dbscan_c.labels_, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(len(np.unique(dbscan_c.labels_)) + 1) - 0.5)\
.set_ticks(np.arange(len(np.unique(dbscan_c.labels_))))
plt.title('DBSCAN over the Digits dataset', fontsize=24);

In [None]:
check_emb_dbscan = pd.DataFrame({'Ground truth':digits.target, 'Infered Labels':dbscan_c.labels_})
unique_clusters = np.unique(dbscan_c.labels_)
print(unique_clusters)
print(f'Total clusters infered: {len(unique_clusters)}')
check_emb_dbscan.head(20)

---