<center>
<img src="./images/EAN.jpg" style="width:1200px">
</center>

<center>
<img src="./images/0_intro_ml.jpg" style="width:1200px">
</center>

# Lecture 3: Unsupervised Learning 

## Instructors:

>Leonardo A. Espinosa, PhD. Instructor.
(*email*: leonardo.espinosaleal@arcada.fi)

> Ruben D. Acosta, MSc. Instructor.
(*email*:  rdacostav@universidadean.edu.co)

# Goal for today
* Understand the basic principles of unsupervised learning.
* Identify the pros and cons of the main algorithms for preprocessing, scaling and clustering.

# Unsupervised Learning

* Unsupervised learning subsumes all kinds of machine learning where there is no known output, no teacher to instruct the learning algorithm. 

* In unsupervised learning, the learning algorithm is just shown the input data and asked to extract knowledge from this data.

## Dataset transformations 
* summarizes the essential characteristics of high dimensional data with fewer features.
* finding the parts or components that 'make up' the data.

## Clustering.
* Partition data into distinct groups of similar items.

1. <a href="#/57/1">Preprocessing and Scaling</a>:
   
2. <a href="#/60/1">Dimensionality Reduction, Feature Extraction, and Manifold Learning</a>:
   * Principal Component Analysis (PCA).
   * Non-Negative Matrix Factorization (NMF)
   * Manifold Learning with t-SNE

3. <a href="#/75/1">Clustering</a>:
   * k-Means Clustering.
   * Agglomerative Clustering.
   * DBSCAN.

# Preprocessing and Scaling

>Some algorithms are very sensitive to the scaling of data.

>For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order.

* *StandardScaler*: for each feature the mean is 0 and the variance is 1. Same magnitude.
* *MinMaxScaler*: shifts the data such that all features are exactly between 0 and 1.
* *MaxAbsScaler*: same than MinMaxScaler but on positive only data.
* *RobustScaler*: similar to StandardScaler but with the guarantee that they are on the same scale. It uses the median and quartiles.
* *PowerTransformer*: Applies a power transformation to each feature to make the data more Gaussian-like (Box-Cox can only be applied to strictly positive data).
* *QuantileTransformer*: Gaussian output and uniform output.
* *Normalizer*: It scales each data point such that the feature vector has a Euclidean length of 1. 

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import mglearn

In [None]:
mglearn.plots.plot_scaling()

## A more descriptive example

In [None]:
import numpy as np

import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib import cm

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer      # You need the last version of scikit-learn!!!!

from sklearn.datasets import fetch_california_housing

print(__doc__)

dataset = fetch_california_housing()
X_full, y_full = dataset.data, dataset.target

# Take only 2 features to make visualization easier
# Feature of 0 has a long tail distribution.
# Feature 5 has a few but very large outliers.

In [None]:
X = X_full[:, [0, 5]]

distributions = [
    ('Unscaled data', X),
    ('Data after standard scaling',
        StandardScaler().fit_transform(X)),
    ('Data after min-max scaling',
        MinMaxScaler().fit_transform(X)),
    ('Data after max-abs scaling',
        MaxAbsScaler().fit_transform(X)),
    ('Data after robust scaling',
        RobustScaler(quantile_range=(25, 75)).fit_transform(X)),
    ('Data after power transformation (Yeo-Johnson)',
     PowerTransformer(method='yeo-johnson').fit_transform(X)),
    ('Data after power transformation (Box-Cox)',
     PowerTransformer(method='box-cox').fit_transform(X)),
    ('Data after quantile transformation (gaussian pdf)',
        QuantileTransformer(output_distribution='normal')
        .fit_transform(X)),
    ('Data after quantile transformation (uniform pdf)',
        QuantileTransformer(output_distribution='uniform')
        .fit_transform(X)),
    ('Data after sample-wise L2 normalizing',
        Normalizer().fit_transform(X)),
]

# scale the output between 0 and 1 for the colorbar
y = minmax_scale(y_full)

# plasma does not exist in matplotlib < 1.5
cmap = getattr(cm, 'plasma_r', cm.hot_r)

In [None]:
def create_axes(title, figsize=(20, 8)):
    fig = plt.figure(figsize=figsize)
    fig.suptitle(title)

    # define the axis for the first plot
    left, width = 0.1, 0.22
    bottom, height = 0.1, 0.7
    bottom_h = height + 0.15
    left_h = left + width + 0.02

    rect_scatter = [left, bottom, width, height]
    rect_histx = [left, bottom_h, width, 0.1]
    rect_histy = [left_h, bottom, 0.05, height]

    ax_scatter = plt.axes(rect_scatter)
    ax_histx = plt.axes(rect_histx)
    ax_histy = plt.axes(rect_histy)

    # define the axis for the zoomed-in plot
    left = width + left + 0.2
    left_h = left + width + 0.02

    rect_scatter = [left, bottom, width, height]
    rect_histx = [left, bottom_h, width, 0.1]
    rect_histy = [left_h, bottom, 0.05, height]

    ax_scatter_zoom = plt.axes(rect_scatter)
    ax_histx_zoom = plt.axes(rect_histx)
    ax_histy_zoom = plt.axes(rect_histy)

    # define the axis for the colorbar
    left, width = width + left + 0.13, 0.01

    rect_colorbar = [left, bottom, width, height]
    ax_colorbar = plt.axes(rect_colorbar)

    return ((ax_scatter, ax_histy, ax_histx),
            (ax_scatter_zoom, ax_histy_zoom, ax_histx_zoom),
            ax_colorbar)

In [None]:
def plot_distribution(axes, X, y, hist_nbins=50, title="",
                      x0_label="", x1_label=""):
    ax, hist_X1, hist_X0 = axes

    ax.set_title(title)
    ax.set_xlabel(x0_label)
    ax.set_ylabel(x1_label)

    # The scatter plot
    colors = cmap(y)
    ax.scatter(X[:, 0], X[:, 1], alpha=0.5, marker='o', s=5, lw=0, c=colors)

    # Removing the top and the right spine for aesthetics
    # make nice axis layout
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.get_xaxis().tick_bottom()
    ax.get_yaxis().tick_left()
    ax.spines['left'].set_position(('outward', 10))
    ax.spines['bottom'].set_position(('outward', 10))

    # Histogram for axis X1 (feature 5)
    hist_X1.set_ylim(ax.get_ylim())
    hist_X1.hist(X[:, 1], bins=hist_nbins, orientation='horizontal',
                 color='grey', ec='grey')
    hist_X1.axis('off')

    # Histogram for axis X0 (feature 0)
    hist_X0.set_xlim(ax.get_xlim())
    hist_X0.hist(X[:, 0], bins=hist_nbins, orientation='vertical',
                 color='grey', ec='grey')
    hist_X0.axis('off')

In [None]:
def make_plot(item_idx):
    title, X = distributions[item_idx]
    ax_zoom_out, ax_zoom_in, ax_colorbar = create_axes(title)
    axarr = (ax_zoom_out, ax_zoom_in)
    plot_distribution(axarr[0], X, y, hist_nbins=200,
                      x0_label="Median Income",
                      x1_label="Number of households",
                      title="Full data")

    # zoom-in
    zoom_in_percentile_range = (0, 99)
    cutoffs_X0 = np.percentile(X[:, 0], zoom_in_percentile_range)
    cutoffs_X1 = np.percentile(X[:, 1], zoom_in_percentile_range)

    non_outliers_mask = (
        np.all(X > [cutoffs_X0[0], cutoffs_X1[0]], axis=1) &
        np.all(X < [cutoffs_X0[1], cutoffs_X1[1]], axis=1))
    plot_distribution(axarr[1], X[non_outliers_mask], y[non_outliers_mask],
                      hist_nbins=50,
                      x0_label="Median Income",
                      x1_label="Number of households",
                      title="Zoom-in")

    norm = mpl.colors.Normalize(y_full.min(), y_full.max())
    mpl.colorbar.ColorbarBase(ax_colorbar, cmap=cmap,
                              norm=norm, orientation='vertical',
                              label='Color mapping for values of y')


### california housing

In [None]:
print(dataset.feature_names)
print(dataset.target_names)

### Original data

In [None]:
make_plot(0)

### StandardScaler
For each feature the mean is 0 and the variance is 1. Same magnitude.

In [None]:
make_plot(1)

### MinMaxScaler
Shifts the data such that all features are exactly between 0 and 1.

In [None]:
make_plot(2)

### MaxAbsScaler
same than MinMaxScaler but on positive only data.

In [None]:
make_plot(3)

### RobustScaler
similar to *StandardScaler* but with the guarantee that they are on the same scale. It uses the median and quartiles.

In [None]:
make_plot(4)

### PowerTransformer (Yeo-Johnson)
Applies a power transformation to each feature to make the data more Gaussian-like. Can be applied to both positive and negative data.
*Yeo and R.A. Johnson, “A New Family of Power Transformations to Improve Normality or Symmetry”, Biometrika 87.4 (2000)*

In [None]:
make_plot(5)

### PowerTransformer (Box-Cox) 
Applies a power transformation to each feature to make the data more Gaussian-like.  Can only be applied to strictly positive data.
*G.E.P. Box and D.R. Cox, “An Analysis of Transformations”, Journal of the Royal Statistical Society B, 26, 211-252 (1964).*

In [None]:
make_plot(6)

### QuantileTransformer (Gaussian output)
 Gaussian output and uniform output.

In [None]:
make_plot(7)

### QuantileTransformer (uniform output)
Gaussian output and uniform output.

In [None]:
make_plot(8)

### Normalizer
It scales each data point such that the feature vector has a Euclidean length of 1. 

In [None]:
make_plot(9)

## Applying Data Transformations

In [None]:
file_path='https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
names_list = ['Sex','Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight','Rings']
df = pd.read_csv(file_path,header=None,names=names_list)

# We add a Years column  
df['Years'] = df['Rings'] + 1.5

# We change the M,F and I categorical variables as numerical using 0,1 and 2.
replace_list = {"Sex" : {"M": 0, "F" : 1, "I": 2}}
df.replace(replace_list,inplace=True)
# If we want, we can inspect the dataset.

In [None]:
# Here we turn into numpy arrays
X_cls = df.iloc[:,1:8].values  # dataset
y_cls = df['Sex'].values   # target classification
y_reg = df['Years'].values   # target regression

# Reformulate the problem using the Abalone dataset, now binary Male or Female is the target.
#First remove the rows for Sex I (Infant) = 2.

df_bin = df[df.Sex !=2]

# Here we turn into numpy arrays
X_bin = df_bin.iloc[:,1:].values
y_bin = df_bin.iloc[:,0].values

In [None]:
df

In [None]:
from matplotlib import rc
font = {'family' : 'monospace', 'weight' : 'bold', 'size'   : 25}
rc('font', **font) 

plt.rcParams['figure.figsize'] = [20, 10]
plt.rcParams['lines.linewidth'] = 5.0
plt.rcParams['lines.markersize'] = 15.0

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_bin,y_bin, random_state=42)
svm = SVC()
svm.fit(X_train, y_train)
print("Test set accuracy: {:.2f}".format(svm.score(X_test, y_test)))


In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)

In [None]:
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# learning an SVM on the scaled training data
svm.fit(X_train_scaled, y_train)

# scoring on the scaled test set
print("Scaled test set accuracy: {:.2f}".format(svm.score(X_test_scaled, y_test)))

In [None]:
# preprocessing using zero mean and unit variance scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# learning an SVM on the scaled training data
svm.fit(X_train_scaled, y_train)
# scoring on the scaled test set
print("SVM test accuracy: {:.2f}".format(svm.score(X_test_scaled, y_test)))

### Another example

UCI ML Breast Cancer Wisconsin (Diagnostic) dataset (https://goo.gl/U2Uwz2).

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
random_state=1)
print(X_train.shape)
print(X_test.shape)

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)

In [None]:
# transform train data
X_train_scaled = scaler.transform(X_train)
# print dataset properties before and after scaling
print("transformed shape: {}".format(X_train_scaled.shape))
print("\n")
print("per-feature minimum before scaling:\n {}".format(X_train.min(axis=0)))
print("per-feature maximum before scaling:\n {}".format(X_train.max(axis=0)))
print("\n")
print("per-feature minimum after scaling:\n {}".format(X_train_scaled.min(axis=0)))
print("per-feature maximum after scaling:\n {}".format(X_train_scaled.max(axis=0)))

In [None]:
# transform test data
X_test_scaled = scaler.transform(X_test)
# print test data properties after scaling
print("per-feature minimum after scaling:\n{}".format(X_test_scaled.min(axis=0)))
print("per-feature maximum after scaling:\n{}".format(X_test_scaled.max(axis=0)))

*Some of the features are even outside the 0–1 range!* WRONG!!!

In [None]:
from sklearn.datasets import make_blobs

def get_plot1():
    # make synthetic data
    X, _ = make_blobs(n_samples=50, centers=5, random_state=4, cluster_std=2)
    # split it into training and test sets
    X_train, X_test = train_test_split(X, random_state=5, test_size=.1)
    # plot the training and test sets
    fig, axes = plt.subplots(1, 3, figsize=(22, 8))

    axes[0].scatter(X_train[:, 0], X_train[:, 1],
        c=mglearn.cm2(0), label="Training set", s=60)
    axes[0].scatter(X_test[:, 0], X_test[:, 1], marker='^',
        c=mglearn.cm2(1), label="Test set", s=60)
    axes[0].legend(loc='upper left')
    axes[0].set_title("Original Data")

    # scale the data using MinMaxScaler
    scaler = MinMaxScaler()
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # visualize the properly scaled data
    axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],
        c=mglearn.cm2(0), label="Training set", s=60)
    axes[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], marker='^',
        c=mglearn.cm2(1), label="Test set", s=60)
    axes[1].set_title("Scaled Data")

    # rescale the test set separately
    # so test set min is 0 and test set max is 1
    # DO NOT DO THIS! For illustration purposes only.
    test_scaler = MinMaxScaler()
    test_scaler.fit(X_test)
    X_test_scaled_badly = test_scaler.transform(X_test)

    # visualize wrongly scaled data
    axes[2].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],
        c=mglearn.cm2(0), label="training set", s=60)
    axes[2].scatter(X_test_scaled_badly[:, 0], X_test_scaled_badly[:, 1],
        marker='^', c=mglearn.cm2(1), label="test set", s=60)
    axes[2].set_title("Improperly Scaled Data")
    for ax in axes:
        ax.set_xlabel("Feature 0")
        ax.set_ylabel("Feature 1")

In [None]:
from matplotlib import rc
font = {'family' : 'monospace', 'weight' : 'bold', 'size'   : 15}
rc('font', **font) 

plt.rcParams['figure.figsize'] = [20, 10]
plt.rcParams['lines.linewidth'] = 2.0
plt.rcParams['lines.markersize'] = 10.0

In [None]:
get_plot1()

## Shortcuts and Efficient Alternatives

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# calling fit and transform in sequence (using method chaining)
X_scaled = scaler.fit(X).transform(X)

# same result, but more efficient computation
X_scaled_d = scaler.fit_transform(X)

### The Effect of Preprocessing on Supervised Learning

In [None]:
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,random_state=0)
svm = SVC(C=100)
svm.fit(X_train, y_train)
print("Test set accuracy: {:.2f}".format(svm.score(X_test, y_test)))

### Using a scaler before fitting the SVC

In [None]:
# preprocessing using 0-1 scaling
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# learning an SVM on the scaled training data
svm.fit(X_train_scaled, y_train)
# scoring on the scaled test set
print("Scaled test set accuracy: {:.2f}".format(svm.score(X_test_scaled, y_test)))

In [None]:
# preprocessing using zero mean and unit variance scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# learning an SVM on the scaled training data
svm.fit(X_train_scaled, y_train)
# scoring on the scaled test set
print("SVM test accuracy: {:.2f}".format(svm.score(X_test_scaled, y_test)))

# Conclusion

>As we saw before, the effect of scaling the data is quite significant. Even though scaling the data doesn't involve any complicated math, it is good practice to use the scaling mechanisms provided by scikit-learn instead of reimplementing them yourself, as it's easy to make mistakes even in these simple computations.

>You can also easily replace one preprocessing algorithm with another by changing the class you use, as all of the preprocessing classes have the same interface, consisting of the fit and transform methods:

<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

<center>
<img src="./images/00_hands-on.jpg" style="width:1200px">
</center>

# Exercise

Let's fit a couple of models to the wine dataset to predict:
1. The amount of alcohol
2. The quality of the wine 

for both methods compare the score of the test set when two scaling methods are used.

In [None]:
df_red = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",sep=';')
df_white = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",sep=';')

df_red['type'] = 0
df_white['type'] = 1

df_allwine = pd.concat([df_red, df_white])

# Here we turn into numpy arrays
Xw = df_allwine.iloc[:,:10].values  # dataset
yw_bin = df_allwine['type'].values   # target classification
yw_reg = df_allwine['alcohol'].values   # target regression
yw_mul = df_allwine['quality'].values   # multiclass regression

# Dimensionality Reduction, Feature Extraction, and Manifold Learning

The most common motivations for transforming data using unsupervised learning are visualization, compressing the data, and finding a representation that is more informative for further processing.

* Principal component analysis (PCA). 
* Non-negative matrix factorization (NMF), commonly used for feature extraction, and 
* t-SNE, commonly used for visualization using two-dimensional scatter plots.

# Principal Component Analysis (PCA)

* Principal component analysis is a method that rotates the dataset in a way such that the rotated features are statistically uncorrelated. 
* This rotation is often followed by selecting only a subset of the new features, according to how important they are for explaining the data.

In [None]:
from matplotlib import rc
font = {'family' : 'monospace', 'weight' : 'bold', 'size'   : 8}
rc('font', **font) 

plt.rcParams['figure.figsize'] = [20, 10]
plt.rcParams['lines.linewidth'] = 2.0
plt.rcParams['lines.markersize'] = 10.0

In [None]:
mglearn.plots.plot_pca_illustration()

In [None]:
new_names = ['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight','Rings']


fig, axes = plt.subplots(4, 2, figsize=(10, 20))
Male = X_bin[y_bin == 0]
Female = X_bin[y_bin == 1]
ax = axes.ravel()
for i in range(8):
    _, bins = np.histogram(X_bin[:, i], bins=50)
    ax[i].hist(Male[:, i], bins=bins, color=mglearn.cm3(0), alpha=.5)
    ax[i].hist(Female[:, i], bins=bins, color=mglearn.cm3(2), alpha=.5)
    ax[i].set_title(new_names[i])
    ax[i].set_yticks(())
    ax[i].set_xlabel("Feature magnitude")
    ax[i].set_ylabel("Frequency")
ax[0].legend(["Male", "Female"], loc="best")
fig.tight_layout()

In [None]:
#scaler = StandardScaler()
scaler = MinMaxScaler()
scaler.fit(X_bin)
X_scaled = scaler.transform(X_bin)

In [None]:
from sklearn.decomposition import PCA
# keep the first two principal components of the data
pca = PCA(n_components=2)
# fit PCA model to breast cancer data
pca.fit(X_scaled)
# transform data onto the first two principal components
X_pca = pca.transform(X_scaled)
print("Original shape: {}".format(str(X_scaled.shape)))
print("Reduced shape: {}".format(str(X_pca.shape)))

In [None]:
from matplotlib import rc
font = {'family' : 'monospace', 'weight' : 'bold', 'size'   : 22}
rc('font', **font) 

plt.rcParams['figure.figsize'] = [20, 10]
plt.rcParams['lines.linewidth'] = 2.0
plt.rcParams['lines.markersize'] = 10.0

### PCA can be used for visualization of features.

In [None]:
# plot first vs. second principal component, colored by class
plt.figure(figsize=(10, 8))
mglearn.discrete_scatter(X_pca[:, 0], X_pca[:, 1], y_bin)
plt.legend(['Male','Female'], loc="best")
plt.gca().set_aspect("equal")
plt.xlabel("First principal component")
plt.ylabel("Second principal component")

In [None]:
print("PCA component shape: {}".format(pca.components_.shape))

In [None]:
print("PCA components:\n{}".format(pca.components_))

In [None]:
plt.matshow(pca.components_, cmap='viridis')
plt.yticks([0, 1], ["First component", "Second component"])
plt.colorbar()
plt.xticks(range(len(new_names)), new_names, rotation=60, ha='left')
plt.xlabel("Feature")
plt.ylabel("Principal components")

### Eigenfaces for feature extraction

Another application of PCA is feature extraction. The idea behind feature extraction is that it is possible to find a representation of your data that is better suited to analysis than the raw representation you were given. Suitable for images.

In [None]:
from matplotlib import rc
font = {'family' : 'monospace', 'weight' : 'bold', 'size'   : 12}
rc('font', **font) 

plt.rcParams['figure.figsize'] = [20, 10]
plt.rcParams['lines.linewidth'] = 2.0
plt.rcParams['lines.markersize'] = 10.0

In [None]:
from sklearn.datasets import fetch_lfw_people
people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)
image_shape = people.images[0].shape
fix, axes = plt.subplots(2, 5, figsize=(15, 8),
subplot_kw={'xticks': (), 'yticks': ()})

for target, image, ax in zip(people.target, people.images, axes.ravel()):
    ax.imshow(image)
    ax.set_title(people.target_names[target])

There are 3,023 images, each 87×65 pixels large, belonging to 62 different people:

In [None]:
print("people.images.shape: {}".format(people.images.shape))
print("Number of classes: {}".format(len(people.target_names)))

In [None]:
# count how often each target appears
counts = np.bincount(people.target)
# print counts next to target names
for i, (count, name) in enumerate(zip(counts, people.target_names)):
    print("{0:25} {1:3}".format(name, count), end='')
    if (i + 1) % 3 == 0:
        print()

In [None]:
mask = np.zeros(people.target.shape, dtype=np.bool)
for target in np.unique(people.target):
    mask[np.where(people.target == target)[0][:50]] = 1

X_people = people.data[mask]
y_people = people.target[mask]

# scale the grayscale values to be between 0 and 1
# instead of 0 and 255 for better numeric stability
X_people = X_people / 255.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X_people, y_people, stratify=y_people, random_state=0)
# build a KNeighborsClassifier using one neighbor
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print("Test set score of 1-nn: {:.2f}".format(knn.score(X_test, y_test)))

We obtain an accuracy of 23%, which is not actually that bad for a 62-class classification problem (random guessing would give you around 1/62 = 1.5% accuracy), but is also not great. We only correctly identify a person every fourth time.

This is where PCA comes in.

*Here we use the whitening option of PCA, which rescales the principal components to have the same scale. This is the same as using StandardScaler after the transformation.*

In [None]:
mglearn.plots.plot_pca_whitening()

We fit the PCA object to the training data and extract the first 100 principal components. Then we transform the training and test data:

In [None]:
pca = PCA(n_components=100, whiten=True, random_state=0).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print("X_train_pca.shape: {}".format(X_train_pca.shape))

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train_pca, y_train)
print("Test set accuracy: {:.2f}".format(knn.score(X_test_pca, y_test)))

In [None]:
print("pca.components_.shape: {}".format(pca.components_.shape))

In [None]:
fix, axes = plt.subplots(3, 5, figsize=(15, 12),
subplot_kw={'xticks': (), 'yticks': ()})
for i, (component, ax) in enumerate(zip(pca.components_, axes.ravel())):
    ax.imshow(component.reshape(image_shape),
    cmap='viridis')
    ax.set_title("{}. component".format((i + 1)))

### Reconstructing three face images using increasing numbers of principal components

In [None]:
mglearn.plots.plot_pca_faces(X_train, X_test, image_shape)

### Scatter plot of the faces dataset using the first two principal components

In [None]:
mglearn.discrete_scatter(X_train_pca[:, 0], X_train_pca[:, 1], y_train)
plt.xlabel("First principal component")
plt.ylabel("Second principal component")

<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

<center>
<img src="./images/00_hands-on.jpg" style="width:1200px">
</center>

### a more longer exercise
Let's fit different models to the wine dataset using the components of PCA (2,3 and 4):

* for regression (amount of alcohol) and 
* classification (quality of wine) 
* create a visualization in 2D of the components

explore different values of hyperparameters.

# Non-Negative Matrix Factorization (NMF)

* As in PCA, we are trying to write each data point as a weighted sum of some components. 

* But whereas in PCA we wanted components that were orthogonal and that explained as much variance of the data as possible, in NMF, we want the components and the coefficients to be non-negative.

* We want both the components and the coefficients to be greater than or equal to zero. Consequently, this method can only be applied to data where each feature is non-negative.

* Helpful for data that is created as the addition (or overlay) of several independent sources such as music or pictures.

#### Components found by non-negative matrix factorization with two components (left) and one component (right)

In [None]:
mglearn.plots.plot_nmf_illustration()

### Applying NMF to face images

In [None]:
mglearn.plots.plot_nmf_faces(X_train, X_test, image_shape)

* The quality of the back-transformed data is similar to when using PCA, but slightly worse. 
* PCA finds the optimum directions in terms of reconstruction. 
* NMF is usually not used for its ability to reconstruct or encode data, but rather for finding interesting patterns within the data.

In [None]:
from sklearn.decomposition import NMF
nmf = NMF(n_components=15, random_state=0)
nmf.fit(X_train)

X_train_nmf = nmf.transform(X_train)
X_test_nmf = nmf.transform(X_test)

fix, axes = plt.subplots(3, 5, figsize=(15, 12), subplot_kw={'xticks': (), 'yticks': ()})

for i, (component, ax) in enumerate(zip(nmf.components_, axes.ravel())):
    ax.imshow(component.reshape(image_shape))
    ax.set_title("{}. component".format(i))

In [None]:
compn = 3
# sort by 3rd component, plot first 10 images
inds = np.argsort(X_train_nmf[:, compn])[::-1]
fig, axes = plt.subplots(2, 5, figsize=(15, 8),subplot_kw={'xticks': (), 'yticks': ()})
for i, (ind, ax) in enumerate(zip(inds, axes.ravel())):
    ax.imshow(X_train[ind].reshape(image_shape))

In [None]:
compn = 7
# sort by 7th component, plot first 10 images
inds = np.argsort(X_train_nmf[:, compn])[::-1]
fig, axes = plt.subplots(2, 5, figsize=(15, 8), subplot_kw={'xticks': (), 'yticks': ()})
for i, (ind, ax) in enumerate(zip(inds, axes.ravel())):
    ax.imshow(X_train[ind].reshape(image_shape))

<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

<center>
<img src="./images/00_hands-on.jpg" style="width:1200px">
</center>

### Exercise

Let's fit different models to the wine dataset using the components of NMF (2,3 and 4):

*hint*: First do a transformation where all the data is larger than zero (*MinMaxScaler*).

* for regression (amount of alcohol) and 
* classification (quality of wine) 
* create a visualization in 2D of the components

Explore different values of hyperparameters.

# Manifold Learning with t-SNE

* t-distributed Stochastic Neighbor Embedding, developed by Geoffrey Hinton and Laurens van der Maaten.

* PCA is often a good first approach for transforming your data so that you might be able to visualize it using a scatter plot.

* Manifold learning algorithms are mainly aimed at visualization.

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
fig, axes = plt.subplots(2, 5, figsize=(10, 5),
subplot_kw={'xticks':(), 'yticks': ()})
for ax, img in zip(axes.ravel(), digits.images):
    ax.imshow(img)

In [None]:
# build a PCA model
def get_pca_fig(cp=2):
    pca = PCA(n_components=cp)
    pca.fit(digits.data)
    # transform the digits data onto the first two principal components
    digits_pca = pca.transform(digits.data)
    colors = ["#476A2A", "#7851B8", "#BD3430", "#4A2D4E", "#875525",
    "#A83683", "#4E655E", "#853541", "#3A3120", "#535D8E"]
    plt.figure(figsize=(10, 10))
    plt.xlim(digits_pca[:, 0].min(), digits_pca[:, 0].max())
    plt.ylim(digits_pca[:, 1].min(), digits_pca[:, 1].max())
    for i in range(len(digits.data)):
        # actually plot the digits as text instead of using scatter
        plt.text(digits_pca[i, 0], digits_pca[i, 1], str(digits.target[i]),
        color = colors[digits.target[i]],
        fontdict={'weight': 'bold', 'size': 9})
    plt.xlabel("First principal component")
    plt.ylabel("Second principal component")

In [None]:
get_pca_fig()

In [None]:
from sklearn.manifold import TSNE

def get_tsne_fig():
    
    colors = ["#476A2A", "#7851B8", "#BD3430", "#4A2D4E", "#875525",
    "#A83683", "#4E655E", "#853541", "#3A3120", "#535D8E"]
    
    tsne = TSNE(random_state=42)
    # use fit_transform instead of fit, as TSNE has no transform method
    digits_tsne = tsne.fit_transform(digits.data)

    plt.figure(figsize=(10, 10))
    plt.xlim(digits_tsne[:, 0].min(), digits_tsne[:, 0].max() + 1)
    plt.ylim(digits_tsne[:, 1].min(), digits_tsne[:, 1].max() + 1)
    for i in range(len(digits.data)):
        # actually plot the digits as text instead of using scatter
        plt.text(digits_tsne[i, 0], digits_tsne[i, 1], str(digits.target[i]),
        color = colors[digits.target[i]],
        fontdict={'weight': 'bold', 'size': 9})
    plt.xlabel("t-SNE feature 0")
    plt.xlabel("t-SNE feature 1")

In [None]:
get_tsne_fig()

<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

# Clustering

* Clustering is the task of partitioning the dataset into groups, called clusters.

* The goal is to split up the data in such a way that points within a single cluster are very similar and points in different clusters are different.

## k-Means Clustering

In [None]:
mglearn.plots.plot_kmeans_algorithm()


In [None]:
mglearn.plots.plot_kmeans_boundaries()

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

def get_kmeans_figs():
    X, y = make_blobs(random_state=1)
    fig, axes = plt.subplots(1, 2, figsize=(12, 6))
    # using two cluster centers:
    kmeans = KMeans(n_clusters=2)
    kmeans.fit(X)
    assignments = kmeans.labels_
    mglearn.discrete_scatter(X[:, 0], X[:, 1], assignments, ax=axes[0])
    # using five cluster centers:
    kmeans = KMeans(n_clusters=5)
    kmeans.fit(X)
    assignments = kmeans.labels_
    mglearn.discrete_scatter(X[:, 0], X[:, 1], assignments, ax=axes[1])

In [None]:
get_kmeans_figs()

# Cluster assignments found by k-means using two clusters (left) and five clusters (right)

### Failure cases of k-means

Even if you know the “right” number of clusters for a given dataset, k-means might not always be able to recover them

In [None]:
from matplotlib import rc
font = {'family' : 'monospace', 'weight' : 'bold', 'size'   : 25}
rc('font', **font) 

plt.rcParams['figure.figsize'] = [20, 10]
plt.rcParams['lines.linewidth'] = 15.0
plt.rcParams['lines.markersize'] = 15.0

In [None]:
X_varied, y_varied = make_blobs(n_samples=200,
cluster_std=[1.0, 2.5, 0.5],
random_state=170)
y_pred = KMeans(n_clusters=3, random_state=0).fit_predict(X_varied)
mglearn.discrete_scatter(X_varied[:, 0], X_varied[:, 1], y_pred)
plt.legend(["cluster 0", "cluster 1", "cluster 2"], loc='best')
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")

In [None]:
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
kmeans = KMeans(n_clusters=10, random_state=0)
kmeans.fit(X)
y_pred = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=60, cmap='Paired')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=60,
marker='^', c=range(kmeans.n_clusters), linewidth=2, cmap='Paired')
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
print("Cluster memberships:\n{}".format(y_pred))

<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

<center>
<img src="./images/00_hands-on.jpg" style="width:1200px">
</center>

## Exercises 

Let's try to cluster the wine datset! Use k-means with to cluster the wine dataset!

# Agglomerative Clustering

* It refers to a collection of clustering algorithms that all build upon the same principles: the algorithm starts by declaring each point its own cluster, and then merges the two most similar clusters until some stopping criterion is satisfied.

* There are several linkage criteria that specify how exactly the “most similar cluster” is measured.
     * ward
     * average
     * complete

In [None]:
mglearn.plots.plot_agglomerative_algorithm()

### Hierarchical clustering and dendrograms

In [None]:
mglearn.plots.plot_agglomerative()

In [None]:
from matplotlib import rc
font = {'family' : 'monospace', 'weight' : 'bold', 'size'   : 15}
rc('font', **font) 

plt.rcParams['figure.figsize'] = [20, 10]
plt.rcParams['lines.linewidth'] = 8.0
plt.rcParams['lines.markersize'] = 5.0

In [None]:
# Import the dendrogram function and the ward clustering function from SciPy
from scipy.cluster.hierarchy import dendrogram, ward

#def plot_dendro():
X, y = make_blobs(random_state=0, n_samples=12)
# Apply the ward clustering to the data array X
# The SciPy ward function returns an array that specifies the distances
# bridged when performing agglomerative clustering
linkage_array = ward(X)

# Now we plot the dendrogram for the linkage_array containing the distances
# between clusters
dendrogram(linkage_array)
# Mark the cuts in the tree that signify two or three clusters
ax = plt.gca()
bounds = ax.get_xbound()
ax.plot(bounds, [7.25, 7.25], '--', c='k')
ax.plot(bounds, [4, 4], '--', c='k')
ax.text(bounds[1], 7.25, ' two clusters', va='center', fontdict={'size': 15})
ax.text(bounds[1], 4, ' three clusters', va='center', fontdict={'size': 15})
plt.xlabel("Sample index")
plt.ylabel("Cluster distance")
plt.show()

<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

# DBSCAN

* Density-Based Spatial Clustering of Applications with Noise.
* It does not require the user to set the number of clusters a priori.

In [None]:
from matplotlib import rc
font = {'family' : 'monospace', 'weight' : 'bold', 'size'   : 8}
rc('font', **font) 

plt.rcParams['figure.figsize'] = [20, 10]
plt.rcParams['lines.linewidth'] = 8.0
plt.rcParams['lines.markersize'] = 5.0

In [None]:
mglearn.plots.plot_dbscan()

In [None]:
from sklearn.cluster import DBSCAN

X, y = make_moons(n_samples=200, noise=0.05, random_state=0)
# rescale the data to zero mean and unit variance
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
dbscan = DBSCAN()
clusters = dbscan.fit_predict(X_scaled)
# plot the cluster assignments
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap=mglearn.cm2, s=60)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")

<center>
<img src="./images/00_questions.jpg" style="width:1200px">
</center>

# Conclusions about Clustering Methods

*  k-means, DBSCAN, and agglomerative clustering have a way of controlling the granularity of clustering.
* k-means and agglomerative clustering allow you to specify the number of desired clusters, while DBSCAN lets you define proximity using the eps parameter, which indirectly influences cluster size.
* All three methods can be used on large, real-world datasets, are relatively easy to understand, and allow for clustering into many clusters.