# Unsupervised Learning

Many instances of unsupervised learning, such as dimensionality reduction, manifold learning, and feature extraction, find a new representation of the input data without any additional input. In contrast to supervised learning, usnupervised algorithms don't require or consider target variables like in the previous classification and regression examples. 


## Data Transformation

A very basic example is the rescaling of the data, which is a requirement for many machine learning algorithms as they are not scale-invariant. Rescaling falls into the category of data pre-processing and can barely be called *learning*. There exist many different rescaling technques, and in the following example, we will take a look at a particular method that is commonly called "standardization." Here, we will recale the data so that each feature is centered at zero (mean = 0) with unit variance (standard deviation = 0).

For example, if we have a 1D dataset with the values [1, 2, 3, 4, 5], the standardized values are

- 1 -> -1.41
- 2 -> -0.71
- 3 -> 0.0
- 4 -> 0.71
- 5 -> 1.41

computed via the equation $x_{standardized} = \frac{x - \mu_x}{\sigma_x}$,
where $\mu$ is the sample mean, and $\sigma$ the standard deviation, respectively.

scikit-learn implements a `StandardScaler` class for this computation. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Applying such a preprocessing has a very similar interface to the supervised learning algorithms we saw so far.
To get some more practice with scikit-learn's "Transformer" interface, let's start by loading the iris dataset and rescale it:


In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd


In [6]:
#Load the iris dataset and convert it to Pandas dataframe 
#Split the dataset into a training (70%) and a test set (30%)



In [7]:
#Inspect the mean and standard deviation of the features. 
#Is the dataset "centered"? i.e is the mean non-zero  and the standard deviation  different for each feature?


In [3]:
#Use an unsupervised preprocessing method, such as scikit-learn StandardScaler to standardize the dataset
#https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html



In [8]:
#Inspect the mean and standar deviation of the rescaled dataset. 



## Dimentionality Redcution: Principal Component Analysis


An unsupervised transformation that is somewhat more interesting is Principal Component Analysis (PCA).
It is a technique to reduce the dimensionality of the data, by creating a linear projection.
That is, we find new features to represent the data that are a linear combination of the old data. Thus, we can think of PCA as a projection of our data onto a *new* feature space.

The way PCA finds these new directions is by looking for the directions of maximum variance.
Usually only few components that explain most of the variance in the data are kept. Here, the premise is to reduce the size (dimensionality) of a dataset while capturing most of its information. There are many reason why dimensionality reduction can be useful: It can reduce the computational cost when running learning algorithms, decrease the storage space, and may help with the so-called "curse of dimensionality," which we will discuss in greater detail later.


In [1]:
#Perform PCA on the Iris dataset and choose two principal components to reduce the dataset's dimensionality to 2.
#https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html


In [None]:
#Which is the correct method: conducting PCA on the complete dataset or applying PCA to the training set and then projecting the test set onto the derived principal components?

In [None]:
#Following the correct approach, transform the data by projecting it on the 2 principal components

In [None]:
#Visualize the projected data using a scatter plot, color data points from different classes differently,




#Do the classes in this example exhibit linear separability? Is there any structure in the dataset that can be exploited for classification?



In [None]:
#Perform feature Standardization on the dataset. After standardization, apply PCA to reduce the dimensionality of the data.
#The correct way is to learn a scaling transformation from the training set and then apply that transformation to 
#both the training and testing sets before performing PCA for dimensionality reduction.
#plot the reduced dimensionality data using scatter plot. 




In [None]:
#Perform Min-Max Scaling on the dataset and then apply PCA to reduce the dimensionality of the data.
#plot the reduced dimensionality data using scatter plot. 

In [None]:
#Observe the two scatter plots in the above steps. Does the choice of different feature normalization techniques have an impact on the output of PCA?

## Clustering

Clustering is the task of gathering samples into groups of similar
samples according to some predefined similarity or distance (dissimilarity)
measure, such as the Euclidean distance.
Some common applications of clustering algorithms include:
- Compression for data reduction
- Summarizing data as a reprocessing step for recommender systems
- Similarly:
   - grouping related web news (e.g. Google News) and web search results
   - grouping related stock quotes for investment portfolio management
   - building customer profiles for market analysis
- Building a code book of prototype samples for unsupervised feature extraction

One of the simplest clustering algorithms is the K-means.
This is an iterative algorithm which searches for three cluster
centers such that the distance from each point to its cluster is
minimized. The standard implementation of K-means uses the Euclidean distance, which is why we want to make sure that all our variables are measured on the same scale if we are working with real-world datastets. In the previous notebook, we talked about one technique to achieve this, namely, standardization.


In [4]:
# Perform KMeans clustering on the reduced dimensionality data obtained in the previous step (Output of the PCA)
#https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [5]:
#Choose different values of K and visualize the clusters for each K


In [6]:
#Consider you do not have prior knowledge of number of classes (clusters) in the iris dataset. 

#How do you find the optimal value of parameter K (the number of clusters)?

#Hint: The Elbow method is a "rule-of-thumb" approach to finding the optimal number of clusters. 

#Here, we compute at the cluster dispersion for different values of k (can be obtained using the inertia_ attribute of kmeans object)

#Next, we plot the dispersion values against the k values.

#Then, we pick the value that resembles the "pit of an elbow." 



**Clustering comes with assumptions**: A clustering algorithm finds clusters by making assumptions with samples should be grouped together. Each algorithm makes different assumptions and the quality and interpretability of your results will depend on whether the assumptions are satisfied for your goal. For K-means clustering, the model is that all clusters have equal, spherical variance.

**In general, there is no guarantee that structure found by a clustering algorithm has anything to do with what you were interested in**.


### Some Notable Clustering Algorithms

The following are some well-known clustering algorithms available in scikit-learn library:

- `sklearn.cluster.KMeans`: <br/>
    The simplest, yet effective clustering algorithm. Needs to be provided with the
    number of clusters in advance, and assumes that the data is normalized as input
    (but use a PCA model as preprocessor).
- `sklearn.cluster.MeanShift`: <br/>
    Can find better looking clusters than KMeans but is not scalable to high number of samples.
- `sklearn.cluster.DBSCAN`: <br/>
    Can detect irregularly shaped clusters based on density, i.e. sparse regions in
    the input space are likely to become inter-cluster boundaries. Can also detect
    outliers (samples that are not part of a cluster).
- `sklearn.cluster.AffinityPropagation`: <br/>
    Clustering algorithm based on message passing between data points.
- `sklearn.cluster.SpectralClustering`: <br/>
    KMeans applied to a projection of the normalized graph Laplacian: finds
    normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.
- `sklearn.cluster.Ward`: <br/>
    Ward implements hierarchical clustering based on the Ward algorithm,
    a variance-minimizing approach. At each step, it minimizes the sum of
    squared differences within all clusters (inertia criterion).

Of these, Ward, SpectralClustering, DBSCAN and Affinity propagation can also work with precomputed similarity matrices.

<img src="cluster_comparison.png" width="900">