Yann LeCun: "if intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake"

The most common unsupervised learning task: **dimensionality reduction**

Other types of unsurpervised learnin tasks and algorithms:
- Clustering
    - goal: group similar instances into "clusters"
    - Uses: data analysis, customer segmentation, recommender systems, search engines, image segmentation, semi-supervised learning, dimensionality reduction, and more.
- Anomaly detection
    - goal: learn what "normal" data looks like, and then use that to detect abnormal instances (anomalies) 
    - uses: detecting defective items on a production line or a new trend on a time series.
- Density estimation
    - goal: detect the probability density function of the **random process** that generated the dataset.
    - uses: most commonly used in anomaly detection (instances in low density regions are anomalies)

# Clustering

Example situation
- Finding different but similar plants in the forest, but not knowing what species they are. I can group (cluster) them together because they are similar

Clustering is like classification except it is unsurpervised (no labels)
It makes good use of all features

Applications:
- Customer Segmentation
    - Cluster customers based on their purchases and activity on a website
    - Useful to understand who the customers are and what they need
    - Recommender systems can be built. Recommend something that others enjoyed in the same cluster to a user
- Data Analysis
    - Run a clsutering algorithm on a dataset and analyze each cluster
- Dimensionality Reduction
    - Measure the **affinity (the measure of how well an instance fits into a cluster)** of an instance to each cluster in the dataset. Replace the instances feature vector w/ its affinity vector of length k, which is normally much lower-dimensionality than the original feature vector without losing much information. Overall, the dataset is reduced to k dimensions.
- Anomaly Detection
    - Detect unusual behavior of customers (fraud detection)
- Semi-supervised Learning
    - If only a few labels are available, propagate the labels to all instances in the same cluster. Run a supervised learning algorithm on the fully-labeled dataset
- Search Engines
    - Searching for similar images
        - run a clustering algo on all images in database
- Image Segmentation
    - Cluter pixels according to color and replace each pixel with the mean color of its cluster to reduce the number of different colors in image
    - Makes it easier to detect the contour of each object
    - Used in object detection and tracking systems


 

## K-Means

Simple algorithm that clusters datasets with clear blobs (clusters) very quickly and efficiently.
Proposed by Stuart Lloyd at Bell Labs in 1957

In [1]:
from sklearn.cluster import KMeans
k = 5 # no. of clusters. It is helpful to plot the dataset prior
kmeans = KMeans(n_clusters=k) # each predicted label is the index of cluster that it applies to. so it is in positive whole number

 It does not behave well when blobs' diameters are very different from each other because it assigns an instance's cluster based on its distance to the centroid (center of a cluster)
 
 Types of clustering:
 - hard clustering: only assigning an instance to one cluster
 - soft clustering: calculates the scores (distance to centroid of similarity/affinity score) of an instance to each cluster
     - **can be very efficient to reduce dimensions**
     
     
     
#### K-means algorithm:
- pick random centroids -> label instances -> repeat until the algorithm converges
- Time complexity: 
    - data has clustering structure: linear with regard to m, k, n; m = # of instances, k = # of clusters, n = # of dimensions
    - data doesn't have clustering strucutre: increase exponentially w/ m. *rarely happens*

**note: it is not guaranteed to converge to the optimal clusters**

To increase chance of reaching optimal clustering:
##### Centroid initlialization methods:
- if approx. centroid are known (from running previous clustering algo; from visualization) initialize the centroids in KMeans object in Scikit
- Run K-Means multiple times and keep the best solution. Controlled by n_init hyperparameter (default=10)

How is "best solution" measured between K-Means models?
- Metric **inertia**: the means squared distance between each instance and its closest centroid
- Model with lowest inertia is kept
- *a model's score is the negative intertia in Scikit because of Scikit's rule "greater is better"*

#### K-means++ (improves K-means initialization):
Basically pick initialization centroids that are the furthest distances from one another. The default of KMeans in Scikit

Other improvements for K-means:
- Accelerated K-means
    - avoids unnecessary distance calculations by exploiting the "a straight line is always the shortest distance between two points" rule
    - default algorithm in Scikit
- Mini-batch K-means:
    - Instead of using the full dataset, use mini-batches to move the centroids slightly per iteration
    - Speeds up algorithm by 3x-4x
    - Makes it possible to clusters huge datasets that won't fit in memory
    - Implemented in Scikit as MiniBatchKMeans
    - downside: inertia is much worse
    
What if the dataset does not fit in memory (not just for K-means)?
- use memmap class in NumPy
or
- pass one mini-batch at a time to *partial_fit()* in MiniBatchKMeans
    - much more tedious method. do not use
    
    
### Methods for choosing the right number of clusters k:
*intertia will not work because the current number of clusters could be higher than the optimal number (lower inertia). The more clusters the shorter the distances are from an instance to its nearest centroid.*
1. Manual way: plot inertia vs. k and find the "elbow." use the k at that elbow
2. More precise way: **silhouette score** (b-a)/max(a,b). Plot the silhouette score vs. k. K with higher score is better
    - a is the mean distance to the other instance in the same cluster
    - b is the mean distance to the instances of the next closest cluster (one that maximizes b, excluding instances of one's own cluster)

In [3]:
from sklearn.metrics import silhouette_score

### Limits of clustering
- does not perform well on elliptical clusters. Rather, use Gaussian mixture models

### Make K-means perform better
- scale the input features so the chances of spherical clusters is greater