# Unsupervised Learning in Python

This is the world of unsupervised learning, called as such because you are not guiding, or supervising, the pattern discovery by some prediction task, but instead uncovering hidden structure from unlabeled data. Unsupervised learning encompasses a variety of techniques in machine learning, from clustering to dimension reduction to matrix factorization. In this course, you'll learn the fundamentals of unsupervised learning and implement the essential algorithms using scikit-learn and scipy. You will learn how to cluster, transform, visualize, and extract insights from unlabeled datasets, and end the course by building a recommender system to recommend popular musical artists.

Outline:
1. Clustering for dataset exploration
2. Visualization with hierarchical clustering and t-SNE
3. Decorrelating your data and dimension reduction - PCA
4. Discovering interpretable features

# 1. Clustering for dataset exploration
Discover underlying groups (or "clusters") in a dataset.

k-means clustering
- finds clusters of samples

In [None]:
# k-means clustering in sklearn
print(samples)

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)
KMeans(algorithm='auto', ...)
labels = model.predict(samples)
print(labels)

### Cluster labels for new samples
- new samples can be assigned to existing clusters
- k-means remembers the mean of each cluster (the "centroids")
- find the nearest centroid to each new sample

In [None]:
# cluster labels for new samples
print(new_samples)

new_labels = model.predict(new_samples)
print(new_labels)

### Scatter plots to visualize
Use Iris dataset
- scatter plot of sepal length vs petal length
- each point represents an iris sample
- color points by cluster labels
- Use PyPlot (matplotlib.pyplot)

In [None]:
# scatter plot
import matplotlib.pyplot as plt
# sepal length is at 0
xs = samples[:,0]
# petal length is at 2
ys = samples[:,2]
plt.scatter(xs, ys, c=labels)
plt.show()

### How many clusters?
You are given an array 'points' of size 300x2, where each row gives the (x, y) co-ordinates of a point on a map. Make a scatter plot of these points, and use the scatter plot to guess how many clusters there are.

matplotlib.pyplot has already been imported as plt. In the IPython Shell:

Create an array called xs that contains the values of points[:,0] - that is, column 0 of points.
Create an array called ys that contains the values of points[:,1] - that is, column 1 of points.
Make a scatter plot by passing xs and ys to the plt.scatter() function.
Call the plt.show() function to show your plot.
How many clusters do you see?

In [None]:
In [1]: xs=points[:,0]

In [2]: ys=points[:,1]

In [3]: plt.scatter(xs,ys)
Out[3]: <matplotlib.collections.PathCollection at 0x7f41fd3c0d30>

In [4]: plt.show()

# looks like 3 clusters

### Clustering 2D points
From the scatter plot of the previous exercise, you saw that the points seem to separate into 3 clusters. You'll now create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise. After the model has been fit, you'll obtain the cluster labels for some new points using the .predict() method.

You are given the array points from the previous exercise, and also an array new_points.

In [None]:
# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)

# Fit model to points
model.fit(points)

# Determine the cluster labels of new_points: labels
labels = model.predict(new_points)

# Print cluster labels of new_points
print(labels)

# [2 1 0 2 1 2 1 1 1 0 2 1 1 0 0 1 0 0 1 1 0 1 2 1 2 0 1 0 0 2 2 1 1 1 0 2 1
#  1 2 1 0 2 2 0 2 1 0 0 1 1 1 1 0 0 2 2 0 0 0 2 2 1 1 1 2 1 0 1 2 0 2 2 2 1
#  2 0 0 2 1 0 2 0 2 1 0 1 0 2 1 1 1 2 1 1 2 0 0 0 0 2 1 2 0 0 2 2 1 2 0 0 2
#  0 0 0 1 1 1 1 0 0 1 2 1 0 1 2 0 1 0 0 1 0 1 0 2 1 2 2 1 0 2 1 2 2 0 1 1 2
#  0 2 0 1 2 0 0 2 0 1 1 0 1 0 0 1 1 2 1 1 0 2 0 2 2 1 2 1 1 2 2 0 2 2 2 0 1
#  1 2 0 2 0 0 1 1 1 2 1 1 1 0 0 2 1 2 2 2 0 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0
#  1 1 2 0 2 2 0 2 0 2 0 1 1 0 1 1 1 0 2 2 0 1 1 0 1 0 0 1 0 0 2 0 2 2 2 1 0
#  0 0 2 1 2 0 2 0 0 1 2 2 2 0 1 1 1 2 1 0 0 1 2 2 0 2 2 0 2 1 2 0 0 0 0 1 0
#  0 1 1 2]

### Inspect your clustering with scatter plot
Let's now inspect the clustering you performed in the previous exercise!

A solution to the previous exercise has already run, so 'new_points' is an array of points and 'labels' is the array of their cluster labels.

In [None]:
# Import pyplot
import matplotlib.pyplot as plt

# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs,ys,c=labels,alpha=0.5)

# Assign the cluster centers: centroids
# Compute the coordinates of the centroids 
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
# D marker = diamonds, size of markers 50
plt.scatter(centroids_x, centroids_y,marker='D',s=50)
plt.show()

### Evaluating a clustering
- How can you check the quality of the clustering?

### 

### 

### 

### 

### 

# 2. Visualization with hierarchical clustering and t-SNE
2 unsupervised learning techniques for data visualization
1. hierarchical clustering
2. t-SNE. 

Hierarchical clustering merges the data samples into ever-coarser clusters, yielding a tree visualization of the resulting cluster hierarchy. 

t-SNE maps the data samples into 2d space so that the proximity of the samples to one another can be visualized.

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

# 3. Decorrelating your data and dimension reduction - PCA
Dimension reduction summarizes a dataset using its common occuring patterns.

Learn about the most fundamental of dimension reduction techniques, "Principal Component Analysis" ("PCA"). PCA is often used before supervised learning to improve model performance and generalization. It can also be useful for unsupervised learning. For example, you'll employ a variant of PCA will allow you to cluster Wikipedia articles by their content!

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

# 4. Discovering interpretable features
Dimension reduction technique called "Non-negative matrix factorization" ("NMF") that expresses samples as combinations of interpretable parts. For example, it expresses documents as combinations of topics, and images in terms of commonly occurring visual patterns. You'll also learn to use NMF to build recommender systems that can find you similar articles to read, or musical artists that match your listening history!

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 