# Unsupervised Learning in Python

This is the world of unsupervised learning, called as such because you are not guiding, or supervising, the pattern discovery by some prediction task, but instead uncovering hidden structure from unlabeled data. Unsupervised learning encompasses a variety of techniques in machine learning, from clustering to dimension reduction to matrix factorization. In this course, you'll learn the fundamentals of unsupervised learning and implement the essential algorithms using scikit-learn and scipy. You will learn how to cluster, transform, visualize, and extract insights from unlabeled datasets, and end the course by building a recommender system to recommend popular musical artists.

Outline:
1. Clustering for dataset exploration
2. Visualization with hierarchical clustering and t-SNE
3. Decorrelating your data and dimension reduction - PCA
4. Discovering interpretable features

# 1. Clustering for dataset exploration
Discover underlying groups (or "clusters") in a dataset.

k-means clustering
- finds clusters of samples

In [None]:
# k-means clustering in sklearn
print(samples)

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)
KMeans(algorithm='auto', ...)
labels = model.predict(samples)
print(labels)

### 1.1 Cluster labels for new samples
- new samples can be assigned to existing clusters
- k-means remembers the mean of each cluster (the "centroids")
- find the nearest centroid to each new sample

In [None]:
# cluster labels for new samples
print(new_samples)

new_labels = model.predict(new_samples)
print(new_labels)

### 1.2 Scatter plots to visualize
Use Iris dataset
- scatter plot of sepal length vs petal length
- each point represents an iris sample
- color points by cluster labels
- Use PyPlot (matplotlib.pyplot)

In [None]:
# scatter plot
import matplotlib.pyplot as plt
# sepal length is at 0
xs = samples[:,0]
# petal length is at 2
ys = samples[:,2]
plt.scatter(xs, ys, c=labels)
plt.show()

### 1.3 How many clusters?
You are given an array 'points' of size 300x2, where each row gives the (x, y) co-ordinates of a point on a map. Make a scatter plot of these points, and use the scatter plot to guess how many clusters there are.

matplotlib.pyplot has already been imported as plt. In the IPython Shell:

Create an array called xs that contains the values of points[:,0] - that is, column 0 of points.
Create an array called ys that contains the values of points[:,1] - that is, column 1 of points.
Make a scatter plot by passing xs and ys to the plt.scatter() function.
Call the plt.show() function to show your plot.
How many clusters do you see?

In [None]:
In [1]: xs=points[:,0]

In [2]: ys=points[:,1]

In [3]: plt.scatter(xs,ys)
Out[3]: <matplotlib.collections.PathCollection at 0x7f41fd3c0d30>

In [4]: plt.show()

# looks like 3 clusters

### 1.4 Clustering 2D points
From the scatter plot of the previous exercise, you saw that the points seem to separate into 3 clusters. You'll now create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise. After the model has been fit, you'll obtain the cluster labels for some new points using the .predict() method.

You are given the array points from the previous exercise, and also an array new_points.

In [None]:
# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)

# Fit model to points
model.fit(points)

# Determine the cluster labels of new_points: labels
labels = model.predict(new_points)

# Print cluster labels of new_points
print(labels)

# [2 1 0 2 1 2 1 1 1 0 2 1 1 0 0 1 0 0 1 1 0 1 2 1 2 0 1 0 0 2 2 1 1 1 0 2 1
#  1 2 1 0 2 2 0 2 1 0 0 1 1 1 1 0 0 2 2 0 0 0 2 2 1 1 1 2 1 0 1 2 0 2 2 2 1
#  2 0 0 2 1 0 2 0 2 1 0 1 0 2 1 1 1 2 1 1 2 0 0 0 0 2 1 2 0 0 2 2 1 2 0 0 2
#  0 0 0 1 1 1 1 0 0 1 2 1 0 1 2 0 1 0 0 1 0 1 0 2 1 2 2 1 0 2 1 2 2 0 1 1 2
#  0 2 0 1 2 0 0 2 0 1 1 0 1 0 0 1 1 2 1 1 0 2 0 2 2 1 2 1 1 2 2 0 2 2 2 0 1
#  1 2 0 2 0 0 1 1 1 2 1 1 1 0 0 2 1 2 2 2 0 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0
#  1 1 2 0 2 2 0 2 0 2 0 1 1 0 1 1 1 0 2 2 0 1 1 0 1 0 0 1 0 0 2 0 2 2 2 1 0
#  0 0 2 1 2 0 2 0 0 1 2 2 2 0 1 1 1 2 1 0 0 1 2 2 0 2 2 0 2 1 2 0 0 0 0 1 0
#  0 1 1 2]

### 1.5 Inspect your clustering with scatter plot
Let's now inspect the clustering you performed in the previous exercise!

A solution to the previous exercise has already run, so 'new_points' is an array of points and 'labels' is the array of their cluster labels.

In [None]:
# Import pyplot
import matplotlib.pyplot as plt

# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs,ys,c=labels,alpha=0.5)

# Assign the cluster centers: centroids
# Compute the coordinates of the centroids 
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
# D marker = diamonds, size of markers 50
plt.scatter(centroids_x, centroids_y,marker='D',s=50)
plt.show()

### 1.6 Evaluating a clustering
- How can you check the quality of the clustering?
    - Can check correspondence with ie. iris species
- If no species to check against,..
- Measure quality of clustering
- Informs choice of how many clusters to look for

Iris: clusters vs specieis
- k-means found 3 clusters amongst the iris samples
- Do the clusters correspond to the species

Cross tabulation with pandas
- clusters vs species is a "cross-tabulation"
- Use pandas library


In [None]:
# Build a crosstab

# align labels and specieis
import pandas as pd
df = pd.DataFrame({'labels':labels, 'species':species})
print(df)

# crosstab of labels and species
ct = pd.crosstab(df['labels'], df['species'])
print(ct)


### 1.7 Meauring clustering quality
- Using only samples and their cluster labels
- a good clustering has tight clusters

Inertia measures clustering quality
- measures how spread out the clusters are (lower is better)
- distance from each sample to centroid of its cluster
- after fit(), avaoilable as attribute inertia_


In [None]:
# inertia measure for clustering quality
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)
print(model.inertia_)


### 1.8 How many clusters to choose
- a good clustering has tight clusters (so low inertia)
- but not too many clusters
- choose an "elbow" in the inertia plot
    - elbow = where inertia begins to decrease more slowly
    - ie. for iris datset, 3 is a good choice

### 1.9 How many clusters of grain?
In the video, you learned how to choose a good number of clusters for a dataset using the k-means inertia graph. You are given an array samples containing the measurements (such as area, perimeter, length, and several others) of samples of grain. What's a good number of clusters in this case?

KMeans and PyPlot (plt) have already been imported for you.

This dataset was sourced from the UCI Machine Learning Repository.
https://archive.ics.uci.edu/ml/datasets/seeds

In [None]:
ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(samples)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

# The inertia decreases very slowly from 3 clusters to 4, 
# so it looks like 3 clusters would be a good choice for this data.

### 1.10 Evaluating the grain clustering
In the previous exercise, you observed from the inertia plot that 3 is a good number of clusters for the grain data. In fact, the grain samples come from a mix of 3 different grain varieties: "Kama", "Rosa" and "Canadian". 

In this exercise, cluster the grain samples into three clusters, and compare the clusters to the grain varieties using a cross-tabulation.

You have the array samples of grain samples, and a list varieties giving the grain variety for each sample. Pandas (pd) and KMeans have already been imported for you.

In [None]:
# create clusters and compare in cross-tab

# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters=3)

# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'],df['varieties'])

# Display ct
print(ct)


### 1.11 Transforming features for better clusterings
- Piedmont wines dataset
- 178 samples from 3 distinct varieties of red wine
- features measure chemical composition (ie. alcohol content), color intensity

In [None]:
# Clustering the wines
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
labels = model.fit_predict(samples)
# clusters vs varieties cross-tab
df = pd.DataFrame({'labels':labels,
                  'varieties':varieties})
ct = pd.crosstab(df['labels'], df['varieties'])
print(ct)

# Feature variances very different, so crosstab shoes clustering not good


#### 1.11.a StandardScaler
- In kmeans, feature variance = feature influence
- the features need to be transformed to have equal variance
- StandardScaler transforms each feature to have mean 0 and variance 1
- Features are "standardized"

In [None]:
# sklearn StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(samples)
StandardScaler(copy=True, with_mean=True, with_std=True)
samples_scaled = scaler.transform(samples)

In [None]:
# 2 steps with pipeline: StandardScaler, then KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)

from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(scaler, kmeans)
# fit scaler and kmeans
pipeline.fit(samples)
# get cluster labels
labels = pipeline.predict(samples)

# Feature standardization improves clustering
df = pd.DataFrame({'labels':labels,
                  'varieties':varieties})
ct = pd.crosstab(df['labels'], df['varieties'])
print(ct)

# big improvement in clustering after standardization

####  1.11.b sklearn preprocessing steps - few options
- StandardScaler is a "preprocessing" step
- MaxAbsScaler
- Normalizer

####  1.11.c Practice: Scaling fish data for clustering
You are given an array samples giving measurements of fish. Each row represents an individual fish. The measurements, such as weight in grams, length in centimeters, and the percentage ratio of height to length, have very different scales. In order to cluster this data effectively, you'll need to standardize these features first. In this exercise, you'll build a pipeline to standardize and cluster the data.

These fish measurement data were sourced from the Journal of Statistics Education.
http://ww2.amstat.org/publications/jse/jse_data_archive.htm

In [None]:
# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Create scaler: scaler
scaler = StandardScaler()

# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters=4)

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans)


####  1.11.d Practice: Clustering the fish data
You'll now use your standardization and clustering pipeline from the previous exercise to cluster the fish by their measurements, and then create a cross-tabulation to compare the cluster labels with the fish species.

As before, samples is the 2D array of fish measurements. Your pipeline is available as pipeline, and the species of every fish sample is given by the list species.


In [None]:
# Import pandas
import pandas as pd

# Fit the pipeline to samples
pipeline.fit(samples)

# Calculate the cluster labels: labels
labels = pipeline.predict(samples)

# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels':labels,'species':species})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['species'])

# Display ct
print(ct)

# species  Bream  Pike  Roach  Smelt
# labels                            
# 0           33     0      1      0
# 1            1     0     19      1
# 2            0    17      0      0
# 3            0     0      0     13

###  1.12 Clustering stocks using Normalizer and KMeans
In this exercise, you'll cluster companies using their daily stock price movements (i.e. the dollar difference between the closing and opening prices for each trading day). You are given a NumPy array movements of daily price movements from 2010 to 2015 (obtained from Yahoo! Finance), where each row corresponds to a company, and each column corresponds to a trading day.

Some stocks are more expensive than others. To account for this, include a Normalizer at the beginning of your pipeline. The Normalizer will separately transform each company's stock price to a relative scale before the clustering begins.

Note that Normalizer() is different to StandardScaler(), which you used in the previous exercise. 
- While StandardScaler() standardizes features (such as the features of the fish data from the previous exercise) by removing the mean and scaling to unit variance, 
- Normalizer() rescales each sample - here, each company's stock price - independently of the other.

KMeans and make_pipeline have already been imported for you.

In [None]:
# make pipeline

# Import Normalizer
from sklearn.preprocessing import Normalizer

# Create a normalizer: normalizer
normalizer = Normalizer()

# Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(n_clusters=10)

# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)

# Fit pipeline to the daily price movements
pipeline.fit(movements)


####  1.12.a Which stocks move together?
In the previous exercise, you clustered companies by their daily stock price movements. So which company have stock prices that tend to change in the same way? You'll now inspect the cluster labels from your clustering to find out.

Your solution to the previous exercise has already been run. Recall that you constructed a Pipeline pipeline containing a KMeans model and fit it to the NumPy array movements of daily stock movements. In addition, a list companies of the company names is available.

In [None]:
# Import pandas
import pandas as pd

# Predict the cluster labels: labels
labels = pipeline.predict(movements)

# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': labels, 'companies': companies})

# Display df sorted by cluster label
print(df.sort_values('labels'))

                             companies  labels
45                                Sony       0
34                          Mitsubishi       0
48                              Toyota       0
7                                Canon       0
21                               Honda       0
32                                  3M       1
35                            Navistar       1
44                        Schlumberger       1
13                   DuPont de Nemours       1
12                             Chevron       1
0                                Apple       1
47                            Symantec       1
8                          Caterpillar       1
50  Taiwan Semiconductor Manufacturing       1
51                   Texas instruments       1
53                       Valero Energy       1
57                               Exxon       1
10                      ConocoPhillips       1
23                                 IBM       1
6             British American Tobacco       2
41                       Philip Morris       2
19                     GlaxoSmithKline       2
36                    Northrop Grumman       3
29                     Lookheed Martin       3
4                               Boeing       3
15                                Ford       4
5                      Bank of America       4
26                      JPMorgan Chase       4
1                                  AIG       4
58                               Xerox       4
16                   General Electrics       4
55                         Wells Fargo       4
3                     American express       4
18                       Goldman Sachs       4
54                            Walgreen       5
39                              Pfizer       5
56                            Wal-Mart       5
27                      Kimberly-Clark       5
25                   Johnson & Johnson       5
40                      Procter Gamble       5
2                               Amazon       6
59                               Yahoo       6
42                   Royal Dutch Shell       7
30                          MasterCard       7
31                           McDonalds       7
20                          Home Depot       7
52                            Unilever       7
49                               Total       7
37                            Novartis       7
46                      Sanofi-Aventis       7
43                                 SAP       7
22                                  HP       8
17                     Google/Alphabet       8
33                           Microsoft       8
11                               Cisco       8
14                                Dell       8
24                               Intel       8
9                    Colgate-Palmolive       9
28                           Coca Cola       9
38                               Pepsi       9

# 2. Visualization with hierarchical clustering and t-SNE
2 unsupervised learning techniques for data visualization
1. hierarchical clustering
2. t-SNE. 

Hierarchical clustering merges the data samples into ever-coarser clusters, yielding a tree visualization of the resulting cluster hierarchy. 

t-SNE maps the data samples into 2d space so that the proximity of the samples to one another can be visualized.

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

# 3. Decorrelating your data and dimension reduction - PCA
Dimension reduction summarizes a dataset using its common occuring patterns.

Learn about the most fundamental of dimension reduction techniques, "Principal Component Analysis" ("PCA"). PCA is often used before supervised learning to improve model performance and generalization. It can also be useful for unsupervised learning. For example, you'll employ a variant of PCA will allow you to cluster Wikipedia articles by their content!

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 

# 4. Discovering interpretable features
Dimension reduction technique called "Non-negative matrix factorization" ("NMF") that expresses samples as combinations of interpretable parts. For example, it expresses documents as combinations of topics, and images in terms of commonly occurring visual patterns. You'll also learn to use NMF to build recommender systems that can find you similar articles to read, or musical artists that match your listening history!

### 

### 

### 

### 

### 

### 

### 

### 

### 

### 