# Unsupervised Learning in Python

This is the world of unsupervised learning, called as such because you are not guiding, or supervising, the pattern discovery by some prediction task, but instead uncovering hidden structure from unlabeled data. Unsupervised learning encompasses a variety of techniques in machine learning, from clustering to dimension reduction to matrix factorization. In this course, you'll learn the fundamentals of unsupervised learning and implement the essential algorithms using scikit-learn and scipy. You will learn how to cluster, transform, visualize, and extract insights from unlabeled datasets, and end the course by building a recommender system to recommend popular musical artists.

Outline:
1. Clustering for dataset exploration
2. Visualization with hierarchical clustering and t-SNE
3. Decorrelating your data and dimension reduction - PCA
4. Discovering interpretable features

# 1. Clustering for dataset exploration
Discover underlying groups (or "clusters") in a dataset.

k-means clustering
- finds clusters of samples

In [None]:
# k-means clustering in sklearn
print(samples)

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)
KMeans(algorithm='auto', ...)
labels = model.predict(samples)
print(labels)

### 1.1 Cluster labels for new samples
- new samples can be assigned to existing clusters
- k-means remembers the mean of each cluster (the "centroids")
- find the nearest centroid to each new sample

In [None]:
# cluster labels for new samples
print(new_samples)

new_labels = model.predict(new_samples)
print(new_labels)

### 1.2 Scatter plots to visualize
Use Iris dataset
- scatter plot of sepal length vs petal length
- each point represents an iris sample
- color points by cluster labels
- Use PyPlot (matplotlib.pyplot)

In [None]:
# scatter plot
import matplotlib.pyplot as plt
# sepal length is at 0
xs = samples[:,0]
# petal length is at 2
ys = samples[:,2]
plt.scatter(xs, ys, c=labels)
plt.show()

### 1.3 How many clusters?
You are given an array 'points' of size 300x2, where each row gives the (x, y) co-ordinates of a point on a map. Make a scatter plot of these points, and use the scatter plot to guess how many clusters there are.

matplotlib.pyplot has already been imported as plt. In the IPython Shell:

Create an array called xs that contains the values of points[:,0] - that is, column 0 of points.
Create an array called ys that contains the values of points[:,1] - that is, column 1 of points.
Make a scatter plot by passing xs and ys to the plt.scatter() function.
Call the plt.show() function to show your plot.
How many clusters do you see?

In [None]:
In [1]: xs=points[:,0]

In [2]: ys=points[:,1]

In [3]: plt.scatter(xs,ys)
Out[3]: <matplotlib.collections.PathCollection at 0x7f41fd3c0d30>

In [4]: plt.show()

# looks like 3 clusters

### 1.4 Clustering 2D points
From the scatter plot of the previous exercise, you saw that the points seem to separate into 3 clusters. You'll now create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise. After the model has been fit, you'll obtain the cluster labels for some new points using the .predict() method.

You are given the array points from the previous exercise, and also an array new_points.

In [None]:
# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)

# Fit model to points
model.fit(points)

# Determine the cluster labels of new_points: labels
labels = model.predict(new_points)

# Print cluster labels of new_points
print(labels)

# [2 1 0 2 1 2 1 1 1 0 2 1 1 0 0 1 0 0 1 1 0 1 2 1 2 0 1 0 0 2 2 1 1 1 0 2 1
#  1 2 1 0 2 2 0 2 1 0 0 1 1 1 1 0 0 2 2 0 0 0 2 2 1 1 1 2 1 0 1 2 0 2 2 2 1
#  2 0 0 2 1 0 2 0 2 1 0 1 0 2 1 1 1 2 1 1 2 0 0 0 0 2 1 2 0 0 2 2 1 2 0 0 2
#  0 0 0 1 1 1 1 0 0 1 2 1 0 1 2 0 1 0 0 1 0 1 0 2 1 2 2 1 0 2 1 2 2 0 1 1 2
#  0 2 0 1 2 0 0 2 0 1 1 0 1 0 0 1 1 2 1 1 0 2 0 2 2 1 2 1 1 2 2 0 2 2 2 0 1
#  1 2 0 2 0 0 1 1 1 2 1 1 1 0 0 2 1 2 2 2 0 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 0
#  1 1 2 0 2 2 0 2 0 2 0 1 1 0 1 1 1 0 2 2 0 1 1 0 1 0 0 1 0 0 2 0 2 2 2 1 0
#  0 0 2 1 2 0 2 0 0 1 2 2 2 0 1 1 1 2 1 0 0 1 2 2 0 2 2 0 2 1 2 0 0 0 0 1 0
#  0 1 1 2]

### 1.5 Inspect your clustering with scatter plot
Let's now inspect the clustering you performed in the previous exercise!

A solution to the previous exercise has already run, so 'new_points' is an array of points and 'labels' is the array of their cluster labels.

In [None]:
# Import pyplot
import matplotlib.pyplot as plt

# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs,ys,c=labels,alpha=0.5)

# Assign the cluster centers: centroids
# Compute the coordinates of the centroids 
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
# D marker = diamonds, size of markers 50
plt.scatter(centroids_x, centroids_y,marker='D',s=50)
plt.show()

### 1.6 Evaluating a clustering
- How can you check the quality of the clustering?
    - Can check correspondence with ie. iris species
- If no species to check against,..
- Measure quality of clustering
- Informs choice of how many clusters to look for

Iris: clusters vs specieis
- k-means found 3 clusters amongst the iris samples
- Do the clusters correspond to the species

Cross tabulation with pandas
- clusters vs species is a "cross-tabulation"
- Use pandas library


In [None]:
# Build a crosstab

# align labels and specieis
import pandas as pd
df = pd.DataFrame({'labels':labels, 'species':species})
print(df)

# crosstab of labels and species
ct = pd.crosstab(df['labels'], df['species'])
print(ct)


### 1.7 Meauring clustering quality
- Using only samples and their cluster labels
- a good clustering has tight clusters

Inertia measures clustering quality
- measures how spread out the clusters are (lower is better)
- distance from each sample to centroid of its cluster
- after fit(), avaoilable as attribute inertia_


In [None]:
# inertia measure for clustering quality
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)
print(model.inertia_)


### 1.8 How many clusters to choose
- a good clustering has tight clusters (so low inertia)
- but not too many clusters
- choose an "elbow" in the inertia plot
    - elbow = where inertia begins to decrease more slowly
    - ie. for iris datset, 3 is a good choice

### 1.9 How many clusters of grain?
In the video, you learned how to choose a good number of clusters for a dataset using the k-means inertia graph. You are given an array samples containing the measurements (such as area, perimeter, length, and several others) of samples of grain. What's a good number of clusters in this case?

KMeans and PyPlot (plt) have already been imported for you.

This dataset was sourced from the UCI Machine Learning Repository.
https://archive.ics.uci.edu/ml/datasets/seeds

In [None]:
ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(samples)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

# The inertia decreases very slowly from 3 clusters to 4, 
# so it looks like 3 clusters would be a good choice for this data.

### 1.10 Evaluating the grain clustering
In the previous exercise, you observed from the inertia plot that 3 is a good number of clusters for the grain data. In fact, the grain samples come from a mix of 3 different grain varieties: "Kama", "Rosa" and "Canadian". 

In this exercise, cluster the grain samples into three clusters, and compare the clusters to the grain varieties using a cross-tabulation.

You have the array samples of grain samples, and a list varieties giving the grain variety for each sample. Pandas (pd) and KMeans have already been imported for you.

In [None]:
# create clusters and compare in cross-tab

# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters=3)

# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'],df['varieties'])

# Display ct
print(ct)


### 1.11 Transforming features for better clusterings
- Piedmont wines dataset
- 178 samples from 3 distinct varieties of red wine
- features measure chemical composition (ie. alcohol content), color intensity

In [None]:
# Clustering the wines
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
labels = model.fit_predict(samples)
# clusters vs varieties cross-tab
df = pd.DataFrame({'labels':labels,
                  'varieties':varieties})
ct = pd.crosstab(df['labels'], df['varieties'])
print(ct)

# Feature variances very different, so crosstab shoes clustering not good


#### 1.11.a StandardScaler
- In kmeans, feature variance = feature influence
- the features need to be transformed to have equal variance
- StandardScaler transforms each feature to have mean 0 and variance 1
- Features are "standardized"

In [None]:
# sklearn StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(samples)
StandardScaler(copy=True, with_mean=True, with_std=True)
samples_scaled = scaler.transform(samples)

In [None]:
# 2 steps with pipeline: StandardScaler, then KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)

from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(scaler, kmeans)
# fit scaler and kmeans
pipeline.fit(samples)
# get cluster labels
labels = pipeline.predict(samples)

# Feature standardization improves clustering
df = pd.DataFrame({'labels':labels,
                  'varieties':varieties})
ct = pd.crosstab(df['labels'], df['varieties'])
print(ct)

# big improvement in clustering after standardization

####  1.11.b sklearn preprocessing steps - few options
- StandardScaler is a "preprocessing" step
- MaxAbsScaler
- Normalizer

####  1.11.c Practice: Scaling fish data for clustering
You are given an array samples giving measurements of fish. Each row represents an individual fish. The measurements, such as weight in grams, length in centimeters, and the percentage ratio of height to length, have very different scales. In order to cluster this data effectively, you'll need to standardize these features first. In this exercise, you'll build a pipeline to standardize and cluster the data.

These fish measurement data were sourced from the Journal of Statistics Education.
http://ww2.amstat.org/publications/jse/jse_data_archive.htm

In [None]:
# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Create scaler: scaler
scaler = StandardScaler()

# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters=4)

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans)


####  1.11.d Practice: Clustering the fish data
You'll now use your standardization and clustering pipeline from the previous exercise to cluster the fish by their measurements, and then create a cross-tabulation to compare the cluster labels with the fish species.

As before, samples is the 2D array of fish measurements. Your pipeline is available as pipeline, and the species of every fish sample is given by the list species.


In [None]:
# Import pandas
import pandas as pd

# Fit the pipeline to samples
pipeline.fit(samples)

# Calculate the cluster labels: labels
labels = pipeline.predict(samples)

# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels':labels,'species':species})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['species'])

# Display ct
print(ct)

# species  Bream  Pike  Roach  Smelt
# labels                            
# 0           33     0      1      0
# 1            1     0     19      1
# 2            0    17      0      0
# 3            0     0      0     13

###  1.12 Clustering stocks using Normalizer and KMeans
In this exercise, you'll cluster companies using their daily stock price movements (i.e. the dollar difference between the closing and opening prices for each trading day). You are given a NumPy array movements of daily price movements from 2010 to 2015 (obtained from Yahoo! Finance), where each row corresponds to a company, and each column corresponds to a trading day.

Some stocks are more expensive than others. To account for this, include a Normalizer at the beginning of your pipeline. The Normalizer will separately transform each company's stock price to a relative scale before the clustering begins.

Note that Normalizer() is different to StandardScaler(), which you used in the previous exercise. 
- While StandardScaler() standardizes features (such as the features of the fish data from the previous exercise) by removing the mean and scaling to unit variance, 
- Normalizer() rescales each sample - here, each company's stock price - independently of the other.

KMeans and make_pipeline have already been imported for you.

In [None]:
# make pipeline

# Import Normalizer
from sklearn.preprocessing import Normalizer

# Create a normalizer: normalizer
normalizer = Normalizer()

# Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(n_clusters=10)

# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)

# Fit pipeline to the daily price movements
pipeline.fit(movements)


####  1.12.a Which stocks move together?
In the previous exercise, you clustered companies by their daily stock price movements. So which company have stock prices that tend to change in the same way? You'll now inspect the cluster labels from your clustering to find out.

Your solution to the previous exercise has already been run. Recall that you constructed a Pipeline pipeline containing a KMeans model and fit it to the NumPy array movements of daily stock movements. In addition, a list companies of the company names is available.

In [None]:
# Import pandas
import pandas as pd

# Predict the cluster labels: labels
labels = pipeline.predict(movements)

# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': labels, 'companies': companies})

# Display df sorted by cluster label
print(df.sort_values('labels'))

                             companies  labels
45                                Sony       0
34                          Mitsubishi       0
48                              Toyota       0
7                                Canon       0
21                               Honda       0
32                                  3M       1
35                            Navistar       1
44                        Schlumberger       1
13                   DuPont de Nemours       1
12                             Chevron       1
0                                Apple       1
47                            Symantec       1
8                          Caterpillar       1
50  Taiwan Semiconductor Manufacturing       1
51                   Texas instruments       1
53                       Valero Energy       1
57                               Exxon       1
10                      ConocoPhillips       1
23                                 IBM       1
6             British American Tobacco       2
41                       Philip Morris       2
19                     GlaxoSmithKline       2
36                    Northrop Grumman       3
29                     Lookheed Martin       3
4                               Boeing       3
15                                Ford       4
5                      Bank of America       4
26                      JPMorgan Chase       4
1                                  AIG       4
58                               Xerox       4
16                   General Electrics       4
55                         Wells Fargo       4
3                     American express       4
18                       Goldman Sachs       4
54                            Walgreen       5
39                              Pfizer       5
56                            Wal-Mart       5
27                      Kimberly-Clark       5
25                   Johnson & Johnson       5
40                      Procter Gamble       5
2                               Amazon       6
59                               Yahoo       6
42                   Royal Dutch Shell       7
30                          MasterCard       7
31                           McDonalds       7
20                          Home Depot       7
52                            Unilever       7
49                               Total       7
37                            Novartis       7
46                      Sanofi-Aventis       7
43                                 SAP       7
22                                  HP       8
17                     Google/Alphabet       8
33                           Microsoft       8
11                               Cisco       8
14                                Dell       8
24                               Intel       8
9                    Colgate-Palmolive       9
28                           Coca Cola       9
38                               Pepsi       9

# 2. Visualization with hierarchical clustering and t-SNE
2 unsupervised learning techniques for data visualization
1. hierarchical clustering
2. t-SNE

Hierarchical clustering merges the data samples into ever-coarser clusters, yielding a tree visualization of the resulting cluster hierarchy. 

t-SNE maps the data samples into 2d space so that the proximity of the samples to one another can be visualized.
- t-SNE: creates a 2D map of dataset

## Hierarchical clustering
- Eurovision scoring dataset example
    - countries gave scores performed at Eurovision 2006
    - 2D array of scores
    - rows: countries
    - columns: songs
- Dendrogram visualization
    - read bottom up
    - vertical lines represent clusters
- Hierarchical clustering steps
    - every country (row) begins in a separate cluster
    - at each step, the 2 closest clusters are merged
    - continue until all countries in a single cluster
    - = "agglomerative" hierarchical clustering
    - Divisive clustering is the opposite
- Hierarchical clustering with SciPy
    - given samples (the array of scores), and country_names
    

In [None]:
# Hierarchical clustering with SciPy

import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
# linkage() performs the hierarchical clustering
mergings = linkage(samples, method='complete')
dendrogram(mergings,
          labels=country_names,
          leaf_rotation=90,
          leaf_font_size=6)
plt.show()

How many merges?
If there are 5 data samples, how many merge operations will occur in a hierarchical clustering? To help answer this question, think back to the video, in which Ben walked through an example of hierarchical clustering using 6 countries. How many merge operations did that example have?

4 merges.
Is it n-1?

### 2.1 Hierarchical clustering of the grain data
In the video, you learned that the SciPy linkage() function performs hierarchical clustering on an array of samples. Use the linkage() function to obtain a hierarchical clustering of the grain samples, and use dendrogram() to visualize the result. A sample of the grain measurements is provided in the array samples, while the variety of each grain sample is given by the list varieties.

In [None]:
# Perform the necessary imports
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt

# Calculate the linkage: mergings
mergings = linkage(samples,method='complete')

# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
           labels=varieties,
           leaf_rotation=90,
           leaf_font_size=6,
)
plt.show()


### 2.2 Hierarchies of stocks
In chapter 1, you used k-means clustering to cluster companies according to their stock price movements. Now, you'll perform hierarchical clustering of the companies. You are given a NumPy array of price movements movements, where the rows correspond to companies, and a list of the company names companies. SciPy hierarchical clustering doesn't fit into a sklearn pipeline, so you'll need to use the normalize() function from sklearn.preprocessing instead of Normalizer.

linkage and dendrogram have already been imported from sklearn.cluster.hierarchy, and PyPlot has been imported as plt.

In [None]:
# Import normalize
from sklearn.preprocessing import normalize

# Normalize the movements: normalized_movements
normalized_movements = normalize(movements)

# Calculate the linkage: mergings
mergings = linkage(normalized_movements,
                    method='complete')

# Plot the dendrogram
dendrogram(mergings,
            labels=companies,
            leaf_rotation=90,
            leaf_font_size=6)
plt.show()


### 2.3 Cluster labels in hierarchical clustering
- more than just a visualization tool
- cluster labels recovered at any intermediate stage
- intermediate cluster labels can be used in cross tabulations

Intermediate clusterings and height on dendrogram
- ie. at height 15: Bulgaria, Cyprus, Greece are one cluster
- Height on dendrogram = distance b/n merging clusters
- Height on dendrogram specifies max distance b/n merging clusters
    - Don't merge clusters further apart than this (ie. 15)

Distance b/n clusters
- defined by a "linkage method"
- specified via method parameter, ie. linkage(samples,method='complete)
- in "complete" linkage: distance b/n clusters is max distance b/n their samples
- different linkage methods yield different hierarchical clustering

Extracting cluster labels
- use fcluster method
- returns a NumPy array of cluster labels


#### 2.3.a Extracting cluster labels using fcluster 

In [None]:
from scipy.cluster.hierarchy import linkage
mergings = linkage(samples, method='complete')
from scipy.cluster.hierarchy import fcluster
labels = fcluster(mergings, 15, criterion='distance')
print(labels)

#### 2.3.b Aligning cluster labels with country names
- give a list of string: country_names

In [None]:
import pandas as pd
pairs = pd.DataFrame({'labels':labels,
                      'countries':country_names})
# sort by cluster label and print
print(pairs.sort_values('labels'))

### 2.4 Which clusters are closest?
In the video, you learned that the linkage method defines how the distance between clusters is measured. In complete linkage, the distance between clusters is the distance between the furthest points of the clusters. In single linkage, the distance between clusters is the distance between the closest points of the clusters.

Consider the three clusters in the diagram. Which of the following statements are true?

A. In single linkage, cluster 3 is the closest to cluster 2.

B. In complete linkage, cluster 1 is the closest to cluster 2.

Answer: both - but need plot to answer this

### 2.5 Different linkage, different hierarchical clustering!
In the video, you saw a hierarchical clustering of the voting countries at the Eurovision song contest using 'complete' linkage. Now, perform a hierarchical clustering of the voting countries with 'single' linkage, and compare the resulting dendrogram with the one in the video. Different linkage, different hierarchical clustering!

You are given an array samples. Each row corresponds to a voting country, and each column corresponds to a performance that was voted for. 

The list country_names gives the name of each voting country. This dataset was obtained from Eurovision.
https://eurovision.tv/history/full-split-results

In [None]:
# method:"single" for single linkage

# Perform the necessary imports
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram

# Calculate the linkage: mergings
mergings = linkage(samples,method='single')

# Plot the dendrogram
dendrogram(mergings,
            labels=country_names,
            leaf_rotation=90,
            leaf_font_size=6)
plt.show()


### 2.6 Intermediate clusterings
Displayed on the right is the dendrogram for the hierarchical clustering of the grain samples that you computed earlier. If the hierarchical clustering were stopped at height 6 on the dendrogram, how many clusters would there be?
answer: 3 - need to see plot

### 2.7 Extracting the cluster labels
In the previous exercise, you saw that the intermediate clustering of the grain samples at height 6 has 3 clusters. Now, use the fcluster() function to extract the cluster labels for this intermediate clustering, and compare the labels with the grain varieties using a cross-tabulation.

The hierarchical clustering has already been performed and mergings is the result of the linkage() function. The list varieties gives the variety of each grain sample.

In [None]:
# Perform the necessary imports
import pandas as pd
from scipy.cluster.hierarchy import fcluster

# Use fcluster to extract labels: labels
labels = fcluster(mergings,6,criterion='distance')

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'],df['varieties'])

# Display ct
print(ct)

# varieties  Canadian wheat  Kama wheat  Rosa wheat
# labels                                           
# 1                      14           3           0
# 2                       0           0          14
# 3                       0          11           0

### 2.8 t-SNE for 2-dimensional maps
- t-SNE = t-distributted stochastic neighbor embedding
- maps samples from high-dimensional space to 2D space (or 3D)
- great job appoximately preserves nearness of samples
- great for inspecting datasets

Iris dataset has 4 measurements, so samples are 4-dimensional
- t-SNE maps samples to 2D space
- t-SNE didn't know there were different species
    - yet kept the species mostly separate
- we learn 2 species (versicolor and virginica) have samples close together in space
    - consistent with k-means inertia plot (tight clusters): could argue for 2 clusters, or for 3
    
t-SNE in sklearn
- print(samples)
    - 2D NumPy array samples
    - List species giving species of labels as numbers (0,1, or 2)

In [None]:
# t-SNE in sklearn
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
model = TSNE(learning_rate=100)
transformed = model.fit_transform(samples)
xs = transformed[:,0]
ys = transformed[:,1]
plt.scatter(xs, ys, c=species)
plt.show()

#### t-SNE has only fit_transform()
- has a fit_transform() method
- simulaneously fits the model and transforms the data
- has no separate fit() or transform() methods
    - so, can not extend the map to include NEW data samples
    - must start over each time

#### t-SNE learning rate
- choose learning rate for the dataset
- if wrong choice: points bunch together
- advice: normally enough to try values b/n 50 and 200

#### Different every time
- t-SNE features are different every time
- the axis of the plot do not have any interpretable meaning
- example: although Piedmont wines data, 3 runs, yield 3 different scatter plots
    - however, the wine varieties (=colors) have same position relative to one another


#### 2.8.a t-SNE visualization of grain dataset
In the video, you saw t-SNE applied to the iris dataset. In this exercise, you'll apply t-SNE to the grain samples data and inspect the resulting t-SNE features using a scatter plot. You are given an array samples of grain samples and a list variety_numbers giving the variety number of each grain sample.

In [None]:
# Import TSNE
from sklearn.manifold import TSNE

# Create a TSNE instance: model
model = TSNE(learning_rate=200)

# Apply fit_transform to samples: tsne_features
tsne_features = model.fit_transform(samples)

# Select the 0th feature: xs
xs = tsne_features[:,0]

# Select the 1st feature: ys
ys = tsne_features[:,1]

# Scatter plot, coloring by variety_numbers
plt.scatter(xs, ys, c=variety_numbers)
plt.show()

# the t-SNE visualization manages to separate the 3 varieties
# of grain samples

#### 2.8.b A t-SNE map of the stock market
t-SNE provides great visualizations when the individual samples can be labeled. In this exercise, you'll apply t-SNE to the company stock price data. A scatter plot of the resulting t-SNE features, labeled by the company names, gives you a map of the stock market! The stock price movements for each company are available as the array normalized_movements (these have already been normalized for you). The list companies gives the name of each company. PyPlot (plt) has been imported for you. 

In [None]:
# Import TSNE
from sklearn.manifold import TSNE

# Create a TSNE instance: model
model = TSNE(learning_rate=50)

# Apply fit_transform to normalized_movements: tsne_features
tsne_features = model.fit_transform(normalized_movements)

# Select the 0th feature: xs
xs = tsne_features[:,0]

# Select the 1th feature: ys
ys = tsne_features[:,1]

# Scatter plot
plt.scatter(xs, ys, alpha=0.5)

# Annotate the points
for x, y, company in zip(xs, ys, companies):
    plt.annotate(company, (x, y), fontsize=5, alpha=0.75)
plt.show()


# 3. Decorrelating your data and dimension reduction - PCA
Dimension reduction summarizes a dataset using its common occuring patterns.

Learn about the most fundamental of dimension reduction techniques, "Principal Component Analysis" ("PCA"). PCA is often used before supervised learning to improve model performance and generalization. It can also be useful for unsupervised learning. For example, you'll employ a variant of PCA will allow you to cluster Wikipedia articles by their content!

## Dimension reduction
- more efficient storage and computation
- most important function: remove less-information "noise" features
    - those "noise" features cause problems for prediction tasks like classification and regression

## Principal Component Analysis
- PCA = "Principal Component Analysis"
- Fundamental dimension reduction technique
- 2 steps
    - first step: "decorrelation" 
    - second step: reduces dimension
- Decorrelation:
    - = PCA rotates data samples to be aligned with axis
    - shifts data samples to have mean=0
    - No information is lost
    
## PCA follows the fit/transform pattern
- PCA a scikit-learn component like KMeans or StandardScaler
- fit() learns the transformation from given ata
- transform() applies the learned transformation
- transform() can also be applied to NEW unseen data

## 3.1 Using sklearn PCA
- samples = array of two wine features (total_phenols & o228o

## PCA features
- Rows of transformed correspond to samples
- Columns of transformed are the "PCA features"
- Row gives PCA feature values of corresponding sample

## PCA features are not correlated - due to rotation performed
- resulting PCA features are not linearly correlated ("decorrelation")

## Pearson correlation
- measures linear correlation of features
- value b/n -1 and 1
    - larger values indicate stronger correlation
- value of 0 means NO linear correlation

## Principal components
- "Principal components" = directions of variance
- PCA aligns principal components with the axes
- available as components_ attribute of PCA object
- each row defines displacement from mean
    - numpy array with 1 row for each principal component
    - print(model.components_)

In [None]:
# PCA
print(samples)

from sklearn.decomposition import PCA
model = PCA()
model.fit(samples)
# new array of transformed samples
transformed = model.transform(samples)


### 3.2 Correlated data in nature
You are given an array grains giving the width and length of samples of grain. You suspect that width and length will be correlated. To confirm this, make a scatter plot of width vs length and measure their Pearson correlation.

In [None]:
# Perform the necessary imports
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

# Assign the 0th column of grains: width
width = grains[:,0]

# Assign the 1st column of grains: length
length = grains[:,1]

# Scatter plot width vs length
plt.scatter(width, length)
plt.axis('equal')
plt.show()

# Calculate the Pearson correlation
correlation, pvalue = pearsonr(width,length)

# Display the correlation
print(correlation)

# 0.860414937714
# which is highly correlated

### 3.3 Decorrelating the grain measurements with PCA
- You observed in the previous exercise that the width and length measurements of the grain are correlated. 
- Now, use PCA to decorrelate these measurements,
- then plot the decorrelated points and measure their Pearson correlation.

In [None]:
# Import PCA
from sklearn.decomposition import PCA

# Create PCA instance: model
model = PCA()

# Apply the fit_transform method of model to grains: pca_features
pca_features = model.fit_transform(grains)

# Assign 0th column of pca_features: xs
xs = pca_features[:,0]

# Assign 1st column of pca_features: ys
ys = pca_features[:,1]

# Scatter plot xs vs ys
plt.scatter(xs, ys)
plt.axis('equal')
plt.show()

# Calculate the Pearson correlation of xs and ys
correlation, pvalue = pearsonr(xs, ys)

# Display the correlation
print(correlation)

# 0.0

### 3.4 Intrinsic dimension
Intrinsic dimension of a flight path
- consider 2 features: longitude and latitude at points along a flight path
- but can be appoximated using 1 feature: displacement along flight path
- data is intrinsically 1 dimensional

Intrinsic dimension = number of features needed to approximate the dataset
- essential idea behind dimension reduction
- What is the most compact representation of the samples?
- Can be detected with PCA

Example from Versicolor dataset
- versicolor is one of the iris species
- only 3 features: sepal length, sepal width, and petal width
- samples are points in 3D space
- if you make a 3D scatterplot, samples lie close to a flat 2D sheet
    - so can be approximated using 2 features
    - intrinsic dimension = 2
    
PCA identifies intrinsic dimension
- scatter plots work only if samples have 2 or 3 features
- PCA identifies intrinsic dimension when samples have any number of features
- Intrinsic dimension = number of PCA features with significant variance
- PCA applied to versicolor samples
    - PCA rotates and shifts the samples to align with the coordinate axis
    - 3 features expressed
    - use bar graph to see variance of the 3 features

Variance and intrinsic dimension
- intrinsic dimension is number of PCA features with signficant variance
- for versicolor example, only first 2 features (of 3) have signficant variance
    - so, intrinsic dimension = 2
    - this agrees with scatter plot observation


#### 3.4.a Plotting the variances of PCA features
- samples = array of versicolor samples

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(samples)
features = range(pca.n_components_)

# make bar plot of variances
plt.bar(features, pca.explained_variance_)
plt.xticks(features)
plt.ylabel('variance')
plt.xlabel('PCA feature')
plt.show()

#### 3.4.b Intrinsic dimension can be ambiguous
- intrinsic dimension is an idealization
- ...there is not always one correct answer
    - depends on the threshold you choose
    

#### 3.4.c The first principal component - find and draw arrow on scatter plot
The first principal component of the data is the direction in which the data varies the most. 

In this exercise, your job is to 
- use PCA to find the first principal component of the length and width measurements of the grain samples, and 
- represent it as an arrow on the scatter plot.

The array grains gives the length and width of the grain samples. PyPlot (plt) and PCA have already been imported for you.

In [None]:
# Make a scatter plot of the untransformed points
plt.scatter(grains[:,0], grains[:,1])

# Create a PCA instance: model
model = PCA()

# Fit model to points
model.fit(grains)

# Get the mean of the grain samples: mean
mean = model.mean_

# Get the first principal component: first_pc
first_pc = model.components_[0,:]

# Plot first_pc as an arrow, starting at mean
plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color='red', width=0.01)

# Keep axes on same scale
plt.axis('equal')
plt.show()

#### 3.4.d  Variance of the PCA features
The fish dataset is 6-dimensional. But what is its intrinsic dimension? 
- Make a plot of the variances of the PCA features to find out. 
- As before, samples is a 2D array, where each row represents a fish. 
- You'll need to standardize the features first.

In [None]:
# Perform the necessary imports
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt

# Create scaler: scaler
scaler = StandardScaler()

# Create a PCA instance: pca
pca = PCA()

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, pca)

# Fit the pipeline to 'samples'
pipeline.fit(samples)

# Plot the explained variances
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()

# PCA features 0 and 1 have significant variance.
# intrinsic dimension = 2

### 3.5 Dimension Reduction with PCA
Dimension reduction
- represents same data, using less features
- important part of machine-learning pipelines
- PCA features are in decresing order of variance
    - assumes low variance features are "noise"
    - ...and high variance features are informative
- specify how many features to keep
    - ie. PCA(n_components=2)
        - keeps the first 2 PCA features
    - intrinsic dimension is a good choice


#### 3.5.a Example: dimension reduction of iris dataset
- samples=array of iris measurements (4 features)
- species=list of iris species numbers
- Try to reduce to 2 features

In [None]:
# PCA on iris dataset
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(samples)
transformed = pca.transform(samples)
print(transformed.shape)
# (150, 2)
# you see 2 features

In [None]:
# Iris dataset in 2 dimensions
# PCA reduced the dimension to 2
# Retained the 2 PCA features with highest variance
# These 2 features are very informative as important info preserved
#  since the 3 species remain distinct

import matplotlib.pyplot as plt

xs = transformed[:,0]
ys = transformed[:,1]

plt.scatter(xs, ys, c=species)
plt.show()

# with 2 features, you can still 3 species in the scatter plot

#### 3.5.b Dimension reduction with PCA...
- discards low variance PCA features
- assumes the high variance features are informative
- Assumption typically holds in practice

#### Some cases where this does NOT hold, so use an alternative form of PCA
- Word frequency arrays ("tf-idf") - great example
    - what is it? each row a document, each column a word from a fixed vocabulary
    - this is a sparse array example
- Sparse arrays and csr_matrix
    - sparse arrays: most entries are 0
    - Can use scipy.sparse.csr_matrix instead of NumPy array
    - csr_matrix remembers only non-zero entries (and saves space)
    
#### TruncatedSVD and csr_matrix
- scikit-learn PCA doesn't support csr_matrix
- use TruncatedSVD instead in sklearn
- performs same transformation as PCA, but able to accept csr_matrix as input

#### 3.5.c TruncatedSVD and csr_matrix

In [1]:
from sklearn.decomposition import TruncatedSVD
model = TruncatedSVD(n_components=3)
# documents is a csr_matrix
model.fit(documents)
transformed = model.transform(documents)

#### 3.5.d Dimension reduction of the fish measurements
In a previous exercise, you saw that 2 was a reasonable choice for the "intrinsic dimension" of the fish measurements. Now use PCA for dimensionality reduction of the fish measurements, retaining only the 2 most important components.

The fish measurements have already been scaled for you, and are available as scaled_samples. 

In [None]:
# Import PCA
from sklearn.decomposition import PCA

# Create a PCA model with 2 components: pca
pca = PCA(n_components=2)

# Fit the PCA instance to the scaled samples
pca.fit(scaled_samples)

# Transform the scaled samples: pca_features
pca_features = pca.transform(scaled_samples)

# Print the shape of pca_features
print(pca_features.shape)

# (85, 2)
# reduced dimensionality from 6 to 2 (see columns)

#### 3.5.e A tf-idf word-frequency array
In this exercise, you'll create a tf-idf word frequency array for a toy collection of documents. 
- use the TfidfVectorizer from sklearn. 
    - It transforms a list of documents into a word frequency array, which it outputs as a csr_matrix. 
    - It has fit() and transform() methods like other sklearn objects.

You are given a list 'documents' of toy documents about pets.
- ['cats say meow', 'dogs say woof', 'dogs chase cats']

In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer() 

# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(documents)

# Print result of toarray() method
print(csr_mat.toarray())

# Get the words: words
words = tfidf.get_feature_names()

# Print words
print(words)

# [[ 0.51785612  0.          0.          0.68091856  0.51785612  0.        ]
#  [ 0.          0.          0.51785612  0.          0.51785612  0.68091856]
#  [ 0.51785612  0.68091856  0.51785612  0.          0.          0.        ]]
# ['cats', 'chase', 'dogs', 'meow', 'say', 'woof']

#### 3.5.f Clustering Wikipedia part I
You saw in the video that TruncatedSVD is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays. 

Combine your knowledge of TruncatedSVD and k-means to cluster some popular pages from Wikipedia. 

In this exercise, build the pipeline. In the next exercise, you'll apply it to the word-frequency array of some Wikipedia articles.

Create a Pipeline object consisting of a TruncatedSVD followed by KMeans. (This time, we've precomputed the word-frequency matrix for you, so there's no need for a TfidfVectorizer).

The Wikipedia dataset you will be working with was obtained from here.
https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia/

In [None]:
# setup pipeline for TruncatedSVD

# Perform the necessary imports
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

# Create a TruncatedSVD instance: svd
svd = TruncatedSVD(n_components=50)

# Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters=6)

# Create a pipeline: pipeline
pipeline = make_pipeline(svd, kmeans)


#### 3.5.g Clustering Wikipedia part II
It is now time to put your pipeline from the previous exercise to work! You are given an array articles of tf-idf word-frequencies of some popular Wikipedia articles, and a list titles of their titles. Use your pipeline to cluster the Wikipedia articles.

A solution to the previous exercise has been pre-loaded for you, so a Pipeline 'pipeline' chaining TruncatedSVD with KMeans is available.

In [None]:
# use pipeline above to cluster 

# Import pandas
import pandas as pd

# Fit the pipeline to articles
pipeline.fit(articles)

# Calculate the cluster labels: labels
labels = pipeline.predict(articles)

# Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'article': titles})

# Display df sorted by cluster label
print(df.sort_values('label'))

#                                           article  label
# 29                               Jennifer Aniston      0
# 22                              Denzel Washington      0
# 23                           Catherine Zeta-Jones      0
# 24                                   Jessica Biel      0
# 25                                  Russell Crowe      0
# 26                                     Mila Kunis      0
# 27                                 Dakota Fanning      0
# 28                                  Anne Hathaway      0
# 21                             Michael Fassbender      0
# 20                                 Angelina Jolie      0
# 40                                    Tonsillitis      1
# 43                                       Leukemia      1
# 44                                           Gout      1
# 45                                    Hepatitis C      1
# 46                                     Prednisone      1
# 47                                          Fever      1
# 48                                     Gabapentin      1
# 49                                       Lymphoma      1
# 42                                    Doxycycline      1
# 41                                    Hepatitis B      1
# 0                                        HTTP 404      1
# 1                                  Alexa Internet      1
# 2                               Internet Explorer      1
# 3                                     HTTP cookie      1
# 4                                   Google Search      1
# 5                                          Tumblr      1
# 6                     Hypertext Transfer Protocol      1
# 7                                   Social search      1
# 8                                         Firefox      1
# 9                                        LinkedIn      1
# 18  2010 United Nations Climate Change Conference      2
# 19  2007 United Nations Climate Change Conference      2
# 10                                 Global warming      2
# 14                                 Climate change      2
# 15                                 Kyoto Protocol      2
# 13                               Connie Hedegaard      2
# 12                                   Nigel Lawson      2
# 11       Nationally Appropriate Mitigation Action      2
# 16                                        350.org      2
# 17  Greenhouse gas emissions by the United States      2
# 36              2014 FIFA World Cup qualification      3
# 35                Colombia national football team      3
# 30                  France national football team      3
# 53                                   Stevie Nicks      4
# 55                                  Black Sabbath      4
# 52                                     The Wanted      4
# 56                                       Skrillex      4
# 51                                     Nate Ruess      4
# 57                          Red Hot Chili Peppers      4
# 54                                 Arctic Monkeys      4
# 59                                    Adam Levine      4
# 58                                         Sepsis      4
# 50                                   Chad Kroeger      4
# 32                                   Arsenal F.C.      5
# 31                              Cristiano Ronaldo      5
# 39                                  Franck Ribéry      5
# 38                                         Neymar      5
# 37                                       Football      5
# 34                             Zlatan Ibrahimović      5
# 33                                 Radamel Falcao      5

# 4. NMF - Discovering interpretable features
NMF = non-negative matrix factorization
- Dimension reduction technique that expresses samples as combinations of interpretable parts. 
- NMF models are INTERPRETABLE (unlike PCA)
    - easy to interpret and easy to explian
- However, all sample features must be non-negative (>=0)

Interpretable parts
- interpretable by decomposing samples as sums of their parts
- examples
    - it expresses documents as combinations of topics
    - expresses images in terms of combos of visual patterns
    
- You'll also learn to use NMF to build recommender systems that can find you similar articles to read, or musical artists that match your listening history!

## Using scikit-learn NMF
- follows fit() and transform() pattern
- BUT must specify number of components
    - NMF(n_components=2)
- works with both NumPy arrays and with csr_matrix

## Example word-frequency array
- word frequency array, 4 words, many documents
- measure presence of words in each document using "tf-idf"
    - recall tf-idf measures word frequency
- "tf" = frequency of word in document
    - if 10% of words is "DataCamp" then value = 0.1
- "idf" = reduces influence of frequent words (ie. the)



### 4.1 Example using NMF
- samples is the word-frequency array

In [None]:
from sklearn.decomposition import NMF
# remember you need to specify number of components for NMF
model = NMF(n_components=2)
model.fit(samples)
nmf_features = model.transform(samples)

print(model.components_)


### 4.2 NMF components: model.components_
- NMF has components (like PCA)
- Dimension of components = dimension of samples
    - above example has 2 components in a 4-dimensional space (4 words)
- Entries are non-negative

### 4.3 NMF features: nmf_features
- NMF feature values are non-negative
    - following above example, 2 features columns due to 2 components
- Features can be used to reconstruct the samples
    - ...combine feature values with components

### 4.4 Reconstruction of a samples
print(samples[i,:])

print(nmf_features[i,:])

#### Sample reconstruction
- multiply components by features values, and add up
- can also be expressed as a product of matrices
    - that' why it's called "matrix factorization"

#### NMF fits to non-negative data, only
- word frequencies in each document
- images endoded as arrays
- audio spectograms
- purchase histories on e-commerce sites


### 4.5 NMF applied to Wikipedia articles
In the video, you saw NMF applied to transform a toy word-frequency array. Now it's your turn to apply NMF, this time using the tf-idf word-frequency array of Wikipedia articles, given as a csr matrix articles. Here, fit the model and transform the articles. In the next exercise, you'll explore the result.

In [None]:
# Import NMF
from sklearn.decomposition import NMF

# Create an NMF instance: model
model = NMF(n_components=6)

# Fit the model to articles, which is the word count data
model.fit(articles)

# Transform the articles: nmf_features
nmf_features = model.transform(articles)

# Print the NMF features
print(nmf_features)

# [[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     0.00000000e+00   4.40506373e-01]
#  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     0.00000000e+00   5.66659188e-01]
#  [  3.82050268e-03   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     0.00000000e+00   3.98683767e-01]
#  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     0.00000000e+00   3.81775228e-01]
#  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     0.00000000e+00   4.85564507e-01]
#  [  1.29287389e-02   1.37901036e-02   7.76210545e-03   3.34474992e-02
#     0.00000000e+00   3.34554761e-01]
#  [  0.00000000e+00   0.00000000e+00   2.06716073e-02   0.00000000e+00
#     6.04727411e-03   3.59094015e-01]
#  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     0.00000000e+00   4.91024756e-01]
#  [  1.54269946e-02   1.42828850e-02   3.76586104e-03   2.37106658e-02
#     2.62721416e-02   4.80819557e-01]
#  [  1.11735727e-02   3.13703233e-02   3.09443015e-02   6.56980805e-02
#     1.96750899e-02   3.38321972e-01]
#  [  0.00000000e+00   0.00000000e+00   5.30647953e-01   0.00000000e+00
#     2.83788404e-02   0.00000000e+00]
#  [  0.00000000e+00   0.00000000e+00   3.56461373e-01   0.00000000e+00
#     0.00000000e+00   0.00000000e+00]
#  [  1.20124018e-02   6.50088586e-03   3.12203197e-01   6.09756724e-02
#     1.13905116e-02   1.92621126e-02]
#  [  3.93475069e-03   6.24484146e-03   3.42327174e-01   1.10766620e-02
#     0.00000000e+00   0.00000000e+00]
#  [  4.63808495e-03   0.00000000e+00   4.34856437e-01   0.00000000e+00
#     3.84422147e-02   3.08165696e-03]
#  [  0.00000000e+00   0.00000000e+00   4.83224134e-01   0.00000000e+00
#     0.00000000e+00   0.00000000e+00]
#  [  5.65001825e-03   1.83547742e-02   3.76482282e-01   3.25453318e-02
#     0.00000000e+00   1.13346596e-02]
#  [  0.00000000e+00   0.00000000e+00   4.80849077e-01   0.00000000e+00
#     0.00000000e+00   0.00000000e+00]
#  [  0.00000000e+00   9.01924448e-03   5.50933818e-01   0.00000000e+00
#     0.00000000e+00   0.00000000e+00]
#  [  0.00000000e+00   0.00000000e+00   4.65906981e-01   0.00000000e+00
#     0.00000000e+00   0.00000000e+00]
#  [  0.00000000e+00   1.14088541e-02   2.08627572e-02   5.17755511e-01
#     5.81629743e-02   1.37868947e-02]
#  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   5.10463728e-01
#     0.00000000e+00   0.00000000e+00]
#  [  0.00000000e+00   5.60142127e-03   0.00000000e+00   4.22370287e-01
#     0.00000000e+00   0.00000000e+00]
#  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   4.36741361e-01
#     0.00000000e+00   0.00000000e+00]
#  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   4.98080756e-01
#     0.00000000e+00   0.00000000e+00]
#  [  9.88366761e-02   8.60100620e-02   3.90983687e-03   3.81008870e-01
#     4.39421028e-04   5.22211692e-03]
#  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   5.72156883e-01
#     0.00000000e+00   7.13627097e-03]
#  [  1.31465075e-02   1.04860315e-02   0.00000000e+00   4.68895412e-01
#     0.00000000e+00   1.16322294e-02]
#  [  3.84539282e-03   0.00000000e+00   0.00000000e+00   5.75697492e-01
#     0.00000000e+00   0.00000000e+00]
#  [  2.25239129e-03   1.38747181e-03   0.00000000e+00   5.27933729e-01
#     1.20310496e-02   1.49500333e-02]
#  [  0.00000000e+00   4.07574718e-01   1.85689474e-03   0.00000000e+00
#     2.96723410e-03   4.52365662e-04]
#  [  1.53416989e-03   6.08212641e-01   5.22205076e-04   6.24838316e-03
#     1.18489465e-03   4.40109910e-04]
#  [  5.38804369e-03   2.65034312e-01   5.38434864e-04   1.86921587e-02
#     6.38895284e-03   2.90132437e-03]
#  [  0.00000000e+00   6.44957896e-01   0.00000000e+00   0.00000000e+00
#     0.00000000e+00   0.00000000e+00]
#  [  0.00000000e+00   6.08946624e-01   0.00000000e+00   0.00000000e+00
#     0.00000000e+00   0.00000000e+00]
#  [  0.00000000e+00   3.43707631e-01   0.00000000e+00   0.00000000e+00
#     3.97946096e-03   0.00000000e+00]
#  [  6.10492008e-03   3.15333353e-01   1.54859137e-02   0.00000000e+00
#     5.06437136e-03   4.74379589e-03]
#  [  6.47354568e-03   2.13342445e-01   9.49367504e-03   4.56970890e-02
#     1.71980262e-02   9.52150420e-03]
#  [  7.99125163e-03   4.67625618e-01   0.00000000e+00   2.43419796e-02
#     0.00000000e+00   0.00000000e+00]
#  [  0.00000000e+00   6.42861972e-01   0.00000000e+00   2.35849326e-03
#     0.00000000e+00   0.00000000e+00]
#  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     4.77261103e-01   0.00000000e+00]
#  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     4.94440481e-01   0.00000000e+00]
#  [  0.00000000e+00   2.99068686e-04   2.14451635e-03   0.00000000e+00
#     3.81921702e-01   5.83830661e-03]
#  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   5.64675940e-03
#     5.42444351e-01   0.00000000e+00]
#  [  1.78052695e-03   7.84456293e-04   1.41608159e-02   4.59787746e-04
#     4.24461218e-01   0.00000000e+00]
#  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     5.11582652e-01   0.00000000e+00]
#  [  0.00000000e+00   0.00000000e+00   3.28335229e-03   0.00000000e+00
#     3.73026564e-01   0.00000000e+00]
#  [  0.00000000e+00   2.62097705e-04   3.61054985e-02   2.32322492e-04
#     2.30597045e-01   0.00000000e+00]
#  [  1.12514342e-02   2.12340752e-03   1.60950068e-02   1.02482072e-02
#     3.25583605e-01   3.75915749e-02]
#  [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     4.19115076e-01   3.57704275e-04]
#  [  3.08364955e-01   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     0.00000000e+00   0.00000000e+00]
#  [  3.68171423e-01   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     0.00000000e+00   0.00000000e+00]
#  [  3.97942238e-01   2.81721472e-02   3.66962687e-03   1.70062680e-02
#     1.96040569e-03   2.11664796e-02]
#  [  3.75792129e-01   2.07534363e-03   0.00000000e+00   3.72145807e-02
#     0.00000000e+00   5.85982994e-03]
#  [  4.38025314e-01   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     0.00000000e+00   0.00000000e+00]
#  [  4.57877996e-01   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     0.00000000e+00   0.00000000e+00]
#  [  2.75475412e-01   4.46985945e-03   0.00000000e+00   5.29643478e-02
#     0.00000000e+00   1.91015485e-02]
#  [  4.45190987e-01   0.00000000e+00   0.00000000e+00   0.00000000e+00
#     5.48904081e-03   0.00000000e+00]
#  [  2.92738455e-01   1.33673503e-02   1.14247805e-02   1.05197510e-02
#     1.87766704e-01   9.24051901e-03]
#  [  3.78264000e-01   1.43979717e-02   0.00000000e+00   9.85216875e-02
#     1.35951375e-02   0.00000000e+00]]

### 4.6 NMF features of the Wikipedia articles (prev example cont'd)
Now you will explore the NMF features you created in the previous exercise. A solution to the previous exercise has been pre-loaded, so the array 'nmf_features' is available. Also available is a list 'titles' giving the title of each Wikipedia article.

When investigating the features, notice that for both actors, the NMF feature 3 has by far the highest value. This means that both articles are reconstructed using mainly the 3rd NMF component. In the next video, you'll see why: NMF components represent topics (for instance, acting!).

In [None]:
# Import pandas
import pandas as pd

# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features, index=titles)

# Print the row for 'Anne Hathaway'
print(df.loc['Anne Hathaway'])

# Print the row for 'Denzel Washington'

# print(df.loc['Denzel Washington'])
# 0    0.003845
# 1    0.000000
# 2    0.000000
# 3    0.575711
# 4    0.000000
# 5    0.000000
# Name: Anne Hathaway, dtype: float64
# 0    0.000000
# 1    0.005601
# 2    0.000000
# 3    0.422380
# 4    0.000000
# 5    0.000000
# Name: Denzel Washington, dtype: float64

# Notice that for both actors, the NMF feature 3 has by far the 
# highest value. This means that both articles are reconstructed 
# using mainly the 3rd NMF component.
# You'll se why next.

### 4.7 NMF reconstructs samples
In this exercise, you'll check your understanding of how NMF reconstructs samples from its components using the NMF feature values. Below are the components of an NMF model. If the NMF feature values of a sample are [2, 1], then which of the following is most likely to represent the original sample? A pen and paper will help here! You have to apply the same technique Ben used in the video to reconstruct the sample [0.1203 0.1764 0.3195 0.141].

components of a NMF model
[[ 1.   0.5  0. ]
 [ 0.2  0.1  2.1]]

NMF feature values of a sample
[2, 1]

original sample = components * features
1*2+0.2*1, .5*2+.1*1, 0*2+2.1*1
2.4, 1.1, 2.1

Answer: [2.2, 1.0, 2.0] is mostly likely to represent the original sample

### 4.8 NMF learns interpretable parts
Example: NMF learns interpretable parts
- Word-frequency array articles (tf-idf)
- 20,000 scientific articles (rows)
- 800 words (columns)


In [None]:
# Applying NMF to the articles
print(articles.shape)
# (20000, 800)
from sklearn.decomposition import NMF
nmf = NMF(n_components=10)
nmf.fit(articles)
print(nmf.component_.shape)
# (10, 800)


### 4.9 NMF components
- For documents:
    - NMF components represent topics
    - NMF features combine topics into documents
- For images, NMF components are parts of images

Grayscale images
- "Grayscale" image = no colors, only shades of gray
- measure pixel brightness
- Represent with value b/n 0 and 1 (0 is black)
- Convert to 2D array
- Example:
    - 8x8 grayscale image of the moon in an array
- Grayscale images as flat arrays
    - enumerate the entries
    - row-by-row
    - from left to right
    - from top to bottom

Encoding a collection of images
- collectino of images of the same size
- encode as 2D array
- each row corresponds to an image - as a flattened array
- each column corresponds to a pixel
    

### 4.10 Visualizing samples

In [None]:
print(sample)
bitmap = sample.reshape((2,3))
# 2D array
print(bitmap)

# to recover the image, use the reshaped method of the sample
# specify the dimensions of the original image as a tuple
# Yields 2D array of pixel brightness
print(bitmap)
# To display the corresponding image
from matplotlib import pyplot as plt
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.show()

### 4.11 NMF learns topics of documents
In the video, you learned when NMF is applied to documents, the components correspond to topics of documents, and the NMF features reconstruct the documents from the topics. Verify this for yourself for the NMF model that you built earlier using the Wikipedia articles. Previously, you saw that the 3rd NMF feature value was high for the articles about actors Anne Hathaway and Denzel Washington. In this exercise, identify the topic of the corresponding NMF component.

The NMF model you built earlier is available as model, while words is a list of the words that label the columns of the word-frequency array.

After you are done, take a moment to recognise the topic that the articles about Anne Hathaway and Denzel Washington have in common!

In [None]:
# Import pandas
import pandas as pd

# Create a DataFrame: components_df
components_df = pd.DataFrame(model.components_,columns=words)

# Print the shape of the DataFrame
print(components_df.shape)

# Select row 3: component
component = components_df.iloc[3]

# Print result of nlargest
# This gives the five words with the highest values for that component.
print(component.nlargest())

# (6, 13125)
# film       0.627877
# award      0.253131
# starred    0.245284
# role       0.211451
# actress    0.186398
# Name: 3, dtype: float64

### 4.12 Explore the LED digits dataset
In the following exercises, you'll use NMF to decompose grayscale images into their commonly occurring patterns. 

Firstly, explore the image dataset and see how it is encoded as an array. You are given 100 images as a 2D array 'samples', where each row represents a single 13x8 image. The images in your dataset are pictures of a LED digital display.

In [None]:
# Import pyplot
from matplotlib import pyplot as plt

# Select the 0th row: digit
digit = samples[0,:]

# Print digit - it is a 1D array of 0s and 1s
print(digit)

# Reshape digit to a 13x8 2D array: bitmap
bitmap = digit.reshape(13,8)

# Print bitmap
print(bitmap)

# Use plt.imshow to display bitmap as an image
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.colorbar()
plt.show()

# when you print bitmap, notice how the 1s show the digit 7
[[ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  1.  1.  1.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.]]

### 4.13 NMF learns the parts of images
Now use what you've learned about NMF to decompose the digits dataset. You are again given the digit images as a 2D array 'samples'. This time, you are also provided with a function show_as_image() that displays the image encoded by any 1D array:

def show_as_image(sample):

    bitmap = sample.reshape((13, 8))
    
    plt.figure()
    
    plt.imshow(bitmap, cmap='gray', interpolation='nearest')
    
    plt.colorbar()
    
    plt.show()
    
After you are done, take a moment to look through the plots and notice how NMF has expressed the digit as a sum of the components!

In [None]:
# Import NMF
from sklearn.decomposition import NMF

# Create an NMF model: model
model = NMF(n_components=7)

# Apply fit_transform to samples: features
features = model.fit_transform(samples)

# Call show_as_image on each component
for component in model.components_:
    show_as_image(component)

# Assign the 0th row of features: digit_features
digit_features = features[0]

# Print digit_features
print(digit_features)

# [  4.76823559e-01   0.00000000e+00   0.00000000e+00   5.90605054e-01
#    4.81559442e-01   0.00000000e+00   7.37551667e-16]

### 4.14 PCA doesn't learn parts
Unlike NMF, PCA doesn't learn the parts of things. Its components do not correspond to topics (in the case of documents) or to parts of images, when trained on images. Verify this for yourself by inspecting the components of a PCA model fit to the dataset of LED digit images from the previous exercise. The images are available as a 2D array 'samples'. 

Also available is a modified version of the show_as_image() function which colors a pixel red if the value is negative.

After submitting the answer, notice that the components of PCA do not represent meaningful parts of images of LED digits!

In [None]:
# repeat previous exercise with PCA components
# Appreciate that PCA components doesn't have meaningful parts like NMF

# Import PCA
from sklearn.decomposition import PCA

# Create a PCA instance: model
model = PCA(n_components=7)

# Apply fit_transform to samples: features
features = model.fit_transform(samples)

# Call show_as_image on each component
for component in model.components_:
    show_as_image(component)
    

### 4.15 Building recommender systems using NMF
Example: Finding similar articles
- Engineer at a large online newspaper
- Task: recommend articles similar to article being read by customer
- Similar articles should have similar topics
- Strategy:
    - Apply NMF to the word-frequency array
    - NMF feature values describe the topics
    - ...so similar documents have similar NMF feature values
    - Compare NMF feature values?
    

#### Apply NMF to 'articles', the word-frequency array

In [None]:
from sklearn.decomposition import NMF
nmf = NMF(n_components=6)
nmf_features = nmf.fit_transform(articles)


#### Versions of articles
- Different versions of the same document have same topic proportions
- ...exact feature values may be different!
    - ie. b/c one version uses many meaningless words
        - Dog bites man
        - vs. it seems that a dog has perhaps bitting a man
- On a scatterplot of the NMF features, the weak and strong version all lie on a single line passing through the origin
- in comparing documents, we need to compare these lines - cosine similarity
- Cosine similarity
    - = uses the angle b/n the 2 lines
    - higher values means more similar
    - max value = 1, when angle is 0 deg

#### Calculating the cosine similarities

In [None]:
# calculate cosine similarities
from sklearn.preprocessing import normalize
norm_features = normalize(nmf_features)
current_article = norm_features[23,:] # if has index 23
# compute cosine similarities
similarities = norm_features.dot(current_article)
print(similarities)

#### DataFrames and labels
- label similarities with the article titles, using a DataFrame
- Titles given as a list: titles


In [None]:
# create dataframe of similarities

import pandas as pd
norm_features = normalize(nmf_features)
df = pd.DataFrame(norm_features, index=titles)
# select normalized features of the current article
current_article = df.loc['Dog bites man']
# calculate the cosine similarities
simlarities = df.dot(current_article)
# using .nlargest() find the articles with the highest cosine similarity
print(similarities.nlargest())

### 4.15.a Which articles are similar to 'Cristiano Ronaldo'?
### Using cosine similarity

In the video, you learned how to use NMF features and the cosine similarity to find similar articles. Apply this to your NMF model for popular Wikipedia articles, by finding the articles most similar to the article about the footballer Cristiano Ronaldo. The NMF features you obtained earlier are available as 'nmf_features', while 'titles' is a list of the article titles.

In [None]:
# Perform the necessary imports
import pandas as pd
from sklearn.preprocessing import normalize

# Normalize the NMF features: norm_features
norm_features = normalize(nmf_features)

# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=titles)

# Select the row corresponding to 'Cristiano Ronaldo': article
article = df.loc['Cristiano Ronaldo']

# Compute the dot products: similarities
similarities = df.dot(article)

# Display those with the largest cosine similarity
print(similarities.nlargest())

# Cristiano Ronaldo                1.000000
# Franck Ribéry                    0.999972
# Radamel Falcao                   0.999942
# Zlatan Ibrahimović               0.999942
# France national football team    0.999923
# dtype: float64

### 4.15.b Recommend musical artists part I
In this exercise and the next, you'll use what you've learned about NMF to recommend popular music artists! You are given a sparse array artists whose rows correspond to 'artists' and whose column correspond to users. The entries give the number of times each artist was listened to by each user.

In this exercise, build a pipeline and transform the array into normalized NMF features. The first step in the pipeline, 'MaxAbsScaler', transforms the data so that all users have the same influence on the model, regardless of how many different artists they've listened to. In the next exercise, you'll use the resulting normalized NMF features for recommendation!

This data is part of a larger dataset available here. (note: link deleted since it was flagged as dangerous page)


In [None]:
# part 1: compute the normalized NMF features

# Perform the necessary imports
from sklearn.decomposition import NMF
from sklearn.preprocessing import Normalizer, MaxAbsScaler
from sklearn.pipeline import make_pipeline

# Create a MaxAbsScaler: scaler
scaler = MaxAbsScaler()

# Create an NMF model: nmf
nmf = NMF(n_components=20)

# Create a Normalizer: normalizer
normalizer = Normalizer()

# Create a pipeline: pipeline
pipeline = make_pipeline(scaler, nmf, normalizer)

# Apply fit_transform to artists: norm_features
norm_features = pipeline.fit_transform(artists)


### 4.16.a Recommend musical artists part II
Suppose you were a big fan of Bruce Springsteen - which other musicial artists might you like? Use your NMF features from the previous exercise and the cosine similarity to find similar musical artists. A solution to the previous exercise has been run, so 'norm_features' is an array containing the normalized NMF features as rows. The names of the musical artists are available as the list 'artist_names'.

In [None]:
# part 2: use the normalized NMF features to recommend musical artists

# Import pandas
import pandas as pd

# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=artist_names)

# Select row of 'Bruce Springsteen': artist
artist = df.loc['Bruce Springsteen']

# Compute cosine similarities: similarities
similarities = df.dot(artist)

# Display those with highest cosine similarity
print(similarities.nlargest())

Bruce Springsteen    1.000000
Neil Young           0.957043
Van Morrison         0.875627
Leonard Cohen        0.867585
Bob Dylan            0.863132
dtype: float64