**Unsupervised learning**
___
- Unsupervised learning finds patterns in data
- Dimension = number of features
- k-means clustering
    - finds clusters of samples
    - number of clusters must be specified
    - implemented in sklearn ("scikit-learn")
    - new samples can be assigned to existing clusters
        - k-means remembers the mean of each cluster (the "centroids")
        - finds the nearest centroid to each new sample
___

In [None]:
#clustering 2D points

# Import pyplot
#import matplotlib.pyplot as plt

# Import KMeans
#from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
#model = KMeans(n_clusters = 3)

# Fit model to points
#model.fit(points)

# Determine the cluster labels of new_points: labels
#labels = model.predict(new_points)

# Print cluster labels of new_points
#print(labels)

#################################################
#<script.py> output:
#    [1 2 0 1 2 1 2 2 2 0 1 2 2 0 0 2 0 0 2 2 0 2 1 2 1 0 2 0 0 1 1 2 2 2 0 1 2
#     2 1 2 0 1 1 0 1 2 0 0 2 2 2 2 0 0 1 1 0 0 0 1 1 2 2 2 1 2 0 2 1 0 1 1 1 2
#     1 0 0 1 2 0 1 0 1 2 0 2 0 1 2 2 2 1 2 2 1 0 0 0 0 1 2 1 0 0 1 1 2 1 0 0 1
#     0 0 0 2 2 2 2 0 0 2 1 2 0 2 1 0 2 0 0 2 0 2 0 1 2 1 1 2 0 1 2 1 1 0 2 2 1
#     0 1 0 2 1 0 0 1 0 2 2 0 2 0 0 2 2 1 2 2 0 1 0 1 1 2 1 2 2 1 1 0 1 1 1 0 2
#     2 1 0 1 0 0 2 2 2 1 2 2 2 0 0 1 2 1 1 1 0 2 2 2 2 2 2 0 0 2 0 0 0 0 2 0 0
#     2 2 1 0 1 1 0 1 0 1 0 2 2 0 2 2 2 0 1 1 0 2 2 0 2 0 0 2 0 0 1 0 1 1 1 2 0
#     0 0 1 2 1 0 1 0 0 2 1 1 1 0 2 2 2 1 2 0 0 2 1 1 0 1 1 0 1 2 1 0 0 0 0 2 0
#     0 2 2 1]

# Assign the columns of new_points: xs and ys
#xs = new_points[:,0]
#ys = new_points[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
#plt.scatter(xs, ys, c=labels, alpha=0.5)

# Assign the cluster centers: centroids
#centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
#centroids_x = centroids[:,0]
#centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
#plt.scatter(centroids_x, centroids_y, marker ='D', s=50)
#plt.show()

![images/8.1.svg](images/8.1.svg)

**Evaluating a clustering**
___
- compare against known clustering
    - using crosstabs
- inertia measures clustering quality
    - how spread out the clusters are (lower is better)
    - distance from each sample to centroid of its cluster
    - best choice is elbow in inertia plot
___

In [None]:
# How many clusters?

# Import pyplot
#import matplotlib.pyplot as plt

# Import KMeans
#from sklearn.cluster import KMeans

#ks = range(1, 6)
#inertias = []

#for k in ks:
    # Create a KMeans instance with k clusters: model
#    model = KMeans(n_clusters=k)

    # Fit model to samples
#    model.fit(samples)

    # Append the inertia to the list of inertias
#    inertias.append(model.inertia_)

# Plot ks vs inertias
#plt.plot(ks, inertias, '-o')
#plt.xlabel('number of clusters, k')
#plt.ylabel('inertia')
#plt.xticks(ks)
#plt.show()

![images/8.2.svg](images/8.2.svg)

In [1]:
# Evaluating clusters

#import libraries
from sklearn.cluster import KMeans
import pandas as pd

# Create a KMeans model with 3 clusters: model
#model = KMeans(n_clusters=3)

# Use fit_predict to fit model and obtain cluster labels: labels
#labels = model.fit_predict(samples)

# Create a DataFrame with labels and varieties as columns: df
#df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
#ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
#print(ct)

#################################################
#<script.py> output:
#    varieties  Canadian wheat  Kama wheat  Rosa wheat
#    labels
#    0                       0           1          60
#    1                      68           9           0
#    2                       2          60          10

**Transforming features for better clusterings**
___
- when features have different variances, clustering will be inaccurate.
    - for K Means clustering, variance = influence
- StandardScaler from sklearn.preprocessing transforms each feature to have mean 0 and variance 1
    - StandardScaler has fit() and transform()
    - KMeans has fit() and predict()
- Normalizer rescales samples (rather than features) independently of the other
___

In [None]:
#scaling data for clustering

# Perform the necessary imports
#from sklearn.pipeline import make_pipeline
#from sklearn.preprocessing import StandardScaler
#from sklearn.cluster import KMeans

#import pandas as pd

# Create scaler: scaler
#scaler = StandardScaler()

# Create KMeans instance: kmeans
#kmeans = KMeans(n_clusters=4)

# Create pipeline: pipeline
#pipeline = make_pipeline(scaler, kmeans)

# Fit the pipeline to samples
#pipeline.fit(samples)

# Calculate the cluster labels: labels
#labels = pipeline.predict(samples)

# Create a DataFrame with labels and species as columns: df
#df = pd.DataFrame({'labels' : labels, 'species' : species})

# Create crosstab: ct
#ct = pd.crosstab(df['labels'], df['species'])

# Display ct
#print(ct)

#################################################
#<script.py> output:
#    species  Bream  Pike  Roach  Smelt
#    labels
#    0            0     0      0     13
#    1           33     0      1      0
#    2            0    17      0      0
#    3            1     0     19      1

In [None]:
#Clustering stocks using KMeans and Normalizer

# Perform the necessary imports
#from sklearn.pipeline import make_pipeline
#from sklearn.preprocessing import Normalizer
#from sklearn.cluster import KMeans

#import pandas as pd

# Create a normalizer: normalizer
#normalizer = Normalizer()

# Create a KMeans model with 10 clusters: kmeans
#kmeans = KMeans(n_clusters=10)

# Make a pipeline chaining normalizer and kmeans: pipeline
#pipeline = make_pipeline(normalizer, kmeans)

# Fit pipeline to the daily price movements
#pipeline.fit(movements)

# Import pandas
#import pandas as pd

# Predict the cluster labels: labels
#labels = pipeline.predict(movements)

# Create a DataFrame aligning labels and companies: df
#df = pd.DataFrame({'labels': labels, 'companies': companies})

# Display df sorted by cluster label
#print(df.sort_values('labels'))

#################################################
#<script.py> output:
#        labels                           companies
#    59       0                               Yahoo
#    15       0                                Ford
#    35       0                            Navistar
#    26       1                      JPMorgan Chase
#    16       1                   General Electrics
#    58       1                               Xerox
#    11       1                               Cisco
#    18       1                       Goldman Sachs
#    20       1                          Home Depot
#    5        1                     Bank of America
#    3        1                    American express
#    55       1                         Wells Fargo
#    1        1                                 AIG
#    38       2                               Pepsi
#    40       2                      Procter Gamble
#    28       2                           Coca Cola
#    27       2                      Kimberly-Clark
#    9        2                   Colgate-Palmolive
#    54       3                            Walgreen
#    36       3                    Northrop Grumman
#    29       3                     Lookheed Martin
#    4        3                              Boeing
#    0        4                               Apple
#    47       4                            Symantec
#    33       4                           Microsoft
#    32       4                                  3M
#    31       4                           McDonalds
#    30       4                          MasterCard
#    50       4  Taiwan Semiconductor Manufacturing
#    14       4                                Dell
#    17       4                     Google/Alphabet
#    24       4                               Intel
#    23       4                                 IBM
#    2        4                              Amazon
#    51       4                   Texas instruments
#    43       4                                 SAP
#    45       5                                Sony
#    48       5                              Toyota
#    21       5                               Honda
#    22       5                                  HP
#    34       5                          Mitsubishi
#    7        5                               Canon
#    56       6                            Wal-Mart
#    57       7                               Exxon
#    44       7                        Schlumberger
#    8        7                         Caterpillar
#    10       7                      ConocoPhillips
#    12       7                             Chevron
#    13       7                   DuPont de Nemours
#    53       7                       Valero Energy
#    39       8                              Pfizer
#    41       8                       Philip Morris
#    25       8                   Johnson & Johnson
#    49       9                               Total
#    46       9                      Sanofi-Aventis
#    37       9                            Novartis
#    42       9                   Royal Dutch Shell
#    19       9                     GlaxoSmithKline
#    52       9                            Unilever
#    6        9            British American Tobacco

**Visualizing hierarchies**
___
- t-SNE
    - creates a 2D map of a dataset
- Hierarchical clustering
    - 2D array of scores
    - dendrogram
    - number of operations = # samples compared - 1
    - agglomerative clustering
        - each row begins in a separate cluster, at each step the two closest clusters are merged
        - continues until all rows are in a single cluster
    - divisive clustering
        - opposite to agglomerative clustering
___

In [None]:
#Hierarchical clustering of grain data

# Perform the necessary imports
#from scipy.cluster.hierarchy import linkage, dendrogram
#import matplotlib.pyplot as plt

# Calculate the linkage: mergings
#mergings = linkage(samples, method='complete')

# Plot the dendrogram, using varieties as labels
#dendrogram(mergings,
#           labels=varieties,
#           leaf_rotation=90,
#           leaf_font_size=6,
#)
#plt.show()

![images/8.3.svg](images/8.3.svg)

In [None]:
#Hierarchical clustering of stock data with normalize()

# Perform the necessary imports
#from scipy.cluster.hierarchy import linkage, dendrogram
#import matplotlib.pyplot as plt
#from sklearn.preprocessing import normalize

# Normalize the movements: normalized_movements
#normalized_movements = normalize(movements)

# Calculate the linkage: mergings
#mergings = linkage(normalized_movements, method='complete')

# Plot the dendrogram
#dendrogram(mergings,
#            labels=companies,
#            leaf_rotation=90,
#            leaf_font_size=6
#)
#plt.show()

![images/8.4.svg](images/8.4.svg)

**Cluster labels in hierarchical clustering**
___
- cluster labels from intermediate stages can be recovered and crosstabulated
- y axis of a dendrogram indicates height
    - distance between merging clusters
    - linkage method is called using fcluster() in scipy.cluster.hierarchy
- linkage
    - **complete** - distance between clusters is the distance between the furthest points of the clusters
    - **single** - distance between clusters is the distance between the closest points of the clusters
___

In [None]:
#single linkage, different dendrogram

# Perform the necessary imports
#import matplotlib.pyplot as plt
#from scipy.cluster.hierarchy import linkage, dendrogram

# Calculate the linkage: mergings
#mergings = linkage(samples, method='single')

# Plot the dendrogram
#dendrogram(mergings,
#            labels=country_names,
#            leaf_rotation=90,
#            leaf_font_size=6
#)
#plt.show()

![images/8.5.svg](images/8.5.svg)

In [None]:
#extracting cluster labels

# Perform the necessary imports
#import matplotlib.pyplot as plt
#from scipy.cluster.hierarchy import linkage, dendrogram
#import pandas as pd
#from scipy.cluster.hierarchy import fcluster

# Calculate the linkage: mergings
#mergings = linkage(samples, method='single')

# Use fcluster to extract labels: labels
#labels = fcluster(mergings, t=6, criterion='distance')

# Create a DataFrame with labels and varieties as columns: df
#df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
#ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
#print(ct)

#################################################
#<script.py> output:
#    varieties  Canadian wheat  Kama wheat  Rosa wheat
#    labels
#    1                      14           3           0
#    2                       0           0          14
#    3                       0          11           0

**t-SNE for 2-dimensional maps**
___
"t-distributed stochastic neighbor embedding"
- maps samples to 2D or 3D space
- map approximately preserves nearness of samples
- great for inspecting datasets
- only has fit_transform() method
- t-SNE learning rate - values between 50 and 200
    - if points are clustered together, it is a bad value
- axis values are not interpretable
___

In [None]:
#t-SNE visualization of grain dataset

#import matplotlib.pyplot as plt

# Import TSNE
#from sklearn.manifold import TSNE

# Create a TSNE instance: model
#model = TSNE(learning_rate=200)

# Apply fit_transform to samples: tsne_features
#tsne_features = model.fit_transform(samples)

# Select the 0th feature: xs
#xs = tsne_features[:,0]

# Select the 1st feature: ys
#ys = tsne_features[:,1]

# Scatter plot, coloring by variety_numbers
#plt.scatter(xs, ys, c=variety_numbers)
#plt.show()

![images/8.6.svg](images/8.6.svg)

In [None]:
#t-SNE map of the stock market

#import matplotlib.pyplot as plt

# Import TSNE
#from sklearn.manifold import TSNE

# Create a TSNE instance: model
#model = TSNE(learning_rate=50)

# Apply fit_transform to normalized_movements: tsne_features
#tsne_features = model.fit_transform(normalized_movements)

# Select the 0th feature: xs
#xs = tsne_features[:,0]

# Select the 1th feature: ys
#ys = tsne_features[:,1]

# Scatter plot
#plt.scatter(xs, ys, alpha=0.5)

# Annotate the points
#for x, y, company in zip(xs, ys, companies):
#    plt.annotate(company, (x, y), fontsize=5, alpha=0.75)
#plt.show()

![images/8.7.svg](images/8.7.svg)