# Unsupervised Learning Model Evaluation Lab

Complete the exercises below to solidify your knowledge and understanding of unsupervised learning model evaluation.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from sklearn import datasets

data = datasets.load_wine()

X = pd.DataFrame(data["data"], columns=data["feature_names"])
y = pd.Series(data["target"])

In [11]:
print(data.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

## 1. Train a KMeans clustering model on the data set using 8 clusters and compute the silhouette score for the model.

In [17]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

In [55]:
kmeans = KMeans().fit_predict(X)

In [28]:
# labels
kmeans

array([3, 3, 2, 6, 4, 6, 2, 2, 3, 3, 6, 2, 2, 3, 6, 2, 2, 3, 6, 0, 4, 4,
       3, 3, 0, 0, 2, 2, 0, 3, 2, 6, 3, 2, 3, 0, 0, 3, 3, 4, 0, 3, 3, 4,
       0, 3, 3, 3, 3, 2, 3, 2, 2, 2, 3, 3, 3, 2, 2, 1, 4, 1, 7, 5, 5, 4,
       1, 1, 4, 4, 0, 5, 1, 3, 0, 1, 5, 1, 4, 1, 5, 4, 7, 1, 1, 1, 1, 7,
       4, 7, 1, 1, 1, 5, 5, 0, 7, 1, 4, 5, 4, 7, 1, 5, 4, 5, 1, 1, 5, 4,
       7, 5, 7, 1, 5, 5, 1, 5, 5, 7, 7, 1, 5, 5, 5, 5, 5, 1, 5, 7, 7, 1,
       7, 7, 7, 4, 4, 1, 7, 7, 7, 4, 1, 7, 0, 0, 5, 7, 7, 7, 1, 1, 1, 4,
       7, 4, 1, 0, 4, 7, 1, 4, 7, 4, 7, 1, 4, 4, 4, 7, 1, 1, 4, 4, 4, 0,
       0, 7], dtype=int32)

In [56]:
s_score = silhouette_score(X, kmeans)
s_score

0.5398971441034137

## 2. Train a KMeans clustering model on the data set using 5 clusters and compute the silhouette score for the model.

In [61]:
kmeans_2 = KMeans(n_clusters=5).fit_predict(X)

In [62]:
s_score2 = silhouette_score(X, kmeans_2)
s_score2

0.5489993239795691

## 3. Train a KMeans clustering model on the data set using 3 clusters and compute the silhouette score for the model.

In [31]:
kmeans_3 = KMeans(n_clusters=3).fit_predict(X)

In [54]:
s_score3 = silhouette_score(X, kmeans_3)
s_score3

0.5711381937868844

## 4. Use elbow curve visualizations to see if you can determine the best number of clusters to use.

The Yellowbrick library has 3 metrics that you can plot using the `metric` parameter:

- **distortion**: mean sum of squared distances to centers
- **silhouette**: mean ratio of intra-cluster and nearest-cluster distance
- **calinski_harabaz**: ratio of within to between cluster dispersion

In [63]:
from yellowbrick.features import ParallelCoordinates

ModuleNotFoundError: No module named 'yellowbrick'

## 5. Try performing the same elbow tests with an AgglomerativeClustering model and compare the results you get to the KMeans results.

## 6. Create and plot a scatter matrix showing how the clusters are grouped across all the different combinations of variables in the data.

Use the model and number of clusters that returned the best result above.

## 7. Apply a PCA transform and plot the first two principle components with the plot point colors determined by cluster.

## 8. Generate a series of t-SNE plots showing the clusters at a variety of perplexities.