## Applications of Unsupervised learning

* **Recommender systems**, which involve grouping together users with similar viewing patterns in order to recommend similar content.
* **Customer segmentation**, or understanding different customer groups around which to build marketing or other business strategies.
* **Genetics**, for example clustering DNA patterns to analyze evolutionary biology.
* **Anomaly detection**, including fraud detection or detecting defective mechanical parts (i.e., predictive maintenance).
* **Outlier detection** within a data science / data analytics workflow.

# Let's do some unsupervised learning!

**We're doing KMeans clustering**

In [2]:
# let's get some data

In [3]:
import pandas as pd

In [4]:
from sklearn import datasets

In [5]:
data = datasets.load_wine()

In [6]:
data.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

In [7]:
print(data['DESCR'])

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

In [8]:
data['target']

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

In [9]:
data['data']

array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]])

In [10]:
data['data'].shape

(178, 13)

In [11]:
X = pd.DataFrame(data['data'], columns=data['feature_names'])

y = pd.Series(data['target'])

In [12]:
X.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [13]:
y.unique()

array([0, 1, 2])

In [14]:
from sklearn.preprocessing import StandardScaler
X_prep = StandardScaler().fit_transform(X)

In [15]:
# dataframe of scaled features
X_prep_df = pd.DataFrame(X_prep, columns=data['feature_names'])

In [16]:
from IPython.display import Image
from IPython.core.display import HTML

In [17]:
Image("k_means.gif")

FileNotFoundError: No such file or directory: 'k_means.gif'

FileNotFoundError: No such file or directory: 'k_means.gif'

<IPython.core.display.Image object>

In [25]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=8, random_state=1234)
kmeans.fit(X_prep_df)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=1234, tol=0.0001, verbose=0)

In [26]:
kmeans.cluster_centers_

array([[ 0.59147756, -0.32436906,  1.34490077,  0.25175579,  0.99333031,
         0.84371458,  0.92884469, -0.12237604,  0.13761844,  0.02557888,
         0.70828955,  0.64185034,  1.00327853],
       [ 0.02877558,  1.05666617,  0.08901746,  0.39972231, -0.26270834,
        -1.18764705, -1.29132829,  0.73478767, -1.2028952 ,  0.05630806,
        -0.8777228 , -1.16979521, -0.49796412],
       [-0.68815199, -0.86587116, -1.69237694, -0.58138787, -0.79550369,
        -0.22120754, -0.07426899, -0.46048791, -0.31385814, -0.79739259,
         0.71832595,  0.34325254, -0.7236641 ],
       [-0.69634533,  0.10356862, -0.05533288,  0.28730677, -0.40313502,
         0.45372181,  0.45840358, -0.56787084,  0.40200861, -0.83045392,
         0.14734897,  0.61860727, -0.66950187],
       [-1.08833361, -0.52400154,  0.07630323,  0.46369607, -0.65956635,
        -0.60314437, -0.42529786,  1.07812028, -0.52034416, -0.86374948,
         0.55789073, -0.24986507, -0.6679884 ],
       [ 0.3384594 ,  0.846390

In [28]:
def get_inertia(n_clusters):
    kmeans = KMeans(n_clusters=8, random_state=1234)
    # train your model here
    # calculate an inertia
    return kmeans.inertia
cluster_range = range(1,11)
dct = {cluster_number:get_inertia(cluster_number) for cluster_number in cluster_range}


AttributeError: 'KMeans' object has no attribute 'inertia'