## Applications of Unsupervised learning

* **Recommender systems**, which involve grouping together users with similar viewing patterns in order to recommend similar content.
* **Customer segmentation**, or understanding different customer groups around which to build marketing or other business strategies.
* **Genetics**, for example clustering DNA patterns to analyze evolutionary biology.
* **Anomaly detection**, including fraud detection or detecting defective mechanical parts (i.e., predictive maintenance).
* **Outlier detection** within a data science / data analytics workflow.

# Let's do some unsupervised learning!

**We're doing KMeans clustering**

In [1]:
# let's get some data

In [2]:
import pandas as pd

In [None]:
from sklearn import datasets

In [None]:
data = datasets.load_wine()

In [None]:
data.keys()

In [None]:
print(data['DESCR'])

In [None]:
data['target']

In [None]:
data['data']

In [None]:
data['data'].shape

In [None]:
X = pd.DataFrame(data['data'], columns=data['feature_names'])

y = pd.Series(data['target'])

In [None]:
X.head()

In [None]:
y.unique()

In [None]:
from sklearn.preprocessing import StandardScaler
X_prep = StandardScaler().fit_transform(X)

In [None]:
# dataframe of scaled features
X_prep_df = pd.DataFrame(X_prep, columns=data['feature_names'])

In [None]:
from IPython.display import Image
from IPython.core.display import HTML

In [None]:
Image("k_means.gif")

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=8, random_state=1234)
kmeans.fit(X_prep_df)

In [None]:
kmeans.cluster_centers_

## What makes a cluster a "good" cluster?

* **Inertia**, Intuitively, inertia tells how far away the points within a cluster are. Therefore, a small of inertia is aimed for. The range of inertia’s value starts from zero and goes up.

* **Silhouette score**, (discuss later), -1 to 1

In [None]:
Image("inertia_plate.jpg")

In [None]:
Image("inertia_scale.jpg")

In [None]:
Image("inertia_sum_of_squares.png")

In [None]:
# total inertia of all the centroids
kmeans.inertia_

In [None]:
clusters = kmeans.predict(X_prep)
clusters

In [None]:
pd.Series(clusters).value_counts().sort_index()

In [None]:
kmeans.cluster_centers_

In [None]:
X_df = pd.DataFrame(X)
X_df['cluster'] = clusters
X_df.head()

In [None]:
X_df['cluster'].plot(kind='hist')

```python
def get_inertia(n_clusters):
    kmeans = KMeans(n_clusters=8, random_state=1234)
    
    # train your model here
    # calculate an inertia
    reture kmeans.inertia
    
cluster_range = range(1,11)

dct = {cluster_number:get_inertia(cluster_number) for cluster_number in cluster_range}

```

In [None]:
# I want to iterate over a range of n_clusters and for every value, I want to return the inertia
def get_kmeans_inertia_varying_cluster_n(n_clusters):
    
    # setup the model
    kmeans = KMeans(n_clusters=n_clusters,
                    random_state=1234,
                    n_init=3,
                    #algorithm='elkan',
                   )
    # train the model
    kmeans.fit(X_prep_df)
    
    # return the resulting inertia
    return kmeans.inertia_

# Plot for a range of cluster numbers
import matplotlib.pyplot as plt

cluster_range = range(1,20)

plt.plot(cluster_range,
         [get_kmeans_inertia_varying_cluster_n(c_number) for c_number in cluster_range],
         marker="o",
         ms=10,
        )
plt.xlabel('Cluster Number')
plt.ylabel('inertia')

In [None]:
# I want to iterate over a range of mx_iter and for every value, I want to return the inertia
def get_kmeans_ineratia_varying_max_iter(max_iter):
    kmeans = KMeans(n_clusters=5,
                    random_state=1234,
                    n_init=3,
                    algorithm='elkan',
                    max_iter=max_iter,
                   )
    kmeans.fit(X_prep_df)

    return kmeans.inertia_

max_iter_list = [1, 5, 10, 20, 30, 40, 50, 100]

plt.plot(max_iter_list,
         [get_kmeans_ineratia_varying_max_iter(x) for x in max_iter_list],
        )
plt.xlabel('Max iter')
plt.ylabel('inertia')

* **Inertia**, Intuitively, inertia tells how far away the points within a cluster are. Therefore, a small of inertia is aimed for. The range of inertia’s value starts from zero and goes up.

* **Silhouette score**, (discuss later), -1 to 1

* Sci-kit learn explanation how to read [silhouette plots](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html)
* Inertia was the metric that KMeans used to find the optimum centroids
* but it does **not have a limited range** (as the Mean Squared Error has, it's in units of the distances)
* ranges from 0 to ...
* a score that is **not really comparable**
* what Silhouette score does: **how similar is an observation to its own cluster compared to other clusters**
* $S_i = \frac{(b_i - a_i)}{\text{max}(a_i,b_i)}$
    * `a`: mean intra-cluster distance (the average distance between the i-th observation and every other observation in the cluster where i belongs to)
    * `b`: the mean **nearest** inter cluster distance (the average distance between the i_th observation of the nearest cluster that i is **not part of**)
    
* The **silhouette score for the whole model** is the **average** of all the silhouette scores of each instance.

Well separated clusters:
* `a` - the mean intra cluster distance is relatively small compared to
* `b` - the mean inter cluster distance that the points are not part of
* that means $S = (b - a) / max(a,b)$ approaches 1

Not so well separated clusters:
* `a` - the mean intra cluster distance is not so small (relatively) compared to
* `b` - the mean inter cluster distance that the points are not part of
* that means $S = (b - a) / max(a,b)$ becomes smaller and smaller (approaches 0 when b=a)
* S becomes negative for a point, which is not (yet) in the right cluster (too less iterations? play with tolerance. Or random effect - increase n_init?)

In [None]:
from sklearn.metrics import silhouette_score

K = range(2, 20)

silhouettes = []

for k in K:
    kmeans = KMeans(n_clusters=k,
                   random_state=1234)
    kmeans.fit(X_prep)
    silhouettes.append(silhouette_score(X_prep, kmeans.predict(X_prep)))


In [None]:
import matplotlib.pyplot as plt


plt.figure(figsize=(16,8))
plt.plot(K, silhouettes, 'bo-')
plt.xlabel('k (number of clusters)')
plt.ylabel('silhouette score')

In [None]:
""" doesnt work, check the environment

from yellowbrick.cluster import SilhouetteVisualizer

fig, ax = plt.subplots(2, 2, figsize=(15,8))
for k in [2, 3, 4, 5]:
    '''
    Create KMeans instance for different number of clusters
    '''
    km = KMeans(n_clusters=k,
                random_state=1234)
    q, mod = divmod(k, 2)
    '''
    Create SilhouetteVisualizer instance with KMeans instance
    Fit the visualizer
    '''
    
    visualizer = SilhouetteVisualizer(km, colors='yellowbrick', ax=ax[q-1][mod])
    visualizer.fit(X_prep)
"""

# LEt's visualize the result!

In [None]:
kmeans = KMeans(n_clusters=3,
             random_state=1234)

kmeans.fit(X_prep)

clusters = kmeans.predict(X_prep)
clusters

In [None]:
clusters.shape

In [None]:
wines_clustered = pd.DataFrame(X_prep, columns=data['feature_names'])

In [None]:
wines_clustered['cluster_id'] = clusters

In [None]:
wines_clustered.head()

In [None]:
wines_clustered['cluster_id'].value_counts()

In [None]:
kmeans.cluster_centers_

In [None]:
cluster_centers_df = pd.DataFrame(kmeans.cluster_centers_, columns=data['feature_names'])

In [None]:
cluster_centers_df

In [None]:
cluster_centers_df['cluster_id'] = range(0,3)

In [None]:
cluster_centers_df

In [None]:
# this contains my cluster centers
cluster_center_sub_df = cluster_centers_df[['alcohol', 'color_intensity', 'cluster_id']]

# this cointains my datapoints with the determined
wines_clustered_sub_df= wines_clustered[['alcohol', 'color_intensity', 'cluster_id']]

In [None]:
cluster_center_sub_df

In [None]:
wines_clustered_sub_df

In [None]:
import seaborn as sns

sns.scatterplot(data=wines_clustered_sub_df,
               x='alcohol',
               y='color_intensity',
               hue='cluster_id')

# plot centroids
sns.scatterplot(data=cluster_center_sub_df,
               x="alcohol",
               y="color_intensity",
               hue='cluster_id',
                legend=False,
                # marker=u'8',
                marker='+',
                s=500,
               )

In [None]:
wines2 = wines_clustered[wines_clustered['cluster_id']==2]

In [None]:
X['cluster_id'] = wines_clustered['cluster_id']

In [None]:
for col in X.columns:
    X[col].plot(kind='box')