# Clustering

**Dataset:** Penguins

# INSTALL AND IMPORT LIBRARIES
This demonstration requires the `palmerpenguins` library, which can be installed with Python's `pip` command. This command only needs to be done once per machine.

The standard, shorter approach may work:

In [None]:
 pip install palmerpenguins

If the above command didn't work, it may be necessary to be more explicit, in which case you could run the code below.

In [None]:
import sys
!{sys.executable} -m pip install palmerpenguins

Once `palmerpenguins` is installed, then load the libraries below.

In [17]:
from palmerpenguins import load_penguins  # For penguins dataset
import pandas as pd                       # For dataframes
import matplotlib.pyplot as plt           # For plotting functions
import seaborn as sns                     # For additional plotting functions
from sklearn.cluster import KMeans                # For k-Means
from sklearn.model_selection import GridSearchCV  # For grid search
from sklearn.metrics import silhouette_score      # For metrics and scores
from sklearn.preprocessing import StandardScaler  # For standardizing data

# LOAD AND PREPARE DATA
For this demonstrations of clustering, we'll use the `penguins` dataset, which is available in the `palmerpenguins` package. It is also described at [https://pypi.org/project/palmerpenguins/](https://pypi.org/project/palmerpenguins/).

Following steps are used to prepare the data:

1. Load the `penguins` dataset in variable `df`
1. Remove the `island`, `year`, and `sex` variables
1. Rename the class variable `species` as `y`
1. Drop all rows with `NaN`
1. Display the first 5 rows of `df`

In [None]:
# Loads the penguins dataset
df = load_penguins()

# Drop variables and NaN cases, rename variable
df = df.drop(['island', 'year', 'sex'], axis=1) \
    .dropna() \
    .rename(columns={'species': 'y'})

# Displays the first 5 rows of data
df.head()

In [None]:
len(df.bill_depth_mm)

In [None]:
len(df.bill_length_mm)



```
# This is formatted as code
```

# EXPLORE THE DATA
Visualize various aspects of penguins dataset.

## Bar Plot of Class Variable
Use Seaborn's `countplot` function to create a bar plot and look at the distribution of different species

In [None]:
sns.countplot(x='y', data=df)

## Scatter Plots and Density Plots for Feature Pairs
Plot the relationships between all features using `PairGrid`. In particular, notice how `bill_length_mm` and `bill_depth_mm` variables are good at distinguishing between the species.

In [None]:
# Creates a grid using Seaborn's PairGrid()
g = sns.PairGrid(
    df, 
    vars=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'], 
    hue='y', 
    diag_sharey=False, 
    palette=["red", "green", "blue"])

# Adds histograms on the diagonal
g.map_diag(plt.hist)

# Adds density plots above the diagonal
g.map_upper(sns.kdeplot)

# Adds scatterplots below the diagonal
g.map_lower(sns.scatterplot)

# Adds a legend
g.add_legend()

###PREPARE DATA

In [None]:
df = df.reset_index(drop=True)

# Separates the class variable in y
y = df.y

# Removes the y column from df
df = df.drop('y', axis=1)

# Standardizes df
df = pd.DataFrame(
    StandardScaler().fit_transform(df),
    columns=df.columns)

# Displays the first 5 rows of df
df.head()

In [None]:
len(df.bill_depth_mm)

In [None]:
len(df.bill_length_mm)

In [None]:
df.shape

# RUNNING k-MEANS

## k-Means: Train the Model
We'll set up a `KMeans` object with the following parameters:

- `n_clusters`: Total number of clusters to make.
- `random_state`: Set to one to reproduce these results.
- `init`: How to initialize the k-means centers; we'll use `k-means++`.
- `n_init`: Number of times k-means would be run; the model returned would have the minimum value of `inertia`.

A few attributes of the `KMeans` object, which are also used in this demo are:
- `cluster_centers_`: Stores the discovered cluster centers.
- `labels_`: Label of each instance.
- `inertia`: Sum of square of distances of each instance from its corresponding center.
- `n_iter`: Number of iterations run to find the centers.

In [None]:
# Sets up the kMeans object
km = KMeans(
    n_clusters=3,
    random_state=1,
    init='k-means++',
    n_init=10)

# Fits the model to the data
km.fit(df)

# Displays the parameters of the fitted model
km.get_params()

# k-Means: Visualize the Clusters
The code below creates a scatterplot of the first two features. Each point is colored according to its actual label. For comparison, each instance is drawn with a marker according to the label found by the clustering algorithm.

In [None]:
# Creates a scatter plot
sns.scatterplot(
    x='bill_length_mm', 
    y='bill_depth_mm',
    data=df, 
    hue= y,
    style=km.labels_,
    palette=["orange", "green", "blue"])

# Adds cluster centers to the same plot
plt.scatter(
    km.cluster_centers_[:,0],
    km.cluster_centers_[:,1],
    marker='x',
    s=200,
    c='red')

# k-MEANS: OPTIMIZE VIA SILHOUETTE SCORES
The main challenge in k-means is to find the optimal number of clusters. We can set up a `GridSearchCV` object to search for the optimal parameters. For k-Mmeans, we require a custom scorer that computes the silhouette value for different number of clusters specified by `n_clusters`. The custom scorer is called `s2()` in the code below, where it uses `silhouette_score()` from the `sklearn.metrics` library to compute a score for an instance `X`. 

A silhouette score is a value in [-1,+1]. It is a means for comparing how similar an instance is to its corresponding cluster compared to its similarity with other clusters. Formally, it takes into account `cohesion` and `separation` to compute a silhouette value. A +1 or close to this score value indicates better clusters.

In [None]:
# Sets up the custom scorer
def s2(estimator,X):
    return silhouette_score(X, estimator.predict(X))

# List of values for the parameter `n_clusters`
param = range(2,10)

# KMeans object
km = KMeans(random_state=0, init='k-means++')

# Sets up GridSearchCV object and stores in grid variable
grid = GridSearchCV(
    km,
    {'n_clusters': param},
    scoring=s2,
    cv=2)

# Fits the grid object to data
grid.fit(df)

# Accesses the optimum model
best_km = grid.best_estimator_

# Displays the optimum model
best_km.get_params()

## Plot of Scores for Different Number of Clusters
The `grid` object has an attribute `cv_results_` through which the scores for different `n_clusters` can be accessed.

In [None]:
# Plot mean_test_scores vs. n_clusters
plt.plot(
    param,
    grid.cv_results_['mean_test_score'])

# Draw a vertical line, where the best model is
plt.axvline(
    x=best_km.n_clusters, 
    color='red',
    ls='--')

# Adds labels to the plot
plt.xlabel('Total Centers')
plt.ylabel('Silhouette Score')

## Visualize the Best Model
Code below creates a visualization of the clusters stored in the optimum model `best_km`.

In [None]:
# Creates a scatter plot
sns.scatterplot(
    x='bill_length_mm', 
    y='bill_depth_mm',
    data=df, 
    hue=y,
    style=best_km.labels_,
    palette=['orange', 'green', 'blue'])

# Adds cluster centers to the same plot
plt.scatter(
    best_km.cluster_centers_[:, 0],
    best_km.cluster_centers_[:, 1],
    marker='x',
    s=200,
    c='red')